Productionising data science

Last week I’ve been to a very interesting meetup about Data Science.

It was focused on Productionising data science and was presented by Dr. Pascal Bugnion & Dr. Scott Stevenson from ASI (http://www.theasi.co).

The insight provided was very structured and interesting, so I decided to summarise all the suggested ‘recipes’.

Data science vs production

The main problem many companies face is that the work of data scientists in many cases is far from ‘ideal’ for further use in production. So it was discussed how these problems can be addressed.

The first step (an easier one) of data science is providing general insights, next step has to do with creating predictive models and the last one – employing these predictions in production. Unfortunately, in most cases solutions produced by data scientists take a lot of time, are linear, are optimised only for accuracy (without thinking too much about latency and scalability). And in production in most cases speed, resilience and possibility to retrain are the most important!

In most companies after a data scientist developed something, he/she just hands the solution over to data engineers who need to make it ‘work’. Usually this code needs to be rewritten completely, the goal of data scientist in most cases is different from the production one (they optimise for accuracy and forget about scalability).

Solutions

The best approach in this case, of course, is effective management and team organisation – for example, teams can be cross functional and share knowledge and work towards the common goal together.

However, there are some best practices that can be adopted by all data scientists to make their code ‘readier’ for production!

I will just list some of the solutions that were suggested:

 1. Coding setup:

  • make sure that you share your setup with production guys (pip freeze was suggested, also having ‘requirements.txt’ for Python is great);
  • implement model versioning, so that you can track changes and always have a way to roll back to a more successful model;
  • make sure that you have some tests in place to check that your prediction brings valuable results, tests for corner cases.

2. Coding style:

  •  make your scripts well structured (so that it has classes, object oriented structure and some methods can be reused without re-running the whole code – e.g. make loading data, cleansing and retrain separate);
  • make sure that your training and prediction code is separate and not linked to each other (so that prediction part can be completely independent and not ‘aware’ of training);
  • don’t hardcore paths to your file system or at least make it a variable that can be easily spotted and changed by production guys.

 

3. Production:

  •  THINK ABOUT PRODUCTION (keep in mind that your prediction should be not only accurate but also efficient);
  • prediction should expect data in the same format you get from production (e.g. JSON files).

‘Bonus tips’

So once all these tips are followed there were couple of additional steps that were suggested by data engineers:

  1. API for prediction:
    • It is great to create an API for your prediction model with several endpoints (e.g. Flask one with /predict and /retrain). Then you can use several instances and have load balancers to manage it in the most efficient way. Stats from these APIs can be then used for internal dashboards to track e.g. performance.
  1. Think about automation:
    • retraining (e.g. use cron jobs to schedule retraining every day)
    • testing (think of performance metrics, test cases to check how systems performs)

In addition to what was said in the presentation, it would be great if each data scientist would know more about best practices in Software Development that make this process very transparent, organised and easy for team work.

And of course – using source control sounds like a must!:) And not only for handing over to production team but also for versioning and documenting the changes.

  • Items 1 and 2 are very much about portability, collaboration or elegance with the side effect of making work/collaboration enjoyable and with less headaches. Item 3 and the ‘Bonus Tips’ should always seem as the Northern Start and it touches good working practices like automation, modular architecture and telemetry.

    Many Data Scientists place their focus in the wrong area – basically in machine learning and statistics – and forget the infrastructure component that is vital to contribute effectively to a team and product. For me infrastructure contains things like devops and agile mindset, emphasis in user stories, usage of docker, jenkins, git, linux, vagrant, jira, aws ecosystem. When I get a CV with this I’m confident that I’ve a production-ready candidate and experienced coder.