Productionising data science

Last week I’ve been to a very interesting meetup about Data Science.

It was focused on Productionising data science and was presented by Dr. Pascal Bugnion & Dr. Scott Stevenson from ASI (http://www.theasi.co).

The insight provided was very structured and interesting, so I decided to summarise all the suggested ‘recipes’.

Data science vs production

The main problem many companies face is that the work of data scientists in many cases is far from ‘ideal’ for further use in production. So it was discussed how these problems can be addressed.

The first step (an easier one) of data science is providing general insights, next step has to do with creating predictive models and the last one – employing these predictions in production. Unfortunately, in most cases solutions produced by data scientists take a lot of time, are linear, are optimised only for accuracy (without thinking too much about latency and scalability). And in production in most cases speed, resilience and possibility to retrain are the most important!

In most companies after a data scientist developed something, he/she just hands the solution over to data engineers who need to make it ‘work’. Usually this code needs to be rewritten completely, the goal of data scientist in most cases is different from the production one (they optimise for accuracy and forget about scalability).

Solutions

The best approach in this case, of course, is effective management and team organisation – for example, teams can be cross functional and share knowledge and work towards the common goal together.

However, there are some best practices that can be adopted by all data scientists to make their code ‘readier’ for production!

I will just list some of the solutions that were suggested:

 1. Coding setup:

  • make sure that you share your setup with production guys (pip freeze was suggested, also having ‘requirements.txt’ for Python is great);
  • implement model versioning, so that you can track changes and always have a way to roll back to a more successful model;
  • make sure that you have some tests in place to check that your prediction brings valuable results, tests for corner cases.

2. Coding style:

  •  make your scripts well structured (so that it has classes, object oriented structure and some methods can be reused without re-running the whole code – e.g. make loading data, cleansing and retrain separate);
  • make sure that your training and prediction code is separate and not linked to each other (so that prediction part can be completely independent and not ‘aware’ of training);
  • don’t hardcore paths to your file system or at least make it a variable that can be easily spotted and changed by production guys.

 

3. Production:

  •  THINK ABOUT PRODUCTION (keep in mind that your prediction should be not only accurate but also efficient);
  • prediction should expect data in the same format you get from production (e.g. JSON files).

‘Bonus tips’

So once all these tips are followed there were couple of additional steps that were suggested by data engineers:

  1. API for prediction:
    • It is great to create an API for your prediction model with several endpoints (e.g. Flask one with /predict and /retrain). Then you can use several instances and have load balancers to manage it in the most efficient way. Stats from these APIs can be then used for internal dashboards to track e.g. performance.
  1. Think about automation:
    • retraining (e.g. use cron jobs to schedule retraining every day)
    • testing (think of performance metrics, test cases to check how systems performs)

In addition to what was said in the presentation, it would be great if each data scientist would know more about best practices in Software Development that make this process very transparent, organised and easy for team work.

And of course – using source control sounds like a must!:) And not only for handing over to production team but also for versioning and documenting the changes.

Category: Machine Learning