Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components—-a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. We present the anatomy of a general-purpose machine learning platform and one implementation of such a platform at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes service disruptions. We present the case study of one deployment of the platform in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying the platform led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis.
WHY IT MATTERS: this post is a bit theoretical for someone not bathing in digital transformation all day as I do (mostly). It explores what could come next but more importantly stresses an important fact: we now have a new platform - the mobile phone - in our pockets that has the potential to transform our world. We have started to see this with eCommerce - physical retail stores become irrelevant and we can price compare from anywhere - but we have yet to see the real impact of eCommerce+mobile+cloud+5G. THIS is what I think about most days...