Bringing software development principles to Machine Learning
How rethinking Machine Learning in terms of software development principles led to the new tools in Machine Learning
Reason to bring up this conversation
There has been an immense buzz around MLOps and the second wave of MLOps tools. But MLOps is an umbrella term and there are a lot of steps involved in it. Going just around the umbrella term would give a false perception of space. Through this blog, we will walk you through the reason for the MLOps wave and the companies doing well, positioning them right in the space by bringing a software development analogy.
Evolution
Back in the 1990s when software development started, there used to be a platform approach to building things and bringing processes into play.
In the early 2012’s when DataBricks and other ML platforms came out, it was a similar approach. Bring a process into play on how a team needs to go about building Machine Learning.
Most successful companies are opinionated on how a particular thing needs to be done and have successfully embedded that behavior into customers.
The reason it worked is that Machine Learning or Data Science is all about data. You build a data lake and build tools on top of it to perform analytics, it’s a no-brainer to use it.
Going back into evolution, developers evolved out of a process framework and entered a value-based framework.
This led to the evolution of DevOps tools. No single platform for end-to-end but multiple SaaS products that can solve various needs of their development lifecycle
This is the core behavior that is seen in the current MLOps space which led to the second wave of MLOps.
Comparision
Typical Machine Learning model building pipeline
Hardware-related problems that are seen in Machine Learning
1. Data Management
Software Development
Databases: both structured and unstructured:
Postgres, Mongo, etc.
Machine Learning Development
Given that ML data is huge in size and some of it can still be saved in databases for tabular data. For unstructured data, there have been a lot of tools that are coming up just to cater to unstructured data needs
DataLakes: SnowFlake, Databricks DataLake, PineCone ( Unstructured data indexing), ActiveLoop ( early stage)
2. Code Development
Software Development
Standard code editors:
VSCode, Intellij, Eclipse, etc.
Machine Learning Development
As ML development is more of Data Analysis + structured code for future experimentation, its a combination of both EDA compatible editors and normal code editors
Data-Centric Code editors: Jupyter NoteBooks, Deep Note, etc.
3. Code Versioning
Software Development
Git has been the defacto tool used to manage code versioning. Players built on git and made it big in software development already
GitHub, GitLab, BitBucket
Machine Learning Development
As versioning of NoteBooks is not as easy as NoteBooks come with metadata which makes the commit history bigger, there have been some tools specifically built to cater to the code versioning/notebook versioning needs
DAGsHub, GitHub to some extent for non-notebook code
Now talking about some components which are part of ML but not Software development as ML is all about experimentation and iterations.
4. Model training with Hyper Parameter tuning
Software Development
None
Machine Learning Development
Currently, most companies do this manually as this is compute intense. Though there are SaaS companies evolving as model training as a service to run more training jobs at scale with better resource utilization
5. Model training Monitoring
Software Development
None
Machine Learning Development
This has been the most successful MLOps space in the recent past given the high need to monitor how the model training is going on as model training is a time taking( Usually 5 hours -24 hours depending on data size) and painful process. Data Scientists need to be cognizant of what’s happening during the training. And post-training it’s always better to have saved logs to compare models trained that can be used for production
6. Testing
Software Development
There has been quite some movement on Test Driven Development(Test-driven development is an iterative development process. In TDD, developers write a test before they write just enough production code to fulfill that test and the subsequent refactoring. Developers use the specifications and first write tests describing how the code should behave. It is a rapid cycle of testing, coding, and refactoring.)
Source: Wikipedia
There have been some new tools coming up in this area.
Machine Learning Development
The majority of the testing depends on the business problem. And this process is currently done manually. We haven’t come across any niche product which just provides a simple SaaS to monitor all the models that can go into production. Some parts of this are covered by training monitoring and some part by deployment companies
7. Testing for different hardware
Software Development
With the evolution of different browsers and hardware, this has become an absolute need for companies to test their code before they ship. Browser stack paved the way for this and many more companies are following now
Machine Learning Development
This is one of the latest MLOps space which has seen great funding given the different types of hardware architectures for GPUs and different needs for deployment like edge. Optimizing/ converting a trained model file to work on different hardware automatically is saving a lot of DS’s time. Yet this market is still in a nascent stage.
8. Deployment
Software Development
Thanks to CI/CD pipelines and DevOps teams who work on it. This process has been streamlined for some time.
GitHub Actions, GitLab pipelines, Jenkins, Circle CI
Machine Learning Development
This is one of the other MLOps space which caught a lot of heat due to the fact that ML has slowly started getting into production and it has been a huge problem for bringing ML models into training as the development process was not streamlined. It involves, wrapping the code into API and then putting it into the cloud for customer usage. Lots of open-source tools and SaaS tools in this space.
9. Scaled Deployment
Software Development
Scaled deployment and deployment are very close in this aspect as nowadays almost every DevOps team is getting into Kubernetes for scaling the deployment automatically, performing canary deployments, and A/B testing. There have been other tools that are helping in first setting up compute like Terraform, and Pulumi.
Machine Learning Development
Similar to the above, it doesn’t end by exposing an API. It's all about load balancing, and autoscaling the APIs. Most of the deployment tools are using Kubernetes/ docker swarm to make this possible
10. API Monitoring
Similar for both
11. Model Monitoring
Software Development
None
Machine Learning Development
This has become crucial for ML models in production as businesses want to be sure of what the model has been doing in production and the feedback loop is much needed to improve the model
Now moving on to ML-specific tools and use cases that require special mention. Not really comparable to software development as this has evolved just as part of Machine Learning ( Deep Learning specifically)
1. GPU resource allocation and Scheduling
As GPUs are a scarce resource and come with a problem of lack of virtualization for previous architectures other than Ampere. There has been some activity around better allocating resources within the teams and better use single GPU for multiple jobs to increase GPU utilization
2. Federated Learning
With edge deployment as a new deployment model which evolved in ML for faster inference and data security, the ability for teams to train on non-central hardware has become a need and this still is very niche but there are startups entering this space.
3. Deployment on CPUs
As GPUs are scarce and investing in GPUs is an expensive job. Companies have been into simulating GPU computations on CPUs to deploy GPU intense models on CPUs. However, these are pure research plays.
End Notes
Sometimes it is very easy to get carried away with the umbrella terms but only when you dig deeper, do you understand what exactly is happening in the market.
Bringing Processes into play is only possible if you are the first mover in the market. If you want to build something valuable to the ML developers, it’s best to build a value-add tool than a process embedding tool.
And it’s clearly proven given the above companies in each niche in MLOps have raised 100s of millions of dollars in funding and successfully captured their niche.
Happy to hear your thoughts.
You can connect with me on LinkedIn
Cheers!!