Bringing software development principles to Machine Learning

How rethinking Machine Learning in terms of software development principles led to the new tools in Machine Learning

Jul 06, 2022

Reason to bring up this conversation

There has been an immense buzz around MLOps and the second wave of MLOps tools. But MLOps is an umbrella term and there are a lot of steps involved in it. Going just around the umbrella term would give a false perception of space. Through this blog, we will walk you through the reason for the MLOps wave and the companies doing well, positioning them right in the space by bringing a software development analogy.

Evolution

Back in the 1990s when software development started, there used to be a platform approach to building things and bringing processes into play.

In the early 2012’s when DataBricks and other ML platforms came out, it was a similar approach. Bring a process into play on how a team needs to go about building Machine Learning.

Most successful companies are opinionated on how a particular thing needs to be done and have successfully embedded that behavior into customers.

The reason it worked is that Machine Learning or Data Science is all about data. You build a data lake and build tools on top of it to perform analytics, it’s a no-brainer to use it.

Going back into evolution, developers evolved out of a process framework and entered a value-based framework.

This led to the evolution of DevOps tools. No single platform for end-to-end but multiple SaaS products that can solve various needs of their development lifecycle

This is the core behavior that is seen in the current MLOps space which led to the second wave of MLOps.

Comparision

Typical Machine Learning model building pipeline

Hardware-related problems that are seen in Machine Learning

1. Data Management

Software Development

Databases: both structured and unstructured:

Postgres, Mongo, etc.

Machine Learning Development

Given that ML data is huge in size and some of it can still be saved in databases for tabular data. For unstructured data, there have been a lot of tools that are coming up just to cater to unstructured data needs

DataLakes: SnowFlake, Databricks DataLake, PineCone ( Unstructured data indexing), ActiveLoop ( early stage)

2. Code Development

Software Development

Standard code editors:

VSCode, Intellij, Eclipse, etc.

Machine Learning Development

As ML development is more of Data Analysis + structured code for future experimentation, its a combination of both EDA compatible editors and normal code editors

Data-Centric Code editors: Jupyter NoteBooks, Deep Note, etc.

3. Code Versioning

Software Development

Git has been the defacto tool used to manage code versioning. Players built on git and made it big in software development already

GitHub, GitLab, BitBucket

Machine Learning Development

As versioning of NoteBooks is not as easy as NoteBooks come with metadata which makes the commit history bigger, there have been some tools specifically built to cater to the code versioning/notebook versioning needs

DAGsHub, GitHub to some extent for non-notebook code

Now talking about some components which are part of ML but not Software development as ML is all about experimentation and iterations.

4. Model training with Hyper Parameter tuning

Software Development

None

Machine Learning Development

Currently, most companies do this manually as this is compute intense. Though there are SaaS companies evolving as model training as a service to run more training jobs at scale with better resource utilization

GridAI, Determined AI, NetBook

5. Model training Monitoring

Software Development

None

Machine Learning Development

This has been the most successful MLOps space in the recent past given the high need to monitor how the model training is going on as model training is a time taking( Usually 5 hours -24 hours depending on data size) and painful process. Data Scientists need to be cognizant of what’s happening during the training. And post-training it’s always better to have saved logs to compare models trained that can be used for production

Weights and Biases, Comet ML, Neptune

6. Testing

Software Development

There has been quite some movement on Test Driven Development(Test-driven development is an iterative development process. In TDD, developers write a test before they write just enough production code to fulfill that test and the subsequent refactoring. Developers use the specifications and first write tests describing how the code should behave. It is a rapid cycle of testing, coding, and refactoring.)

Source: Wikipedia

There have been some new tools coming up in this area.

Test Rail , Avo Assure

Machine Learning Development

The majority of the testing depends on the business problem. And this process is currently done manually. We haven’t come across any niche product which just provides a simple SaaS to monitor all the models that can go into production. Some parts of this are covered by training monitoring and some part by deployment companies

7. Testing for different hardware

Software Development

With the evolution of different browsers and hardware, this has become an absolute need for companies to test their code before they ship. Browser stack paved the way for this and many more companies are following now

Browser stack, Lambdatest

Machine Learning Development

This is one of the latest MLOps space which has seen great funding given the different types of hardware architectures for GPUs and different needs for deployment like edge. Optimizing/ converting a trained model file to work on different hardware automatically is saving a lot of DS’s time. Yet this market is still in a nascent stage.

OctoML, DeciAI

8. Deployment

Software Development

Thanks to CI/CD pipelines and DevOps teams who work on it. This process has been streamlined for some time.

GitHub Actions, GitLab pipelines, Jenkins, Circle CI

Machine Learning Development

This is one of the other MLOps space which caught a lot of heat due to the fact that ML has slowly started getting into production and it has been a huge problem for bringing ML models into training as the development process was not streamlined. It involves, wrapping the code into API and then putting it into the cloud for customer usage. Lots of open-source tools and SaaS tools in this space.

BentoML, Seldon, Nimblebox, Truefoundry

9. Scaled Deployment

Software Development

Scaled deployment and deployment are very close in this aspect as nowadays almost every DevOps team is getting into Kubernetes for scaling the deployment automatically, performing canary deployments, and A/B testing. There have been other tools that are helping in first setting up compute like Terraform, and Pulumi.

Machine Learning Development

Similar to the above, it doesn’t end by exposing an API. It's all about load balancing, and autoscaling the APIs. Most of the deployment tools are using Kubernetes/ docker swarm to make this possible

KfServing, Seldon, Cortex Dev

10. API Monitoring

Similar for both

11. Model Monitoring

Software Development

None

Machine Learning Development

This has become crucial for ML models in production as businesses want to be sure of what the model has been doing in production and the feedback loop is much needed to improve the model

Fiddler AI, Aporia, Superwise

Now moving on to ML-specific tools and use cases that require special mention. Not really comparable to software development as this has evolved just as part of Machine Learning ( Deep Learning specifically)

1. GPU resource allocation and Scheduling

As GPUs are a scarce resource and come with a problem of lack of virtualization for previous architectures other than Ampere. There has been some activity around better allocating resources within the teams and better use single GPU for multiple jobs to increase GPU utilization

Run AI

2. Federated Learning

With edge deployment as a new deployment model which evolved in ML for faster inference and data security, the ability for teams to train on non-central hardware has become a need and this still is very niche but there are startups entering this space.

Fedml

3. Deployment on CPUs

As GPUs are scarce and investing in GPUs is an expensive job. Companies have been into simulating GPU computations on CPUs to deploy GPU intense models on CPUs. However, these are pure research plays.

Neural Magic

End Notes

Sometimes it is very easy to get carried away with the umbrella terms but only when you dig deeper, do you understand what exactly is happening in the market.

Bringing Processes into play is only possible if you are the first mover in the market. If you want to build something valuable to the ML developers, it’s best to build a value-add tool than a process embedding tool.

And it’s clearly proven given the above companies in each niche in MLOps have raised 100s of millions of dollars in funding and successfully captured their niche.

Happy to hear your thoughts.

You can connect with me on LinkedIn

Cheers!!

Founder Journey

Bringing software development principles to Machine Learning

How rethinking Machine Learning in terms of software development principles led to the new tools in Machine Learning

Reason to bring up this conversation

Evolution

Comparision

1. Data Management

Software Development

Machine Learning Development

2. Code Development

Software Development

Machine Learning Development

3. Code Versioning

Software Development

Machine Learning Development

4. Model training with Hyper Parameter tuning

Software Development

Machine Learning Development

5. Model training Monitoring

Software Development

Machine Learning Development

6. Testing

Software Development

Machine Learning Development

7. Testing for different hardware

Software Development

Machine Learning Development

8. Deployment

Software Development

Machine Learning Development

9. Scaled Deployment

Software Development

Machine Learning Development

10. API Monitoring

11. Model Monitoring

Software Development

Machine Learning Development

Now moving on to ML-specific tools and use cases that require special mention. Not really comparable to software development as this has evolved just as part of Machine Learning ( Deep Learning specifically)

1. GPU resource allocation and Scheduling

2. Federated Learning

3. Deployment on CPUs

End Notes