Designing Machine Learning Systems by Chip Huyen – Book Notes

Find answers to some of the most important questions in developing machine learning systems.

Jun 28, 2024

Designing Machine Learning Systems teaches how intelligent systems are built using machine learning within organizations and tech companies. Huyen provides a holistic approach to solving problems with machine learning without getting lost in teaching specific tools or platforms.

Reading this book will help data science enthusiasts learn more about the complete process of building ML solutions beyond the usual focus of model training, which is commonly emphasized in an academic setting. This is only an overview of the many questions thoroughly addressed in full chapters in Huyen's textbook.

Key Questions Answered

When should ML solutions be used?

Effective ML solutions require developers to consider important reasons for and against ML before starting. Some criteria for ML solutions include the problem's ability to be framed as a prediction task, a complex pattern existing in the data, plentiful high-quality data being available, and a low cost of a wrong prediction. Netflix can recommend the wrong movie to users at almost no cost, but when Tesla self-driving makes a poor prediction, it can cost a life. Some reasons against ML solutions include the use case being unethical, a more straightforward solution existing, such as if-else logic, or a high cost of a wrong prediction.

In what order should a ML solution be developed?

1. Solving the problem with a heuristic instead of ML

2. Implementing the simplest ML model

3. Optimizing the simple model

4. Implementing a complex model

What is the best performance metric?

The best performance metric is one that highly correlates with the chosen business metric. In business, ML solutions must be tied directly or indirectly to increasing profits for the use case to be valid. Additionally, performance metrics are meaningless without a baseline for comparison, such as the performance of prior business logic or human performance at the task.

Is it better to invest in data or algorithms?

Generative AI shows that the success of ML solutions relies increasingly on plentiful, high-quality data rather than more powerful algorithms. A clever model with poor training data won't work well, so investing in high-quality data is more important.

At what point does training data need to be structured?

There needs to be a clearer line between unstructured and structured data. All data in a ML solution eventually becomes structured data, and it doesn't really matter where that happens in the data pipeline. At any point between data collection and model training a schema can be applied to the unstructured data that gathers it into the format the ML model can handle. In general, structured data is stored in data warehouses, and unstructured data is stored in data lakes. With cloud storage being so plentiful, data warehousing is common.

What should be done with missing values in the data?

It is incredibly uncommon for data to be missing entirely due to random chance. For this reason, the common practice of deleting rows or columns of missing data can lead to biased data and worse model performance. Instead, filling in missing values with the variable's mean, median, or mode is recommended. You can also create a new variable indicating when the value is missing, which sometimes contains predictive power.

What is the number one strategy for avoiding data leakage?

Almost all data leakage problems can be solved by splitting data into testing and training sets as early as possible and before doing any exploratory data analysis, feature scaling, imputing missing values, resampling, or other data transformations. The strongest features in your model are also most likely to have leaked information about the testing data or labels. If you split data first, you must write functions or transformation pipelines that apply all training data transformations to the validation, testing, and production data.

How to pick the highest performing model for the task?

Start with the simplest models, then progress to more state-of-the-art models while putting equal effort into training each model. During this process do experiment tracking, which is when you write down the performance metrics and hyperparameters of all the trained models. Experiment tracking allows you to reproduce the results of your training experiments and easily compare performance. Then, you can pick your top 3 models, use additional resources to improve performance further, and pick a final model.

What's the best location for inference computing?

ML on edge devices rather than on the cloud offers many benefits, and it is a growing practice in the industry. Some benefits include lower-cost prediction, predictions that can be made without the internet, no network latency, and that sensitive data does not touch the internet making it easier to comply with regulations like GDPR.

What happens after a model is deployed?

Building ML solutions continues after deployment as ML solutions have many failure points in production. A common problem is when the real-world data begins to look different from the training data. This is called a data distribution shift, which causes decreasing prediction performance. You can spot distribution shifts by running a hypothesis test to determine whether the current input data comes from a different distribution than training data, similar to quality control charts.

What infrastructure does a task require for a ML solution?

A broad spectrum of infrastructure needs exists for ML solutions, ranging from small companies getting by with Jupyter Notebooks to large companies needing highly specialized technology. Most companies will likely fall in the middle and can get by procuring third-party technologies like popular cloud storage, cloud computing, and cloud development environments.

How frequently should ML models be retrained/updated?

Update your model as often as you can! Start by manually updating your model, and then develop a script to detect distribution shifts automatically and update the model.

Huyen, C. (2022). Designing machine learning systems: An iterative process for production-ready applications. O’Reilly Media, Inc.