Notes about “Structuring Machine Learning Projects” by Andrew Ng (Part I)
During the next days I will be releasing my notes about the course “Structuring machine learning projects”, some randoms points:
- This is by far the less technical course from the specialization “Deep learning“
- This is for aspiring technical leader in AI
- The idea is provided the tools for set the direction for the team’s work
.
Do you need to have ideas or concepts about ML before to take the course?
No necessarily:
- If you are familiar with all the ML world, going though the videos is a walk in the park.
- If you are not familiar, you will need some googling during each video to understand some words / expressions, but even in that case some basic and initial ideas and concepts are explained in a very didactic way.
.
What are the objectives of this course?
This is a copy from the specialization:
- Understand how to diagnose errors in a machine learning system, and
- Be able to prioritize the most promising directions for reducing error
- Understand complex ML settings, such as mismatched training/test sets, and comparing to and/or surpassing human-level performance
- Know how to apply end-to-end learning, transfer learning, and multi-task learning
Introduction to ML Strategy
1. Why ML Strategy?
What is machine learning strategy? When you are working with a deep learning system, there is a bunch of ideas to try in order to improve the system.
Take away
- The machine learning strategy is how to choose the right direction of the most promising things to try.
.
2. Orthogonalization
Orthogonalization is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, it reduces testing and development time.
When a supervised learning system is design, these are the 4 assumptions that needs to be true:
1 | Fit training set well in cost function | If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization algorithm might help. |
2 | Fit development set well on cost function | If it doesn’t fit well, regularization or using bigger training set might help. |
3 | Fit test set well on cost function | If it doesn’t fit well, the use of a bigger development set might help |
4 | Performs well in real world | If it doesn’t perform well, the development test set is not set correctly or the cost function is not evaluating the right thing. |
It’s nice if you can look at your system and say, this piece of it is wrong: Wrong in which way? When you can say: It does not do well on the training set, it does not do well on the dev set, it does not do well on the test set, or it’s doing well on the test set but just not in the real world.
Take away
- Orthogonalization assures that modifying a component will not create or propagate side effects to other components of the system.
- Through orthogonalization, we are looking for a way to diagnose what exactly is the bottleneck to our system’s performance, once identify the specific set of problems we can improve that aspect of its performance.
.
Setting up your goal
3. Single number evaluation metric
Building your machine learning system, the progress will be much faster working with a single real number evaluation metric that lets you quickly tell if the new thing you just tried is working better or worse than your last idea.
.
Example: How to create a single number evaluation metric
Let’s compare 2 classifiers A and B used to evaluate if there are cat images:
The evaluation metrics are precision and recall: for classifier A, there is a 95% chance that there is a cat in the image and a 90% chance that it has correctly detected a cat. Whereas for classifier B there is a 98% chance that there is a cat in the image and a 85% chance that it has correctly detected a cat.
Classifier | Precision (p) | Recall (r) |
---|---|---|
A | 95% | 90% |
B | 98% | 85% |
The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1-score, a harmonic mean, combine both precision and recall.
F1-Score= 2 / (1/p + 1/r)
So, let’s make an evaluation for our case using F1-score
Classifier | Precision (p) | Recall (r) | F1-Score |
---|---|---|---|
A | 95% | 90% | 92.4 % |
B | 98% | 85% | 91.0% |
Classifier A is a better choice. F1-Score is not the only evaluation metric that can be use, the average, forexample, could also be an indicator of which classifier to use.
Take away
- Set up a single real number evaluation metric.
.
4. Satisficing and optimizing metric
When there are different metrics to evaluate the performance of a classifier, they are called evaluation matrices.They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrix must be evaluated on a training set, a development set or on the test set.
Example: Cat vs Non-cat
Classifier | Accuracy | Running time |
---|---|---|
A | 90% | 80 ms |
B | 92% | 95 ms |
C | 95% | 1500 ms |
In this case, accuracy and running time are the evaluation matrices. Accuracy is the optimizing metric, because you want the classifier to correctly detect a cat image as accurately as possible. The running time which is set to be under 100 ms in this example, is the satisficing metric which mean that the metric has to meet expectation set.
Take away
- When there are different metrics to evaluate the performance of a classifier, they are called evaluation matrices.
- An optimizing metric mean that the metric should be as good as possible.
- A satisficing metric mean that the metric has to meet expectation set.
.
5. Train/dev/test distributions
Setting up the training, development and test sets have a huge impact on productivity. It is important to choose the development and test sets from the same distribution and it must be taken randomly from all the data.
Example 1: Building a cat crossfire
Building a system with operations in these regions: in the U.S, U.K, other European countries, South America, India, China, other Asian countries, and Australia.
How do you set up your dev set and your test set?
- One way is to pick four (or another random number) of these regions into the dev set. And other four regions could be the test set. This is a very bad idea because in this example, your dev and test sets come from different distributions.
- Best way, find a way to make your dev and test sets come from the same distribution.
Example 2: Given an input X about a loan application, can you predict why and which is, whether or not, they’ll repay the loan?
How do the team set up dev set and test set?
- The development test came from loan applications, from average income zip codes.
- The test set came from low-income zip codes.
Results:
- The crossfire did not work well, because the distribution data for the middle and low income postal codes are very different.
- The dev team really lost about three months of time and had to come back to the initial point and really do a lot of work again.
Take away
- Choose a development set and test set to reflect data you expect to get in the future and consider important to do well.
- Setting up the dev set, as well as the validation metric, is really defining what target you want to aim at.
- Setting the dev set and the test set to the same distribution, you’re really aiming at whatever target you hope your machine learning team will hit. The way you choose your training set will affect how well you can actually hit that target.
.
6. Size of the dev and test sets
Modern era – Big data (millions of examples)
- Because a large amount of data is available, it is not necessary use a greater portion of the examples to train the model.
- 1% of a big modern dataset, is 10,000 examples, and that might be plenty enough for a dev set or for a test set.
- 98% training set / 1% dev set / 1% test set
Take away
- In the era of big data, the old rule of thumb of a 70/30 is that, that no longer applies.
- Set up the size of the test set to give a high confidence in the overall performance of the system.
- Test set helps evaluate the performance of the final classifier which could be less 30% of the whole data set.
- The development set has to be big enough to evaluate different ideas.
.
7. When to change dev/test sets and metrics
Example: Cat vs Non-cat
We have 2 algorithm, and the evaluation metric could be defined like “The algorithm with less error”.
Algorithm A | 3% error |
Algorithm B | 5% error |
For some reason, Algorithm A is letting through a lot of the pornographic images. Algorithm B has 5% error thus it classifies fewer images but it doesn’t have pornographic images. From a company’s point of view, as well as from a user acceptance point of view, Algorithm B is actually a better algorithm.
Because the evaluation metric fails to correctly rank order preferences between algorithms, the evaluation metric should be changed.
Take away
- Define correctly an evaluation metric that helps better rank order classifiers
.
Bibliography
Next notes
This is just the first part of the course “Structuring Machine Learning projects“, part of the specialization “Deep Learning”. In the next days I will be sharing the next parts.
I hope these notes encourage you to take the course!