Notes about Structuring ML projects (Part IV)
If you didn’t have the chance to check the first part, the second part and the third part of my notes about the course Structuring Machine Learning projects, I encourage that you do that ASAP, but in any case you can check this fourth part of my notes.
Mismatched training and dev/test set
Training and testing on different distributions
In this example Cat vs Non-cat, we want to create a mobile application that will classify and recognize pictures of cats taken and uploaded by users.
There are two sources of data used to develop the mobile app. The first data distribution is small, 10 000 pictures uploaded from the mobile application. Since they are from amateur users, the pictures are not well framed and blurrier. The second source is from the web with 200 000 cat’s pictures professionally framed and in high resolution.
The problem is that you have a different distribution:
- small data set from pictures uploaded by users. This distribution is important for the mobile app.
- bigger data set from the web.
The guideline used is that you have to choose a development set and test set to reflect data you expect to get in the future and consider important to do well. The data is split as follow:
So the training set will include 200,000 images from the web and 5,000 from the mobile app. The dev set will be 2,500 images from the mobile app, and the test set will be 2,500 images also from the mobile app.
The advantage of this way of splitting up your data into train, dev, and test, is that you’re now aiming the target where you want it to be.
You’re telling your team, my dev set has data uploaded from the mobile app and that’s the distribution of images you really care about, so let’s try to build a machine learning system that does really well on the mobile app distribution of images. The disadvantage, of course, is that now your training distribution is different from your dev and test set distributions. But it turns out that this split of your data into train, dev and test will get you better performance over the long term.
Take away
- In a deep learning era, more and more teams are now training on data that comes from a different distribution than your dev and test sets. And there’s some subtleties and some best practices for dealing with when you’re training and test distributions differ from each other.
- The advantage of this way of splitting up is that the target is well defined.
- The disadvantage is that the training distribution is different from the development and test set distributions. However, this way of splitting the data has a better performance in long term.
- Should you always use all the data you have? The answer is subtle, it is not always yes.
Bias and Variance with mismatched data distributions
Example: Cat classifier with mismatch data distribution
When the training set is from a different distribution than the development and test sets, the method to analyze bias and variance changes.
Scenario A | Scenario B | Scenario C | Scenario D | Scenario E | Scenario F | |
---|---|---|---|---|---|---|
Human (proxy for Bayes error) | 0 | 0 | 0 | 0 | 0 | 4 |
Training error | 1 | 1 | 10 | 10 | 10 | 7 |
Training-development error | - | 1.5 | 11 | 11 | 11 | 10 |
Development error | 10 | 10 | 12 | 20 | 20 | 6 |
Test error | - | - | - | - | - | 6 |
Scenario A
If the development data comes from the same distribution as the training set, then there is a large variance problem and the algorithm is not generalizing well from the training set.
However, since the training data and the development data come from a different distribution, this conclusion cannot be drawn. There isn’t necessarily a variance problem. The problem might be that the development set contains images that are more difficult to classify accurately.
When the training set, development and test sets distributions are different, two things change at the same time. First of all, the algorithm trained in the training set but not in the development set. Second of all, the distribution of data in the development set is different.
It’s difficult to know which of these two changes what produces this 9% increase in error between the training set and the development set. To resolve this issue, we define a new subset called training- development set. This new subset has the same distribution as the training set, but it is not used for training the neural network.
Scenario B
The error between the training set and the training- development set is 8%. In this case, since the training set and training-development set come from the same distribution, the only difference between them is the neural network sorted the data in the training and not in the training development. The neural network is not generalizing well to data from the same distribution that it hadn’t seen before
Therefore, we have really a variance problem.
Scenario C
In this case, we have a mismatch data problem since the 2 data sets come from different distribution.
Scenario D
In this case, the avoidable bias is high since the difference between Bayes error and training error is 10 %.
Scenario E
In this case, there are 2 problems. The first one is that the avoidable bias is high since the difference between Bayes error and training error is 10 % and the second one is a data mismatched problem.
Scenario F
Development should never be done on the test set. However, the difference between the development set and the test set gives the degree of overfitting to the development set.
General formulation
Addressing data mismatch
This is a general guideline to address data mismatch:
- Perform manual error analysis to understand the error differences between training, development/test sets. Development should never be done on test set to avoid overfitting.
- Make training data or collect data similar to development and test sets. To make the training data more similar to your development set, you can use is artificial data synthesis. However, it is possible that if you might be accidentally simulating data only from a tiny subset of the space of all possible examples.
Take away
- Carry out manual error analysis to try to understand difference between training and dev/test
- Make training data more similar; or collect more data similar to dev/test sets