Let’s use a simple example to explain some terms. For one dataset of points in a 2-D plane, there might be many curves that can fit these points (each curve correponds to a machine learning model).
Each machine learning algorithm has its own inductive bias. Our problem is for the two models we have in the above image which machine learning model should we choose. There is a simple principle for this kind of problem.
Occam’s Razor
The Occam’s razor principle tells us if we have several models that have the similar ability in fitting our data, we should choose the simpler one as our final model. In the above image, curve A is much simpler than curve B, so we choose curve A as our model.
No Free Lunch Theorem
Although based on Occam’s razor principle, we believe curve A is a better model for this dataset, there is still a possibility that curve B might fit the test dataset better than curve A. This is called no free lunch theorem. If model A is better than model B on some dataset, then there must be some dataset on which model B performs better than model A. This can be proved mathmatically.
According tho no free lunch theorem, there is no best machine learning algorithm at in the world. Machine learning algorithms can only be compared based on the problem you want to solve and the dataset you have.
Performance Measurement
Precision and recall are two common performance measurements we use.
| Data-Label | Predicted-as-Positive | Predicted-as-Negative |
| — | — | — |
| Positive | True-Positive | False-Negative |
| Negative | False-Positive | True-Negative |
According to data label and predicted results, we can devide data samples into
to four subsets (TP, FP, TN, FN).
Precision is defined as: $P = \frac{TP}{TP+FP}$
Recall is defined as: $R=\frac{TP}{TP+FN}$
Usually, the higher the precision, the lower the recall rate. So we use F1
to combine precision and recall into one formula.
$F1 = \frac{2*TP}{NUMSAMPLES + TP - TN}$
$F_\beta = \frac{(1+\beta^2)*P*R}{(\beta^2*P)+R}$
When beta equals to 1, F_beta is F1. When beta is greater than 1, recall has
more influence. When beta is smaller than 1, precision has more influence.
ROC and AUC are another way to measure models. For the ROC curve, the
x axis is False Positive Rate (FPR) and the y axis is True Positive Rate (TPR).
$TPR = \frac{TP}{TP+FN}$
$FPR = \frac{FP}{FP+TN}$
AUC is the area under the ROC curve. If one model’s ROC curve is totally under
another model’s ROC, then we say the latter model has better performance.