Wednesday, October 3, 2018

Published 5:33 PM by with 0 comment

Why Is Overfitting Dangerous

Imagine you have some measurements and you want to approximate an underlying model for them. Shouldn't you pick the one that most closely matches the data?

Sample Data

Imagine that you run an experiment on weight gain. You feed subjects more calories than they burn each day, and you end up with the following results:



How do you model it?

EDIT: Throughout, the y-axis should be lbs/week instead of lbs/day but I mistyped.

Modeling

Since 3500 excess calories should cause a gain of ~1 pound, you hypothesize a linear relationship (pounds per day = excess calories per day / 3500). A linear fit to the data is given below:



That looks pretty good, but it's not a perfect fit. You try out other fits, and notice that a 10th-order polynomial is a better fit:



Shouldn't you just use that?

Problem

You decide to collect two more points. You do so and then test to see how well your previous models worked. Here's the linear one:



Looks pretty good. Here's the 10th-order one:



Oh...

Conclusion

With polynomial fits, you can always get a perfect fit if the order is greater than or equal to the number of points you're fitting. That doesn't mean it's good. In this case, our hypothesis was linear, so we should have stuck with it unless the results are clearly not linear. We know that the data we collected has noise/uncertainty in it, so we shouldn't 'overfit' that raw data set.


      edit

0 comments:

Post a Comment