### The perils of fitting

May. 14th, 2009 04:20 pm**lederhosen**

One of the standard techniques throughout science (well, a whole family of techniques really) is 'fitting'.

The way it goes is like this: we have something that we consider important (cost, blood pressure, whatever) that's affected by a whole bunch of other things (how far we drive, how many people we employ, how much the patient gets of what drugs - we call these "predictive variables").

The relationship is complex enough that we can't just predict it from first principles, so what we do instead is we take a bunch of observations in which we record both the predictive variables and the results. Then we try to come up with a mathematical model that matches our observations.

For instance: suppose I go to the shops and buy myself an apple and an orange. The receipt isn't itemised, but the total cost is $1.50. Next day I go back, and buy myself an apple and two oranges, and it costs me $2.00.

From this, you might reasonably conclude that an apple costs $1.00 and an orange costs 50 cents. If I was buying a dozen different things every trip, you'd need more observations before you could figure out prices (at least one for each predictive variable we're considering). You would probably also want to use a computer to work the numbers, but there are plenty of packages that will do that - it's really quite easy to do, which is part of why it's such a popular technique.

Unfortunately, like many easy things, it's a little too good to be true. The catch is that while what we

On Monday, I go to a new greengrocer and buy myself an apple and ten oranges, for a total of $6.00. The next day I buy myself an apple and *nine* oranges... and it costs me $6.10. What do you make of this?

Under the 'fitting' approach, there's only one way to interpret this: apples cost $7.00, and oranges cost minus ten cents each. WTF?

In fact, the explanation is much more reasonable: apples still cost

The problem here is that we forgot that the relationship between predictive and output variables is not exact - we don't know all the factors that affect our output variable. The more predictive variables we have, the easier it is to mistakenly credit them with effects that are really due to factors outside our knowledge. (You can see a LOT of this happening in electioneering or sports coverage - see Edward Tufte's debunking of 'belwether districts'.) Past a certain point, over-fitting becomes a sophisticated form of superstition - one to which scientists are especially susceptible.

(Jotting this down because I suspect I'm going to have to deal with these concepts at some length in the near future...)

The way it goes is like this: we have something that we consider important (cost, blood pressure, whatever) that's affected by a whole bunch of other things (how far we drive, how many people we employ, how much the patient gets of what drugs - we call these "predictive variables").

The relationship is complex enough that we can't just predict it from first principles, so what we do instead is we take a bunch of observations in which we record both the predictive variables and the results. Then we try to come up with a mathematical model that matches our observations.

For instance: suppose I go to the shops and buy myself an apple and an orange. The receipt isn't itemised, but the total cost is $1.50. Next day I go back, and buy myself an apple and two oranges, and it costs me $2.00.

From this, you might reasonably conclude that an apple costs $1.00 and an orange costs 50 cents. If I was buying a dozen different things every trip, you'd need more observations before you could figure out prices (at least one for each predictive variable we're considering). You would probably also want to use a computer to work the numbers, but there are plenty of packages that will do that - it's really quite easy to do, which is part of why it's such a popular technique.

Unfortunately, like many easy things, it's a little too good to be true. The catch is that while what we

*want*to do is predict the future, what we're really doing here is describing the past. Up to a certain point, that's useful - we can expect the future will behave something like the past. But it's possible to*overdescribe*the past...On Monday, I go to a new greengrocer and buy myself an apple and ten oranges, for a total of $6.00. The next day I buy myself an apple and *nine* oranges... and it costs me $6.10. What do you make of this?

Under the 'fitting' approach, there's only one way to interpret this: apples cost $7.00, and oranges cost minus ten cents each. WTF?

In fact, the explanation is much more reasonable: apples still cost

*about*$1.00 each, and oranges cost*about*50 cents. But this relationship isn't exact - oranges are actually sold by weight, and the ones I bought on Tuesday were a little heavier, so they cost a little more - about 56 cents each.The problem here is that we forgot that the relationship between predictive and output variables is not exact - we don't know all the factors that affect our output variable. The more predictive variables we have, the easier it is to mistakenly credit them with effects that are really due to factors outside our knowledge. (You can see a LOT of this happening in electioneering or sports coverage - see Edward Tufte's debunking of 'belwether districts'.) Past a certain point, over-fitting becomes a sophisticated form of superstition - one to which scientists are especially susceptible.

(Jotting this down because I suspect I'm going to have to deal with these concepts at some length in the near future...)