Thursday, February 14, 2019

Published 9:52 PM by with 0 comment

What Does R^2 Mean in Linear Regression?

You see r^2 constantly when you see linear fits or linear regression. You'll often hear that it represents '% of variance explained by the model'. What does that mean?


Data

I've generated a fake blood pressure data set. The set contains blood pressure (systolic; BP throughout), distance from a freeway broken into 4 categories, and income level broken into 2 categories. The BP is equal to 130 - 10*[income level (0 or 1)] - 5*[distance to road (0, 1, 2, or 3)].

Run with perfect data

To start with, there is no noise in the data and each distance to road group is 80% low-income and 20% high-income. That is, there is no correlation between 'distance to road' and 'income level'. Trying out three regression models, the results are:

ModelR^2
BP = C*(distance to road)0.66
BP = C*(income level)0.34
BP = C1*(distance to road) + C2*(income level)1.00


Considering only one of the variables gives you an r^2 of either 0.66 or 0.34. Considering both gives you an r^2 of 1. This is a meaning of '% of variance explained by the model'. The model is the sum of two components. Including one or the other explains <= 100% of the result. Including both explains 100% of it.

Since there was no correlation between the two components, each one is equal to 100% minus the other.

Run with partially correlated data

Now the data set is adjusted so that as distance from the road increases, average income increases. In the 0 distance bin, 8% of results are high-income. In the 1, 2, and 3 distance bins, 16%, 24%, and 32% are high-income respectively. Trying out three regression models again, our results are:

ModelR^2
BP = C*(distance to road)0.73
BP = C*(income level)0.48
BP = C1*(distance to road) + C2*(income level)1.00

Notice the change here. The individual r^2's no longer sum to 1. Because 'distance to road' and 'income level' are correlated here, the apparent effects of each one in isolation are amplified.

Run with noisy data

Now the original, perfect data set is adjusted to add some random variance to the results. The new model is 130 - 10*[income level (0 or 1)] - 5*[distance to road (0, 1, 2, or 3)] + noise with a stdev of 5. Trying out three regression models again, our results are:

ModelR^2
BP = C*(distance to road)0.43
BP = C*(income level)0.22
BP = C1*(distance to road) + C2*(income level)0.65

The complete model no longer has an r^2 of 1. This is because it does not explain the variance introduced by the noise that was added to the results.

Does correlation imply causation?

Another phrase you'll often hear is 'correlation does not imply causation'. What does that mean? The rough summary is that two variables having a high r^2 when plotted against each other doesn't necessarily mean that one variable affects the other. We can see it clearly with an example.

Taking our original, perfect data set, assume that all high-income subjects paint their houses blue, and all low-income subjects paint their houses red. Add 'house color' to the model (0 for red and 1 for blue). If I try out a regression model of BP = C*(house color), I get an r^2 of 0.34. House color is not the cause of the blood pressure drop...income is. However, income explains both the blood pressure drop and the house color, so house color and blood pressure do have a relationship, but neither one causes the other.

      edit

0 comments:

Post a Comment