Monday, December 31, 2018

Published 6:07 PM by with 0 comment

What Does It Mean When a Study Controls for a Variable?

If you read scientific studies, you'll often see the phrases 'controlled for' or 'adjusted for' to describe variables like age, income, or gender that are accounted for in their statistical analysis. What does this mean?

Overview

There isn't one magic way to control for a variable in a study. The phrase usually refers to controlling for it in regression when I see it though. If you aren't familiar and didn't click the link, a really rough summary of linear regression is that you pick a bunch of parameters (X) and one result (y), and try to fit y = c1*X1 + c2*X2 + ... where the c values are coefficients.

For this example, I created a simple, fake data set to test if living near a freeway increases blood pressure. The properties of the data set are:
  • 5000 total subjects
  • blood pressure is stored as the systolic (higher) value
  • 4 distance bins: 0-200m, 200-500m, 500-1000m, >1000m
  • 2 income bins: 0 (low) and 1 (high)
  • being in the high income bin lowers blood pressure by 15 points
  • with 0-200m as the reference bin for distance with a blood pressure of 130, 200-500m lowers blood pressure by 5 points, 500-1000m by 10 points, and >1000m by 15 points
  • each distance bin has 1250 subjects, and the average income differs between bins
  • each blood pressure reading has some normally distributed noise added to it
To explore this yourself, check the notebook I created here:

https://colab.research.google.com/drive/17DCsEwOFx4-CBAZa4Q1GRIFuNCN_YTIf

Note that the numbers will vary slightly due to the randomness in the data set generation. Also feel free to play with different assumptions here.

Sample experiment

Imagine you've collected the following information for 5,000 people:
  • blood pressure
  • distance from their home to the freeway
  • income
With this data, you want to see if blood pressure is correlated with how close they live to the freeway. To keep this simple going forward, bin the data into 4 distance bins and 2 income bins (0 is low income, 1 is high income).

Breaking the blood pressure data up by distance bin, you get the following plot:

If you've never used box plots, the way to read this is that half of the points for each bin are inside the rectangle for that bin, the vast majority of points are within the two lines, and 50% of the points are below the horizontal line in the middle of the box.

Clearly blood pressure drops the further you live from the road. Done right? There's a problem though. Living by the road is awful. If someone has a high income, they might live further from the road than the average person. Breaking the blood pressure data up by distance and income bins, you get the following plot:


Income is a huge factor. To check the idea that income is higher on average in the subjects that live furthest from the road, you generate this plot to see how many subjects in each distance bin have each income level:


Almost no subjects in the 0 - 200m bin have a high income while 50% of the ones in the >1000m bin do. Since income has a larger effect in the plots above, can you just assume income is the explanation here and move on?

Simple regression

Assuming that income is the only variable, you run a simple linear regression with the blood pressure data and the income bins. The results are:

TermValueRangep
low-income (baseline)123.5123.0 to 124.00.000
high-income-19.6-20.6 to -18.60.000

If you aren't familiar with what that means...
  • blood pressures for the low-income group were between 123 and 124
  • having a high income lowered blood pressures by between -20.6 and -18.6 points
  • both coefficients/terms there were useful
  • a simple model for blood pressure is 'blood pressure = 123.5 - 19.6*(0 for low-income, 1 for high-income)'
If you are familiar with what that means, check the notebook I linked at the beginning to see the full results. Just to check, you do the same thing for distance bins:

TermValueRangep
0-200m (baseline)128.6127.7 to 129.50.000
200-500m-6.8-8.1 to -5.50.000
500-1000m-13.9-15.2 to -12.60.000
>1000m-22.1-23.3 to -20.80.000

Both seem to work. What can you do?

Control for income

You now run the same analysis with both distance bins and income bins. This allows you to 'control for' income in your initial test ('blood pressure is correlated with how close they live to the freeway'). Here are the results:

TermValueRangep
0-200m and low-income (baseline)129.8129.0 to 130.60.000
200-500m-5.0-6.2 to -3.80.000
500-1000m-9.7-10.9 to -8.50.000
>1000m-15.7-17.0 to -14.50.000
high-income-15.1-16.0 to -14.10.000


Nice. All bins are important factors. Assuming these are the only factors, this answers our question. Controlling for income, living near the freeway increases blood pressure by ~5 points per bin. A high income lowers blood pressure by 15 points when binned this way.

Once more in case it isn't clear..the blood pressure data were fit using both distance to road and income as variables, and the result was:

'blood pressure = 129.8  - (0 if <200m from road, 5.0 if 200-500m from road, 9.7 if 500-1000m from road, 15.7 if >1000m from road) - (0 if low-income, 15.1 if high-income)'

Additional method

One additional method for controlling for a variable that I'll briefly cover here is simply converting the variable into a constant. You'll often see this as a study targeting a very specific group...'black women aged 50-55' for example. By limiting age, gender, and race, you control for those variables.

With the data set here, you can do this by considering only low-income subjects. Running the same analysis on blood pressure vs distance bin with only those subjects yields:

TermValueRangep
0-200m (baseline)129.8129.0 to 130.70.000
200-500m-5.0-6.2 to -3.70.000
500-1000m-9.5-10.9 to -8.20.000
>1000m-15.8-17.3 to -14.50.000


Conclusion

As I stated in the beginning, I created a fake data set where distance from road decreased blood pressure by 5 points per bin, and the high income bin's blood pressures were 15 points lower. When not controlling for income, the result is a decrease of >7 points per bin. Worse, the range is above 5 in every case, so the correct result is not included in the confidence intervals. 

When controlling for income, the result is a decrease of ~5 points per distance bin and 15 points for the high income bin. It works perfectly with this ideal, contrived data. Hopefully this makes the concept of 'controlling for' a variable in a study really easy to understand.


      edit

0 comments:

Post a Comment