## Friday, January 18, 2019

Published 7:06 PM by with 2 comments

# Handling Outliers in Linear Regression

What are some of the techniques for handling outliers in linear regression, and how do they compare? I evaluate several in Python.

#### Methods

I'm taking sample data with a few different types of outliers, and calculating the slope and intercept using the following methods:

For all but LTS, I used scikit-learn. I wrote a custom implementation of LTS because I could not find a nice one when I searched. The LTS algorithm is basically:
1. randomly sample 60% of the points, perform simple linear regression on them, and repeat 20 times
2. keep the sample from step 1 that gave you the best score
3. replace a point in the sample with another point from the original pool of data, perform simple linear regression, and calculate the score; if it improved, keep the newpoint; repeat a bunch of times

#### Results

I used three outlier types:
1. 20% of points are all way-off in the same direction
2. 20% of points have large, random errors added to them
3. 1 point is massively off; error is 50x the total scale of the data
Overall, simple linear regression resulted in noticeable errors for all three outlier types. All three of the other methods worked well, and LTS and Theil-Sen gave the best results for this specific data set and outlier type.

With an outlier free slope of 1 and intercept of 0, these are the results:

#### outliers in one direction

ideal slope: 1 ideal intercept: 0 simple linear regression slope: 0.647 simple linear regression intercept: -1.503 RANSAC slope: 1.03 RANSAC intercept: -2.132 Theil-Sen estimator slope: 0.999 Theil-Sen intercept: -0.004 least trimmed squares slope: 1.0 least trimmed squares intercept: -0.003

#### random outliers

ideal slope: 1 ideal intercept: 0 simple linear regression slope: 0.639 simple linear regression intercept: 8.915 RANSAC slope: 0.997 RANSAC intercept: -0.111 Theil-Sen estimator slope: 1.0 Theil-Sen intercept: 0.006 least trimmed squares slope: 1.0 least trimmed squares intercept: -0.004

#### one big outlier

ideal slope: 1 ideal intercept: 0 simple linear regression slope: 0.97 simple linear regression intercept: -48.996 RANSAC slope: 1.0 RANSAC intercept: 0.004 Theil-Sen estimator slope: 1.0 Theil-Sen intercept: -0.001 least trimmed squares slope: 1.0 least trimmed squares intercept: 0.001

1. 2. 