golang

Cook’s Distance Removal in Python

Cook’s distance is a useful approach for identifying outliers and the impact of each observation on a particular regression model. It can aid in the removal of outliers and the investigation of which points contribute the least to the prediction of target variables. We’ll look at regression, outliers, and how Cook’s distance plays a role in developing a good regression model. Later, we will also implement Cook’s distance in Python.

What is Regression?

Regression analysis is a statistical tool for analyzing the connection between independent and dependent variables (this can also be extended in many different ways). The most typical application of regression analysis is forecasting or predicting how a collection of conditions will affect an outcome. Suppose you had a set of data on high school students that included their GPA, gender, age, and SAT scores.

In that case, you could create a basic linear regression model with the dependent factors being GPA, gender, ethnicity, and age and the independent variables being SAT scores. Then, once you have the model, you can estimate what fresh students will score on the SAT based on the other four factors, assuming it is a good fit. Another good example of regression analysis is the house price prediction based on the number of rooms, area, and other factors.

What Do We Mean by Linear Regression?

Linear regression is the most common, straightforward, but effective supervised learning technique for predicting continuous variables. The goal of linear regression is to determine how an input variable (independent variable) affects an output variable (dependent variable). Given below are the elements of Linear Regression:

  1. The input variable is usually continuous
  2. The output variable is continuous
  3. The assumptions of Linear Regression hold.

The assumptions of linear regression include a linear relationship between the input and output variables, that errors are normally distributed, and that the error term is independent of the input.

What is Euclidean Distance?

The smallest distance between two specified objects in a plane is the Euclidean distance. If a right triangle is drawn from the two specified points, it equals the sum of squares of the triangle’s base and its height. It’s commonly used in geometry for a variety of purposes. This is the type of space where lines that begin parallel remain parallel and are always the same distance apart.

This closely resembles the space in which humans’ dwell. This indicates that the Euclidean distance between two objects is the same as your common sense tells you while calculating the shortest distance between two objects. Pythagoras’ theorem is used to calculate it mathematically. The Manhattan distance is an alternative metric for determining the distance between two places.

What is Manhattan Distance?

Manhattan distance is calculated where the plane is divided into blocks, and you cannot travel diagonally. As a result, Manhattan does not always provide the most direct route between two points. If two points in a plane are (x1, y1) and (x2, y2), the Manhattan distance between them is calculated as |x1-x2| + |y1-y2|. This is commonly employed in cities where streets are laid out in blocks, and it is impossible to go diagonally from one location to another.

What are Outliers?

Outliers in a dataset are numbers or data points abnormally high or low compared to other data points or values. An outlier is an observation that deviates from a sample’s overall pattern. Outliers should be removed as they reduce a model’s accuracy. Outliers are typically visualized using box plots. For example, in a class of students, we may expect them to be between 5 and 20. A 50-year-old student in the class would be considered an outlier since he does not “belong” to the data’s regular trend.

Plotting the data (typically with a box plot) is perhaps the simplest technique to see any outliers in the dataset. Statistics processes related to quality control can tell you how far out you are statistically (according to probability standard deviations and confidence levels). However, keep in mind that an outlier is only an outlier if you have enough information about the data to explain why it is different from the other data points, thus justifying the term “outlier.” Otherwise, the data must be treated as a random occurrence. They should be kept in the data set — and you must accept the less desirable (i.e., less desirable) findings due to the data point’s inclusion.

What is Cook’s Distance?

The Cook’s distance in Data Science is used to calculate the influence of each data point as a regression model. Performing a least-squares regression analysis is a method of identifying influential outliers in a set of predictor variables. R. Dennis Cook, an American statistician, originated this concept, which is why it is named after him. In Cook’s distance, the values are compared to see if removing the current observation affects the regression model. The greater the influence of a certain observation on the model, the greater the Cook’s distance of that observation.
Mathematically, Cook’s distance is represented as

Di = (di2 /c*M) * (hii / (1-hii)2)

where:
di is the ith data point
c represents the number of coefficients in the given regression model
M is Mean Squared Error which is used to calculate the standard deviation of points with the mean
hii is the ith leverage value.

Conclusions of Cook’s Distance

  1. A probable outlier is a data point with a Cook’s distance more than three times the mean.
  2. If there are n observations, any point with Cook’s distance larger than 4/n is deemed influential.

Implementing Cook’s Distance in Python

Reading the Data
We will read a 2-D array where ‘X’ represents the independent variable while ‘Y’ represents the dependent variable.

import pandas as pd

#create dataframe
df = pd.DataFrame({'X': [10, 20, 30, 40, 50, 60],
                   'Y': [20, 30, 40, 50, 100, 70]})

Creating the Regression Model

import statsmodels.api as sm

# storing dependant values
Y = df['Y']

# storing independent values
X = df['X']


X = sm.add_constant(X)

# fit the model
model = sm.OLS(Y, X)
model.fit()

Calculate Cook’s distance

import numpy as np
np.set_printoptions(suppress=True)

# create instance of influence
influence = model.get_influence()

# get Cook's distance for each observation
cooks_distances = influence.cooks_distance

# print Cook's distances
print(cooks_distances)

Other Outlier Detection Technique

Interquartile Range (IQR)
The interquartile range (IQR) is a measure of data dispersion. It’s especially effective for significantly skewed or otherwise out-of-the-ordinary data. For example, data regarding money (income, property and car prices, savings and assets, and so on) is frequently skewed to the right, with the majority of observations being on the low end and a few scattered on the high end. As others have pointed out, the interquartile range concentrates on the middle half of the data while disregarding the tails.

Conclusion

We went through the description of Cook’s distance, its related concepts like regression, outliers, and how we can use it to find the influence of each observation in our dataset. Cook’s distance is important to examine the outliers and what impact each observation has on the regression model. Later, we also implemented Cook’s distance using Python on a regression model.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.