## What is Regression?

Regression analysis is a statistical tool for analyzing the connection between independent and dependent variables (this can also be extended in many different ways). The most typical application of regression analysis is forecasting or predicting how a collection of conditions will affect an outcome. Suppose you had a set of data on high school students that included their GPA, gender, age, and SAT scores.

In that case, you could create a basic linear regression model with the dependent factors being GPA, gender, ethnicity, and age and the independent variables being SAT scores. Then, once you have the model, you can estimate what fresh students will score on the SAT based on the other four factors, assuming it is a good fit. Another good example of regression analysis is the house price prediction based on the number of rooms, area, and other factors.

## What Do We Mean by Linear Regression?

Linear regression is the most common, straightforward, but effective supervised learning technique for predicting continuous variables. The goal of linear regression is to determine how an input variable (independent variable) affects an output variable (dependent variable). Given below are the elements of Linear Regression:

- The input variable is usually continuous
- The output variable is continuous
- The assumptions of Linear Regression hold.

The assumptions of linear regression include a linear relationship between the input and output variables, that errors are normally distributed, and that the error term is independent of the input.

## What is Euclidean Distance?

The smallest distance between two specified objects in a plane is the Euclidean distance. If a right triangle is drawn from the two specified points, it equals the sum of squares of the triangle’s base and its height. It’s commonly used in geometry for a variety of purposes. This is the type of space where lines that begin parallel remain parallel and are always the same distance apart.

This closely resembles the space in which humans’ dwell. This indicates that the Euclidean distance between two objects is the same as your common sense tells you while calculating the shortest distance between two objects. Pythagoras’ theorem is used to calculate it mathematically. The Manhattan distance is an alternative metric for determining the distance between two places.

## What is Manhattan Distance?

Manhattan distance is calculated where the plane is divided into blocks, and you cannot travel diagonally. As a result, Manhattan does not always provide the most direct route between two points. If two points in a plane are (x1, y1) and (x2, y2), the Manhattan distance between them is calculated as |x1-x2| + |y1-y2|. This is commonly employed in cities where streets are laid out in blocks, and it is impossible to go diagonally from one location to another.

## What are Outliers?

Outliers in a dataset are numbers or data points abnormally high or low compared to other data points or values. An outlier is an observation that deviates from a sample’s overall pattern. Outliers should be removed as they reduce a model’s accuracy. Outliers are typically visualized using box plots. For example, in a class of students, we may expect them to be between 5 and 20. A 50-year-old student in the class would be considered an outlier since he does not “belong” to the data’s regular trend.

Plotting the data (typically with a box plot) is perhaps the simplest technique to see any outliers in the dataset. Statistics processes related to quality control can tell you how far out you are statistically (according to probability standard deviations and confidence levels). However, keep in mind that an outlier is only an outlier if you have enough information about the data to explain why it is different from the other data points, thus justifying the term “outlier.” Otherwise, the data must be treated as a random occurrence. They should be kept in the data set — and you must accept the less desirable (i.e., less desirable) findings due to the data point’s inclusion.

## What is Cook’s Distance?

The Cook’s distance in Data Science is used to calculate the influence of each data point as a regression model. Performing a least-squares regression analysis is a method of identifying influential outliers in a set of predictor variables. R. Dennis Cook, an American statistician, originated this concept, which is why it is named after him. In Cook’s distance, the values are compared to see if removing the current observation affects the regression model. The greater the influence of a certain observation on the model, the greater the Cook’s distance of that observation.

Mathematically, Cook’s distance is represented as

where:

d_{i} is the i_{th} data point

c represents the number of coefficients in the given regression model

M is Mean Squared Error which is used to calculate the standard deviation of points with the mean

h_{ii} is the i_{th} leverage value.

## Conclusions of Cook’s Distance

- A probable outlier is a data point with a Cook’s distance more than three times the mean.
- If there are n observations, any point with Cook’s distance larger than 4/n is deemed influential.

## Implementing Cook’s Distance in Python

**Reading the Data**

We will read a 2-D array where ‘X’ represents the independent variable while ‘Y’ represents the dependent variable.

#create dataframe

df = pd.DataFrame({'X': [10, 20, 30, 40, 50, 60],

'Y': [20, 30, 40, 50, 100, 70]})

**Creating the Regression Model**

# storing dependant values

Y = df['Y']

# storing independent values

X = df['X']

X = sm.add_constant(X)

# fit the model

model = sm.OLS(Y, X)

model.fit()

## Calculate Cook’s distance

np.set_printoptions(suppress=True)

# create instance of influence

influence = model.get_influence()

# get Cook's distance for each observation

cooks_distances = influence.cooks_distance

# print Cook's distances

print(cooks_distances)

## Other Outlier Detection Technique

**Interquartile Range (IQR)**

The interquartile range (IQR) is a measure of data dispersion. It’s especially effective for significantly skewed or otherwise out-of-the-ordinary data. For example, data regarding money (income, property and car prices, savings and assets, and so on) is frequently skewed to the right, with the majority of observations being on the low end and a few scattered on the high end. As others have pointed out, the interquartile range concentrates on the middle half of the data while disregarding the tails.

## Conclusion

We went through the description of Cook’s distance, its related concepts like regression, outliers, and how we can use it to find the influence of each observation in our dataset. Cook’s distance is important to examine the outliers and what impact each observation has on the regression model. Later, we also implemented Cook’s distance using Python on a regression model.