What is Linear Regression?
Linear regression is a simple yet effective supervised machine learning algorithm to predict continuous variables. Linear regression seeks to determine how the input variable (explanatory variable) varies from the output variable (response variable). Many advanced supervised machine learning algorithms are based on linear regression concepts. Linear regression is commonly used in machine learning problems to predict continuous variables where the target and feature variables have a linear relationship.
The following are the main components of a simple linear regression: continuous input variable, continuous response variable, and the linear regression assumptions are met.
Assumptions of Linear Regression:
- Input variables (x) have a linear relationship with the target variable (y). Also, the input variable coefficients should not be correlated with each other.
- The error term is distributed equally around 0, so the expected value of the error term is E( e ) = 0.
How Linear Regression Works?
A linear regression model attempts to fit a line that passes through the most significant number of points while minimizing the squared distance (cost function) of the points to the fitted line values given a set of data points inputs (x) and responses (y).
As a result, the cost function is ultimately minimized. The cost function for linear regression is usually Mean Squared Error:
The regression equation is written as y = β1x + βo.
The term c represents the intercept, m represents the slope of the regression line, x represents the input variable, and y represents the predicted value of the response variable.
We know from basic mathematics that a straight line is identified by two parameters: slope and intercept. The linear regression algorithm selects some initial parameters and continuously updates them to minimize the standard deviation. Below is the image showing the regression line (blue), deviations (green), and the data points (red).
The linear regression can also be extended to multiple input variables, and the approach remains exactly the same. The equation of the line for multiple variables is represented by:
A Demo on Linear Regression
Let us predict a target variable using a single input variable. The below example and dataset are from the scikit-learn official documentation. scikit-learn is a widely used library for developing Machine Learning models.
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
Logistic Regression is a classification algorithm. It is a decision-making algorithm, which means it seeks out the boundaries between two classes, and it simulates the probabilities of a single class. Because the input is discrete and can take two values, it is typically used for binary classification.
The target variable in linear regression is continuous, which means it can take any real number value, whereas, in logistic regression, we want our output to be probabilities ( between 0 to 1 ). Logistic regression is derived from linear regression, but it adds an extra layer of sigmoid function to ensure that the output remains between 0 and 1.
How Logistic Regression Works?
Logistic Regression is a simple and widely used machine learning algorithm, especially for binary classification problems. This extension of the linear regression algorithm uses a sigmoid activation function to limit the output variable between 0 and 1. Suppose we need to build a machine learning model, then each independent variable data point will be x1 * w1 + x2 * w2… and so on, and this will give a value between 0 and 1 when passed through the activation function if we use 0.50 as a deciding value or threshold. Then, any result greater than 0.5 is considered a 1, and any result less than that is considered a 0. The sigmoid activation function is represented as:
We can see from the graph that the output variable is restricted between 0 and 1.
In scenarios of more than two classes, we use a one vs. all classification approach. Splitting the multi-class dataset into multiple binary classification problems is what One vs. Rest is all about.
On each binary classification problem, a binary classifier is trained, and predictions are made using the model with the highest confidence.
Implementing Logistic Regression
Below is the script from scikit-learn official documentation to classify the iris flower based on various features.
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(random_state=0).fit(X, y)
>>> clf.predict(X[:2, :])
>>> clf.predict_proba(X[:2, :])
array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
[9.7...e-01, 2.8...e-02, ...e-08]])
>>> clf.score(X, y)
We went through the introduction of logistic and linear regression, discussed the underlying mathematics involved, and went through the implementation part of each of them. We can conclude that linear regression helps predict continuous variables while logistic regression is used in the case of discrete target variables. Logistic regression does this by applying the sigmoid activation function on the linear regression equation.