Python Pandas

Pandas Correlation Matrix

Machine learning techniques are being used in today’s world to predict the prices of houses, perform customer segmentation, assist in agriculture, and in many sectors. While predicting, we need to analyze the data first, and only the correlated features must be considered.

In Python, pandas supports the corr() function, which generates a correlation matrix with correlation coefficients included. In this guide, we will discuss how to generate a correlation matrix from the pandas DataFrame using this function and discuss different parameters that are passed to this function with separate examples. For visualization of the correlation matrix, we will demonstrate how to create HeatMap.

pandas.DataFrame.corr

The correlation coefficient varies between -1 to +1. Two variables are positively correlated if the correlation coefficient is greater than 0. When the correlation coefficient is equal to 0, there is no correlation between the two variables. Two variables are negatively correlated if the correlation coefficient is less than 0.

pandas.DataFrame.corr() function is used to compute the pairwise correlation of DataFrame columns. It will not consider the missing values that exist in the DataFrame.

Syntax

Let’s see the syntax and the parameters passed to this function.

DataFrame.corr(method, min_periods, numeric_only)
  1. There are different correlation methods that exist mathematically. This method parameter will take the correlation method. It can take ‘kendall’, ‘spearman’ and ‘pearson’. By default, it will consider ‘pearson’ as the correlation method.
  2. The minimum number of observations required per pair of columns to have a valid result is set by the min_periods parameter (default = 1). ‘Kendall’ method is not used along with this parameter.
  3. The numeric_only (default = False) include numeric data columns like int, float.

Dataset

In this entire guide, we will use the houses.csv file that holds seven columns and six records. There are five numeric columns [‘beds’, ‘baths’, ‘size’, ‘lot_size’, ‘price’] and two non-numeric columns [‘size_units’, ‘lot_size_units’]. We won’t consider these two columns while creating the correlation matrix.

Let’s import this into the pandas DataFrame.

import pandas

# Import houses.csv file into house_df DataFrame.  
house_df = pandas.read_csv('houses.csv')
print(house_df)

print(Marks.quantile())

Output

Now this DataFrame holds seven columns with six records.

Example 1: No Parameters

Let’s compute the correlation matrix for the house’s DataFrame with Pearson correlation as the default method.

import pandas
house_df = pandas.read_csv('houses.csv')

# Correlation matrix for the above DataFrame
print(house_df.corr())

Output

A correlation coefficient matrix is generated for five columns. Diagonal positions in the matrix are positively correlated, as they are equal. Basically, the correlation matrix holds the correlation coefficients. We will say

  1. The correlation between beds and baths is 0.834058, indicating a strong positive correlation.
  2. The correlation between beds and size is 0.799378, showing a strong positive correlation.
  3. There is a negative correlation between beds and lot_size is -0.170355.
  4. The correlation between beds and price is 0.666335, signifying a moderate positive correlation.

This way, you can compare the correlation for all the columns.

Example 2: Method Parameter

First, we will return the correlation matrix with kendall and then spearman.

import pandas
house_df = pandas.read_csv('houses.csv')

# Correlation matrix - kendall
print(house_df.corr(method='kendall'),"\n")

# Correlation matrix - spearman
print(house_df.corr(method='spearman'),"\n")

Output

The correlation coefficients are different in both the matrices. The coefficients in the ‘spearman’ methods are greater than the coefficients in the ‘kendall’ method.

Example 3: Filtering Positive Correlation Coefficients

In this example, we will first filter the results that are positively correlated (greater than 0), then we will get the data that are strongly positively correlated (greater than 0.8).

import pandas
house_df = pandas.read_csv('houses.csv')

# Positive Correlations from the Correlation matrix
corr_matrix= house_df.corr(numeric_only=True,).unstack()
print(corr_matrix[corr_matrix > 0],"\n")

# Strong Positive Correlations from the Correlation matrix
corr_matrix= house_df.corr(numeric_only=True,).unstack()
print(corr_matrix[corr_matrix > 0.8])

Output

Example 4: Filtering Negative Correlation Coefficients

Let’s filter the results that are negatively correlated (less than 0).

import pandas
house_df = pandas.read_csv('houses.csv')

# Negative Correlations from the Correlation matrix
corr_matrix= house_df.corr(numeric_only=True,).unstack()
print(corr_matrix[corr_matrix < 0],"\n")

Output

Only beds are negatively correlated with lot_size.

Example 5: Visualizing the Correlation Matrix Using Heatmap

Utilize the Heatmap from the seaborn library and plot it using the pyplot from the matplotlib library.

  1. Pass the pandas.DataFrame.corr() function to the heatmap.
  2. Sett the annot parameter to True for displaying the coefficient value in each grid.
  3. The fmt parameter is used to format the coefficient values to 1 decimal points. We have to set it to display one decimal point.
  4. The cmap parameter is set to the color map named “crest.”
import pandas
import seaborn
import matplotlib

house_df = pandas.read_csv('houses.csv')

seaborn.heatmap(house_df.corr(numeric_only=True),fmt=".1f",annot=True,cmap="crest")
matplotlib.pyplot.show()

Output

You can see the color scale that represents the correlation coefficients, and also on each grid, you will see the coefficient.

Example 6: Filtering Negative Correlation Coefficients

The NumPy module supports the corrcoef() function, which will return the correlation coefficient (Pearson product-moment) for two variables. It will only take two variables (DataFrame columns).

  1. Return Pearson product-moment correlation coefficient for ‘beds’ and ‘baths.’
  2. Return Pearson product-moment correlation coefficient for ‘beds’ and ‘size.’
import numpy
import pandas

house_df = pandas.read_csv('houses.csv')

# Return Pearson product-moment correlation coefficient
print(numpy.corrcoef(house_df['beds'], house_df['baths']),"\n")
print(numpy.corrcoef(house_df['beds'], house_df['size']),"\n")

Output

The correlation coefficient for beds and baths is 0.83405766, and for beds and size is 0.79937773.

Conclusion

We saw how to compute the correlation matrix from the pandas DataFrame using the pandas.DataFrame.corr() function. All the parameters that are passed to this parameter are discussed in a separate example. It is possible to visualize the correlation between different variables using the Heatmap. We also discussed how to filter positive and negative correlation coefficients with examples.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain