Python

Pandas Rolling Correlation

“Rolling correlations are obtained by calculating the correlations among two time series using a rolling window. We can identify if two correlated time series diverge from one another over time using rolling correlations.”

Finding the rolling correlation on a Pandas DataFrame can be done using the “DataFrame_object.rolling().corr()” method. In this illustration, we will learn to compute the rolling correlation on a Pandas DataFrame with the basic technique.

Syntax:

On two DataFrames:

DataFrame_object1.rolling(width).corr(DataFrame_object2)

 

(OR)

On two columns in a DataFrame:

DataFrame_object[‘column1’].rolling(width).corr(DataFrame_object[‘column2’])

 
The important thing to remember while specifying the values for the columns is that the length of the values for all the columns which are contained in the DataFrame must have to be equal. If we put an unequal length of values, the program will not execute.

Example 1: Correlate Column1 vs Column2

Let’s create a DataFrame with 3 columns and 10 rows and correlate the quantity with the cost column for 2 days.

import pandas
# Create pandas dataframe for calculating Correlation
# with 3 columns.
analytics=pandas.DataFrame({'Product':[11,22,33,44,55,66,77,88,99,110],
                            'quantity':[200,455,800,900,900,122,400,700,80,500],
                            'cost':[2400,4500,5090,600,8000,7800,1100,2233,500,1100]})


# Correlate quantity with cost column for 2 days.
analytics['Correlated']=analytics['quantity'].rolling(2).corr(analytics['cost'])

print(analytics)

 
Output:

   Product  quantity  cost  Correlated
0       11       200  2400         NaN
1       22       455  4500         1.0
2       33       800  5090         1.0
3       44       900   600        -1.0
4       55       900  8000         NaN
5       66       122  7800         1.0
6       77       400  1100        -1.0
7       88       700  2233         1.0
8       99        80   500         1.0
9      110       500  1100         1.0

 
The correlation for 2 days, 200 to 400, is NaN and so on which are placed in the “Correlated” column.

Example 2: Visualization

Let’s create a DataFrame with 3 columns and 5 rows and correlate the “Sales” vs “Product_likes”.

Use the Seaborn to view the correlation in a graph and get the Pearson correlation coefficient.

import pandas
import seaborn
from scipy import stats

# Create pandas dataframe for calculating Correlation
# with 3 columns.
analytics=pandas.DataFrame({'Product name':['tv','steel','plastic','leather','others'],
                            'Product_likes':[100,20,45,67,9],
                            'Sales':[2300,890,1400,1800,200]})

print(analytics)

print()

# See the coefficient of correlation
print(stats.pearsonr(analytics['Sales'], analytics['Product_likes']))

print()

# Now see the Correlation Sales vs Product_likes
seaborn.lmplot(x="Sales", y="Product_likes", data=analytics)

 
Output:

  Product name  Product_likes  Sales
0           tv            100   2300
1        steel             20    890
2      plastic             45   1400
3      leather             67   1800
4       others              9    200

(0.9704208315867275, 0.006079620327457793)

 

Now, you can see the correlation between Sales and Product_likes.

Let’s now get the rolling correlation for these two columns for 3 days.

Code for Example 2:

# Correlate Sales with Product_likes column for 5 days.
analytics['Correlated']=analytics['Sales'].rolling(3).corr(analytics['Product_likes'])
 
print(analytics)

 
Output:

  Product name  Product_likes  Sales  Correlated
0           tv            100   2300         NaN
1        steel             20    890         NaN
2      plastic             45   1400    0.998496
3      leather             67   1800    0.999461
4       others              9    200    0.989855

 
You can see that these two columns are highly correlated.

Example 3: Different DataFrames

Let’s create 2 DataFrames with 1 column each and correlate them.

import pandas
import seaborn
from scipy import stats

analytics1=pandas.DataFrame({ 'Sales':[2300,890,1400,1800,200,2000,340,56,78,0]})
analytics2=pandas.DataFrame({'Product_likes':[100,20,45,67,9,90,8,1,3,0]})


# See the coefficient of correlation for the above two DataFrames
print(stats.pearsonr(analytics1['Sales'], analytics2['Product_likes']))

# Correlate Sales with Product_likes DataFrame
print(analytics1['Sales'].rolling(5).corr(analytics2['Product_likes']))

 
Output:

(0.9806646612423284, 5.97410226154508e-07)
0         NaN
1         NaN
2         NaN
3         NaN
4    0.970421
5    0.956484
6    0.976242
7    0.990068
8    0.996854
9    0.996954
dtype: float64

 
You can see that these two columns are highly correlated.

Conclusion

This discussion revolves around calculating the rolling window and then finding the correlation of a Pandas DataFrame. To put both these concepts into practice, Pandas offers a practical “DataFrame.rolling().corr()” method. For the learner’s convenience to understand the process better, we have given three practically implemented examples along with visualization and Searborn module. Each example is drawn-out with a detailed explanation of the steps. You can either apply it to different columns in a single DataFrame or you may use the same columns from different DataFrames; it all depends on your requirements.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain