Python

Pandas Covariance

The Pandas cov() method calculates the paired covariance amongst a DataFrame’s series. The DataFrame returned is the covariance matrix of the DataFrame’s columns. The computation automatically excludes NA and null entries. This technique is commonly used to evaluate the time series data to determine the association between various measurements across time.

The syntax for this method is as follows:

Here, the “min periods” determines the least number of occurrences needed for each pair of columns to provide a valid outcome.

You will learn and understand this method through the practical demonstration of codes in this article.

Example 1

This illustration is about finding the covariance among the columns of a DataFrame. Let’s start learning its practical implementation.

The first and most necessary task is to find a tool that is compatible with your machine and supports the Python language. For our requirements, the “Spyder” tool is found the most appropriate. So, we have to download, install, and finally launch the tool. Once the interface is displayed, we open a new file by clicking the “file” button and choosing the “new file” option. A new file with the “.py” extension is opened. The “.py” refers to the “Python” file.

Now, start writing the Python code. Before we begin with our main code, we need to get some necessary libraries on this Python file. For the present topic, we don’t need many libraries but only a single package which is “Pandas”.  So, we write the code “import pandas as pd” which imports all the features of Pandas in our Python file. We can access them using the “pd” instead of “pandas” throughout the script.

Since we have to calculate the covariance among the columns of a DataFrame, we are required to have a Pandas DataFrame where we exercise this method. To construct a DataFrame, Pandas provides us a “pd.DataFrame()” function. As we already know that “pd” is the “pandas”, we access the Pandas method. The “DataFrame()” is the keyword of this function which, when invoked, creates a DataFrame. We generate a DataFrame using this “pd.DataFrame()” method and initialized it with three columns – “Alpha”, “Beta”, and “Gamma”.

Our first column “Alpha” stores six values which are “3”, “4”, “1”, “10”, “5”, and “7”. The second column “Beta” holds six values which are “12”, “2”, “8”, “13”, “4”, and “5”. The third and the last column “Gamma” have the values “4”, “6”, “12”, “9”, “3”, and “10”.  All these columns store the integer type of values and are of the same length which is 6.

Now, to store this DataFrame, we create a DataFrame object or a “grade” variable. This “grade” variable assigns the output generated from calling the Pandas “pd.DataFrame()” method.  So, when we call the “pd.DataFrame()” method, a Pandas DataFrame is created and stored in “grade”. We can access the DataFrame with this object. We generated the DataFrame and stored it. Now, what about displaying it? To display the DataFrame on the terminal, we have a very simple and handy “print()”method. This method takes the variable, function, or statement as its parameter and simply displays it on the terminal. We write it as “print(grade)” and it will display the DataFrame.

When we click the “Run file” button on the “Spyder” tool or hit the “Shift+Enter” keys, a DataFrame with three columns and six rows is displayed on the terminal.

Now, we need to perform our main task where we created this DataFrame which calculates the covariance. To calculate the covariance among all the columns of this DataFrame, we have a Pandas-provided method “cov()”.  To utilize this method, we called the “.cov()” method with the DataFrame name “grade.cov()”. This calculates the covariance on the provided DataFrame. Then, we put this method between the parentheses of the “print()” method to display the DataFrame with calculated covariance on all of its columns. Otherwise, you can create a variable and store the calculated covariance in it and display it using the “print()” method.

The execution of the script explained previously gets us a matrix with calculated covariance between all the columns of the DataFrame “grade”. You can see that all the covariance values are positive.

Example 2

Now, we will see what happens when we have some “NaN” (Not a Number) values in our DataFrame and we need to calculate the covariance on that DataFrame. When the DataFrame has any “NaN” values, the “cov()” function ignores these “NaN” values and calculates the covariance between the rest of the values.

For this purpose, we utilized the previously-created DataFrame and modified it according to our requirements. We changed one value from each column of the DataFrame to a “None” value. The second value of the “Alpha” column is changed to “None”, the “Beta” column’s second value is changed to “None” and the “Gamma” column’s fifth value is also changed to “None”.  Then, we simply displayed the modified DataFrame with the “print()” function.

This is what our updated DataFrame looks like with NaN values.

We calculate its covariance now. We simply invoked the “cov()” function with the name of the DataFrame and passed this function as a parameter to the “print()” method to display the calculated covariance with “NaN” values.

When we run the previously-mentioned script, it displays to us the covariance calculated for all the columns in the DataFrame where, after ignoring the “NaN” values, the covariance between those columns having “Nan” values is negative.

Example 3

You learned how to calculate covariance among all the columns of the DataFrame with or without any “NaN” values. Here, we will make you familiar with another technique of using the “cov()” function. This technique is calculating the covariance between two Pandas series. We use the DataFrame that we created in the first illustration of this guide. From this DataFrame, we create two Pandas series.

To create a series, we employe the “pd.Series()” function. Between its braces, you can define the values manually but, in our illustration, we create the series from the previously created DataFrame “grade”. So, we provide the column name with the DataFrame name between the “pd.Series()” function as “pd.Series(grade[‘Alpha’])”. Then, we store this series in a variable “v1”. We create another series with the same steps using the “grade” DataFrame’s column “Gamma” this time as “pd.Series(grade[‘Gamma’])” and store it in variable “v2”.

We utilize the “print()” method to print both series “v1” and “v2”. In the last step, we calculate the covariance by invoking the “cov()” method. Write the title of the first series with the “.cov()” function and then the second series within its braces as “v1.cov(v2)”. Pass this as a parameter to the “print()” method to display it.

This yields us the following output with the calculated covariance between the two Pandas series.

Conclusion

Calculating the covariance between all the columns of the DataFrame or between the two series created from the DataFrame can be carried out with a simple and effective Pandas function – “cov()”. This article provided you with the practical implementation of Python codes executed on the “Spyder” tool. The first illustration was explained to you to estimate the covariance among Pandas DataFrame’s columns. The second example was based on learning the covariance calculation with “Nan” values. And the last example focused on finding the covariance among two Pandas series. We elaborated on every minor to major detail in this article to make learning fun for you.

About the author

Aqsa Yasin

I am a self-motivated information technology professional with a passion for writing. I am a technical writer and love to write for all Linux flavors and Windows.