Python

Pandas Crosstab() Function

Pandas provide a variety of choices for combining and summing up the data. Data summarization is more like delivering an overview of the provided data. This summary is instructive as well as simple to interpret. Certain functions will assist you in your task. To summarize the data, for instance, you can utilize the “groupby()” as well as the “pivot_table()” functions. However, for the time being, we will concentrate on crosstab Pandas for data summarizing. The “crosstab()” function in Pandas creates a cross-tabulation table that displays the frequency with which the specific sets of data appear. The “crosstab()” method is one of the several techniques in Pandas that allows you to restructure your data. This post will explain how to utilize the “crosstab()” function. 

Pandas Crosstab() Function

The syntax for the Pandas “crosstab()” method is given as follows:

Let’s understand its parameters first.

Here, the first parameter is “index” which is the values utilized as the output DataFrame’s index. It can be an array, series, or multiple arrays or series. The “columns” is the generated DataFrame’s columns. The “values” are the values utilized to calculate the statistic provided by “agg _func”. The “rowname” is the name(s) that are allocated to the rows in the resultant DataFrame. The “rownames=None” is the default setting for this argument. The “colnames” is the column name(s) given to the output DataFrame. The “colnames=None” is the default setting. This function’s input is “values” and the return element is the aggregated statistic such as the mean, maximum, etc.

The “margin” determines whether to add an extra row and column that shows the sum of the cells row-wise as well as column-wise. The “Margin=False” is the default setting. The “Margins-name” is the title of the newly appended row and column if the “margins” is configured to “True”. The “dropna” determines whether to remove the columns with only NaN values. The “dropna=False” is the default configuration. The “normalize” tells whether to normalize the obtained numbers by dividing by the total.

Now, let’s start implementing the Pandas crosstab method with some of its major parameters.

Example: Utilizing Pd.Crosstab() Method to Compute the Cross Tabulation of  Data

This illustration is based on calculating a cross-tabulation of the specified data from the given DataFrame. Let’s start learning its practical implementation.

The first and most important priority is to pick a tool that is compatible with your laptop and supports the Python programming. The “Spyder” tool is known to be the optimum fit for our needs. So, we first download, install, and then launch the tool. We open a new directory by clicking the “file” button and selecting the “new file” option after the interface was loaded. A new file with the extension “.py” is created. The “.py” extension refers to the “python” file. Start to write the Python programs. Before we get started with our main script, we need to import the several libraries into this Python file.

We only need one package called “Pandas” for the topic under discussion instead of the multiple libraries. Therefore, we write a script “import pandas as pd” that imports all of the Pandas’ features into our Python file and allows us to utilize them all through the script by using “pd” rather than “pandas.” We first have to generate a DataFrame. The “pd.DataFrame()” function from Pandas enables us to construct a DataFrame. Because we now know that “pd” stands for “pandas,” we call a Pandas method. The “DataFrame()” is the function’s keyword which, when invoked, builds a DataFrame.

We create a DataFrame with the “pd.DataFrame()” method and populate it with three columns: “Age,” “Gender,” and “Ailment.” Our first column, “Age”, contains five values: “34,” “45,” “19,” “3,” and “50”. The second column, “Gender”, includes five values: “female,” “male,” “female,” “male,” and “female.” The last column, “Ailment”, holds the values “Y”, “N”, “N”, “Y”, and “N”. All of these columns hold distinct datatypes of values but have the same length which is five.

We now build a DataFrame object “tab” to store this DataFrame. This variable “tab” provides us an access to the freshly produced DataFrame. To exhibit this DataFrame on the terminal, we use the “print()” method. This method takes as a parameter the variable, function, or statement and simply shows it on the terminal. So, we type the “print(grade)” and it displays the DataFrame.

If you’re unfamiliar with the “Spyder” tool, you might just be thinking about how you’ll execute the code. To run this Python file, click the “Run file” button or hold the “Shift+Enter” keys. On the console of the “Spyder” tool, you can now find the DataFrame that we just created.

Our DataFrame is successfully generated. The next and core task is to apply the “pd.crosstab()” function to compute the cross-tabulation. The “pd.crosstab()” function can be used by invoking this method and providing the parameters. For the basic cross-tabulation between two columns of the DataFrame, we use the “index” parameter which is the value that is utilized as the index of the output DataFrame.

So, whatever value we provide to the “index” argument becomes the index of the resultant DataFrame. We provide the “index=tab[“Age”]” which makes this “Age” column from the “tab” DataFrame as the index of the output DataFrame.  The second parameter we use here is the “column” which is the column of the generated DataFrame on which the tabulation is computed.

Here, we utilize the “Gender” column from the “tab” DataFrame. We create a “cross” variable. This variable is assigned the output of invoking the “pd.crosstab()” function. Finally, we call the “print()” method to display the resultant cross-tabulated DataFrame with the provided two columns – “Age” and “Gender”.

When we execute the previous script, the resultant DataFrame is shown on the terminal. In the output DataFrame, you can see a summary of the values for each category. For example, there is only one female of age “19” and no male of this age. Here, the “gender” is the title given to the columns.

Let’s explore some other parameters of this function.

In the previous output, you have seen that a label is given to the rows and columns automatically. But you can also explicitly specify the titles for the rows as well as for the columns. This can be done using the “rownames” and “colnames” arguments in the “pd.crosstab()” method. For instance, we specify a title “ERA” for the “Age” “rowname” and a “colnames” “SEXUAL IDENTITY” given to the “Gender” column.

This yields us the following output where the label for the column and the row is updated:

The other parameter that we utilize here is the “margins= True”. We can get the row-wise as well as the column-wise sum of data by setting this parameter to “True”. Here, we use the “Gender” for “index” while we use the “Ailment” as the “columns” from the “tab” DataFrame.

The output displays another column “ALL” which gives a count for each data.

You can also specify the margin name by utilizing the “margin_name” parameter. Here, we use the “SUM” title for the calculated margin column.

The output can be observed in the following snapshot:

Conclusion

This article is written to give you the idea and make you understand the concept of calculating the cross-tabulation for data analysis. Pandas provides us with a bunch of useful features. Out of which, “pd.crosstab()” is one. We utilized this method in this learning to compute the cross-tabulation. We defined all the parameters that can be used in this function and implemented some of them in Python using the “Spyder” tool. We highly recommend the new learners to give a quality time and focused practice to learn the new Pandas concepts.

About the author

Aqsa Yasin

I am a self-motivated information technology professional with a passion for writing. I am a technical writer and love to write for all Linux flavors and Windows.