Python

Pandas Crosstab() Function

The “crosstab()” function in Pandas creates a cross-tabulation table that displays the frequency with which the specific sets of data appear. The “crosstab()” method is one of the several techniques in Pandas that allows you to restructure your data. This post explains how to utilize the “crosstab()” function.

Crosstab() Syntax:

pandas.crosstab(index,columns,values=None,rownames=None,colnames=None,margins=False,margins_name=’All’,dropna=True,normalize=False)

Parameters:

  1. Here, the first parameter is “index” which is the values that are utilized as the output DataFrame’s index. It can be an array, series, or multiple arrays or series.
  2. The “columns” are the generated DataFrame’s columns.
  3. The “values” are the values which are utilized to calculate the statistic provided by “agg _func”.
  4. The “rownames” is the name(s) that are allocated to the rows in the resultant DataFrame. The “rownames=None” is the default setting for this argument.
  5. The “colnames” is the column name(s) which are given to the output DataFrame. The “colnames=None” is the default setting. This function’s input is “values” and the return element is the aggregated statistic such as the mean, maximum, etc.
  6. The “margins” determine whether to add an extra row and column that shows the sum of the cells row-wise as well as column-wise. The “Margin=False” is the default setting.
  7. The “Margins-name” is the title of the newly appended row and column if the “margins” is configured to “True”.
  8. The “dropna” determines whether to remove the columns with only NaN values. The “dropna=False” is the default configuration.
  9. The “normalize” tells whether to normalize the obtained numbers by dividing by the total.

Example 1:
Create a DataFrame with 5 columns and generate a contingency table (cross tab) by setting the x-axis as “option2” column and y-axis as “name” column.

import pandas

# Create a DataFrame with 5 columns
mydata = pandas.DataFrame([[1,"sravan","cloud technology","Databases",89],
                           [2,"sravan","Python","Frontend programming",98],
                           [3,"sravan","elective 1","Databases",50],
                           [5,"satwik","php","Databases",99],
                           [6,"diba","php","Databases",19],
                           [9,"diba","elective 1","Frontend programming",54],
                   ],
                  columns=["exam id","name","option1","option2","marks"]
                  )
 
print(mydata,"\n")

# Let's use crosstab() function
frequency = pandas.crosstab(mydata.option2,mydata.name)

print(frequency)

Output:

Explanation:
We pass the “option2” column as x-axis and the “name” column as y-axis. Now, you can see in the previous illustration that the frequency table is generated. Let’s analyze it.

  1. “Databases” – “diba” hold only 1, “satwik” also 1, and “sravan” 2.
  2. “Frontend programming” – “diba” holds only 1, “satwik” holds 0, and “sravan” is also 1.

Example 2:
Utilize the same DataFrame and generate a contingency table (cross tab) by setting the x-axis as “option1” and “option2” columns and the y-axis as “name” column.

import pandas

# Create a DataFrame with 5 columns
mydata = pandas.DataFrame([[1,"sravan","cloud technology","Databases",89],
                           [2,"sravan","elective 1","Frontend programming",98],
                           [3,"sravan","elective 1","Databases",50],
                           [5,"satwik","php","Databases",99],
                           [6,"diba","php","Databases",19],
                           [9,"diba","php","Frontend programming",54],
                   ],
                  columns=["exam id","name","option1","option2","marks"]
                  )

# Let's use crosstab() function
frequency = pandas.crosstab([mydata.option1,mydata.option2],mydata.name)

print(frequency)

Output:

Explanation:
We pass the “option1” and “option2” columns as x-axis through a list and the “name” column as y-axis. Now, you can see in the previous illustration that the frequency table is generated.

Example 3: Set the Margins
Now, we create a contingency table for “option2” and “name” columns by setting the margins and providing the margin names as “Total Count”.

import pandas

# Create a DataFrame with 5 columns
mydata = pandas.DataFrame([[1,"sravan","cloud technology","Databases",89],
                           [2,"sravan","elective 1","Frontend programming",98],
                           [3,"sravan","elective 1","Databases",50],
                           [5,"satwik","php","Databases",99],
                           [6,"diba","php","Databases",19],
                           [9,"diba","php","Frontend programming",54],
                   ],
                  columns=["exam id","name","option1","option2","marks"]
                  )

# Let's use crosstab() function by setting margins and margins_name parameter
frequency = pandas.crosstab(mydata.option2,mydata.name,margins=True,margins_name="Total Count")

print(frequency)

Output:

Explanation:
We pass the margins and set them to True and the margins_name to “Total Count”. The margins are created and the total sum of values for the axis are returned as 6.

Across Rows:

  1. For “Databases” – 1+1+2 = 4
  2. For “Frontend programming” – 1+0+1=2

Total count is 6.

Across Columns:

  1. For “diba” – 1+1 = 2
  2. For “satwik” – 1+0 = 1
  3. For “sravan” – 2+1 = 3

Total count is 6.

Example 4: Set the Normalize Values
Now, we create a contingency table for “option2” and “name” columns by normalizing the values.

import pandas

# Create a DataFrame with 5 columns
mydata = pandas.DataFrame([[1,"sravan","cloud technology","Databases",89],
                           [2,"sravan","elective 1","Frontend programming",98],
                           [3,"sravan","elective 1","Databases",50],
                           [5,"satwik","php","Databases",99],
                           [6,"diba","php","Databases",19],
                           [9,"diba","php","Frontend programming",54],
                   ],
                  columns=["exam id","name","option1","option2","marks"]
                  )

# Let's use the crosstab() function by setting the normalize parameter to index.
frequency = pandas.crosstab(mydata.option2,mydata.name,normalize='index')

print(frequency)

Output:

Explanation:
We pass the normalization parameter and set it to “index”. It returns the percentage of each occurrence.

The total count is 6.

  1. In “Databases” – For “diba” – 1 among 4 => The normalized value is 0.25.
  2. In “Databases” – For “satwik” – 1 among 4 => The normalized value is 0.25.
  3. In “Databases” – For “sravan” – 2 among 4 => The normalized value is 0.5.
  4. In “Frontend programming” – For “diba” – 1 among 2 => The normalized value is 0.50.
  5. In “Frontend programming” – For “satwik” – 0 among 2 => The normalized value is 0.00.
  6. In “Frontend programming” – For “sravan” – 1 among 2 => The normalized value is 0.5.

Conclusion

This article is written to give you the idea and make you understand the concept of calculating the cross-tabulation for data analysis. Pandas provides us with a bunch of useful features. Out of which, “pandas.crosstab()” is one. We utilized this method in this learning to compute the cross-tabulation and considered most of the parameters for demonstration.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain