Crosstab() Syntax:
Parameters:
- Here, the first parameter is “index” which is the values that are utilized as the output DataFrame’s index. It can be an array, series, or multiple arrays or series.
- The “columns” are the generated DataFrame’s columns.
- The “values” are the values which are utilized to calculate the statistic provided by “agg _func”.
- The “rownames” is the name(s) that are allocated to the rows in the resultant DataFrame. The “rownames=None” is the default setting for this argument.
- The “colnames” is the column name(s) which are given to the output DataFrame. The “colnames=None” is the default setting. This function’s input is “values” and the return element is the aggregated statistic such as the mean, maximum, etc.
- The “margins” determine whether to add an extra row and column that shows the sum of the cells row-wise as well as column-wise. The “Margin=False” is the default setting.
- The “Margins-name” is the title of the newly appended row and column if the “margins” is configured to “True”.
- The “dropna” determines whether to remove the columns with only NaN values. The “dropna=False” is the default configuration.
- The “normalize” tells whether to normalize the obtained numbers by dividing by the total.
Example 1:
Create a DataFrame with 5 columns and generate a contingency table (cross tab) by setting the x-axis as “option2” column and y-axis as “name” column.
# Create a DataFrame with 5 columns
mydata = pandas.DataFrame([[1,"sravan","cloud technology","Databases",89],
[2,"sravan","Python","Frontend programming",98],
[3,"sravan","elective 1","Databases",50],
[5,"satwik","php","Databases",99],
[6,"diba","php","Databases",19],
[9,"diba","elective 1","Frontend programming",54],
],
columns=["exam id","name","option1","option2","marks"]
)
print(mydata,"\n")
# Let's use crosstab() function
frequency = pandas.crosstab(mydata.option2,mydata.name)
print(frequency)
Output:
Explanation:
We pass the “option2” column as x-axis and the “name” column as y-axis. Now, you can see in the previous illustration that the frequency table is generated. Let’s analyze it.
- “Databases” – “diba” hold only 1, “satwik” also 1, and “sravan” 2.
- “Frontend programming” – “diba” holds only 1, “satwik” holds 0, and “sravan” is also 1.
Example 2:
Utilize the same DataFrame and generate a contingency table (cross tab) by setting the x-axis as “option1” and “option2” columns and the y-axis as “name” column.
# Create a DataFrame with 5 columns
mydata = pandas.DataFrame([[1,"sravan","cloud technology","Databases",89],
[2,"sravan","elective 1","Frontend programming",98],
[3,"sravan","elective 1","Databases",50],
[5,"satwik","php","Databases",99],
[6,"diba","php","Databases",19],
[9,"diba","php","Frontend programming",54],
],
columns=["exam id","name","option1","option2","marks"]
)
# Let's use crosstab() function
frequency = pandas.crosstab([mydata.option1,mydata.option2],mydata.name)
print(frequency)
Output:
Explanation:
We pass the “option1” and “option2” columns as x-axis through a list and the “name” column as y-axis. Now, you can see in the previous illustration that the frequency table is generated.
Example 3: Set the Margins
Now, we create a contingency table for “option2” and “name” columns by setting the margins and providing the margin names as “Total Count”.
# Create a DataFrame with 5 columns
mydata = pandas.DataFrame([[1,"sravan","cloud technology","Databases",89],
[2,"sravan","elective 1","Frontend programming",98],
[3,"sravan","elective 1","Databases",50],
[5,"satwik","php","Databases",99],
[6,"diba","php","Databases",19],
[9,"diba","php","Frontend programming",54],
],
columns=["exam id","name","option1","option2","marks"]
)
# Let's use crosstab() function by setting margins and margins_name parameter
frequency = pandas.crosstab(mydata.option2,mydata.name,margins=True,margins_name="Total Count")
print(frequency)
Output:
Explanation:
We pass the margins and set them to True and the margins_name to “Total Count”. The margins are created and the total sum of values for the axis are returned as 6.
Across Rows:
- For “Databases” – 1+1+2 = 4
- For “Frontend programming” – 1+0+1=2
Total count is 6.
Across Columns:
- For “diba” – 1+1 = 2
- For “satwik” – 1+0 = 1
- For “sravan” – 2+1 = 3
Total count is 6.
Example 4: Set the Normalize Values
Now, we create a contingency table for “option2” and “name” columns by normalizing the values.
# Create a DataFrame with 5 columns
mydata = pandas.DataFrame([[1,"sravan","cloud technology","Databases",89],
[2,"sravan","elective 1","Frontend programming",98],
[3,"sravan","elective 1","Databases",50],
[5,"satwik","php","Databases",99],
[6,"diba","php","Databases",19],
[9,"diba","php","Frontend programming",54],
],
columns=["exam id","name","option1","option2","marks"]
)
# Let's use the crosstab() function by setting the normalize parameter to index.
frequency = pandas.crosstab(mydata.option2,mydata.name,normalize='index')
print(frequency)
Output:
Explanation:
We pass the normalization parameter and set it to “index”. It returns the percentage of each occurrence.
The total count is 6.
- In “Databases” – For “diba” – 1 among 4 => The normalized value is 0.25.
- In “Databases” – For “satwik” – 1 among 4 => The normalized value is 0.25.
- In “Databases” – For “sravan” – 2 among 4 => The normalized value is 0.5.
- In “Frontend programming” – For “diba” – 1 among 2 => The normalized value is 0.50.
- In “Frontend programming” – For “satwik” – 0 among 2 => The normalized value is 0.00.
- In “Frontend programming” – For “sravan” – 1 among 2 => The normalized value is 0.5.
Conclusion
This article is written to give you the idea and make you understand the concept of calculating the cross-tabulation for data analysis. Pandas provides us with a bunch of useful features. Out of which, “pandas.crosstab()” is one. We utilized this method in this learning to compute the cross-tabulation and considered most of the parameters for demonstration.