Python

Cumulative Percentage Pandas

The standard mathematics formula for calculating the cumulative percentage is calculating the cumulative sum for each value in the column and dividing the cumulative sum of each cell in a column by the mathematical sum of all the values in the column and then multiplying it with 100. The question now is how to accomplish this in Python using pandas.

This article is designed to make you learn how to find cumulative percentages in python using pandas. Let’s start with the basic formula or syntax of cumulative percentage and then proceed further.

What is the Syntax of cumulative percentage?

Below is the syntax of the cumulative percentage that you can follow.

Cumulative Percentage Pandas

The Pandas Data frames in python provide two predefined functions, sum() and cumsum(), to calculate the cumulative percentage with python. The sum() function is used to calculate the sum of all the values in the column, and the cumsum() function is used to calculate the cumulative sum of each value in the column.

How does the cumsum() function work to calculate the cumulative sum in python?

The cumsum() is a python built-in panda data frame used to calculate the cumulative sum of array elements. It takes an array as input and calculates a sequence of partial sums or, in other words, a running total. The main purpose of calculating a cumulative sum is to update the sum of a dataset each time a new value is added to the dataset.

There are five steps that you must follow to calculate the cumulative sum with pandas in python, and here they are:

  1. Create a data frame or provide an array of data you want to calculate the cumulative percentage.
  2. Calculate the cumulative sum with the built-in cumsum() function.
  3. Calculate the sum of the array with the built-in sum() function.
  4. Divide the calculated cumsum() value by the calculated sum() value.
  5. Multiply each calculated value by 100, and the resultant value will be the cumulated percentage of a particular value in the array.

Now let us see some examples to understand how cumsum() and sum() work to calculate the column’s cumulative percentage with a pandas data frame in python.

Example 1:
In the first example, we will start calculating the cumulative sum of just one column so that you can understand the function easily. First, create a data frame as ‘data_frame’ and provide the values you need to calculate the cumulative sum, then pass the ‘data_frame’ parameter to pd.DataFrame() while specifying the column values, and finally, use the cumsum() and sum() built-in functions to calculate the cumulative percentage.

import pandas as pd
import numpy as np

data_frame = {
     'Names':['ab','bc','cd','de','ef','fg','gh'],
   'MathScore':[152,187,149,174,128,159,148]}

data_frame = pd.DataFrame(data_frame, columns=['Names','MathScore'])

data_frame['cum_per'] = 100*(data_frame.MathScore.cumsum() / data_frame.MathScore.sum())

data_frame

Here is the output of the above code:

Example 2:
In this example, we will show you the cumulative sum at each step. The code is exactly the same as the above example, with some minor modifications to print the cumulative sum at each step. See the sample code given below:

import pandas as pd
import numpy as np

data_frame = {
     'Names':['ab','bc','cd','de','ef','fg','gh'],
   'MathScore':[152,187,149,174,128,159,148]}

data_frame = pd.DataFrame(data_frame, columns=['Names','MathScore'])

data_frame['Cumulative Sum'] = data_frame.MathScore.cumsum()
_sum = data_frame.MathScore.sum()
data_frame['Cumulative%'] = 100*(cum_sum / _sum)


data_frame

Here is the output. Note that the cumulative sum is shown in each value’s ‘cumulative sum’ column. And the total of all values is 1097. See the last line of the output.

Example 3:
We have seen how to calculate the cumulative percentage of one column at one time. Now let us see how to calculate the sum of two columns simultaneously. The procedure follows the same steps as the above examples.

First, you need to create a data frame and then provide the columns in which you need to calculate the cumulative sum. Then, find the sum and cumulative sum with sum() and cumsum() built-in function, respectively, and multiply the values by 100. Now print the values in the data frame to see the output. The code is given below:

import pandas as pd
import numpy as np

data_frame = {
     'Names':['ab','bc','cd','de','ef','fg','gh'],
   'MathScore':[152,187,149,174,128,159,148],
  'EnglishScore':[134,167,125,189,192,145,186]}

data_frame = pd.DataFrame(data_frame, columns=['Names','MathScore','EnglishScore'])

data_frame['Cum Math Sum'] = data_frame.MathScore.cumsum()
data_frame['Cum Eng Sum'] = data_frame.EnglishScore.cumsum()
Msum = data_frame.MathScore.sum()
Esum = data_frame.EnglishScore.sum()
data_frame['MCumulative%'] = 100*(cum_sum / Msum)
data_frame['ECumulative%'] = 100*(data_frame['Cum Eng Sum'] / Esum)

data_frame

The output of the above code is as follows:

Example 4:
In this example, we will explain how to handle the ‘NaN’ values in an array? And how cumsum() deals with ‘NaN’ values in the data frame. The initial steps are the same as those given in the examples; however, the additional steps are given below in the code:

import pandas as pd
import numpy as np

data_frame = pd.DataFrame({
     'Names':['ab','bc','cd','de','ef','fg','gh'],
   'MathScore':[152,187,149,174,128,159,148]},
   index = ['1','2','3','4','5','6','7'])
print(data_frame)

The output will show an index column now:

If you notice, there is an additional column ‘index’ added to the data frame. The next step is to find the cumulative sum we have calculated in the above examples.

data_frame['Cumulative Sum'] = data_frame.MathScore.cumsum()

print(data_frame)

Here is the cumulative sum before providing a ‘NaN’ value:

After that, let us drop one value in the array and replace it with the ‘Nan’ value. Look at the code below to discover how to replace a value with ‘NaN.

data_frame = data_frame.drop('Cumulative Sum', axis=1)
data_frame.loc['3', 'MathScore'] = np.NAN

print(data_frame)

As you can see in the output below that the value at index 3 is replaced with ‘NaN’:

Let us calculate the cumulative sum again to see how cumsum() responds to the ‘NaN’ value in
the array.

data_frame['Cumulative Sum'] = data_frame.MathScore.cumsum()

print(data_frame)

Here is the output of the new cumulative sum.

If you notice that cumsum() ignored the ‘NaN’ and calculated the cumulative sum of all other values. Moreover, the total value is also different as the one value in the array is ‘NaN,’ and it does not contribute to the total value.

Conclusion:

This article is designed to learn how to find the cumulative percentage with pandas data frame in python. First, we have discussed the basic mathematical formula of cumulative percentage, and then we have seen the two built-in python functions, cumsum() and sum(), to find the cumulative percentage with pandas in python. The cumulative sum is the running total of a sequence of partial sums. With the help of examples, we have shown how to use the cumsum() and sum() built-in python function to calculate the cumulative sum of a column.

About the author

Kalsoom Bibi

Hello, I am a freelance writer and usually write for Linux and other technology related content