Apache Spark

PySpark sum() and avg() Window Functions

After partitioning the rows in the PySpark DataFrame, it is possible to return the total sum and average in each partition. By using some aggregate functions on a partition window, it is possible to return the total sum and average. First, we will see how to partition the DataFrame in PySpark.

PySpark Partition a Data Frame

It is possible to partition the rows in the DataFrame using the Window function. It is available in the pyspark.sql.window module. There will be two steps to partition the rows in a PySpark DataFrame.

Steps

  1. Create a PySpark DataFrame with similar values in at least one column.
  2. Partition the data using partitionBy() method available in the Window function.

Syntax

partition=Window.partitionBy(“column”)

We can order the partitioned data with the partitioned column or any other column. Let’s create the DataFrame.

Example
Here, we will create a PySpark DataFrame that has 5 columns: [‘subject_id’,’name’,’age’,’technology1′,’technology2′] with 10 rows.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
 
spark_app = SparkSession.builder.appName('_').getOrCreate()
 
students =[(4,'sravan',23,'PHP','Testing'),
           (4,'sravan',23,'PHP','Testing'),
           (46,'mounika',22,'.NET','HTML'),
           (4,'deepika',21,'Oracle','HTML'),
           (46,'mounika',22,'Oracle','Testing'),
           (12,'chandrika',22,'Hadoop','C#'),
           (12,'chandrika',22,'Oracle','Testing'),
           (4,'sravan',23,'Oracle','C#'),
           (4,'deepika',21,'PHP','C#'),
           (46,'mounika',22,'.NET','Testing')
              ]
 
dataframe_obj = spark_app.createDataFrame ( students,['subject_id','name','age','technology1','technology2'])
 
print("----------Actual DataFrame----------")
dataframe_obj.show()

Output:

PySpark sum() Window Function

The sum() in the Window function is used to return the total sum of values in each partition. So, it is possible to return a sum after partitioning the DataFrame.

Syntax

dataframe_obj.withColumn ("SUM",sum(col("column")).over(partition))

Parameter:

sum(col("column"))

Here, the sum() function takes the column name as the parameter. So, it returns the sum of values in this column in each partition. We add the result into a SUM column using the withColumn() function.

Example 1
Now, we have a DataFrame (Created above). Let’s Partition it based on the technology1 column and get the total sum of all values in the subject_id column in each partition.

# import the Window Function
from pyspark.sql.window import Window
 
#import the sum and col from pyspark.sql.functions
from pyspark.sql.functions import sum,col
 
#partition the dataframe based on the values in technology1 column
partition = Window.partitionBy("technology1")
 
print("----------Partitioned DataFrame----------")
 
#return the total sum of values in subject_id column for each partition
dataframe_obj.withColumn ("SUM",sum(col("subject_id")).over(partition)).show()

Output:

Explanation
Total partitions: 4

Partition 1:
.NET occurred two times in the first partition. And the sum of subject_id values is 46+46=92.

Partition 2:
Hadoop occurred one time in the second partition. So, the sum is 12.

Partition 3:
Oracle occurred four times in the third partition.
And the sum of subject_id values is 4+46+12+4=66.

Partition 4:
PHP occurred three times in the fourth partition.
And the sum of subject_id values is 4+4+4=12.

Example 2
Now, we have a DataFrame (Created above). Let’s partition it based on the technology1 column and get the total sum of all values in the age column in each partition.

# import the Window Function
from pyspark.sql.window import Window
 
#import the sum and col from pyspark.sql.functions
from pyspark.sql.functions import sum,col
 
#partition the dataframe based on the values in technology1 column
partition = Window.partitionBy("technology1")
 
print("----------Partitioned DataFrame----------")
 
#return the total sum of values in the age column for each partition
dataframe_obj.withColumn("SUM",sum(col("age")).over(partition)).show()

Output:

Explanation
Total partitions: 4

Partition 1:
.NET occurred two times in the first partition. And the sum of age values is 22+22=92.

Partition 2:
Hadoop occurred one time in the second partition. So, the sum is 22.

Partition 3:
Oracle occurred four times in the third partition.
And the sum of age values is 21+22+22+23=88.

Partition 4:
PHP occurred three times in the fourth partition.
And the sum of age values is 23+23+21=67.

PySpark avg() Window Function

The avg() in the Window function is used to return the total average of values in each partition. So, it is possible to return the average after partitioning the DataFrame.

Syntax

dataframe_obj.withColumn("AVERAGE",avg(col("column")).over(partition))

Parameter:

avg(col("column"))

Here, avg() takes the column name as the parameter. So, it returns the average of values in this column in each partition. We are adding the result into the AVERAGE column using the withColumn() function.

Example 1
Now, we have a DataFrame (Created above). Let’s partition it based on the technology1 column and get the total average of all values in the subject_id column in each partition.

# import the Window Function
from pyspark.sql.window import Window
 
#import the avg and col from pyspark.sql.functions
from pyspark.sql.functions import avg,col
 
#partition the dataframe based on the values in technology1 column
partition = Window.partitionBy("technology1")
 
print("----------Partitioned DataFrame----------")
 
#return the total average of values in the subject_id column for each partition
dataframe_obj.withColumn ("AVERAGE",avg(col("subject_id")).over(partition)).show()

Output:

Explanation
Total partitions: 4

Partition 1:
.NET occurred two times in the first partition. And the average of subject_id values is (46+46)/2=46.0.

Partition 2:
Hadoop occurred one time in the second partition. So, the average is 12.0.

Partition 3:
Oracle occurred four times in the third partition.
And the average of subject_id values is (4+46+12+4)/4=16.5.

Partition 4:
PHP occurred three times in the fourth partition.
And the average of subject_id values is (4+4+4)/3=4.0.

Example 2
Now, we have a DataFrame (Created above). Let’s partition it based on the technology1 column and get the total average of all values in the age column in each partition.

# import the Window Function
from pyspark.sql.window import Window
 
#import the avg and col from pyspark.sql.functions
from pyspark.sql.functions import avg,col
 
#partition the dataframe based on the values in technology1 column
partition = Window.partitionBy("technology1")
 
print("----------Partitioned DataFrame----------")
 
#return the total average of values in the age column for each partition
dataframe_obj.withColumn ("AVERAGE",avg(col("age")).over(partition)).show()

Output:

Explanation
Total partitions: 4

Partition 1:
.NET occurred two times in the first partition. And the average age value is (22+22)/2=22.0.

Partition 2:
Hadoop occurred one time in the second partition. So, the average is 22.0.

Partition 3:
Oracle occurred four times in the third partition.
And the average age is (21+22+22+23)/4=22.0.

Partition 4:
PHP occurred three times in the fourth partition.
And the average age value is (23+23+21)/3=22.333.

Conclusion

In this PySpark partitioning tutorial, we saw how to return the sum of values in each partitioned window using the sum() function and the average values in each partitioned window using the avg() function. We added the result to the existing DataFrame as a new column. It is also possible to get the average similar to the avg() function using mean(). The things import mean from pyspark.sql.functions module and simply make sure that you have to import sum and average from the same module.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain