Apache Spark

PySpark RDD – Aggregate Functions

In Python, PySpark is a Spark module used to provide a similar kind of processing like spark. As such it supports SQL like operations including SQL Set Aggregate Functions. In this PySpark RDD tutorial, we will see how to perform different aggregation functions on PySpark RDD.

First lets create a RDD. RDD stands for Resilient Distributed Datasets. We can call RDD as a fundamental data structure in Apache Spark. We need to import RDD from the pyspark.rdd module. In PySpark to create an RDD, we can use the parallelize() method.

Syntax:

spark_app.sparkContext.parallelize(data)

Where, data can be a one dimensional (linear data) or two dimensional data (row-column data).

Now let’s go thourh the aggregate functions one by one: sum(), min(), max(), mean, and count().

1. sum()

sum() is used to return the total (sum) value in the RDD. It takes no parameters.

Syntax:

RDD_data.sum()

Example:

In this example, we create an RDD named student_marks with 20 elements and return the sum of total elements from an RDD.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,
78,21,34,34,56,34])

#perform sum() operation
print(student_marks.sum())

Output:

1112

From the above output, we can see that the total sum of elements in RDD is 1112.

2. min()

min() is used to return the minimum value from the RDD. It takes no parameters.

Syntax:

RDD_data.min()

Example:

In this example, we create an RDD named student_marks with 20 elements and return the minimum value from an RDD.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,
21,34,34,56,34])

#perform min() operation
print(student_marks.min())

Output:

21

From the above output, we can see that the minimum value in RDD is 21.

3. max()

max() is used to return the maximum value from the RDD. It takes no parameters.

Syntax:

RDD_data.max()

Example:

In this example, we create an RDD named student_marks with 20 elements and return the maximum value from an RDD.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,
21,34,34,56,34])

#perform max() operation
print(student_marks.max())

Output

100

From the above output, we can see that the maximum value in RDD is 100.

4. mean()

mean() is used to return the average (mean) value in the RDD. It takes no parameters.

Syntax:

RDD_data.mean()

Example:

In this example, we create an RDD named student_marks with 20 elements and return the average of elements from an RDD.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,
78,21,34,34,56,34])

#perform mean() operation
print(student_marks.mean())

Output

55.6

From the above output, we can see that the average value in RDD is 55.6.

5. count()

count() is used to return the total values present in the RDD. It takes no parameters.

Syntax:

RDD_data.count()

Example:

In this example, we create an RDD named student_marks with 20 elements and return the count of elements in an RDD.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,
78,21,34,34,56,34])

#perform count() operation
print(student_marks.count())

Output

20

From the above output, we can see that the total number of values in RDD are 20.

Conclusion

In this PySpark tutorial, we saw five different aggregation operations performed on RDD. sum() is used to return total value in an RDD. mean() is used to return total average from an RDD. min() and max() are used to return minimum and maximum values. If you need to return the total number of elements present in an RDD, you can use the count() function.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain