First lets create a RDD. RDD stands for Resilient Distributed Datasets. We can call RDD as a fundamental data structure in Apache Spark. We need to import RDD from the pyspark.rdd module. In PySpark to create an RDD, we can use the parallelize() method.
Syntax:
Where, data can be a one dimensional (linear data) or two dimensional data (row-column data).
Now let’s go thourh the aggregate functions one by one: sum(), min(), max(), mean, and count().
1. sum()
sum() is used to return the total (sum) value in the RDD. It takes no parameters.
Syntax:
Example:
In this example, we create an RDD named student_marks with 20 elements and return the sum of total elements from an RDD.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
# import RDD from pyspark.rdd
from pyspark.rdd import RDD
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,
78,21,34,34,56,34])
#perform sum() operation
print(student_marks.sum())
Output:
From the above output, we can see that the total sum of elements in RDD is 1112.
2. min()
min() is used to return the minimum value from the RDD. It takes no parameters.
Syntax:
Example:
In this example, we create an RDD named student_marks with 20 elements and return the minimum value from an RDD.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
# import RDD from pyspark.rdd
from pyspark.rdd import RDD
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,
21,34,34,56,34])
#perform min() operation
print(student_marks.min())
Output:
From the above output, we can see that the minimum value in RDD is 21.
3. max()
max() is used to return the maximum value from the RDD. It takes no parameters.
Syntax:
Example:
In this example, we create an RDD named student_marks with 20 elements and return the maximum value from an RDD.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
# import RDD from pyspark.rdd
from pyspark.rdd import RDD
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,
21,34,34,56,34])
#perform max() operation
print(student_marks.max())
Output
From the above output, we can see that the maximum value in RDD is 100.
4. mean()
mean() is used to return the average (mean) value in the RDD. It takes no parameters.
Syntax:
Example:
In this example, we create an RDD named student_marks with 20 elements and return the average of elements from an RDD.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
# import RDD from pyspark.rdd
from pyspark.rdd import RDD
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,
78,21,34,34,56,34])
#perform mean() operation
print(student_marks.mean())
Output
From the above output, we can see that the average value in RDD is 55.6.
5. count()
count() is used to return the total values present in the RDD. It takes no parameters.
Syntax:
Example:
In this example, we create an RDD named student_marks with 20 elements and return the count of elements in an RDD.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
# import RDD from pyspark.rdd
from pyspark.rdd import RDD
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student marks data with 20 elements
student_marks =spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,
78,21,34,34,56,34])
#perform count() operation
print(student_marks.count())
Output
From the above output, we can see that the total number of values in RDD are 20.
Conclusion
In this PySpark tutorial, we saw five different aggregation operations performed on RDD. sum() is used to return total value in an RDD. mean() is used to return total average from an RDD. min() and max() are used to return minimum and maximum values. If you need to return the total number of elements present in an RDD, you can use the count() function.