Apache Spark

PySpark – Pandas Series: Aggregate Functions

“In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using Series, which will store the given data in an array (column in PySpark Internally).

PySpark – pandas Series represents the pandas Series, but it holds the PySpark column internally.

Pandas support Series data structure, and pandas is imported from the pyspark module.

Before that, you have to install the pyspark module.”

Command

pip install pyspark

Syntax to import

from pyspark import pandas

After that, we can create or use the series from the pandas module.

Syntax to create pandas Series

pyspark.pandas.Series()

We can pass a list or list of lists with values.

Let’s create a pandas Series through pyspark that has five numeric values.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create series with 5 elements
pyspark_series=pandas.Series([90,56,78,54,0])

print(pyspark_series)

Output

Now, we will go into our tutorial.

Aggregate functions are used to perform aggregation operations like sum(), min(),mean() and max().These operations work only on numeric data like integer, double, etc

Let’s see them one by one.

pyspark.pandas.Series.sum()

sum() in the pyspark pandas series is used to return the total sum.

Syntax

pyspark_series.sum()

Where pyspark_series  is the pyspark pandas series.

Example
Return sum of the above pyspark pandas series.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create series with 5 elements
pyspark_series=pandas.Series([90,56,78,54,0])
#return sum
print(pyspark_series.sum())

Output:

278
Working:
90+56+78+54+0=278.

pyspark.pandas.Series.mean()

mean() in the pyspark pandas series is used to return the total average.

Syntax

pyspark_series.mean()

Where pyspark_series  is the pyspark pandas series.

Example
Return average of the above pyspark pandas series.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create series with 5 elements
pyspark_series=pandas.Series([90,56,78,54,0])
#return average
print(pyspark_series.mean())

 

Output

55.6
Working:
(90+56+78+54+0)/5=55.6.

pyspark.pandas.Series.min()

min() in the pyspark pandas series is used to return minimum value.

Syntax

pyspark_series.min()

Where pyspark_series  is the pyspark pandas series.

Example
Return minimum value from the above pyspark pandas series.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create series with 5 elements
pyspark_series=pandas.Series([90,56,78,54,0])
#return minimum
print(pyspark_series.min())

Output

0
Working:
minimum(90+56+78+54+0)=0

pyspark.pandas.Series.max()

max() in the pyspark pandas series is used to return maximum value.

Syntax

pyspark_series.max()

Where pyspark_series  is the pyspark pandas series.

Example
Return maximum value from the above pyspark pandas series.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create series with 5 elements
pyspark_series=pandas.Series([90,56,78,54,0])
#return maximum
print(pyspark_series.max())

Output

90
Working:
maximum(90+56+78+54+0)=90

Conclusion

In this pyspark pandas series tutorial, we saw four different aggregation functions performed on the series. sum() will return the total sum, avg() is used to return the total average, min() is used to return the minimum value, and max() will return the maximum value.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain