PySpark – max()

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame. max() in PySpark returns the maximum value from a particular column in the DataFrame. We can get the maximum value in three ways.

Method 1: Using select() method
Method 2: Using agg() method
Method 3: Using groupBy() method

Before that, we have to create PySpark DataFrame for demonstration.

Example:

We will create a dataframe with 5 rows and 6 columns and display it using the show() method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display dataframe
df.show()

Output Screenshot:

Method 1: Using select() method

We can get the maximum value from the column in the dataframe using the select() method. Using the max() method, we can get the maximum value from the column. To use this method, we have to import it from pyspark.sql.functions module, and finally, we can use the collect() method to get the maximum from the column.

Syntax:

df.select(max (‘column_name’))

Where,

df is the input PySpark DataFrame
column_name is the column to get the maximum value

If we want to return the maximum value from multiple columns, we must use the max () method inside the select() method by specifying the column name separated by a comma.

Syntax:

df.select(max (‘column_name’), max (‘column_name’),………., max (‘column_name’))

Where,

df is the input PySpark DataFrame
column_name is the column to get the maximum value

Example 1: Single Column

This example will get the maximum value from the height column in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the maximum - max function
from pyspark.sql.functions import max

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#return the maximum from the height column
df.select(max('height')).collect()

Output:

[Row(max(height)=5.79)]

In the above example, the maximum value from the height column is returned.

Example 2: Multiple Columns

This example will get the maximum value from the height, age, and weight columns in the PySpark dataframe.

Output:

[Row(max(height)=5.79, max(age)=37, max(weight)=67)]

In the above example, the maximum value from the height, age and weight columns is returned.

Method 2: Using agg() method

We can get the maximum value from the column in the dataframe using the agg() method. This method is known as aggregation, which groups the values within a column. It will take dictionary as a parameter in that key will be column name and value is the aggregate function, i.e., max. Using the max () method, we can get the maximum value from the column, and finally, we can use the collect() method to get the maximum from the column.

Syntax:

df.agg({‘column_name’: max })

Where,

df is the input PySpark DataFrame
column_name is the column to get the maximum value
max is an aggregation function used to return the maximum value

If we want to return the maximum value from multiple columns, we must specify the column name with the max function separated by a comma.

Syntax:

df.agg({‘column_name’: max ,‘column_name’: max ,…………………,‘column_name’: max })

Where,

df is the input PySpark DataFrame
column_name is the column to get the maximum value
max is an aggregation function used to return the maximum value

Example 1: Single Column

This example will get the maximum value from the height column in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,'height':5.79,
'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#return the maximum from the height column
df.agg({'height': 'max'}).collect()

Output:

[Row(max(height)=5.79)]

In the above example, the maximum value from the height column is returned.

Example 2: Multiple Columns

This example will get the maximum value from the height, age, and weight columns in the PySpark dataframe.

Output:

[Row(max(weight)=67, max(age)=37, max(height)=5.79)]

In the above example, the maximum value from the height, age and weight columns is returned.

Method 3: Using groupBy() method

We can get the maximum value from the column in the dataframe using the groupBy() method. This method will return the maximum value by grouping similar values in a column. We have to use max() function after performing groupBy() function

Syntax:

df.groupBy(group_column). max (‘column_name’)

Where,

df is the input PySpark DataFrame
group_column is the column where values are grouped based on this column
column_name is the column to get the maximum value
max is an aggregation function used to return the maximum value.

Example 1:

In this example, we will group the address column with the height column to return the maximum value based on this address column.

Output:

There are three unique values in the address field – hyd, guntur, and patna. So the maximum will be formed by grouping the values across the address values.

[Row(address='hyd', max(height)=5.59),
Row(address='guntur', max(height)=5.79),
Row(address='patna', max(height)=2.79)]

Example 2:

In this example, we will group the address column with the weight column to return the maximum value based on this address column.

Output:

There are three unique values in the address field – hyd, guntur, and patna. So the maximum will be formed by grouping the values across the address values.

[Row(address='hyd', max(weight)=54),
Row(address='guntur', max(weight)=67),
Row(address='patna', max(weight)=17)]

Conclusion:

We discussed how to get the maximum value from the PySpark DataFrame using the select() and agg() methods. To get the maximum value by grouping with other columns, we used the groupBy along with the max() function. See also PySpark Min() article.

PySpark – max()

Example:

Method 1: Using select() method

Example 1: Single Column

Example 2: Multiple Columns

Method 2: Using agg() method

Example 1: Single Column

Example 2: Multiple Columns

Method 3: Using groupBy() method

Example 1:

Example 2:

Conclusion:

About the author

Gottumukkala Sravan Kumar