Apache Spark

PySpark – Greatest() Function

It is possible to return the greatest elements in two or more columns in PySpark DataFrame. PySpark supports the greatest() function which is used to find the highest values in multiple columns across all the rows in a PySpark RDD or in a PySpark DataFrame. It is available in the pyspark.sql.functions module.

Syntax:

dataframe_obj.select(greatest(dataframe_obj.column1,dataframe_obj.column2,..............))

Parameter:

It takes columns as parameters.

We can access the columns using the “.” operator (column1,column2,. represents the column names).

Data:
Here, we create a PySpark DataFrame that has 5 columns -[‘subject_id’,’name’,’age’,’technology1′,’technology2′] with 10 rows.

import pyspark
from pyspark.sql import SparkSession

spark_app = SparkSession.builder.appName('_').getOrCreate()

students =[(4,'sravan',23,'PHP','Testing'),
(4,'sravan',23,'PHP','Testing'),
(46,'mounika',22,'.NET','HTML'),
(4,'deepika',21,'Oracle','HTML'),
(46,'mounika',22,'Oracle','Testing'),
(12,'chandrika',22,'Hadoop','C#'),
(12,'chandrika',22,'Oracle','Testing'),
(4,'sravan',23,'Oracle','C#'),
(4,'deepika',21,'PHP','C#'),
(46,'mounika',22,'.NET','Testing')
]

dataframe_obj = spark_app.createDataFrame( students,['subject_id','name','age','technology1','technology2'])

print("----------DataFrame----------")
dataframe_obj.show()

Output:

Now, we will see the examples to return the greatest values in two or multiple columns from the previous DataFrame.

Example 1:

We created the given DataFrame. Now, we return the greatest values from the subject_id and age columns.

# Import the greatest function from the module - pyspark.sql.functions
from pyspark.sql.functions import greatest

#compare the columns - subject_id and age and return greatest values across each and every row.
dataframe_obj.select(dataframe_obj.subject_id,dataframe_obj.age,greatest(dataframe_obj.subject_id,dataframe_obj.age)).show()

Output:

Explanation:

You can compare the two column values in each row.

greatest(4,23) - 23
greatest(4,23) - 23
greatest(46,22) -46
greatest(4,21) - 21
greatest(46,22) - 46
greatest(12,22) - 22
greatest(12,22) - 22
greatest(4,23) - 23
greatest(4,21) - 21
greatest(46,22) - 46.

Example 2:

We created the given DataFrame. Now, we return the greatest values from the name , technology1, and technology2 columns.

# Import the greatest function from the module - pyspark.sql.functions
from pyspark.sql.functions import greatest

#compare the columns - name,technology1,technology2 and age and return greatest values across each and every row.
dataframe_obj.select(dataframe_obj.name,dataframe_obj.technology1,dataframe_obj.technology2,

greatest(dataframe_obj.name,dataframe_obj.technology1,dataframe_obj.technology2)).show()

Output:

Here, the strings are compared based on the ASCII values.

greatest(sravan,PHP,Testing) - sravan
greatest(sravan,PHP,Testing) - sravan
greatest(mounika, .NET,HTML) - mounika
greatest(deepika, Oracle,HTML) - deepika
greatest(mounika, Oracle,Testing) - mounika
greatest(chandrika, Hadoop, C#) - chandrika
greatest(chandrika,Oracle,Testing) - chandrika
greatest(sravan,Oracle,C#) - sravan
greatest(deepika, PHP, C#) - deepika
greatest(mounika,.NET,Testing) -mounika.

Entire Code:

import pyspark
from pyspark.sql import SparkSession

spark_app = SparkSession.builder.appName('_').getOrCreate()

students =[(4,'sravan',23,'PHP','Testing'),
(4,'sravan',23,'PHP','Testing'),
(46,'mounika',22,'.NET','HTML'),
(4,'deepika',21,'Oracle','HTML'),
(46,'mounika',22,'Oracle','Testing'),
(12,'chandrika',22,'Hadoop','C#'),
(12,'chandrika',22,'Oracle','Testing'),
(4,'sravan',23,'Oracle','C#'),
(4,'deepika',21,'PHP','C#'),
(46,'mounika',22,'.NET','Testing')
]

dataframe_obj = spark_app.createDataFrame( students,['subject_id','name','age','technology1','technology2'])

print("----------DataFrame----------")
dataframe_obj.show()

# Import the greatest function from the module - pyspark.sql.functions
from pyspark.sql.functions import greatest

#compare the columns - subject_id and age and return greatest values across each and every row.
dataframe_obj.select(dataframe_obj.subject_id,dataframe_obj.age,greatest(dataframe_obj.subject_id,dataframe_obj.age)).show()

#compare the columns - name,technology1,technology2 and age and return greatest values across each and every row.
dataframe_obj.select(dataframe_obj.name,dataframe_obj.technology1,dataframe_obj.technology2,

greatest(dataframe_obj.name,dataframe_obj.technology1,dataframe_obj.technology2)).show()

Conclusion

The greatest() function is used to find the highest values in multiple columns across all the rows in a PySpark RDD or in a PySpark DataFrame. It compares the columns with similar data types only. Otherwise, it raises the Analysis Exception – The expressions should all have the same type.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain