Apache Spark

PySpark least() Function

In PySpark DataFrame, it is possible to return the least elements in two or more columns.

PySpark supports the least() function, which is used to find the least values in multiple columns across all the rows in a PySpark RDD or a PySpark DataFrame. It is available in the pyspark.sql.functions module.

Syntax

dataframe_obj.select(least(dataframe_obj.column1,dataframe_obj.column2,..............))

Parameter:
It takes columns as parameters. We can access the columns using the ‘.’ operator (column1, column2, represents the column names).

Data
Here, we will create a PySpark DataFrame that has 5 columns: [‘subject_id’,’name’,’age’,’technology1′,’technology2′] with 10 rows.

import pyspark
from pyspark.sql import SparkSession
 
spark_app = SparkSession.builder.appName('_').getOrCreate()
 
students =[(4,'sravan',23,'PHP','Testing'),
           (4,'sravan',23,'PHP','Testing'),
           (46,'mounika',22,'.NET','HTML'),
           (4,'deepika',21,'Oracle','HTML'),
           (46,'mounika',22,'Oracle','Testing'),
           (12,'chandrika',22,'Hadoop','C#'),
           (12,'chandrika',22,'Oracle','Testing'),
           (4,'sravan',23,'Oracle','C#'),
           (4,'deepika',21,'PHP','C#'),
           (46,'mounika',22,'.NET','Testing')
              ]
 
dataframe_obj = spark_app.createDataFrame ( students,['subject_id','name','age','technology1','technology2'])
 
print("----------DataFrame----------")
dataframe_obj.show()

Output:

Now, we will see the examples to return the least values in two or multiple columns from the previous DataFrame.

Example 1
So, we created the previous DataFrame. Now, we will return the least values from subject_id and age columns.

# Import the least function from the module - pyspark.sql.functions
from pyspark.sql.functions import least
 
#compare the columns - subject_id and age and return the lowest values across each and every row.
dataframe_obj.select(dataframe_obj.subject_id, dataframe_obj.age,least (dataframe_obj.subject_id,dataframe_obj.age)).show()

Output:

Explanation
You can compare the two column values in each row.

least(4,23) - 4
least(4,23) - 4
least(46,22) -22
least(4,21) - 4
least(46,22) - 22
least(12,22) - 12
least(12,22) - 12
least(4,23) - 4
least(4,21) - 4
least(46,22) - 22.

Example 2
We created the previous DataFrame. Now, we will return the least values from name, technology1, and technology2 columns.

# Import the least function from the module - pyspark.sql.functions
from pyspark.sql.functions import least
 
#compare the columns - name,technology1, technology2 and age and return the lowest values across each and every row.
dataframe_obj.select(dataframe_obj.name, dataframe_obj.technology1, dataframe_obj.technology2,
                     least(dataframe_obj.name,dataframe_obj.technology1, dataframe_obj.technology2)).show()

Output:

Here, strings are compared based on the ASCII values:

least(sravan,PHP,Testing) - PHP
least(sravan,PHP,Testing) - PHP
least(mounika, .NET,HTML) - .NET
least(deepika, Oracle,HTML) - HTML
least(mounika, Oracle,Testing) - Oracle
least(chandrika, Hadoop, C#) - C#
least(chandrika,Oracle,Testing) - Oracle
least(sravan,Oracle,C#) - C#
least(deepika, PHP, C#) - C#
least(mounika,.NET,Testing) -.NET.

Entire Code

import pyspark
from pyspark.sql import SparkSession
 
spark_app = SparkSession.builder.appName('_').getOrCreate()
 
students =[(4,'sravan',23,'PHP','Testing'),
           (4,'sravan',23,'PHP','Testing'),
           (46,'mounika',22,'.NET','HTML'),
           (4,'deepika',21,'Oracle','HTML'),
           (46,'mounika',22,'Oracle','Testing'),
           (12,'chandrika',22,'Hadoop','C#'),
           (12,'chandrika',22,'Oracle','Testing'),
           (4,'sravan',23,'Oracle','C#'),
           (4,'deepika',21,'PHP','C#'),
           (46,'mounika',22,'.NET','Testing')
              ]
 
dataframe_obj = spark_app.createDataFrame( students, ['subject_id','name','age','technology1','technology2'])
 
print("----------DataFrame----------")
dataframe_obj.show()

# Import the least function from the module - pyspark.sql.functions
from pyspark.sql.functions import least
 
#compare the columns - subject_id and age and return the lowest values across each and every row.
dataframe_obj.select(dataframe_obj.subject_id,dataframe_obj.age,least (dataframe_obj.subject_id,dataframe_obj.age)).show()

#compare the columns - name,technology1,technology2 and age and return the lowest values across each and every row.
dataframe_obj.select (dataframe_obj.name,dataframe_obj.technology1,dataframe_obj.technology2,
 least(dataframe_obj.name,dataframe_obj.technology1,dataframe_obj.technology2)).show()

Conclusion

The least() function is used to find the lowest values in multiple columns across all the rows in a PySpark RDD or a PySpark DataFrame. It compares the columns with similar data types only. Otherwise, it will raise the Analysis Exception. The expressions should all have the same type.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain