PySpark supports the least() function, which is used to find the least values in multiple columns across all the rows in a PySpark RDD or a PySpark DataFrame. It is available in the pyspark.sql.functions module.
Syntax
Parameter:
It takes columns as parameters. We can access the columns using the ‘.’ operator (column1, column2, represents the column names).
Data
Here, we will create a PySpark DataFrame that has 5 columns: [‘subject_id’,’name’,’age’,’technology1′,’technology2′] with 10 rows.
from pyspark.sql import SparkSession
spark_app = SparkSession.builder.appName('_').getOrCreate()
students =[(4,'sravan',23,'PHP','Testing'),
(4,'sravan',23,'PHP','Testing'),
(46,'mounika',22,'.NET','HTML'),
(4,'deepika',21,'Oracle','HTML'),
(46,'mounika',22,'Oracle','Testing'),
(12,'chandrika',22,'Hadoop','C#'),
(12,'chandrika',22,'Oracle','Testing'),
(4,'sravan',23,'Oracle','C#'),
(4,'deepika',21,'PHP','C#'),
(46,'mounika',22,'.NET','Testing')
]
dataframe_obj = spark_app.createDataFrame ( students,['subject_id','name','age','technology1','technology2'])
print("----------DataFrame----------")
dataframe_obj.show()
Output:
Now, we will see the examples to return the least values in two or multiple columns from the previous DataFrame.
Example 1
So, we created the previous DataFrame. Now, we will return the least values from subject_id and age columns.
from pyspark.sql.functions import least
#compare the columns - subject_id and age and return the lowest values across each and every row.
dataframe_obj.select(dataframe_obj.subject_id, dataframe_obj.age,least (dataframe_obj.subject_id,dataframe_obj.age)).show()
Output:
Explanation
You can compare the two column values in each row.
least(4,23) - 4
least(46,22) -22
least(4,21) - 4
least(46,22) - 22
least(12,22) - 12
least(12,22) - 12
least(4,23) - 4
least(4,21) - 4
least(46,22) - 22.
Example 2
We created the previous DataFrame. Now, we will return the least values from name, technology1, and technology2 columns.
from pyspark.sql.functions import least
#compare the columns - name,technology1, technology2 and age and return the lowest values across each and every row.
dataframe_obj.select(dataframe_obj.name, dataframe_obj.technology1, dataframe_obj.technology2,
least(dataframe_obj.name,dataframe_obj.technology1, dataframe_obj.technology2)).show()
Output:
Here, strings are compared based on the ASCII values:
least(sravan,PHP,Testing) - PHP
least(mounika, .NET,HTML) - .NET
least(deepika, Oracle,HTML) - HTML
least(mounika, Oracle,Testing) - Oracle
least(chandrika, Hadoop, C#) - C#
least(chandrika,Oracle,Testing) - Oracle
least(sravan,Oracle,C#) - C#
least(deepika, PHP, C#) - C#
least(mounika,.NET,Testing) -.NET.
Entire Code
from pyspark.sql import SparkSession
spark_app = SparkSession.builder.appName('_').getOrCreate()
students =[(4,'sravan',23,'PHP','Testing'),
(4,'sravan',23,'PHP','Testing'),
(46,'mounika',22,'.NET','HTML'),
(4,'deepika',21,'Oracle','HTML'),
(46,'mounika',22,'Oracle','Testing'),
(12,'chandrika',22,'Hadoop','C#'),
(12,'chandrika',22,'Oracle','Testing'),
(4,'sravan',23,'Oracle','C#'),
(4,'deepika',21,'PHP','C#'),
(46,'mounika',22,'.NET','Testing')
]
dataframe_obj = spark_app.createDataFrame( students, ['subject_id','name','age','technology1','technology2'])
print("----------DataFrame----------")
dataframe_obj.show()
# Import the least function from the module - pyspark.sql.functions
from pyspark.sql.functions import least
#compare the columns - subject_id and age and return the lowest values across each and every row.
dataframe_obj.select(dataframe_obj.subject_id,dataframe_obj.age,least (dataframe_obj.subject_id,dataframe_obj.age)).show()
#compare the columns - name,technology1,technology2 and age and return the lowest values across each and every row.
dataframe_obj.select (dataframe_obj.name,dataframe_obj.technology1,dataframe_obj.technology2,
least(dataframe_obj.name,dataframe_obj.technology1,dataframe_obj.technology2)).show()
Conclusion
The least() function is used to find the lowest values in multiple columns across all the rows in a PySpark RDD or a PySpark DataFrame. It compares the columns with similar data types only. Otherwise, it will raise the Analysis Exception. The expressions should all have the same type.