In Python, PySpark is a spark module used to provide a similar kind of processing like spark using DataFrame. In PySpark, orderBy() is used to arrange the rows in sorting/ascending order in the DataFrame.
It will return the new dataframe by arranging the rows in the existing dataframe.
Let’s create a PySpark DataFrame.
Example:
In this example, we are going to create the PySpark DataFrame with 5 rows and 6 columns and display using show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display dataframe
df.show()
Output:
Method – 1: Using orderBy()
Here, we are using the orderBy() function to sort the PySpark DataFrame based on the columns. It will take one or more columns.
Syntax:
Here,
- dataframe is the input PySpark DataFrame.
- column_name is the column where sorting is applied.
Example:
In this example, we are going to sort the dataframe based on address and age columns with the orderBy() function and display the sorted dataframe using the collect() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# sort the dataframe based on address and age columns
# and display the sorted dataframe
df.orderBy("address","age").collect()
Output:
Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),
Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),
Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),
Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]
Method – 2: Using orderBy() with Col Function
Here, we are using the orderBy() function to sort the PySpark DataFrame based on the columns. We have to specify the column names/s inside the orderBy() function through the col function. We have to import this function from pyspark.sql.functions module. This is used to read a column from the PySpark DataFrame.
Syntax:
Here,
- dataframe is the input PySpark DataFrame.
- column_name is the column where sorting is applied through the col function.
Example:
In this example, we are going to sort the dataframe based on address and age columns with the orderBy() function and display the sorted dataframe using the collect() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# sort the dataframe based on address and age columns
# and display the sorted dataframe
df.orderBy(col("address"),col("age")).collect()
Output:
Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),
Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),
Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),
Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]
Method – 3: Using orderBy() with DataFrame Label
Here, we are using the orderBy() function to sort the PySpark DataFrame based on the columns. We have to specify the column names/labels inside the orderBy() function through the DataFrame column name/label.
Syntax:
Here,
- dataframe is the input PySpark DataFrame.
- column_name is the column where sorting is applied.
Example:
In this example, we are going to sort the dataframe based on address and age columns with the orderBy() function and display the sorted dataframe using the collect() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# sort the dataframe based on address and age columns
# and display the sorted dataframe
df.orderBy(df.address,df.age).collect()
Output:
Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),
Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),
Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),
Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]
Method – 4: Using orderBy() with DataFrame Index
Here, we are using the orderBy() function to sort the PySpark DataFrame based on the columns. We have to specify the column index/indices inside the orderBy() function through the DataFrame column index/position. In DataFrame, indexing starts with ‘0’.
Syntax:
Here,
- dataframe is the input PySpark DataFrame.
- column_index is the column position where sorting is applied.
Example:
In this example, we are going to sort the dataframe based on address and age columns with the orderBy() function and display the sorted dataframe using the collect() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# sort the dataframe based on address and age columns
# and display the sorted dataframe
df.orderBy(df[0],df[1]).collect()
Output:
Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),
Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),
Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),
Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]
Conclusion
In this article, we discuss how to use the orderBy() function using four scenarios on the PySpark dataframe in Python. Finally, we came to a point where we can sort the data in the PySpark Dataframe based on the columns present in the DataFrame.