Apache Spark

PySpark – Asc() & Desc()

In Python, PySpark is a spark module used to provide a similar kind of processing like spark using DataFrame. Let’s create a PySpark DataFrame.

Example:

In this example, we are going to create the PySpark DataFrame with 5 rows and 6 columns and display using show() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},

 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},

 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},

 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},

 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

#display dataframe

df.show()

Output:

PySpark – asc()

In PySpark, asc() is used to arrange the rows in ascending order in the DataFrame.

It will return the new dataframe by arranging the rows in the existing dataframe. It is used with sort() or orderBy() functions.

Method – 1: Using asc() with Col Function

Here, we are using the orderBy() or sort() functions to sort the PySpark DataFrame based on the columns in ascending order. We have to specify the column names/s inside the orderBy()/sort() function through the col function. We have to import this function from pyspark.sql.functions module. This is used to read a column from the PySpark DataFrame.

Syntax:

dataframe.orderBy(col(“column_name”).asc(),………, col(“column_name”).asc())

dataframe.sort(col(“column_name”).asc(),………, col(“column_name”).asc())

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied through the col function.

Example:

In this example, we are going to sort the dataframe in ascending order based on address and age columns with the orderBy() and sort() functions and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#import the col function

from pyspark.sql.functions import col

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},

 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},

 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},

 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},

 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

print(df.orderBy(col("address").asc(),col("age").asc()).collect())

print()

print(df.sort(col("address").asc(),col("age").asc()).collect())

Output:

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]

Method – 2: Using asc() with DataFrame Label

Here, we are using the orderBy() or sort() functions to sort the PySpark DataFrame based on the columns in ascending order. We have to specify the column names/labels inside the orderBy()/sort() function through the DataFrame column name/label.

Syntax:

dataframe.orderBy(dataframe.column_name.asc(),………, dataframe.column_name.asc())

dataframe.sort(dataframe.column_name.asc(),………, dataframe.column_name.asc())

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied.

Example:

In this example, we are going to sort the dataframe in ascending order based on address and age columns with the orderBy() and sort() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},

 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},

 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},

 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},

 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

print(df.orderBy(df.address.asc(),df.age.asc()).collect())

print()

print(df.sort(df.address.asc(),df.age.asc()).collect())

Output:

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]

Method – 3: Using asc() with DataFrame Index

Here, we are using the orderBy() or sort() functions to sort the PySpark DataFrame based on the columns in ascending order. We have to specify the column index/indices inside the orderBy()/sort() function through the DataFrame column index/position. In DataFrame, indexing starts with ‘0’.

Syntax:

dataframe.orderBy(dataframe[column_index].asc(),………, dataframe[column_index].asc())

dataframe.sort(dataframe[column_index].asc(),………, dataframe[column_index].asc())

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_index is the column position where sorting is applied.

Example:

In this example, we are going to sort the dataframe in ascending order based on address and age columns with the orderBy() and sort() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},

 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},

 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},

 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},

 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

print(df.orderBy(df[0].asc(),df[1].asc()).collect())

print()

print(df.sort(df[0].asc(),df[1].asc()).collect())

Output:

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]

PySpark – desc()

In PySpark, desc() is used to arrange the rows in descending order in the DataFrame.

It will return the new dataframe by arranging the rows in the existing dataframe. It is used with sort() or orderBy() functions.

Method – 1: Using desc() with Col Function

Here, we are using the orderBy() or sort() functions to sort the PySpark DataFrame based on the columns to sort the PySpark DataFrame in descending order. We have to specify the column names/s inside the orderBy()/sort() function through the col function. We have to import this function from pyspark.sql.functions module. This is used to read a column from the PySpark DataFrame.

Syntax:

dataframe.orderBy(col(“column_name”).desc(),………, col(“column_name”).desc())

dataframe.sort(col(“column_name”).desc(),………, col(“column_name”).desc())

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied through the col function.

Example:

In this example, we are going to sort the dataframe in descending order based on address and age columns with the orderBy() and sort() functions and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#import the col function

from pyspark.sql.functions import col

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},

 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},

 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},

 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},

 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

print(df.orderBy(col("address").desc(),col("age").desc()).collect())

print()

print(df.sort(col("address").desc(),col("age").desc()).collect())

Output:

[Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67)]

[Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67)]

Method – 2: Using desc() with DataFrame Label

Here, we are using the orderBy() or sort() functions to sort the PySpark DataFrame based on the columns to sort the PySpark DataFrame in descending order. We have to specify the column names/labels inside the orderBy()/sort() function through the DataFrame column name/label.

Syntax:

dataframe.orderBy(dataframe.column_name.desc(),………, dataframe.column_name.desc())

dataframe.sort(dataframe.column_name.desc(),………, dataframe.column_name.desc())

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied.

Example:

In this example, we are going to sort the dataframe in descending order based on address and age columns with the orderBy() and sort() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},

 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},

 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},

 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},

 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

print(df.orderBy(df.address.desc(),df.age.desc()).collect())

print()

print(df.sort(df.address.desc(),df.age.desc()).collect())

Output:

[Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67)]

[Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67)]

Method – 3: Using asc() with DataFrame Index

Here, we are using the orderBy() or sort() functions to sort the PySpark DataFrame based on the columns in descending order. We have to specify the column index/indices inside the orderBy()/sort() function through the DataFrame column index/position. In DataFrame, indexing starts with ‘0’.

Syntax:

dataframe.orderBy(dataframe[column_index].desc(),………, dataframe[column_index].desc())

dataframe.sort(dataframe[column_index].desc(),………, dataframe[column_index].desc())

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_index is the column position where sorting is applied.

Example:

In this example, we are going to sort the dataframe in descending order based on address and age columns with the orderBy() and sort() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},

 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},

 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},

 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},

 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

print(df.orderBy(df[0].asc(),df[1].asc()).collect())

print()

print(df.sort(df[0].asc(),df[1].asc()).collect())

Output:

[Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67)]

[Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17),

Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54),

Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34),

Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28),

Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67)]

Miscellaneous

We can also use both the functions on different columns in PySpark DataFrame at a time.

Example:

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#import the col function

from pyspark.sql.functions import col

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},

 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},

 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},

 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},

 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

print(df.orderBy(col("address").desc(),col("age").asc()).collect())

print()

print(df.sort(col("address").asc(),col("age").desc()).collect())

Output:

[Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17), Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28), Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34), Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54), Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67)]

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67), Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54), Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34), Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28), Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17)]

Conclusion

In this article, we discuss how to use the asc() function using three scenarios with sort() and orderBy() functions on the PySpark dataframe in Python. Finally, we came to a point where we can sort the data in ascending order using asc() and descending order using desc() in the PySpark Dataframe based on the columns present in the DataFrame.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain