Apache Spark

PySpark dense_rank() Window Function

dense_rank() in PySpark is a window function that sets the ranks to the rows in each partition.

The rank starts from 1, and if any values are the same in two or more rows, then the same rank will be assigned.

Here, the values are from the column where we specified the column inside orderBy(). So, based on the values in this column, dense_rank() checks the values and assigns the rank number.

The main thing is that dense_rank() is not available directly. We have to import it from the pyspark.sql.functions module.

In PySpark, it is possible to partition the rows in the DataFrame using the Window function. It is available in the pyspark.sql.window module.

The advantage of partition is to achieve data consistency and grouping. After performing the partition, we can set the ranks to the rows in each partition using the dense_rank() function.

It is possible to partition the rows in a DataFrame based on the values in a particular column. So, all the similar values are set to each partition.

We must follow the following steps to partition and apply dense_rank on partitioned data:

Steps:

  1. Create a PySpark DataFrame with similar values in at least one column.
  2. Partition the data using the partitionBy() method available in the Window function and order them based on the column using the orderBy() function.

    Syntax:
    partition=Window.partitionBy(“column”).orderBy(“column”)

    We can order the partitioned data with the partitioned column or any other column.
  3. Now, you can use dense_rank() function on the partitioned rows using over() function.
    So, we will add a column to store the row number using the withColumn() function.

    Syntax:
    dataframe_obj.withColumn(“NAME”,dense_rank().over(partition))

    Here, NAME specifies row name, and dataframe_obj is our PySpark DataFrame.

    Let’s implement the code.

Example 1
Here, we will create a PySpark DataFrame that has 5 columns: [‘subject_id’,’name’,’age’,’technology1′,’technology2′] with 10 rows and partition the rows based on technology1 using the Window function.

Finally, we will give Ranks to the rows in all partitions by adding a new column named “DENSE RANK”.

import pyspark
 
from pyspark.sql import *
 
spark_app = SparkSession.builder.appName('_').getOrCreate()
 
 
students =[(4,'sravan',23,'PHP','Testing'),
           (4,'sravan',23,'PHP','Testing'),
           (46,'mounika',22,'.NET','HTML'),
           (4,'deepika',21,'Oracle','HTML'),
           (46,'mounika',22,'Oracle','Testing'),
           (12,'chandrika',22,'Hadoop','C#'),
           (12,'chandrika',22,'Oracle','Testing'),
           (4,'sravan',23,'Oracle','C#'),
           (4,'deepika',21,'PHP','C#'),
           (46,'mounika',22,'.NET','Testing')
              ]
 
dataframe_obj = spark_app.createDataFrame( students,['subject_id','name','age','technology1','technology2'])
 
print("----------Actual DataFrame----------")
dataframe_obj.show()
 
# import the Window Function
from pyspark.sql.window import Window
 
#import the dense_rank from pyspark.sql.functions
from pyspark.sql.functions import dense_rank
 
#partition the dataframe based on the values in technology1 column and
#order the rows in each partition based on subject_id column
partition = Window.partitionBy("technology1").orderBy('subject_id')
 
print("----------Partitioned DataFrame----------")
 
#Now mention dense_rank for each row in DENSE RANK column
dataframe_obj.withColumn("DENSE RANK",dense_rank().over(partition)).show()

Output:

Explanation
The first output represents the actual data present in the DataFrame.

In the second output, the partition is done based on the technology1 column.

Total partitions: – 4

Partition 1:
.NET occurred two times in the first partition. The values in the subject_id column are the same. Hence the dense_rank() function returns the same rank-1 for both .NET values.

Partition 2:
Hadoop occurred one time in the second partition. So, the rank is 1.

Partition 3:
Oracle occurred four times in the third partition.
For the first two Oracle, the subject_id values are the same, so for both, rank-1 is assigned.
For the third Oracle, the rank is 2.
For the last Oracle, the rank is 3.

Partition 4:
PHP occurred three times in the fourth partition, and the values in the subject_id column are the same for all. Hence Rank is 1 for all PHP.

Example 2
Here, using the Window function, we will partition the rows based on technology2.

Finally, we will rank the rows based on the subject_id column in all partitions by adding a new column named “DENSE RANK”.

(If you didn’t create the PySpark DataFrame, create it using the first example.)

# import the Window Function
from pyspark.sql.window import Window
 
#import the dense_rank from pyspark.sql.functions
from pyspark.sql.functions import dense_rank
 
#partition the dataframe based on the values in technology2 column and
#order the rows in each partition based on subject_id column
partition = Window.partitionBy("technology2").orderBy('subject_id')
 
print("----------Partitioned DataFrame----------")
 
#Now mention dense_rank for each row in DENSE RANK column
dataframe_obj.withColumn("DENSE RANK",dense_rank().over(partition)).show()

Output:

Explanation
The output represents the actual data present in the dataframe.

In the second output, the partition is done based on the technology2 column and ordered based on the subject_id column.

Total partitions: – 3

Partition 1:
C# occurred three times in the first partition. Also, the first two values in the subject_id column are the same. Hence, the dense_rank() function returns the same rank-1 for the first two C#, and for the third C#, the rank is 2.

Partition 2:
HTML occurred twice in the second partition, and the subject_id values are different for these two. So, rank-1 is assigned for the first HTML, and rank-2 is assigned for the second.

Partition 3:
Testing occurred five times in the third partition.
For the first two testings, the subject_id values are the same, so for both, rank-1 is assigned.
For the third testing, the rank is 2.
For the last testing, the rank was 3.

Example 3
Here, using the Window function, we will partition the rows based on technology2.

Finally, we will rank the rows based on the age column in all partitions by adding a new column named “DENSE RANK”.

(If you didn’t create the PySpark DataFrame, create it using the first example.)

# import the Window Function
from pyspark.sql.window import Window
 
#import the dense_rank from pyspark.sql.functions
from pyspark.sql.functions import dense_rank
 
#partition the dataframe based on the values in technology2 column and
#order the rows in each partition based on age column
partition = Window.partitionBy("technology2").orderBy('age')
 
print("----------Partitioned DataFrame----------")
 
#Now mention dense_rank for each row in DENSE RANK column
dataframe_obj.withColumn("DENSE RANK",dense_rank().over(partition)).show()

Output:

Explanation
Output represents the actual data present in the dataframe.
In the second output, the partition is done based on the technology2 column and ordered based on the age column.

Total partitions: – 3

Partition 1:
C# occurred three times in the first partition. All the values in the age column are different. Hence, the dense_rank() function returns the ranks 1,2 and 3 for three C# values.

Partition 2:
HTML occurred two times in the first partition. All the values in the age column are different. Hence the dense_rank() function returns the ranks 1 and 2 for two HTML values.

Partition 3:
Testing occurred five times in the third partition.
For the first three testing values, the age values are the same. So, for three, rank-1 is assigned.
The age values for the fourth and fifth testing are the same, so rank-2 is given to both.

Conclusion

We discussed partitioning the rows in the PySpark DataFrame and setting the rank in each partition using the dense_rank() Window function. dense_rank in PySpark is a window function that sets the ranks to the rows in each partition. The rank starts from 1, and if any values are the same in two or more rows, then the same rank will be assigned.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain