Apache Spark

PySpark between() Function

The between() function in PySpark is used to select the values within the specified range. It can be used with the select() method. It will return true across all the values within the specified range. For the values that are not in the specified range, false is returned.

Syntax

dataframe_obj.select(dataframe_obj.age.between(low,high))

Where, dataframe_object is the PySpark DataFrame.

Parameters:
It takes two parameters.

  1. The low will be the starting range
  2. The high will be the ending range.

Return:
It returns all the rows with Boolean values (true/false). We will look at different examples.

Example 1
Here, we will get the values in the age column that are in the range of 10 to 21.

import pyspark
from pyspark.sql import *
spark_app = SparkSession.builder.appName('_').getOrCreate()
students =[(4,'sravan',23,'PHP','Testing'),
        (2,'sravan',23,'Oracle','Testing'),
        (46,'mounika',22,'.NET','HTML'),
        (12,'deepika',21,'Oracle','HTML'),
        (46,'mounika',22,'Oracle','Testing'),
        (12,'chandrika',23,'Hadoop','C#'),
        (12,'chandrika',22,'Oracle','Testing'),
        (45,'sravan',23,'Oracle','C#'),
        (4,'deepika',21,'PHP','C#'),
        (46,'mounika',22,'.NET','Testing')
            ]
 
dataframe_obj = spark_app.createDataFrame( students,['subject_id','name','age','technology1','technology2'])

print("---Actual Dataframe---")
dataframe_obj.show()

print("---The values in the age column between 10 and 21---")
dataframe_obj.select(dataframe_obj.age,dataframe_obj.age.between(10, 21)).show()

Output:

You can see that the values in the age column returned true between 10 and 21. The rest of the values returned false.

Example 2
Here, we will have the values in the subject_id column that are in the range of 40 to 46.

import pyspark
from pyspark.sql import *
spark_app = SparkSession.builder.appName('_').getOrCreate()
students =[(4,'sravan',23,'PHP','Testing'),
        (2,'sravan',23,'Oracle','Testing'),
        (46,'mounika',22,'.NET','HTML'),
        (12,'deepika',21,'Oracle','HTML'),
        (46,'mounika',22,'Oracle','Testing'),
        (12,'chandrika',23,'Hadoop','C#'),
        (12,'chandrika',22,'Oracle','Testing'),
        (45,'sravan',23,'Oracle','C#'),
        (4,'deepika',21,'PHP','C#'),
        (46,'mounika',22,'.NET','Testing')
            ]
 
dataframe_obj = spark_app.createDataFrame( students, ['subject_id','name','age','technology1','technology2'])

print("---Actual Dataframe---")
dataframe_obj.show()

print("---The values in the subject_id column between 40 and 46---")
dataframe_obj.select(dataframe_obj.subject_id, dataframe_obj.subject_id.between(40,46)).show()

Output:

You can see that the values in the subject_id column returned true that are between 40 and 46. The rest of the values are returned false.

Example 3
Here, we will get the values in the subject_id column that are in the range of 60 to 100.

import pyspark
from pyspark.sql import *
spark_app = SparkSession.builder.appName('_').getOrCreate()
students =[(4,'sravan',23,'PHP','Testing'),
        (2,'sravan',23,'Oracle','Testing'),
        (46,'mounika',22,'.NET','HTML'),
        (12,'deepika',21,'Oracle','HTML'),
        (46,'mounika',22,'Oracle','Testing'),
        (12,'chandrika',23,'Hadoop','C#'),
        (12,'chandrika',22,'Oracle','Testing'),
        (45,'sravan',23,'Oracle','C#'),
        (4,'deepika',21,'PHP','C#'),
        (46,'mounika',22,'.NET','Testing')
            ]
 
dataframe_obj = spark_app.createDataFrame ( students,['subject_id','name','age','technology1','technology2'])

print("---Actual Dataframe---")
dataframe_obj.show()

print("---The values in the subject_id column between 60 and 100---")
dataframe_obj.select(dataframe_obj.subject_id, dataframe_obj.subject_id.between(60,100)).show()

Output:

You can see that the no values in the subject_id column are not in the range specified. So, for all rows, false is returned.

Conclusion

In this PySpark tutorial, we discussed the between() function. Wherein, the between() function selects the values within the specified range. It can be used with the select() method. It will return true across all the values that are inside within the specified range. For the values that are not in the specified range, false is returned.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain