Apache Spark

PySpark – expr()

In Python, PySpark is a Spark module that provides a similar kind of Processing like spark using DataFrame. expr() in PySpark refers to an SQL Expression used to extract the data by performing some operations. expr() can be used in many scenarios.

Actual Syntax

expr(SQL Expression)

It takes an SQL Expression and performs some operations within the expression.

Let’s discuss it one by one. Before that, we have to create a PySpark DataFrame for demonstration.”

Example
We are going to create a dataframe with 5 rows and 6 columns and display it using the show() method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
 
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
 
# create the dataframe
df = spark_app.createDataFrame( students)
 
#display dataframe
df.show()

Output

Scenario 1: Concatenating One or More Columns

In this scenario, we will concatenate one or more columns in the PySpark DataFrame using expr().

It uses the withColumn() method to create a new column name for the concatenate columns.

Syntax

dataframe.withColumn("column_name", expression)

Parameters
It takes two parameters.

  1.  First is the column name used as the column name for the concatenated columns
  2. Expression is created using expr() to concatenate columns.

Here the expression is :

expr("column1||  separator  || column2||  separator  ' ||............ ||column n”)

Here, column refers to the column names to be concatenated with a separator in between them. It can be any character, like special characters.

Example
In this example, we are concatenating three columns – roll no name and address separated by “-“ and the column name for these concatenated columns is – “rollno with name and address.”

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
 
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
 
# create the dataframe
df = spark_app.createDataFrame( students)
 
#concate rollno,name and address columns uaing expr()
df.withColumn("rollno with name and address", expr("rollno|| '-'|| name || '-' || address")).show()

Output

We can see that rollno, name, and address columns are concatenated with the “-“ separator.

Scenario 2: Add a New Column Based on Conditions

Here, we will use expr() to specify the conditions inside it and, along withColumn(), to assign the values from the conditions returned in a new column.

We can specify the conditions using CASE WHEN

Structure

CASE WHEN
  THEN
CASE WHEN
  THEN
…..
…..
ELSE
END

Inside, when we have to provide the condition and inside, THEN we have to perform an action.

If the condition inside is True, THEN the block is executed. If it fails, it goes to the ELSE block and END.

This conditional statement is provided inside expr() as an SQL Expression.

Syntax

dataframe.withColumn("column", expr("CASE WHEN condition1  THEN  action " +"WHEN condition2  THEN action ……………..  ELSE action  END"))

Where,

  1. column is the new column that is created, and values are assigned based on the conditions in this column
  2. expr() takes CASE WHEN conditional logic

Example
In this example, we are creating a new column – “upgraded Height,” and assigning values to it based on the following conditions applied to the height column.

If the height is greater than 5 – assign 10 in the upgraded Height column across it.

If the height is less than 5 – assign 5 in the upgraded Height column across it.

Otherwise, upgrade the Height column to 7 across the rows.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
 
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
 
# create the dataframe
df = spark_app.createDataFrame( students)
 
#In height column
#if height is greater than 5 - assign 10 in upgraded Height column across it
#if height is less than 5 - assign 5 in upgraded Height column across it
#otherwise upgraded Height column  as 7 across the rows
df.withColumn("upgraded Height", expr("CASE WHEN height >5  THEN 10 " +"WHEN height <5  THEN 5 ELSE 7 END")).show()

Output

We can see that values are allocated in the “upgraded Height” column based on the conditions.

Scenario 3: Perform Arithmetic Operations

It is possible to perform arithmetic operations like addition, subtraction, multiplication, and division with existing columns using expr().

Here expr() can be used with a select clause to display the updated columns.

Syntax

dataframe.select(expr(“expression”))

The expression is  used to perform arithmetic operations that can be:

“column arithmetic_operator  value as new_column”

Here, the column is the column name in which we can perform arithmetic operations; value is the number added to each row in the column, and new_column is the column name that stores the result.

Example
In this example, we will perform two arithmetic operations – addition and subtraction on the age column.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
 
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
 
# create the dataframe
df = spark_app.createDataFrame( students)
 
 #add 10 to age column
df.select(df.age,expr("age + 10 as age_after_10")).show()

 #subtract 10 from age column
df.select(df.age,expr("age - 10 as age_before_10")).show()

Output

In the addition operation, we are adding 10 to the age column and store in the “age_after_10” column.

In the subtraction operation, we subtract 10 from the age column and store it in the “age_before_10” column.

Conclusion

In this PySpark tutorial, we saw how to use expr() in three different scenarios. It can be used for conditional expressions, It can be used to perform arithmetic operations, and it can be used to concatenate multiple columns.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain