Apache Spark

PySpark – Lit()

In Python, PySpark is a Spark module used to provide a similar kind of processing like spark using DataFrame. Lit() is used create a new column by adding values to that column in PySpark DataFrame. Before moving to the syntax, we will create PySpark DataFrame.

Example:

Here, we are going to create PySpark dataframe with 5 rows and 6 columns.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import the col function
from pyspark.sql.functions import col

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display the dataframe
df.show()

Output:

lit() – Syntax

lit(“value”).alias(“column_name”)

Where,

  1. column_name is the new column.
  2. value is the constant value added to the new column.

We have to import this method from pyspark.sql.functions module.

Note: We can add multiple columns at a time

Using select() method, we can use lit() method.

Select() is used to display the columns from the dataframe. Along with that we can add column/s using  lit() method.

Syntax:

dataframe.select(col("column"),…………,lit("value").alias("new_column"))

Where,

  1. column is the existing column name to display.
  2. new_column is the new column name to be added.
  3. value is the constant value added to the new column.

Example 1:

In this example, we are going to add a new column named – PinCode and add a constant value – 522112 to this column and select rollno column along with PinCode from the PySpark DataFrame.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import the col,lit function
from pyspark.sql.functions import col,lit

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

# add a new column PinCode with Constant value - 522112
final = df.select(col("rollno"),lit("522112").alias("PinCode"))

#display the final dataframe
final.show()

Output:

Example 2:

In this example, we are going to add new columns named – PinCode and City and add a constant value – 522112 and Guntur to these columns and select rollno column along with PinCode and City from the PySpark DataFrame.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import the col,lit function
from pyspark.sql.functions import col,lit

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

# add a new columns: PinCode with Constant value - 522112
# city with constant value - Guntur
final = df.select(col("rollno"),lit("522112").alias("PinCode"),lit("Guntur").alias("City"))

#display the final dataframe
final.show()

Output:

We can also add values to the new column from the existing column vales. We just need to provide the column name inside lit(value) parameter.

Syntax:

dataframe.select(col("column"),…………,lit(dataframe.column).alias("new_column"))

Where,

  1. dataframe is the input PySpark DataFrame.
  2. column is the existing column name to display.
  3. new_column is the new column name to be added.
  4. value is the constant value added to the new column.

Example:

In this example, we are going to add a column – “PinCode City” and assign values from address column.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import the col,lit function
from pyspark.sql.functions import col,lit

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

# add a new column: "PinCode City from address column
final = df.select(col("rollno"),lit(df.address).alias("PinCode City"))

#display the final dataframe
final.show()

Output:

We can also add existing column values through column index – column indexing starts with – 0.

Example:

In this example, we are going to add a column – “PinCode City” and assign values from address column through address column index i.e., – 4.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import the col,lit function
from pyspark.sql.functions import col,lit

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
 {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,
'address':'patna'},
 {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
 {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

# add a new column: "PinCode City from address column
final = df.select(col("rollno"),lit(df[4]).alias("PinCode City"))

#display the final dataframe
final.show()

Output:

Conclusion

In this tutorial, we discussed the lit() method for creating a new column with constant values. It can be possible to assign the values from the existing column by specifying the column in place of value parameter through column name as well as a column index.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain