Apache Spark

PySpark – withColumn method

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame.

withColumn() in PySpark is used to do the operations on the PySpark dataframe columns. The Operations includes

  1. Change the data type of the column
  2. Modify the values in the column
  3. Add a new column from the existing column

Before moving to the methods, we will create PySpark DataFrame

Example:

Here we will create a PySpark dataframe with 5 rows and 6 columns.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display the dataframe
df.show()

Output:

Capture.PNG

Change the data type of the column

We can change the data type of a particular column using the withColumn() method.

Syntax:

Dataframe.withColumn("column_name",col("column_name ").cast("datatype"))

Parameters:

1. column_name is the column whose data type is changed

2. col() function is used to get the column name

3. cast() is used to change the column datatype from one type to another, by accepting the datatype name as a parameter. The data types include String, Integer,Double.

Example:

In this example, the height is of float data type. we can change it to Integer by using the above method and displaying the schema using the printSchema() method and dataframe using the collect() method.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#convert float type to integer type of height column
df=df.withColumn("height",col("height").cast("Integer"))

#display modified dataframe
print(df.collect())

#llets display the schema
df.printSchema()

Output:

[Row(address='guntur', age=23, height=5, name='sravan', rollno='001', weight=67), Row(address='hyd', age=16, height=3, name='ojaswi', rollno='002', weight=34), Row(address='patna', age=7, height=2, name='gnanesh chowdary', rollno='003', weight=17), Row(address='hyd', age=9, height=3, name='rohith', rollno='004', weight=28), Row(address='hyd', age=37, height=5, name='sridevi', rollno='005', weight=54)]

root

|-- address: string (nullable = true)

|-- age: long (nullable = true)

|-- height: integer (nullable = true)

|-- name: string (nullable = true)

|-- rollno: string (nullable = true)

|-- weight: long (nullable = true)

Modify the values in the column

We can modify the values of a particular column using the withColumn() method.

Syntax:

Dataframe.withColumn("column_name",col("column_name "))

Parameters:

1. column_name is the column whose data type is changed

2. col() function is used to change the values in the column name

Example:

In this example, we will subtract each value in the weight column by 10.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#decrease each value in weight column by 10
df=df.withColumn("weight",col("weight")-10)

#display modified dataframe
print(df.collect())

#llets display the schema
df.printSchema()

Output:

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=57), Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=24), Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=7), Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=18), Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=44)]

root

|-- address: string (nullable = true)

|-- age: long (nullable = true)

|-- height: double (nullable = true)

|-- name: string (nullable = true)

|-- rollno: string (nullable = true)

|-- weight: long (nullable = true)

Add a new column from the existing column

We can add a new column from an existing column using the withColumn() method.

Syntax:

Dataframe.withColumn("new_column ",col("column_name "))

Parameters:

1. new_column is the column

2. col() function is used to add its column values to the new_column

Example:

This example will create a new column – “Power” and add values to this column, multiplying each value in the weight column by 10.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#Add column named Power
#from the weight column multiplied by 2
df=df.withColumn("Power",col("weight")* 2)

#display modified dataframe
print(df.collect())

#llets display the schema
df.printSchema()

Output:

[Row(address='guntur', age=23, height=5.79, name='sravan', rollno='001', weight=67, Power=134), Row(address='hyd', age=16, height=3.79, name='ojaswi', rollno='002', weight=34, Power=68), Row(address='patna', age=7, height=2.79, name='gnanesh chowdary', rollno='003', weight=17, Power=34), Row(address='hyd', age=9, height=3.69, name='rohith', rollno='004', weight=28, Power=56), Row(address='hyd', age=37, height=5.59, name='sridevi', rollno='005', weight=54, Power=108)]

root

|-- address: string (nullable = true)

|-- age: long (nullable = true)

|-- height: double (nullable = true)

|-- name: string (nullable = true)

|-- rollno: string (nullable = true)

|-- weight: long (nullable = true)

|-- Power: long (nullable = true)

Conclusion:

This article discussed how to change the data types, modify the values in the existing columns, and add a new column using the withcolumn() method.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain