withColumn() in PySpark is used to do the operations on the PySpark dataframe columns. The Operations includes
- Change the data type of the column
- Modify the values in the column
- Add a new column from the existing column
Before moving to the methods, we will create PySpark DataFrame
Example:
Here we will create a PySpark dataframe with 5 rows and 6 columns.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display the dataframe
df.show()
Output:
Change the data type of the column
We can change the data type of a particular column using the withColumn() method.
Syntax:
Parameters:
1. column_name is the column whose data type is changed
2. col() function is used to get the column name
3. cast() is used to change the column datatype from one type to another, by accepting the datatype name as a parameter. The data types include String, Integer,Double.
Example:
In this example, the height is of float data type. we can change it to Integer by using the above method and displaying the schema using the printSchema() method and dataframe using the collect() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#convert float type to integer type of height column
df=df.withColumn("height",col("height").cast("Integer"))
#display modified dataframe
print(df.collect())
#llets display the schema
df.printSchema()
Output:
root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- height: integer (nullable = true)
|-- name: string (nullable = true)
|-- rollno: string (nullable = true)
|-- weight: long (nullable = true)
Modify the values in the column
We can modify the values of a particular column using the withColumn() method.
Syntax:
Parameters:
1. column_name is the column whose data type is changed
2. col() function is used to change the values in the column name
Example:
In this example, we will subtract each value in the weight column by 10.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#decrease each value in weight column by 10
df=df.withColumn("weight",col("weight")-10)
#display modified dataframe
print(df.collect())
#llets display the schema
df.printSchema()
Output:
root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- height: double (nullable = true)
|-- name: string (nullable = true)
|-- rollno: string (nullable = true)
|-- weight: long (nullable = true)
Add a new column from the existing column
We can add a new column from an existing column using the withColumn() method.
Syntax:
Parameters:
1. new_column is the column
2. col() function is used to add its column values to the new_column
Example:
This example will create a new column – “Power” and add values to this column, multiplying each value in the weight column by 10.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#Add column named Power
#from the weight column multiplied by 2
df=df.withColumn("Power",col("weight")* 2)
#display modified dataframe
print(df.collect())
#llets display the schema
df.printSchema()
Output:
root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- height: double (nullable = true)
|-- name: string (nullable = true)
|-- rollno: string (nullable = true)
|-- weight: long (nullable = true)
|-- Power: long (nullable = true)
Conclusion:
This article discussed how to change the data types, modify the values in the existing columns, and add a new column using the withcolumn() method.