Example:
We will create a dataframe with 5 rows and 6 columns and display it using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display dataframe
df.show()
Output:
Now, display the dataframe schema using the printSchema() method to check the columns before removing the columns.
This method will return the column names along with their data type.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the countfunction
from pyspark.sql.functions import count
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display the schema
df.printSchema()
Output:
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- height: double (nullable = true)
|-- name: string (nullable = true)
|-- rollno: string (nullable = true)
|-- weight: long (nullable = true)
Method -1 : Drop single column
We will remove only one column at a time using the drop() function by passing the column inside the drop function.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to be dropped.
Example :
In this example, we will drop the name column and display the resultant dataframe and the schema.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the countfunction
from pyspark.sql.functions import count
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#drop name column
df=df.drop('name')
#check the dataframe
print(df.collect())
#display the schema
#after removing name column
df.printSchema()
Output:
root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- height: double (nullable = true)
|-- rollno: string (nullable = true)
|-- weight: long (nullable = true)
In the above example, we will see that the name column is not present in the dataframe
Method – 2 : Drop mutiple columns
We will remove only one column at a time using the drop() function by passing the column inside the drop function. If we have to remove multiple columns, then we have to add * before column names to be removed inside ().
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to be dropped.
Example :
In this example, we will drop the name, height, and weight columns and display the resultant dataframe along with the schema.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the countfunction
from pyspark.sql.functions import count
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#drop name,height and weight column
df=df.drop(*('name','height','weight'))
#check the dataframe
print(df.collect())
#display the schema
#after removing name column
df.printSchema()
Output:
root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- rollno: string (nullable = true)
In the above example, we will see that the name,height, and weight columns are not present in the dataframe.
Method – 3 : Drop mutiple columns from a list
We will remove only one column at a time using the drop() function by passing the column inside the drop function. If we have to remove multiple columns, then we have to add * before column names to be removed inside a list – [].
Syntax:
Here, the list will hold multiple columns
Where,
- df is the input PySpark DataFrame
- column_name is the column to be dropped.
Example :
In this example, we will drop the name, height, and weight columns passed through list1 and display the resultant dataframe along with the schema.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the countfunction
from pyspark.sql.functions import count
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#list of columns
list1=['name','height','weight']
#drop columns from the list1
df=df.drop(*list1)
#check the dataframe
print(df.collect())
#display the schema
#after removing name column
df.printSchema()
Output:
root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- rollno: string (nullable = true)
In the above example, we will see that the name,height, and weight columns are not present in the dataframe.
Conclusion:
We discussed how to drop the columns using the drop() function, and we have also discussed how to remove multiple columns at a time with drop by passing a list of columns and passing multiple columns.