Apache Spark

PySpark – drop(), Drop Column

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame. drop() in PySpark is used to remove the columns from the DataFrame. By using drop(), we can remove more than one column at a time in the PySpark DataFrame. We can drop the columns from the DataFrame in three ways. Before that, we have to create PySpark DataFrame for demonstration.

Example:

We will create a dataframe with 5 rows and 6 columns and display it using the show() method.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display dataframe
df.show()

Output:

Capture.PNG

Now, display the dataframe schema using the printSchema() method to check the columns before removing the columns.

This method will return the column names along with their data type.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import the countfunction
from pyspark.sql.functions import count

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display the schema
df.printSchema()

Output:

root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- height: double (nullable = true)
|-- name: string (nullable = true)
|-- rollno: string (nullable = true)
|-- weight: long (nullable = true)

Method -1 : Drop single column

We will remove only one column at a time using the drop() function by passing the column inside the drop function.

Syntax:

df.drop(‘column_name’)

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to be dropped.

Example :

In this example, we will drop the name column and display the resultant dataframe and the schema.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the countfunction
from pyspark.sql.functions import count

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#drop name column
df=df.drop('name')

#check the dataframe
print(df.collect())

#display the schema
#after removing name column
df.printSchema()

Output:

[Row(address='guntur', age=23, height=5.79, rollno='001', weight=67), Row(address='hyd', age=16, height=3.79, rollno='002', weight=34), Row(address='patna', age=7, height=2.79, rollno='003', weight=17), Row(address='hyd', age=9, height=3.69, rollno='004', weight=28), Row(address='hyd', age=37, height=5.59, rollno='005', weight=54)]

root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- height: double (nullable = true)
|-- rollno: string (nullable = true)
|-- weight: long (nullable = true)

In the above example, we will see that the name column is not present in the dataframe

Method – 2 : Drop mutiple columns

We will remove only one column at a time using the drop() function by passing the column inside the drop function. If we have to remove multiple columns, then we have to add * before column names to be removed inside ().

Syntax:

df.drop(*(‘column_name’,’ column_name’,……………,’ column_name’))

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to be dropped.

Example :

In this example, we will drop the name, height, and weight columns and display the resultant dataframe along with the schema.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the countfunction
from pyspark.sql.functions import count

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#drop name,height and weight column
df=df.drop(*('name','height','weight'))

#check the dataframe
print(df.collect())

#display the schema
#after removing name column
df.printSchema()

Output:

[Row(address='guntur', age=23, rollno='001'), Row(address='hyd', age=16, rollno='002'), Row(address='patna', age=7, rollno='003'), Row(address='hyd', age=9, rollno='004'), Row(address='hyd', age=37, rollno='005')]

root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- rollno: string (nullable = true)

In the above example, we will see that the name,height, and weight columns are not present in the dataframe.

Method – 3 : Drop mutiple columns from a list

We will remove only one column at a time using the drop() function by passing the column inside the drop function. If we have to remove multiple columns, then we have to add * before column names to be removed inside a list – [].

Syntax:

df.drop(*list)

Here, the list will hold multiple columns

list = (column_name’,’ column_name’,……………,’ column_name’)

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to be dropped.

Example :

In this example, we will drop the name, height, and weight columns passed through list1 and display the resultant dataframe along with the schema.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the countfunction
from pyspark.sql.functions import count

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
 {'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#list of columns
list1=['name','height','weight']

#drop columns from the list1
df=df.drop(*list1)

#check the dataframe
print(df.collect())

#display the schema
#after removing name column
df.printSchema()

Output:

[Row(address='guntur', age=23, rollno='001'), Row(address='hyd', age=16, rollno='002'), Row(address='patna', age=7, rollno='003'), Row(address='hyd', age=9, rollno='004'), Row(address='hyd', age=37, rollno='005')]

root
|-- address: string (nullable = true)
|-- age: long (nullable = true)
|-- rollno: string (nullable = true)

In the above example, we will see that the name,height, and weight columns are not present in the dataframe.

Conclusion:

We discussed how to drop the columns using the drop() function, and we have also discussed how to remove multiple columns at a time with drop by passing a list of columns and passing multiple columns.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain