Apache Spark

PySpark – select clause

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame.

select() in PySpark is used to select the columns in the DataFrame.

We can select columns in many ways.

Let’s discuss it one by one. Before that, we have to create PySpark DataFrame for demonstration.

Example:

We will create a dataframe with 5 rows and 6 columns and display it using the show() method.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display dataframe
df.show()

Output:

Capture.PNG

Method -1 : Using column names

Here we will give column names directly to select() method. This method returns the data present in those columns; we can give multiple columns simultaneously.

Syntax:

Dataframe.select(“column_name”,…….)

Example:

In this example, we are going to select the name and address column from the PySpark DataFrame and display it using the collect() method

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display name and address columns
df.select("name","address").collect()

Output:

[Row(name='sravan', address='guntur'),

Row(name='ojaswi', address='hyd'),

Row(name='gnanesh chowdary', address='patna'),

Row(name='rohith', address='hyd'),

Row(name='sridevi', address='hyd')]

Method -2 : Using column names with DataFrame

Here we will give column names with dataframe to select() method. This method returns the data present in those columns; we can give multiple columns simultaneously.

Syntax:

Dataframe.select(dataframe.column_name,…….)

Example:

In this example, we are going to select the name and address column from the PySpark DataFrame and display it using the collect() method

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display name and address columns
df.select(df.name,df.address).collect()

Output:

[Row(name='sravan', address='guntur'),

Row(name='ojaswi', address='hyd'),

Row(name='gnanesh chowdary', address='patna'),

Row(name='rohith', address='hyd'),

Row(name='sridevi', address='hyd')]

Method -3 : Using [] Operator

Here we will give column names inside [] operator with dataframe to select() method. This method returns the data present in those columns; we can give multiple columns simultaneously.

Syntax:

Dataframe.select(dataframe.column_name,…….)

Example:

In this example, we are going to select the name and address column from the PySpark DataFrame and display it using the collect() method

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display name and address columns
df.select(df["name"],df["address"]).collect()

Output:

[Row(name='sravan', address='guntur'),

Row(name='ojaswi', address='hyd'),

Row(name='gnanesh chowdary', address='patna'),

Row(name='rohith', address='hyd'),

Row(name='sridevi', address='hyd')]

Method -4 : Using col function

Here we will give column names inside the col function to select() method. This function is available in pyspark.sql functions, which returns the data present in that columns; we can give multiple columns at a time inside the select() method.Syntax:

Dataframe.select(col(“column_name”),…….)

Example:

In this example, we are going to select the name and address column from the PySpark DataFrame and display using collect() method

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display name and address columns
#with col function
df.select(col("name"),col("address")).collect()

Output:

[Row(name='sravan', address='guntur'),

Row(name='ojaswi', address='hyd'),

Row(name='gnanesh chowdary', address='patna'),

Row(name='rohith', address='hyd'),

Row(name='sridevi', address='hyd')]

Conclusion

In this article, we discussed how to select the data from the dataframe, and we discussed 4 ways to select the data using column names with the collect() method.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain