select() in PySpark is used to select the columns in the DataFrame.
We can select columns in many ways.
Let’s discuss it one by one. Before that, we have to create PySpark DataFrame for demonstration.
Example:
We will create a dataframe with 5 rows and 6 columns and display it using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display dataframe
df.show()
Output:
Method -1 : Using column names
Here we will give column names directly to select() method. This method returns the data present in those columns; we can give multiple columns simultaneously.
Syntax:
Example:
In this example, we are going to select the name and address column from the PySpark DataFrame and display it using the collect() method
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display name and address columns
df.select("name","address").collect()
Output:
Row(name='ojaswi', address='hyd'),
Row(name='gnanesh chowdary', address='patna'),
Row(name='rohith', address='hyd'),
Row(name='sridevi', address='hyd')]
Method -2 : Using column names with DataFrame
Here we will give column names with dataframe to select() method. This method returns the data present in those columns; we can give multiple columns simultaneously.
Syntax:
Example:
In this example, we are going to select the name and address column from the PySpark DataFrame and display it using the collect() method
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display name and address columns
df.select(df.name,df.address).collect()
Output:
Row(name='ojaswi', address='hyd'),
Row(name='gnanesh chowdary', address='patna'),
Row(name='rohith', address='hyd'),
Row(name='sridevi', address='hyd')]
Method -3 : Using [] Operator
Here we will give column names inside [] operator with dataframe to select() method. This method returns the data present in those columns; we can give multiple columns simultaneously.
Syntax:
Example:
In this example, we are going to select the name and address column from the PySpark DataFrame and display it using the collect() method
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display name and address columns
df.select(df["name"],df["address"]).collect()
Output:
Row(name='ojaswi', address='hyd'),
Row(name='gnanesh chowdary', address='patna'),
Row(name='rohith', address='hyd'),
Row(name='sridevi', address='hyd')]
Method -4 : Using col function
Here we will give column names inside the col function to select() method. This function is available in pyspark.sql functions, which returns the data present in that columns; we can give multiple columns at a time inside the select() method.Syntax:
Example:
In this example, we are going to select the name and address column from the PySpark DataFrame and display using collect() method
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the col function
from pyspark.sql.functions import col
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display name and address columns
#with col function
df.select(col("name"),col("address")).collect()
Output:
Row(name='ojaswi', address='hyd'),
Row(name='gnanesh chowdary', address='patna'),
Row(name='rohith', address='hyd'),
Row(name='sridevi', address='hyd')]
Conclusion
In this article, we discussed how to select the data from the dataframe, and we discussed 4 ways to select the data using column names with the collect() method.