Example
In this example, we will create the PySpark DataFrame with 5 rows and 6 columns and display it using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display dataframe
df.show()
Output:
PySpark – concat()
concat() will join two or more columns in the given PySpark DataFrame and add these values into a new column.
By using the select() method, we can view the column concatenated, and by using an alias() method, we can name the concatenated column.
Syntax
where,
- dataframe is the input PySpark Dataframe
- concat() – It will take multiple columns to be concatenated – column will be represented by using dataframe.column
- new_column is the column name for the concatenated column.
Example 1
In this example, we will concatenate height and weight columns into a new column and name the column as Body Index. Finally, we will only select this column and display the DataFrame using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import concat function
from pyspark.sql.functions import concat
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# concatenating height and weight into a new column named - "Body Index"
df.select(concat(df.height,df.weight).alias("Body Index")).show()
Output:
Example 2
In this example, we will concatenate rollno, name, and address columns into a new column and name the column as Details. Finally, we will only select this column and display the DataFrame using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import concat function
from pyspark.sql.functions import concat
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# concatenating rollno , name and address into a new column named - "Details"
df.select(concat(df.rollno,df.name,df.address).alias("Details")).show()
Output:
PySpark – concat_ws()
Concat_ws() will join two or more columns in the given PySpark DataFrame and add these values into a new column. It will separate each column’s values with a separator.
By using the select() method, we can view the column concatenated, and by using an alias() method, we can name the concatenated column.
Syntax
where,
- dataframe is the input PySpark Dataframe
- concat() – It will take multiple columns to be concatenated – column will be represented by using dataframe.column
- new_column is the column name for the concatenated column.
- the separator can be anything like space, special character, etc.
Example 1
In this example, we will concatenate height and weight columns into a new column and name the column as Body Index separated with “ _.” Finally, we will only select this column and display the DataFrame using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import concat_ws function
from pyspark.sql.functions import concat_ws
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# concatenating height and weight into a new column named - "Body Index"
df.select(concat_ws("_",df.height,df.weight).alias("Body Index")).show()
Output:
Example 2
In this example, we will concatenate rollno, name, and address columns into a new column and name the column as Details separated by “ ***.” Finally, we will only select this column and display the DataFrame using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import concat_ws function
from pyspark.sql.functions import concat_ws
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# concatenating rollno , name and address into a new column named - "Details"
df.select(concat_ws("***",df.rollno,df.name,df.address).alias("Details")).show()
Output:
Conclusion
We can concatenate two or more columns by using concat() and concat_ws() methods. The main difference between the two methods is we can add a separator in the concat_ws() method.