Apache Spark

PySpark translate() & overlay()

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame. We will discuss two functions: translate() and overlay() in PySpark. translate and overlay functions are used for modify strings that are part of data frames with new content.

Let’s discuss it one by one. Before that, we have to create a PySpark DataFrame for demonstration.

Example

We are going to create a dataframe with 5 rows and 6 columns and display it using the show() method.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,'
 height'
:5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display dataframe
df.show()

Output:

PySpark translate()

translate() is used to replace strings in PySpark DataFrame column character by character. We have to specify the characters in a string to be replaced with some other characters. It takes three parameters.

Syntax:

translate(column,’actual_characters’,’repacing_characters’)

Where,

  1. column is the name of the column in which the characters are replaced in this column.
  2. actual_characters are the characters present in the strings of the given column.
  3. replacing_characters are the characters that replace the actual_characters one by one.

Note – The number of characters in the actual_characters must be equal to the number of replacing_characters.

translate() can be used with the withColumn() method.

Overall Syntax:

dataframe.withColumn(column, translate(column,’actual_characters’,’repacing_characters’))

Example 1

In this example, we are translating the characters – gunhy to @$%^& in the address column.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import translate from pyspark.sql.functions
from pyspark.sql.functions import translate

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,'
 height'
:5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#translate the characters - gunhy to @$%^&
df.withColumn('address', translate('address', 'gunhy', '@$%^&')).show()

Output:

We can see that in the address column – the strings that contain

  1. g is translated to @
  2. u is translated to $
  3. n is translated to %
  4. h is translated to ^
  5. y is translated to &

Example 2

In this example, we are translating the characters – jaswi to 56434 in the name column.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import translate from pyspark.sql.functions
from pyspark.sql.functions import translate

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,'
 height'
:5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#translate the characters - jaswi to 56434 in name column
df.withColumn('name', translate('name', 'jaswi', '56434')).show()

Output:

We can see that in the name column – the strings that contain

  1. j is translated to 5
  2. a is translated to 6
  3. s is translated to 4
  4. w is translated to 3
  5. i is translated to 4

PySpark overlay()

overlay() is used to replace values in a given column with another column values. It takes three parameters. It can be used with a select clause.

Syntax:

overlay(replaced_column,replacing_column,position)

Where,

  1. replaced_column is the column in which values are replaced.
  2. replacing_column is the column in which it replaced the values in a replaced_ column.
  3. position is used to specify the position or location in replaced_column such that the values in replacing_column occupy replaced_column.

Note – If all the characters in values are replaced in replaced_column, from the next position, replaced_column characters will appear.

Overall Syntax:

dataframe.select(overlay(replaced_column,replacing_column,position))

Example

In this example, we will replace values in

  1. name column with age from 4 th character in name column
  2. rollno column with name from 2nd character
#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#import overlay from pyspark.sql.functions
from pyspark.sql.functions import overlay

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,'
 height'
:5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#replace values in name column with age from 4 th character
df.select(overlay("name", "age", 4)).show()

#replace values in rollno column with name from 2nd character
df.select(overlay("rollno", "name", 2)).show()

Output:

From this output,

  1. characters in the age column are replaced in the name column from the 4th position of every value, and the rest of the characters have remained the same in the name column.
  2. characters in the name column are replaced in the rollno column from the 4th position of every value, and the rest of the characters have not resulted in the rollno column since the total number of characters in rollno column values is lesser than name column values. That’s why name column values are occupied.

Conclusion

From this tutorial, we saw how to replace strings in the dataframe columns translate() and overlay() functions with simple examples.translate() is used to replace strings in PySpark DataFrame column character by character. We have to specify the characters in a string to be replaced with some other characters. overlay() is used to replace values in a given column with another column values.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain