Apache Spark

PySpark Substring() Method

PySpark library is a general-purpose, in-memory, distributed processing engine that allows you to handle the data across several machines efficiently. One of its popular methods is substring(), which is helpful to get a substring from a given column, along with its examples. Let us dive and learn more about this library and its substring method.

What is PySpark?

PySpark is one of Spark’s supported languages. Spark is a big data processing technology that can handle data on a petabyte-scale. You can develop Spark applications to process the data and run them on the Spark platform using PySpark. The AWS offers managed EMR and the Spark platform. You may use PySpark to process the data and establish an EMR cluster on AWS. PySpark can read the data from CSV, parquet, json, and databases.

Because Spark is mostly implemented in Scala, creating Spark apps in Scala or Java allows you to access more of its features than writing Spark programs in Python or R. PySpark do not currently support the Dataset. For someone pursuing a data science, PySpark is a better option than Scala because there are many popular data science libraries written in Python such as NumPy, TensorFlow, and Scikit-learn. For smaller datasets, Pandas are utilized, whereas, for larger datasets, PySpark is employed.

In comparison to PySpark, Pandas gives faster results. Depending on the memory limitation and the size of the data, you can choose between PySpark and Pandas to improve performance. Always use Pandas over PySpark when processing data is enough to fit into the memory.

The Resilient Distributed Dataset (RDD) is the sophisticated underlying mechanism of Spark data. The data is resilient, which implies that if a system with the data fails, the data is replicated elsewhere and may be restored. Distributed means that the data is split among ‘N’ machines, allowing you to theoretically speed up a process while also handling massive amounts of data. One of the ramifications of distributed computing is that the data must be synchronized with extreme caution. Spark demands functional programming, which means that the functions must not have any side effects to prevent many of these concerns. As a result, if you wish to alter a table, you must first create a new table.

Many programmers are unfamiliar with the concept of functional programming. PySpark does not do a good job in making the RDD transparent. The API picks up on some of the unpleasantness of the RDD environment. Functional programming, for example, means that a function cannot have any side effects (which makes the keeping of the data consistent much harder). Another example is the “lazy” evaluation, which allows Spark to wait until it has a comprehensive picture of what you’re trying to achieve before attempting to optimize the processes. Spark has quickly become the industry’s preferred technology for data processing. It is, however, not the first. Before Spark, the processing engine was MapReduce. Spark is widely used in industries on distributed storage systems like Hadoop, Mesos, and the cloud. It is critical to comprehend the distributed storage systems and how they operate.

What is the Substring() Method in PySpark?

The substring() method in PySpark extracts a substring from a DataFrame column of the string type by specifying its length and location.

SQL Function Substring()

We can get a substring of a string using the substring() function of the pyspark.sql.functions module by supplying the index and the length of the string we wish to slice. Here is an example of using this method:

substring(str, pos, len)

data = [(1,"20210828"),(2,"20190725")]

columns=["id","date"]

df=spark.createDataFrame(data,columns)

df.withColumn('year', substring('date', 1,4))\

.withColumn('month', substring('date', 5,2))\

.withColumn('day', substring('date', 7,2))

df.printSchema()

df.show(truncate=False)

Using Substring() with Select()

Using select in PySpark, we can get the substring of a column.

df.select('date', substring('date', 1,4).alias('year'), \

substring('date', 5,2).alias('month'), \

substring('date', 7,2).alias('day'))

Using Substring() with SelectExpr()

The example of using selectExpr method to get the year, month, and day as substrings of column(date) is as follows:

df.selectExpr('date', 'substring(date, 1,4) as year', \

'substring(date, 5,2) as month', \

'substring(date, 7,2) as day')

Using Substr() from Column Type

Get the substring using the substr() function from pyspark.sql.Column type in Pyspark.

df3=df.withColumn('year', col('date').substr(1, 4))\

.withColumn('month',col('date').substr(5, 2))\

.withColumn('day', col('date').substr(7, 2))

5. Putting It Together

import pyspark

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, substring

spark=SparkSession.builder.appName("Demo").getOrCreate()

data = [(1,"20210828"),(2,"20190725")]

columns=["id","date"]

df=spark.createDataFrame(data,columns)

df.withColumn('year', substring('date', 1,4))\

.withColumn('month', substring('date', 5,2))\

.withColumn('day', substring('date', 7,2))

df.printSchema()

df.show(truncate=False)

df.select('date', substring('date', 1,4).alias('year'), \

substring('date', 5,2).alias('month'), \

substring('date', 7,2).alias('day'))

df.selectExpr('date', 'substring(date, 1,4) as year', \

'substring(date, 5,2) as month', \

'substring(date, 7,2) as day')

df3=df.withColumn('year', col('date').substr(1, 4))\

.withColumn('month',col('date').substr(5, 2))\

.withColumn('day', col('date').substr(7, 2))

Conclusion

We discussed about PySpark, a big data processing system capable of handling petabytes of data, and its substring() method along with its few examples.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.