Time series analysis is a prominent machine learning exploratory data analysis technique that allows us to see how data points change over time. Several time series-based problem statements, such as ticket sales forecast, stock price analysis, etc. The time series might exhibit a variety of trends that are hard to parse simply by looking at the plot. As a result, clustering the time series’ trends is a good idea. We’ll look at what a time series is, what clustering is, and how to cluster time series data.
What is Time Series?
A time series is a collection of data pointers grouped in order of time. The data points represent an activity that occurs over a period of time. A common example is the total number of stocks traded at a given time interval, along with other parameters such as stock prices and their respective trading information at each second. Unlike a continuous-time variable, these time-series data points have discrete values at various moments in time. As a result, discrete data variables are frequently used. Data for a time series can be collected over any length of time, from a few minutes to several years. The time over which data is collected has no lower or upper limit. There are various time series-based prediction problems in Machine Learning and Deep Learning like predicting a company’s stock price, human activity recognition, flight ticket quantity prediction, etc. This saves a lot of money and helps companies take careful decisions before investing in something. The example plot is given below shows the variation of observations with time.
What is Clustering?
Clustering is a type of machine learning unsupervised learning technique. The conclusions are acquired from data sets that do not have labeled output variables in the unsupervised learning method. It’s a type of exploratory data analysis that lets us look at multivariate data sets.
Clustering is the machine learning or mathematical approach in which data points are grouped into a specified number of clusters with similar features among the data points inside each cluster. Clusters are made up of data points grouped together so that the space between them is kept to a minimum. The way the clusters are produced is determined by the type of algorithm we choose. Because there is no criterion for good clustering, the conclusions drawn from the data sets also depend on what and how the user is developing the clustering algorithm. Clustering can be used to tackle problems such as customer segmentation, recommender systems, anomaly detection, and so on. The k-means clustering approach, in which we don’t have labels and must place each data point into its own cluster, may be recognizable to you. A prominent clustering approach is K-means. The figure below shows how we cluster different data points with the same features into the same cluster.
What is Time Series Clustering?
The Time Series Clustering technique is an unsupervised data processing approach for classifying data points based on their similarity. The goal is to maximize data similarity between clusters while minimizing it. A basic technique in data science for anomaly identification and pattern discovery is time-series clustering, which is used as a subroutine for other more complicated algorithms. This technique is particularly helpful when analyzing the trends in very large datasets of time series. We cannot differentiate the trends just by looking at the time series plot. Here is where you can cluster the trends. Different trends will then be grouped into different clusters.
Kernel K means
Kernel technique refers to transforming data into another dimension with a distinct separating edge between non-linearly separable data groups. Kernel k-means technique uses the same trick as k-means, except that the kernel method is used to calculate distance instead of Euclidean distance. When applied to the algorithm, the kernel approach can find non-linear structures and is best suited for real-world data sets.
K means for Time series Clustering
The most frequent method of time series clustering is the K mean. The common approach is to flatten the time series data into a 2-D array, with each column for each time index, and then use standard clustering algorithms like k-means to cluster the data. However, typical clustering algorithms’ distance measurements, such as Euclidean distance, are frequently inappropriate for time series. A preferable way is to use a metric for comparing the trends of the time series instead of the default distance measure. One of the most popular techniques used for this is Dynamic Time Warping.
Dynamic Time Warping
Even though one signal is time-shifted from the other, Dynamic Time Warping allows a system to compare two signals and look for similarities. Its capacity to check for known speech artifacts regardless of the speaker’s speaking tempo makes it useful for speech recognition problems as well. For instance, if there are two arrays: [1, 2, 3] and [4, 5, 6], calculating the distance between them is easy as you can simply do element-wise subtraction and add all the differences. However, it won’t be easy once the size of the arrays is different. We can consider these arrays as the sequence of signals. The “Dynamic” component suggests that the signal sequence can be moved back and forth to look for a match without speeding up or slowing down the entire sequence. If Time Warping is stretching or shrinking a rubber band, DTW is extending or shrinking that rubber band to fit the contours of a surface. Below is the visual representation of DTW.
Steps for Dynamic Time Warping
- Make an equal number of points in each of the two series.
- Using the Euclidean distance formula, calculate the distance between the first point in the first series and each point in the second series. Save the computed minimum distance.
- Move to the second point and repeat 2. Go step by step along with points and repeat two till all points are completed.
- Take the second series as a reference point and repeat 2 and 3.
- Add together all of the stored minimum distances for a true estimate of similarity between the two series.
Implementation of DTW in Python
from scipy.spatial.distance import euclidean
sig1 = np.array([1, 2, 3, 4])
sig2 = np.array([1, 2, 2, 4, 4, 5])
distance, path = fastdtw(sig1, sig2, dist=euclidean)
Use cases of Time Series Clustering
- Used in anomaly detection to track uncommon trends in series.
- Used in speech recognition.
- Used in Outlier Detection.
- Used in biological applications, including DNA recognition.
This article looked through the definition of time series, clustering, and combining the two to cluster time series trends. We went through a popular method for this called Dynamic Time Warping (DTW) and the processes and implementation involved in using it.