Find relationships between multiple time series

1. Find relationships between multiple time series

This lesson will explore how to compute and visualize correlations in datasets containing multiple time series.

2. Correlations between two variables

One of the most widely used methods to assess the similarities between a group of time series is by using the correlation coefficient. The correlation coefficient is a measure used to determine the strength or lack of relationship between two variables. The standard way to compute correlation coefficients is by using the Pearson's coefficient, which should be used when you think that the relationship between your variables of interest is linear. Otherwise, you can use the Kendall Tau or Spearman rank coefficient methods when the relationship between your variables of interest is thought to be non-linear.

3. Compute correlations

In Python, you can quickly compute the correlation coefficient between two variables by using the pearsonr, spearmanr or kendalltau functions in the scipy dot stats-dot-stats module. All three of these correlation measures return both the correlation and p-value between the two variables x and y.

4. What is a correlation matrix?

If you want to investigate the dependence between multiple variables at the same time, you will need to compute a correlation matrix. The result is a table containing the correlation coefficients between each pair of variables. Correlation coefficients can take any values between -1 and 1. A correlation of 0 indicates no correlation, while 1 and -1 indicate strong positive and negative correlation.

5. What is a correlation matrix?

Importantly, a correlation matrix will be always be "symmetric", i.e., the correlation between x and y will be identical to the correlation between y and x. Finally, the diagonal values will always be equal to 1, since the correlation between the variable x and a copy of itself is 1.

6. Computing Correlation Matrices with Pandas

The pandas library comes in with a dot corr() method that allows you to measure the correlation between all pairs of columns in a DataFrame. Using the meat dataset, we selected the columns beef , veal and turkey and invoked the dot corr() method by invoking both the pearson and spearman methods. The results are correlation matrices stored as two new pandas DataFrames called corr_p and corr_s.

7. Computing Correlation Matrices with Pandas

If you want to compute the correlation between all time series in your DataFrame, simply remove the references to the columns.

8. Heatmap

Once you have stored your correlation matrix in a new DataFrame, it might be easier to visualize it instead of trying to interpret several correlation coefficients at once. In order to achieve this, we will introduce the Seaborn library, which will be used to produce a heatmap of our correlation matrix. Here we use the dot heatmap() function on the object corr_mat from the previous slide

9. Heatmap

to create a heatmap of the correlation matrix. Heatmap is a useful tool to visualize correlation matrices, but the lack of ordering can make it difficult to read, or even identify which groups of time series are the most similar.

10. Clustermap

For this reason, it is recommended to leverage the dot clustermap() function in the seaborn library, which applies hierarchical clustering

11. Clustermap

to your correlation matrix to plot a sorted heatmap, where similar time series are placed closer to one another.

12. Let's practice!

Time to put this into practice!