top of page

Correlation approaches for stock pairs you have not seen before

Stock Pairs

As described in our other articles, stock pairs are a mean-reversion trading system widely used in the industry. 

In pairs, you invest in 2 stocks that are correlated somehow and go long in one and short in another (classical Coca-Cola and Pepsi example). Naturally, you are hedged in the market, so this is a market-neutral strategy. The idea is pretty simple – you watch the stocks whose price series have very similar traces and wait when the price difference is significantly bigger than usual. 

After, you just buy underpriced and sell overpriced stock (under/overpriced according to the second stock).

How should the performance of a market-neutral strategy look like? It should beat the market during the very volatile times and crises, but not during the recovery – nicely visible on an out-of-sample backtest of my strategy during the financial crisis of 2008.

Fundamental relationships

When looking for stocks, you have to consider stocks with the same industry or sector or some fundamental relationship. There can be more advanced relationships, but that one is hard to find automatically for thousands of stocks. More advanced relationships are supply/demand chains between different companies (the company is a significant supplier of some product for another company).

In the article, you will see that many stocks can have correlated traces for some time, but it doesn’t mean there is some logical relationship between them. 

Even within the same industry, nowadays, many companies do so many other activities that they can be in many different industries. However, the main category should be the correct one, but the market is full of randomness, so that the categories can be random, too.

Unlimited possibilities

While writing this article, I already have more than three years of stock pairs development experience. I want to share some experiences with you. 

We will cover finding the pairs, which is very difficult according to the number of potential pairs. Imagine if you use 5700 stocks available for trading on the US stock market (already used some filters on stock’s universe, but if you apply some filters for dollar volume, you can take it down to 3000). How many potential pairs exist? Computation is simple, straight by combination number:

potential pairs (4.5 million for using 3000 stocks). By applying some fundamental categories (145 for industries and 12 sectors), we still have over 660000 potential pairs from 5700 stocks. Of course, most of them are not suitable for trading, so now the selection process.

Different approaches to find pairs:

  1. Correlation analysis – classical approach but a good one;

  2. Cointegration – statistical testing for a relation between stocks;

  3. Dynamic Time Warping – the methodology for time series analysis which can detect more complex relations;

  4. Variational Autoencoder with Convolutional Neural Networks (CNN-VAE) – my approach to stocks’ clustering, resp. computing more complex distances between different time series.

Note that cointegration can also generate signals, but in this article, we are looking for suitable pairs according to their industry and price traces. You can find useful scientific articles on generating signals with different approaches (distance, cointegration, and cupola). 

A copula is an approach from probability theory. It creates multivariate distributions to find dependencies between random variables. But before applying any methodology for generating signals, you should find suitable pairs first. My approach for finding signals is not based on any of the mentioned methodologies (to find the edge you have to be creative).

MAIN setting

We will be looking for correlated-like relations between two stocks in a two years long period (2018-2019). Since we want to use some at least basic fundamental relationships, we will use industry Other Industrial Metals & Mining (also available by Alpha Vantage) or web-scrap it from Yahoo and other free sources). To have a larger group, we will not apply dollar volume filters here. 

Still, for real functionality, you should avoid illiquid stocks with this strategy (it will make your backtests unrealistically good, and the reality is that for buying/selling just a few thousand dollars, you can wait even 30 minutes).

We will get rid of stocks with missing values of more than 5% of the last two years (according to date 2020-01-01). The group of potentially similar stocks consists of 21 symbols. 

Note that we use only the industry to understand more complex relationships. It is necessary to look deeper into areas where they work, similarities, and possibly if there are some contracts between the companies (unfortunately not possible without complex, high-quality news data for the last few years).

import pandas as pd, numpy as np, glob, datetime # basics
from scipy.spatial.distance import pdist # distance calculation
from statsmodels.tsa.stattools import coint # cointegration test
import torch # used for neural networks
from cnn_vae import VAE # my own CNN-VAE model, not sharing this one
from tslearn.metrics import dtw # dynamic time warping    

Note that downloading and preparing data is shown in our other articles, so we will not repeat it here. The DataFrame of adjusted close daily prices without missing values is saved in ts_daily. Now let’s find some suitable pairs.

1. Correlation analysis

Correlation is the simplest and quickest way of how you can find some relationships between the two datasets. We will do the correlations of the percentage daily returns for our stocks. Correlation is defined as:


are random variables,

is covariance, and

are standard deviations for given random variables. Computation form of Pearson correlation coefficient:


are observed values of random variable (in this case percentual daily return),

is length of time series (in our case 500), and

is a sample mean. Now we simply calculate correlations and look at the pairs with highest correlations.

corrs = ts_daily.pct_change().corr()
crrs = {}
for i in range(len(corrs.index)-1):
    for j in range(i+1, len(corrs.index)):
        crrs['~'.join(corrs.index[[i,j]])] = corrs.iloc[i,j] 
corrs = pd.Series(crrs)    

The most correlated are BHP~RIO, BHP~TECK, RIO~TECK, and BHP~VALE, and all have correlations higher than 0.6. The first one has 0.85, which is pretty high. For better readability, we use scaled prices (min-max scaler) to see that traces are very similar – but we lose the height of volatility on that picture. 

The ratio is computed simply by dividing the prices of stocks (there are many other methodologies to find a signal differently, here it is when the ratio is out of the Bollinger band – distance method). 

The beta coefficient is used in practice, so the market value is the same (we invest 100 in one stock and beta*100 in another). Then there is a moving average of 21 days (1 trading month) and a Bollinger band to find signals (according to distance from MA).


  1. Easy to compute;

  2. Straightforward conclusions;

  3. Can detect even correlated stocks that have different traces (see another picture BHP~TECK).


  1. Can detect only very basic linear relationship;

  2. Correlation doesn’t work for time dependency – you can shuffle the days, and you have the same result;

  3. Like all the other methodologies mentioned here, it doesn’t explain why there is a correlation.

Let’s look at this example, the stocks have different traces, but still they are correlated and possibly can form a good pair. Other methodologies cannot find out pairs like these, so just simple correlation analysis has its meaning here.

2. Cointegration Testing

In python package statsmodels, there are useful tools for time series analysis. You can find the cointegration test (statsmodels.tsa.stattools.coint), which applies the Engle-Granger two-step method. In this method, the regression is used where

, where

are stock prices at time

and the result is stationary time series

. The second step is testing this stationarity by the Augmented Dickey-Fuller test. 

This test can include different regressions, which test for constant stationarity, constant + trend, constant + trend + quadratic trend, and no constant / no trend.

Since stocks’ prices change during time, it is good to use some time trend in this methodology (linear is sufficient) or test according to no constant / no trend stationarity. For more info, look at python documentation of cointegration, resp. adfuller test.

syms = ts_daily.columns
cointegr = {}
for i in range(len(syms)-1):
    for j in range(i+1, len(syms)):
        cointegr[(syms[i], syms[j])] = coint(
            ts_daily[syms[i]], # stock 1
            ts_daily[syms[j]], # stock 2
            'ct' # constant + time trend
            )[1] # in position 1 is p-value of adfuller test
cointegr = pd.Series(cointegr)     

According to p-values, since the null hypothesis is ‘there is no cointegration,’ when the p-value is sufficiently low (mostly lower than 0.05), we can reject the null hypothesis. When we sort the values from the lowest p-values, we can find some new pairs. 

Truly cointegration is better for finding signals than finding functional pairs, but it is a widely used metric, so I wanted to show it. You can see the best resulting pair according to the lowest p-value. The top 3 pairs are CHNR~GSM, CHNR~WRN, and CHNR~WWR.

Note that, in this case, we do not filter small dollar volume stocks, neither the area where the companies work, but when using pairs for trading, make sure there are some real dependencies why the prices of these companies should affect each other.

3. Dynamic Time Warping

It is an algorithm that measures similarity between two sequences, mostly time series. DTW is able to detect even dependencies when some movement is faster and other is slower – for example, two people walking, one is faster, the second is slower. 

See the explanation and examples in Wikipedia. Look at a simple example downloaded from Wikipedia, DTW can catch the correlation effect even when there is some lag or slower/faster realization of the change.

Relationship that can be caught by DTW:

Definitely, this methodology is more complex and can catch real-word correlations between stocks. I have a good experience with using this methodology.

Thanks to package tslearn, the application is simple and straightforward (computation time is also low). Critical note – according to the picture of DTW and its definition, it is better to apply it on prices, not percent returns. 

We cannot use normal prices because the distances would not be consistent for different pairs. By scaling the prices (min-max scaler), we solve this problem (but again, we lose the information about volatilities – they can be used later in the signal-picking process).

# min-max scaler (close prices are between (0,1) so easily comparable traces
data = (ts_daily - ts_daily.min()) / (ts_daily.max() - ts_daily.min())
res_dtw = {} # save results into dictionary
for i in range(len(pct_ch.columns)-1):
    for j in range(i+1, len(pct_ch.columns)):
        res_dtw['~'.join(data.columns[[i,j]])] = dtw(
            data.iloc[:,i], # stock 1
            data.iloc[:,j], # stock 2                        
            sakoe_chiba_radius=10 # how far we look for correlation (lagged or slowed effect) - 10 days in this case
res_dtw = pd.Series(res_dtw)    

The results can be sorted, and pairs that have the lowest distance are the most suitable. The top 3 are in this case: TGB~WWR, GSM~VEDL, CHNR~XPL. When you look at the next picture, you can see the correlation between stocks which we are looking for. 

Even on the image of scaled prices, you can see the mean-reversion between these two stocks. The next step is testing whether the dependencies continue on out-of-sample, so it is not just a random overfitted pair (suitable topic for another article).

4. Convolutional autoencoders

I was inspired by facial recognition with variational autoencoders (VAE) – when we can distinguish faces, why not distinguish similar stocks, humans look at plots and can find similar traces (like faces). This methodology is an unsupervised algorithm that inputs some data and outputs the same data, but inside it creates a lower-dimensional representation. VAE uses randomness while it is creating latent probability distribution, represented by mean and variance vectors. The mean vector is a lower-dimensional representation of the original input. With this representation, VAE also could generate artificial outputs that are similar to the original input. There is a Bayesian approach hidden inside. This algorithm’s theory is beautiful and non-trivial (more in-depth knowledge of neural networks and probability theory is necessary). Since the inputs are images, firstly, they have to go through some convolutional layers, that’s why CNN-VAE.

The idea came from a human-eye analysis of the stock prices – pictures of the price time series. The most important part of this VAE is the representation of a given picture in lower-dimensional space. Working with lower dimensional space is simpler. 

We can apply Euclidean distances to find stocks with similar traces. Much more work is needed inside this methodology than just putting prices inside, but I consider it my know-how.

This methodology can also be used to further clustering to create a map of the whole stock universe (based on historical prices and similar behavior of the time series). 

Each part of the map consists of stocks with slightly different traces. Since the 20-dimensional bottleneck is still too much for visualization, we use t-SNE dimensionality reduction into 2D space (because of this, some important information can be lost, so we use it just for visualization).

In the map, we look at three different pairs, each at various locations. The blue pair consists of stocks that are too far away on the map. Every time we show the real price and scaled close price to see a mean reversion. 

Firstly we apply only price similarity, and then we look if the stocks have something in common fundamentally. Pairs based just on prices can be very random. The traces can be very similar just out of the stock market’s randomness and behavior.

This map consists only of stocks’ traces from 2018-2019; the map of 2015-2016 would be entirely different because of t-SNE dimensionality reduction, which is applied on the whole dataset. But the model is not re-optimized, so the location of the same traces in the 20-dimensional vector will be the same for 2015-2016 and 2018-2019. 

Note that we faced a sharp market correction at the end of 2018, followed by swift growth at the beginning of 2019, so many stocks have similar traces during this period.

The red stocks in the center-left part of the map do not face a sharp drawdown at the end of 2018. Traces are very similar, ORLY – O’Reilly Automotive (Speciality Retail) and CRMT – America Car-Mart (Auto & Truck Dealers) have slightly different industries but the same sector, and both sell cars. This pair has fundamental meaning even though it was found solely by historical prices.

Green stock in the center consists of stocks that were hit by 2018 correction, then recovered fast but were volatile until the end of 2019. Again we can see similar mean-reversion behavior. PQG – PQ Group Holdings (Speciality Chemicals), VPG – Vishay Precision Group (Scientific & Technical Instruments). The sector and industry are different, but not that much (like mining and tourism). 

Is this relation just random? If you look deeper into these two stocks, both are based in Malvern, Pennsylvania. Looking deeper into the description, it looks like there can be some supply-demand relationship. PQG – performance materials, polymer additives, and a lot more, VPG – foil technology product and many others. We have at least one supply-demand relationship just from a 10-seconds read of the stock description.

Blue stocks are in different parts of the map, so they have different traces – MRSN (Biotechnology) didn’t recover until the end of 2019 and fell from mid-2018. On the other hand, PLOW (Auto Parts) also fell but recovered to new highs. Also, as expected, these two stocks have nothing in common. You can also find many stocks that have similar traces, but they have nothing in common fundamentally.

Back to comparison of approaches

This part of code will not be shown because it is just applying the data into black-box model and sorting the euclidean distances of VAE bottleneck (lower dimensional representation of our stock data)

On the image, you can see that this methodology can find even longer-term mean-reversion relationships but the sort-termed, too. The pair with the lowest distance is GSM~VELD, and you can see the mean-reversion process in a longer-termed (weeks and months). Other top pairs are TGB~XPL, TGB~WWR (best by DTW, so we can combine methodologies), and WWR~XPL.


This article showed you a few different approaches to finding correlation-like relations between stocks and potentially finding suitable pairs. The best solution does not exist, but by combining different techniques and applying some other fundamental (logical) dependencies, you can find exciting pairs for this market-neutral strategy.

You will always find interesting pairs based on historical prices, but that relation without fundamentals is only random and will vanish soon. Also, be aware of trading the illiquid stocks, which makes your strategy very profitable in backtests, but they are not executable in reality. Stock pairs is a trading strategy that still has an edge because there are thousands of pairs no one trade, almost unlimited universe, you can trade intraday, on open, on close, hold positions for hours / days / weeks.

Stock pairs strategy is my main specialization and joy because of unlimited possibilities and exciting approaches (you can use fundamentals, prices, news, options, implied volatility, anything). Even if you use just pairs based on historical price relations, the strategy has an edge and is profitable compared to random pairs trading. The real edge in this strategy is not only in picking the right pair but also the right signal among thousands every day.



Recent Posts


Recent Comments

1 view0 comments


bottom of page