About - Blog - Contact

Statistical Arbitrage: Finding Correlated Stock Pairs

Statistical Arbitrage , A.K.A StatArb is a pair trading strategy that invloves buying and selling a pair of stocks based on a underlying correlation between them. This correlation usually exist in a given sector or competitors, for example Pepsi (PEP) and Coca-Cola (KO) is a pretty popular pair.

The logic behind the strategy is that pair stocks tend to follow one another, so when they fall outta sync, there is a high chance that they will fall back in sync, which creates an opportunity for the statarb trader.

This article tries to detect these correlations by calculating correlation between stock pairs, to do so , we're going to be using Python Numpy , Pandas to calculate corr and Matplotlib to visualize it.

First things first, lets import the libs.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# we're using yahoo finance data, pandas datareader will import the data we need
from pandas_datareader.data import DataReader

lets first grab the required data from Yahoo Finance and put them in a DataFrame

start_date = "2011-01-01"
symbols = ["PEP", "KO"]
# df is the main dataframe that'll hold the Adjusted closing prices
df = pd.DataFrame()

for symbol in symbols:
    dftemp = DataReader(symbol,"yahoo",start_date)
    # we only need the adjusted close price.
    df[symbol] = dftemp["Adj Close"]

# lets check the data
print "df DataFrame:\n",df.head()
df DataFrame:
                  PEP         KO
2011-01-03  55.135444  27.272059
2011-01-04  54.850336  26.707549
2011-01-05  55.839833  26.548651
2011-01-06  56.049474  26.356299
2011-01-07  55.672124  26.310301

Looks good, Now we gotta create a lagging dataframe, where 1 stock prices is shifted 1 day, This is done because we're trying to detect which stock follows the other.

dflag = df
dflag["KO_lag"] = dflag["KO"].shift(1)

# delete NaN values caused by the shift
dflag = dflag.dropna()

Now that we got dflag lets compute the 5 day rolling correlation, luckily for us this can be done with 1 line of code thanks to Numpy and Pandas, This produces an array of correlation values where 1 is perfectly correlated, -1 is prefectly inversely correlated and 0 is not correlated at all.

# computing correlation with 1 line.
dflag = dflag.assign(correlation = dflag["PEP"].rolling(window=5).corr(dflag["KO_lag"]))

print dflag.head(10)
                  PEP         KO     KO_lag  correlation
2011-01-04  54.850336  26.707549  27.272059          NaN
2011-01-05  55.839833  26.548651  26.707549          NaN
2011-01-06  56.049474  26.356299  26.548651          NaN
2011-01-07  55.672124  26.310301  26.356299          NaN
2011-01-10  55.387016  26.368845  26.310301    -0.622976
2011-01-11  55.621812  26.214126  26.368845     0.743870
2011-01-12  55.957238  26.360481  26.214126     0.333118
2011-01-13  56.108179  26.511017  26.360481    -0.176748
2011-01-14  55.999162  26.398115  26.511017     0.206867
2011-01-18  55.823065  26.544469  26.398115     0.015542

Note the NaN values there because we specified a 5 day window, so The first 4 days isn't enough to compute correlation. Lets drop those NaN values to avoid errors

dflag = dflag.dropna()

Next step is to find the most occuring correlation coef to decide whether this pair is mostly correlated or not. to do so, we need to construct a histogram, which basically tallies the number of occurance for each value.

ax = dflag.hist(column="correlation")
plt.title("Correlation Histogram")


Note how values aren't normally distributed and leaning toward 1.0 which indicates high correlation. we can also plot a line to the most occuring value.

# this produces 2 arrays of count and the slices
count, division = np.histogram(dflag["correlation"])
# argmax is used to get the index of the highest count,
# then getting the value in the divison array using that index
most_occuring_value = division[count.argmax()]

ax = dflag.hist(column="correlation")
plt.title("Correlation Histogram")
# plotting a line
plt.axvline(most_occuring_value, color="r", linestyle="dashed", linewidth=2)
print "Most re-occuring Corr value = %f" % most_occuring_value


Most re-occuring Corr value = 0.602270

As you can See The 2 stocks seem historically correlated, So lets look at the prices to see how the correlation holds up

df_normalized = df[["PEP","KO"]]
# normalized the numbers to make it easier to compare
df_normalized = df_normalized/ df_normalized.iloc[0]
plt.title("Normalized Adj Closing Prices")


PEP Seems to have preformed higher than KO in the past couple of years but their trend are in line.

One major Caveat here is although pairs move in tandem, there is no rule saying they have to, This means sometimes they could fall outta sync for a long time regardless of the pre-existing historical correlation.

Thats it for now, Feel free to leave a comment If you have a better way of implementing this or if I made a mistake.

Here's the Gist of the full script

Happy Trading

Get in touch with the author of this post: @ya7ya

Up Next: Calculating Stock Beta, Alpha and R^2 using Pandas and Statsmodels