Statistical Arbitrage , A.K.A StatArb is a pair trading strategy that invloves buying and selling a pair of stocks based on a underlying correlation between them. This correlation usually exist in a given sector or competitors, for example Pepsi (PEP) and Coca-Cola (KO) is a pretty popular pair.
The logic behind the strategy is that pair stocks tend to follow one another, so when they fall outta sync, there is a high chance that they will fall back in sync, which creates an opportunity for the statarb trader.
This article tries to detect these correlations by calculating correlation between stock pairs, to do so , we're going to be using Python Numpy , Pandas to calculate corr and Matplotlib to visualize it.
First things first, lets import the libs.
import numpy as np import pandas as pd import matplotlib.pyplot as plt # we're using yahoo finance data, pandas datareader will import the data we need from pandas_datareader.data import DataReader
lets first grab the required data from Yahoo Finance and put them in a DataFrame
start_date = "2011-01-01" symbols = ["PEP", "KO"] # df is the main dataframe that'll hold the Adjusted closing prices df = pd.DataFrame() for symbol in symbols: dftemp = DataReader(symbol,"yahoo",start_date) # we only need the adjusted close price. df[symbol] = dftemp["Adj Close"] # lets check the data print "df DataFrame:\n",df.head()
df DataFrame: PEP KO Date 2011-01-03 55.135444 27.272059 2011-01-04 54.850336 26.707549 2011-01-05 55.839833 26.548651 2011-01-06 56.049474 26.356299 2011-01-07 55.672124 26.310301
Looks good, Now we gotta create a lagging dataframe, where 1 stock prices is shifted 1 day, This is done because we're trying to detect which stock follows the other.
dflag = df dflag["KO_lag"] = dflag["KO"].shift(1) # delete NaN values caused by the shift dflag = dflag.dropna()
Now that we got
dflag lets compute the 5 day rolling correlation, luckily for us this can be done with 1 line of code thanks to Numpy and Pandas, This produces an array of correlation values where 1 is perfectly correlated, -1 is prefectly inversely correlated and 0 is not correlated at all.
# computing correlation with 1 line. dflag = dflag.assign(correlation = dflag["PEP"].rolling(window=5).corr(dflag["KO_lag"])) print dflag.head(10)
PEP KO KO_lag correlation Date 2011-01-04 54.850336 26.707549 27.272059 NaN 2011-01-05 55.839833 26.548651 26.707549 NaN 2011-01-06 56.049474 26.356299 26.548651 NaN 2011-01-07 55.672124 26.310301 26.356299 NaN 2011-01-10 55.387016 26.368845 26.310301 -0.622976 2011-01-11 55.621812 26.214126 26.368845 0.743870 2011-01-12 55.957238 26.360481 26.214126 0.333118 2011-01-13 56.108179 26.511017 26.360481 -0.176748 2011-01-14 55.999162 26.398115 26.511017 0.206867 2011-01-18 55.823065 26.544469 26.398115 0.015542
Note the NaN values there because we specified a 5 day window, so The first 4 days isn't enough to compute correlation. Lets drop those NaN values to avoid errors
dflag = dflag.dropna()
Next step is to find the most occuring correlation coef to decide whether this pair is mostly correlated or not. to do so, we need to construct a histogram, which basically tallies the number of occurance for each value.
ax = dflag.hist(column="correlation") plt.title("Correlation Histogram") plt.show()
Note how values aren't normally distributed and leaning toward 1.0 which indicates high correlation. we can also plot a line to the most occuring value.
# this produces 2 arrays of count and the slices count, division = np.histogram(dflag["correlation"]) # argmax is used to get the index of the highest count, # then getting the value in the divison array using that index most_occuring_value = division[count.argmax()] ax = dflag.hist(column="correlation") plt.title("Correlation Histogram") # plotting a line plt.axvline(most_occuring_value, color="r", linestyle="dashed", linewidth=2) plt.show() print "Most re-occuring Corr value = %f" % most_occuring_value
Most re-occuring Corr value = 0.602270
As you can See The 2 stocks seem historically correlated, So lets look at the prices to see how the correlation holds up
df_normalized = df[["PEP","KO"]] # normalized the numbers to make it easier to compare df_normalized = df_normalized/ df_normalized.iloc df_normalized.plot() plt.title("Normalized Adj Closing Prices") plt.show()
PEP Seems to have preformed higher than KO in the past couple of years but their trend are in line.
One major Caveat here is although pairs move in tandem, there is no rule saying they have to, This means sometimes they could fall outta sync for a long time regardless of the pre-existing historical correlation.
Thats it for now, Feel free to leave a comment If you have a better way of implementing this or if I made a mistake.
Here's the Gist of the full script
Happy TradingGet in touch with the author of this post: @ya7ya Up Next: Calculating Stock Beta, Alpha and R^2 using Pandas and Statsmodels