Statistical Arbitrage , A.K.A StatArb is a pair trading strategy that invloves buying and selling a pair of stocks based on a underlying correlation between them. This correlation usually exist in a given sector or competitors, for example Pepsi (PEP) and Coca-Cola (KO) is a pretty popular pair.

The logic behind the strategy is that pair stocks tend to follow one another, so when they fall outta sync, there is a high chance that they will fall back in sync, which creates an opportunity for the statarb trader.

This article tries to detect these correlations by calculating correlation between stock pairs, to do so , we're going to be using Python Numpy , Pandas to calculate corr and Matplotlib to visualize it.

First things first, lets import the libs.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# we're using yahoo finance data, pandas datareader will import the data we need
from pandas_datareader.data import DataReader
```

lets first grab the required data from Yahoo Finance and put them in a DataFrame

```
start_date = "2011-01-01"
symbols = ["PEP", "KO"]
# df is the main dataframe that'll hold the Adjusted closing prices
df = pd.DataFrame()
for symbol in symbols:
dftemp = DataReader(symbol,"yahoo",start_date)
# we only need the adjusted close price.
df[symbol] = dftemp["Adj Close"]
# lets check the data
print "df DataFrame:\n",df.head()
```

```
df DataFrame:
PEP KO
Date
2011-01-03 55.135444 27.272059
2011-01-04 54.850336 26.707549
2011-01-05 55.839833 26.548651
2011-01-06 56.049474 26.356299
2011-01-07 55.672124 26.310301
```

Looks good, Now we gotta create a lagging dataframe, where 1 stock prices is shifted 1 day, This is done because we're trying to detect which stock follows the other.

```
dflag = df
dflag["KO_lag"] = dflag["KO"].shift(1)
# delete NaN values caused by the shift
dflag = dflag.dropna()
```

Now that we got `dflag`

lets compute the 5 day rolling correlation, luckily for us this can be done with 1 line of code thanks to Numpy and Pandas, This produces an array of correlation values where 1 is perfectly correlated, -1 is prefectly inversely correlated and 0 is not correlated at all.

```
# computing correlation with 1 line.
dflag = dflag.assign(correlation = dflag["PEP"].rolling(window=5).corr(dflag["KO_lag"]))
print dflag.head(10)
```

```
PEP KO KO_lag correlation
Date
2011-01-04 54.850336 26.707549 27.272059 NaN
2011-01-05 55.839833 26.548651 26.707549 NaN
2011-01-06 56.049474 26.356299 26.548651 NaN
2011-01-07 55.672124 26.310301 26.356299 NaN
2011-01-10 55.387016 26.368845 26.310301 -0.622976
2011-01-11 55.621812 26.214126 26.368845 0.743870
2011-01-12 55.957238 26.360481 26.214126 0.333118
2011-01-13 56.108179 26.511017 26.360481 -0.176748
2011-01-14 55.999162 26.398115 26.511017 0.206867
2011-01-18 55.823065 26.544469 26.398115 0.015542
```

Note the NaN values there because we specified a 5 day window, so The first 4 days isn't enough to compute correlation. Lets drop those NaN values to avoid errors

`dflag = dflag.dropna()`

Next step is to find the most occuring correlation coef to decide whether this pair is mostly correlated or not. to do so, we need to construct a histogram, which basically tallies the number of occurance for each value.

```
ax = dflag.hist(column="correlation")
plt.title("Correlation Histogram")
plt.show()
```

Note how values aren't normally distributed and leaning toward 1.0 which indicates high correlation. we can also plot a line to the most occuring value.

```
# this produces 2 arrays of count and the slices
count, division = np.histogram(dflag["correlation"])
# argmax is used to get the index of the highest count,
# then getting the value in the divison array using that index
most_occuring_value = division[count.argmax()]
ax = dflag.hist(column="correlation")
plt.title("Correlation Histogram")
# plotting a line
plt.axvline(most_occuring_value, color="r", linestyle="dashed", linewidth=2)
plt.show()
print "Most re-occuring Corr value = %f" % most_occuring_value
```

```
Most re-occuring Corr value = 0.602270
```

As you can See The 2 stocks seem historically correlated, So lets look at the prices to see how the correlation holds up

```
df_normalized = df[["PEP","KO"]]
# normalized the numbers to make it easier to compare
df_normalized = df_normalized/ df_normalized.iloc[0]
df_normalized.plot()
plt.title("Normalized Adj Closing Prices")
plt.show()
```

PEP Seems to have preformed higher than KO in the past couple of years but their trend are in line.

One major Caveat here is although pairs move in tandem, there is no rule saying they have to, This means sometimes they could fall outta sync for a long time regardless of the pre-existing historical correlation.

Thats it for now, Feel free to leave a comment If you have a better way of implementing this or if I made a mistake.

Here's the Gist of the full script

*Happy Trading*

*Get in touch with the author of this post: @ya7ya*Up Next: Calculating Stock Beta, Alpha and R^2 using Pandas and Statsmodels