Pair Trading – Statistical Arbitrage On Cash Stocks

This article is the final project submitted by the author as a part of his coursework in Executive Programme in Algorithmic Trading (EPAT™) at QuantInsti™. Do check our Projects page and have a look at what our students are building.

About the Author

Jonathan has a strong knowledge of mathematical programming and has worked as a process optimization engineer for 3 years. He started to get involved in trading as a hobby, especially in algorithmic trading due to his passion for math but eventually, it became his full-time job. Jonathan enrolled for Executive Programme in Algorithmic Trading (EPAT™) in November 2016 and found his space in the world on quantitative analysis in finance. Currently, he is taking several courses online in subjects related to Artificial Intelligence and its applications in finance and is about to start an online portal in Financial Engineering to share his experience as a Quant Trader.


Project Objective

The objective of this project is to model a statistical arbitrage trading strategy and quantitatively analyze the modeling results. Motivation relies on diversifying investment throughout five sectors, aka Technology, Financial, Services, Consumer Goods and Industrial Goods. Furthermore, some stocks, generally in the same sector, move in tandem because prices are affected by the same market events. However, the noise might make them temporarily deviate from the usual pattern and a trader can take advantage of this apparent deviation with the expectation that the stocks will eventually return to their long-term relationship.

Within each sector, stocks were selected based on high liquidity, small bid/ask spread and ability to short the stock. However, it is possible to consider other stocks for further analysis. Once the stock universe is defined, pairs can be formed. Every day as we want to enter a position, all the pairs in the universe are evaluated and the top pairs are selected per some criteria.

Trading Strategy Idea

As the universe of pairs is already defined, correlation analysis should be performed for all possible pairs to filter out pairs which have suitable properties for executing statistical arbitrage. With this correlation test, we are looking for a measurement of the relationship between two stock prices. The logic of the strategy is: for any pair that is correlated (from the universe established), if the pair ratio diverges from a certain threshold, then we short the stock that is expensive and buy the cheap stock. Once they converge to the mean, we close the position and profit from the reversal.

The strategy triggers new orders whenever the pair ratio of the prices of the stocks on the universe of filtered pairs diverges from the mean. To ensure the convenience of trading at this point, the pair must be cointegrated. If the pair ratio is cointegrated, the ratio is mean reverting and the greater the dispersion from its mean, the higher the probability of a reversal, which makes the trade more attractive. This analysis allows in determining the stability of the long-term relationship. Spread time series is tested for stationarity by the Augmented Dickey-Fuller (ADF) test. In other words, if pair stocks are cointegrated, it suggests that the mean and variance of this correlation remains constant over time. There is, however, a major issue which makes this simple strategy difficult to implement in practice: long term relationship can break down, and the spread can move from one equilibrium to another.

A training period of minimum 1-year data is chosen for out-of-sample test and the capital allocated to each sector is decided based on a minimum variance portfolio approach. Each sector is traded independently. Yahoo finance has been used for testing this strategy.  To perform the backtesting for each pair, data for the period 1-Jan-2009 to 31-Dec-2014 has been used.

Strategy Details

You can read the complete project work of the author including the Python codes for Pairs Trading by downloading the Ebook provided below.

Highlights from the project include:

  • Pair Trading – Statistical Arbitrage on Cash Stocks
  • Strategy
  • Code Details and In-Sample Backtesting
  • Analyzing Model Output
  • Monte Carlo Analysis and much more…

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!

Read more

R Weekly Bulletin Vol – XII

This week’s R bulletin will cover topics on how to resolve some common errors in R.

We will also cover functions like, rename, and lapply. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Find and Replace – Ctrl+F
2. Find Next – F3
3. Find Previous – Shift+F3

Problem Solving Ideas

Resolving the ‘: cannot open the connection’ Error

There can be two reasons for this error to show up when we run an R script: 1) A file/connection can’t be opened because R can’t find it (mostly due to an error in the path) 2) Failure in .onLoad() because a package can’t find a system dependency


symbol = "AXISBANK"
noDays = 1
dirPath = paste(getwd(), "/", noDays, " Year Historical Data", sep = "")
fileName = paste(dirPath, symbol, ".csv", sep = "")
data =

Warning in file(file, “rt”): cannot open file ‘C:/Users/Madhukar/Documents/
1 Year Historical DataAXISBANK.csv’: No such file or directory
Error in file(file, “rt”): cannot open the connection

We are getting this error because we have specified the wrong path to the “dirPath” object in the code. The right path is shown below. We missed adding a forward slash after “Year Historical Data” in the paste function. This led to the wrong path, and hence the error.

dirPath = paste(getwd(),”/”,noDays,” Year Historical Data/”,sep=””)

After adding the forward slash, we re-ran the code. Below we can see the right dirPath and fileName printed in the R console.


symbol = "AXISBANK"
noDays = 1
dirPath = paste(getwd(), "/", noDays, " Year Historical Data/", sep = "")
fileName = paste(dirPath, symbol, ".csv", sep = "")
data =
print(head(data, 3))

Resolving the ‘could not find function’ Error

This error arises when an R package is not loaded properly or due to the misspelling of the function names.

When we run the code shown below, we get a “could not find the function ymd” error in the console. This is because we have misspelled the “ymd” function as “ymed”. If we do not load the required packages, this will also throw up a “could not find function ymd” error.


# Read NIFTY price data from the csv file
df = read.csv("NIFTY.csv")

# Format date
dates = ymed(df$DATE)

Error in eval(expr, envir, enclos): could not find function “ymed”

Resolving the “replacement has” Error

This error occurs when one tries to assign a vector of values to an existing object and the lengths do not match up.

In the example below, the stock price data of Axis bank has 245 rows. In the code, we created a sequence “s” of numbers from 1 to 150. When we try to add this sequence to the Axis Bank data set, it throws up a “replacement error” as the lengths of the two do not match. Thus to resolve such errors one should ensure that the lengths match.


symbol = "AXISBANK" ; noDays = 1 ;
dirPath = paste(getwd(),"/",noDays," Year Historical Data/",sep="")
fileName = paste(dirPath,symbol,".csv",sep="")
df =

# Number of rows in the dataframe "df"
n = nrow(df); print(n);

# create a sequence of numbers from 1 to 150
s = seq(1,150,1)

# Add a new column "X" to the existing data frame "df"
df$X = s

Error in $<*tmp*, “X”, value = c(1, 2, 3, 4, 5, 6, 7, : replacement has 150 rows, data has 245

Functions Demystified function

The function is used for calling other functions. The function which is to be called is provided as the first argument to the function, while the second argument of the function is a list of arguments of the function to be called. The syntax for the function is given as: (function_name, arguments)

Example: Let us first define a simple function that we will call later in the function.

numbers = function(x, y) {
sqrt(x^3 + y^3)

# Now let us call this 'numbers' function using the function. We provide the function name as # the first argument to the function, and a list of the arguments as the second argument., list(x = 3, y = 2))
[1] 5.91608

rename function

The rename function is part of the dplyr package, and is used to rename the columns of a data frame. The syntax for the rename function is to have the new name on the left-hand side of the = sign, and the old name on the right-hand side. Consider the data frame “df” given in the example below.


Tic = c("IOC", "BPCL", "HINDPETRO", "ABAN")
OP = c(555, 570, 1242, 210)
CP = c(558, 579, 1248, 213)
df = data.frame(Tic, OP, CP)

# Renaming the columns as 'Ticker', 'OpenPrice', and 'ClosePrice'. This can be done in the following 
# manner:

renamed_df = rename(df, Ticker = Tic, OpenPrice = OP, ClosePrice = CP)

lapply function

The lapply function is part of the R base package, and it takes a list “x” as an input, and returns a list of the same length as “x”, each element of which is the result of applying a function to the corresponding element of X. The syntax of the function is given as:

lapply(x, Fun)
x is a vector (atomic or list)
Fun is the function to be applied

Example 1:

Let us create a list with 2 elements, OpenPrice and the ClosePrice. We will compute the mean of the values in each element using the lapply function.

x = list(OpenPrice = c(520, 521.35, 521.45), ClosePrice = c(521, 521.1, 522))
lapply(x, mean)

[1] 520.9333

[1] 521.3667

Example 2:

x = list(a = 1:10, b = 11:15, c = 1:50)
lapply(x, FUN = length)

[1] 10

[1] 5

[1] 50

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Read more

5 Ways Your Life Will Change After GST

5 ways your life will change after GST

By Sushant Ratnaparkhi

It is happening, the Goods and Services Tax (GST) will be implemented across the country from 1st July 2017. Everyone is going to get affected, some in a good way some in a bad way. Here, I have compiled a list of 5 ways your life will change after GST.

Before we do that, why don’t we take a quick look at what GST is, for the uninitiated ones. So there are two types of taxes in India, first is direct tax and second is an indirect tax. Direct tax is income tax, yes, the one that gets deducted from your salary every month. And indirect taxes are rest of the taxes like Service Tax, VAT, Excise Duty, Customs Duty, Entertainment Tax, Luxury Tax etc. these are the taxes that you indirectly end up paying whenever you make a transaction. Now since the list of indirect taxes is quite long and they differ vastly state wise, it becomes very complex for the government and taxpayers to manage them. This leads to inefficiencies, loopholes, and ways for people to escape them.


Read more

[WEBINAR] Classification of Quantitative Trading Strategies

Tuesday 11th July, 7:00 PM IST | 9:30 AM EST | 9:30 PM SGT

Quantitative Trading Strategies

There exist thousands of academic research papers written on trading strategies. Learn what these academics found out and how we can use their knowledge in the trading world.

Session Outline

  • Introduction to ‘Quantpedia & QuantInsti™’
  • Overview of research in a field of quantitative trading
  • Taxonomy of quantitative trading strategies
  • Where to look for unique alpha
  • Examples of lesser-known trading strategies
  • Common issues in quant research
  • Questions and Answers


Read more

Trading Using Machine Learning In Python Part-2

Trading using Machine Learning in Python Part-2

By Varun Divakar


At the end of my last blog, I had asked a few questions. Now, I will answer them all at the same time. I will also discuss a way to detect the regime/trend in the market without training the algorithm for trends. But before we go ahead, please use a fix to fetch the data from Google to run the code below.

data from Google to run the code

Trading Using Machine Learning In Python Part-2Click To Tweet


Is the equation over-fitting?

This was the first question I had asked. To know if your data is overfitting or not, the best way to test it would be to check the prediction error that the algorithm makes in the train and test data.


Read more

Machine Learning For Trading – How To Predict Stock Prices Using Regression?

Machine Learning in Trading. How to Predict Accurate Stock Prices using Regression

By Sushant Ratnaparkhi

The other day I was reading an article on how AI has progressed so far and where it is going. I was awestruck and had a hard time digesting the picture the author drew on possibilities in the future.

Here is how I reacted. (No, I am not as good looking as Joey but you get the idea)

And here is one of the possibilities where AI could be applied in medical field, para from the article,

A surgeon could control a machine scalpel with her motor cortex instead of holding one in her hand, and she could receive sensory input from that scalpel so that it would feel like an 11th finger to her. So it would be as if one of her fingers was a scalpel and she could do the surgery without holding any tools, giving her much finer control over her incisions. An inexperienced surgeon performing a tough operation could bring a couple of her mentors into the scene as she operates to watch her work through her eyes and think instructions or advice to her. And if something goes really wrong, one of them could “take the wheel” and connect their motor cortex to her outputs to take control of her hands.

You can read the article here.

At this moment, AI and Machine Learning have already progressed enough and they can predict stock prices with a great level of accuracy. Let me show you how.

Machine Learning in Trading – How to Predict Stock Prices using Regression?Click To Tweet

What is Machine Learning?

The definition is this, “Machine Learning is where computer algorithms are used to autonomously learn from data and information and improve the existing algorithms”


Read more

R Weekly Bulletin Vol – XI

This week’s R bulletin will cover topics on how to round to the nearest desired number, converting and comparing dates and how to remove last x characters from an element.

We will also cover functions like rank, mutate, transmute, and set.seed. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Comment/uncomment current line/selection – Ctrl+Shift+C
2. Move Lines Up/Down – Alt+Up/Down
3. Delete Line – Ctrl+D

Problem Solving Ideas

Rounding to the nearest desired number

Consider a case where you want to round a given number to the nearest 25. This can be done in the following manner:

round(145/25) * 25
[1] 150

floor(145/25) * 25
[1] 125

ceiling(145/25) * 25
[1] 150

Assume if you are calculating a stop loss or take profit for an NSE stock in which the minimum tick is 5 paisa. In such case, we will divide and multiply by 0.05 to achieve the desired outcome.


Price = 566
Stop_loss = 1/100

# without rounding
SL = Price * Stop_loss
[1] 5.66

# with rounding to the nearest 0.05
SL1 = floor((Price * Stop_loss)/0.05) * 0.05
[1] 5.65

How to remove last n characters from every element

To remove the last n characters we will use the substr function along with the nchr function. The example below illustrates the way to do it.


# In this case, we just want to retain the ticker name which is "TECHM"
symbol = "TECHM.EQ-NSE"
s = substr(symbol,1,nchar(symbol)-7)
[1] “TECHM”

Converting and Comparing dates in different formats

When we pull stock data from Google finance the date appears as “YYYYMMDD”, which is not recognized as a date-time object. To convert it into a date-time object we can use the “ymd” function from the lubridate package.


x = ymd(20160724)
[1] “2016-07-24”

Another data provider gives stock data which has the date-time object in the American format (mm/dd/yyyy). When we read the file, the date-time column is read as a character. We need to convert this into a date-time object. We can convert it using the as.Date function and by specifying the format.

dt = "07/24/2016"
y = as.Date(dt, format = "%m/%d/%Y")
[1] “2016-07-24”

# Comparing the two date-time objects (from Google Finance and the data provider) after conversion
identical(x, y)
[1] TRUE

Functions Demystified

rank function

The rank function returns the sample ranks of the values in a vector. Ties (i.e., equal values) and
missing values can be handled in several ways.

rank(x, na.last = TRUE, ties.method = c(“average”, “first”, “random”, “max”, “min”))

x: numeric, complex, character or logical vector
na.last: for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if “keep” they are kept with rank NA
ties.method: a character string specifying how ties are treated


x <- c(3, 5, 1, -4, NA, Inf, 90, 43)
[1] 3 4 2 1 8 7 6 5

rank(x, na.last = FALSE)
[1] 4 5 3 2 1 8 7 6

mutate and transmute functions

The mutate and transmute functions are part of the dplyr package. The mutate function computes new variables using the existing variables of a given data frame. The new variables are added to the existing data frame. On the other hand, the transmute function creates these new variables as a separate data frame.

Consider the data frame “df” given in the example below. Suppose we have 5 observations of 1-minute price data for a stock, and we want to create a new variable by subtracting the mean from the 1-minute closing prices. It can be done in the following manner using the mutate function.


OpenPrice = c(520, 521.35, 521.45, 522.1, 522)
ClosePrice = c(521, 521.1, 522, 522.25, 522.4)
Volume = c(2000, 3500, 1750, 2050, 1300)
df = data.frame(OpenPrice, ClosePrice, Volume)

df_new = mutate(df, cpmean_diff = ClosePrice - mean(ClosePrice, na.rm = TRUE))

# If we want the new variable as a separate data frame, we can use the transmute function instead.
df_new = transmute(df, cpmean_diff = ClosePrice - mean(ClosePrice, na.rm = TRUE))

set.seed function

The set.seed function helps generate the same sequence of random numbers every time the program runs. It sets the random number generator to a known state. The function takes a single argument which is an integer. One needs to use the same positive integer in order to get the same initial state.


# Initialize the random number generator to a known state and generate five random numbers
[1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928

# Reinitialize to the same known state and generate the same five 'random' numbers
[1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Read more

R Weekly Bulletin Vol – X

This week’s R bulletin will cover topics on grouping data using ntile function, how to open files automatically, and formatting an Excel sheet using R.

We will also cover functions like the choose function, sample function, runif and rnorm function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Fold selected chunk – Alt+L
2. Unfold selected chunk – Shift+Alt+L
3. Fold all – Alt+0

Problem Solving Ideas

Grouping data using ntile function

The ntile function is part of the dplyr package, and is used for grouping data. The syntax for the function is given by:

ntile(x, n)

“x” is the vector of values and
“n” is the number of buckets/groups to divide the data into.


In this example, we first create a data frame from two vectors, one comprising of Stock symbols, and the other comprising of their respective prices. We then group the values in Price column in 2 groups, and the ranks are populated in a new column called “Ntile”. In the last line we are selecting only those values which fall in the 2nd bucket using the subset function.

Price = c(14742, 33922, 24450, 21800, 5519)

data = data.frame(Ticker, Price)

data$Ntile = ntile(data$Price, 2)

ranked_data = subset(data, subset = (Ntile == 2))

Automatically open the saved files

If you are saving the output returned upon executing an R script, and also want to open the file post running the code, one can you use the shell.exec function. This function opens the specified file using the application specified in the Windows file associations.

A file association associates a file with an application capable of opening that file. More commonly, a file association associates a class of files (usually determined by their filename extension, such as .txt) with a corresponding application (such as a text editor).

The example below illustrates the usage of the function.


df = data.frame(Symbols=c("ABAN","BPCL","IOC"),Price=c(212,579,538))
write.csv(df,"Stocks List.csv")
shell.exec("Stocks List.csv")

Quick format of the excel sheet for column width

We can format the excel sheets for column width using the command lines given below. In the example, the first line will load the excel workbook specified by the file name. In the third & the fourth line, the autoSizeColumn function adjusts the width of the columns, which are specified in the “colIndex”, for each of the worksheets. The last line will save the workbook again after making the necessary formatting changes.


wb = loadWorkbook(file_name)
sheets = getSheets(wb)
autoSizeColumn(sheets[[1]], colIndex=1:7)
autoSizeColumn(sheets[[2]], colIndex=1:5)

Functions Demystified

choose function

The choose function computes the combination nCr. The syntax for the function is given as:


n is the number of elements
r is the number of subset elements

nCr = n!/(r! * (n-r)!)


choose(5, 2)
[1] 10

choose(2, 1)
[1] 2

sample function

The sample function randomly selects n items from a given vector. The samples are selected without replacement, which means that the function will not select the same item twice. The syntax for the function is given as:

sample(vector, n)

Example: Consider a vector consisting of yearly revenue growth data for a stock. We select 5 years revenue growth at random using the sample function.

Revenue = c(12, 10.5, 11, 9, 10.75, 11.25, 12.1, 10.5, 9.5, 11.45)
sample(Revenue, 5)
[1] 11.45 12.00 9.50 12.10 10.50

Some statistical processes require sampling with replacement, in such cases you can specify replace= TRUE to the sample function.


x = c(1, 3, 5, 7)
sample(x, 7, replace = TRUE)
[1] 7 1 5 3 7 3 5

runif and rnorm functions

The runif function generates a uniform random number between 0 and 1. The argument of runif function is the number of random values to be generated.


# This will generate 7 uniform random number between 0 and 1.
[1] 0.6989614 0.5750565 0.6918520 0.3442109 0.5469400 0.7955652 0.5258890

# This will generate 5 uniform random number between 2 and 4.
runif(5, min = 2, max = 4)
[1] 2.899836 2.418774 2.906082 3.728974 2.720633

The rnorm function generates random numbers from normal distribution. The function rnorm stands for the Normal distribution’s random number generator. The syntax for the function is given as:

rnorm(n, mean, sd)


# generates 6 numbers from a normal distribution with a mean of 3 and standard deviation of 0.25
rnorm(6, 3, 0.25)
[1] 3.588193 3.095924 3.240684 3.061176 2.905392 2.891183

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Read more

Project Work: Segregating Human and HFT Algo Orders

Project Work Segregating Human and HFT Algo Orders

By Narasimha Sriharsha Kanduri


Market price is determined by the participants. The trending and range bound markets are the moods of the participants. By analyzing the participants we can have a wide variety of conclusions. Consider the two extreme participants, HFT Algos and humans.

Now let us analyze human oriented orders. Humans cannot replace the orders hundreds of times per second, nor can they adjust the price in fractions so as to profit from them. On the other hand, the Algos can replace the orders even in a fraction of a second and they try to make profit from even the small edges.

Segregating Human and HFT Algo OrdersClick To Tweet

The idea of the study is to segregate the human and machine orders based on the minimum time to replace the order before the execution of the order. If the minimum replace time is more than threshold time Ts, then the order is considered as a human order else HFT algo order. The effectiveness of the segregation is then determined with the sample data that has the human and HFT algo that has the human and HFT algo order details.


Read more

Raining Data – Cloud Computing Solutions for Retail Traders

Cloud computing for trading

By Rashmi Punjabi

retail traders meme

Innovation in technology over the past few years has managed to cause a stir in the evolutionary stage of the traditional instruments involved in the financial markets for traders.

But first, let’s get an idea on who these retail traders are.

A retail trader is someone who buys and sells securities for their own account. He does not represent any organization and is also known as an ‘individual trader’.  These traders want to manage their trade and amplify it using the latest technology. Cloud computing is one such technological innovation that has seen good adoption amongst traders in recent years.

Cloud computing is one such technological innovation that has seen good adoption amongst traders in recent years.Click To Tweet


Read more