Pair Trading – Statistical Arbitrage On Cash Stocks

This article is the final project submitted by the author as a part of his coursework in Executive Programme in Algorithmic Trading (EPAT™) at QuantInsti™. Do check our Projects page and have a look at what our students are building.

About the Author

Jonathan has a strong knowledge of mathematical programming and has worked as a process optimization engineer for 3 years. He started to get involved in trading as a hobby, especially in algorithmic trading due to his passion for math but eventually, it became his full-time job. Jonathan enrolled for Executive Programme in Algorithmic Trading (EPAT™) in November 2016 and found his space in the world on quantitative analysis in finance. Currently, he is taking several courses online in subjects related to Artificial Intelligence and its applications in finance and is about to start an online portal in Financial Engineering to share his experience as a Quant Trader.


Project Objective

The objective of this project is to model a statistical arbitrage trading strategy and quantitatively analyze the modeling results. Motivation relies on diversifying investment throughout five sectors, aka Technology, Financial, Services, Consumer Goods and Industrial Goods. Furthermore, some stocks, generally in the same sector, move in tandem because prices are affected by the same market events. However, the noise might make them temporarily deviate from the usual pattern and a trader can take advantage of this apparent deviation with the expectation that the stocks will eventually return to their long-term relationship.

Within each sector, stocks were selected based on high liquidity, small bid/ask spread and ability to short the stock. However, it is possible to consider other stocks for further analysis. Once the stock universe is defined, pairs can be formed. Every day as we want to enter a position, all the pairs in the universe are evaluated and the top pairs are selected per some criteria.

Trading Strategy Idea

As the universe of pairs is already defined, correlation analysis should be performed for all possible pairs to filter out pairs which have suitable properties for executing statistical arbitrage. With this correlation test, we are looking for a measurement of the relationship between two stock prices. The logic of the strategy is: for any pair that is correlated (from the universe established), if the pair ratio diverges from a certain threshold, then we short the stock that is expensive and buy the cheap stock. Once they converge to the mean, we close the position and profit from the reversal.

The strategy triggers new orders whenever the pair ratio of the prices of the stocks on the universe of filtered pairs diverges from the mean. To ensure the convenience of trading at this point, the pair must be cointegrated. If the pair ratio is cointegrated, the ratio is mean reverting and the greater the dispersion from its mean, the higher the probability of a reversal, which makes the trade more attractive. This analysis allows in determining the stability of the long-term relationship. Spread time series is tested for stationarity by the Augmented Dickey-Fuller (ADF) test. In other words, if pair stocks are cointegrated, it suggests that the mean and variance of this correlation remains constant over time. There is, however, a major issue which makes this simple strategy difficult to implement in practice: long term relationship can break down, and the spread can move from one equilibrium to another.

A training period of minimum 1-year data is chosen for out-of-sample test and the capital allocated to each sector is decided based on a minimum variance portfolio approach. Each sector is traded independently. Yahoo finance has been used for testing this strategy.  To perform the backtesting for each pair, data for the period 1-Jan-2009 to 31-Dec-2014 has been used.

Strategy Details

You can read the complete project work of the author including the Python codes for Pairs Trading by downloading the Ebook provided below.

Highlights from the project include:

  • Pair Trading – Statistical Arbitrage on Cash Stocks
  • Strategy
  • Code Details and In-Sample Backtesting
  • Analyzing Model Output
  • Monte Carlo Analysis and much more…

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!

Read more

Trading Using Machine Learning In Python Part-2

Trading using Machine Learning in Python Part-2

By Varun Divakar


At the end of my last blog, I had asked a few questions. Now, I will answer them all at the same time. I will also discuss a way to detect the regime/trend in the market without training the algorithm for trends. But before we go ahead, please use a fix to fetch the data from Google to run the code below.

data from Google to run the code

Trading Using Machine Learning In Python Part-2Click To Tweet


Is the equation over-fitting?

This was the first question I had asked. To know if your data is overfitting or not, the best way to test it would be to check the prediction error that the algorithm makes in the train and test data.


Read more

Machine Learning For Trading – How To Predict Stock Prices Using Regression?

Machine Learning in Trading. How to Predict Accurate Stock Prices using Regression

By Sushant Ratnaparkhi

The other day I was reading an article on how AI has progressed so far and where it is going. I was awestruck and had a hard time digesting the picture the author drew on possibilities in the future.

Here is how I reacted. (No, I am not as good looking as Joey but you get the idea)

And here is one of the possibilities where AI could be applied in medical field, para from the article,

A surgeon could control a machine scalpel with her motor cortex instead of holding one in her hand, and she could receive sensory input from that scalpel so that it would feel like an 11th finger to her. So it would be as if one of her fingers was a scalpel and she could do the surgery without holding any tools, giving her much finer control over her incisions. An inexperienced surgeon performing a tough operation could bring a couple of her mentors into the scene as she operates to watch her work through her eyes and think instructions or advice to her. And if something goes really wrong, one of them could “take the wheel” and connect their motor cortex to her outputs to take control of her hands.

You can read the article here.

At this moment, AI and Machine Learning have already progressed enough and they can predict stock prices with a great level of accuracy. Let me show you how.

Machine Learning in Trading – How to Predict Stock Prices using Regression?Click To Tweet

What is Machine Learning?

The definition is this, “Machine Learning is where computer algorithms are used to autonomously learn from data and information and improve the existing algorithms”


Read more

Trading Strategy: 52-Weeks High Effect in Stocks

By Milind Paradkar

In today’s algorithmic trading having a trading edge is one of the most critical elements. It’s plain simple. If you don’t have an edge, don’t trade! Hence, as a quant, one is always on a look out for good trading ideas. One of the good resources for trading strategies that have been gaining wide popularity is the Quantpedia site. Quantpedia has thousands of financial research papers that can be utilized to create profitable trading strategies.


The “Screener” page on Quantpedia categorizes hundreds of trading strategies based on different parameters like Period, Instruments, Markets, Complexity, Performance, Drawdown, Volatility, Sharpe etc.


Quantpedia has made some of these trading strategies available for free to their users. In this article, we will explore one such trading strategy listed on their site called the “52-Weeks High Effect in Stocks”.

52-Weeks High Effect in Stocks


The Quantpedia page for this trading strategy provides a detailed description which includes the 52-weeks high effect explanation, source research paper, other related papers, a visualization of the strategy performance and also other related trading strategies.

What is 52-Weeks High Effect? 

Let us put down the lucid explanation provided on Quantpedia here –

The “52-week high effect” states that stocks with prices close to the 52-week highs have better subsequent returns than stocks with prices far from the 52week highs. Investors use the 52-week high as an “anchor” which they value stocks against. When stock prices are near the 52-week high, investors are unwilling to bid the price all the way to the fundamental value. As a result, investors’ under-react when stock prices approach the 52-week high, and this creates the 52-week high effect.

Source Paper


The Source paper, “Industry Information and the 52-Week High Effect” has been authored by Xin Hong, Bradford D. Jordan, and Mark H. Liu.

The financial paper says that traders use the 52-week high as a reference point which they evaluate the potential impact of news against. When good news has pushed a stock’s price near or to a new 52-week high, traders are reluctant to bid the price of the stock higher even if the information warrants it. The information eventually prevails and the price moves up, resulting in a continuation. It works similarly for 52-week lows.

The trading strategy developed by the authors buys stocks in industries in which stock prices are close to 52-week highs and shorts stocks in industries in which stock prices are far from 52-week highs. They found that the industry 52-week high trading strategy is more profitable than the individual 52-week high trading strategy proposed by George and Hwang (2004).

Framing our 52-Weeks High Effect Strategy using R programming

Having understood the 52-weeks High Effect, we will try to backtest a simple trading strategy using R programming. Please note that we are not trying to replicate the exact trading strategy developed by the authors in their research paper.

We test our trading strategy for a 3-year backtest period using daily data on around 140 stocks listed on the National Stock Exchange of India Ltd. (NSE).

Brief about the strategy – The trading strategy reads the daily historical data for each stock in the list and checks if the price of the stock is near its 52-week high at the start of each month. We have shown how to check for this condition in step 4 of the trading strategy formulation process illustrated below. For all the stocks that pass this condition, we form an equal weighted portfolio for that month. We take a long position in these stocks at the start of the month and square off our position at the start of the next month. We follow this process for every month of our backtest period. Finally, we compute and chart the performance metrics of our trading strategy.

Now, let us understand the process of trading strategy formulation in a step-by-step manner. For reference, we have posted the R code snippets of relevant sections of the trading strategy under its respective step.

Step 1: First, we set the backtest period, and the upper and lower thresholds values for determining whether a stock is near its 52-week high.

# Setting the lower and upper threshold limits
lower_threshold_limit = 0.90 # (eg.0.90 = 90%)
upper_threshold_limit = 0.95 # (eg.0.95 = 95%)

# Backtesting period (Eg. 1 = 1 year) minimum period selected should be 2 years.
noDays = 4

Step 2: In this step, we read the historical stock data using the read.csv function from R. We are using the data from Google finance and it consists of the Open/High/Low/Close (OHLC) & Volume values.

# Run the program for each stock in the list
for(s in 1:length(symbol)){


dirPath = paste(getwd(),"/4 Year Historical Data/",sep="") 
fileName = paste(dirPath,symbol[s],".csv",sep="")
data =
data$TICKER = symbol[s]

# Merge NIFTY prices with Stock data and select the Close Price
data = merge(data,data_nifty, by = "DATE") 
data = data[, c("DATE", "TICKER","CLOSE.x","CLOSE.y")] 
colnames(data) = c("DATE","TICKER","CLOSE","NIFTY")
N = nrow(data)

Step 3: Since we are using the daily data we need to determine the start date of each month. The start date need not necessarily be the 1st of every month because the 1st can be a weekend or a holiday for the stock exchange. Hence, we write an R code which will determine the first date of each month.

# Determine the date on which each month of the backtest period starts

data$First_Day = ""

day = format(ymd(data$DATE),format="%d")
monthYr = format(ymd(data$DATE),format="%Y-%m")
yt = tapply(day,monthYr, min)

first_day = as.Date(paste(row.names(yt),yt,sep="-"))
frows = match(first_day, ymd(data$DATE))
for (f in frows) {data$First_Day[f] = "First Day of the Month"}

data = data[, c("TICKER","DATE", "CLOSE","NIFTY","First_Day")]

Step 4: Check if the stock is near the 52-week high mark. In this part, we first compute the 52-week high price for each stock. We then compute the upper and the lower thresholds using the 52-week high price.


If the lower threshold = 0.90, upper threshold = 0.95 and the 52-week high = 1200, then the threshold range is given by:

Threshold range = (0.90 * 1200) – (0.95 * 1200)

Threshold range = 1080 to 1140

If the stock price at the start of the month falls in this range, we then consider the stock to be near its 52-week high mark. We have also included one additional condition in the step. This condition checks whether the stock price in the past 30 days had reached the current 52-week high price and whether it is within the threshold range now. Such a stock will not be included in our portfolio as we assume that the stock price is in decline after reaching today’s 52-week high price.

# Check if the stock is near its 52-week high at the start of the each month

data$Near_52_Week_High = "" ; data$Max_52 = numeric(nrow(data)); 
data$Max_Not = numeric(nrow(data));

frows_tp = frows[frows >= 260]
for (fr in frows_tp){
   # This will determine the max price in the last 1 year (252 trading days)
   data$Max_52[fr] = max(data$CLOSE[(fr-252):(fr-1)]) 
  # This will check whether the max price has occurred in the last "x" days.
  data$Max_Not[fr] = max(data$CLOSE[(fr-no_max):(fr-1)]) 
  if ((data$CLOSE[fr] >= lower_threshold_limit * data$Max_52[fr])
      & (data$CLOSE[fr] <= upper_threshold_limit * data$Max_52[fr])
      & (data$Max_Not[fr] != data$Max_52[fr]) == TRUE ){
  data$Near_52_Week_High[fr] = "Near 52-Week High"
  } else {
  data$Near_52_Week_High[fr] = "Not Near 52-Week High"

Step 5: For all the stocks that fulfill the criteria mentioned in the step above, we create a long-only portfolio. The entry price equals the price at the start of the month. We square off our long position at the start of the next month. We consider the close price of the stock for our entry and exit trades.

# Enter into a long position for stocks at each start of month

data = subset(data,select=c(TICKER,DATE,CLOSE,NIFTY,First_Day,Max_52,Near_52_Week_High)
             ,subset=(First_Day=="First Day of the Month"))
data$NEXT_CLOSE = lagpad(data$CLOSE, 1)
colnames(data) = c("TICKER","DATE","CLOSE","NIFTY","First_Day","Max_52","Near_52_Week_High",

data$Profit_Loss = numeric(nrow(data)); data$Nifty_change = numeric(nrow(data));

for (i in 1:length(data$CLOSE)) { 
  if ((data$Near_52_Week_High[i] == "Near 52-Week High") == TRUE){
  data$Profit_Loss[i] = round(data$CLOSE[i+1] - data$CLOSE[i],2)
  data$Nifty_change[i] = round(Delt(data$NIFTY[i],data$NIFTY[i+1])*100,2)

for (i in 1:length(data$CLOSE)) { 
  if ((data$Near_52_Week_High[i] == "Not Near 52-Week High") == TRUE){
  data$Profit_Loss[i] = 0
  data$Nifty_change[i] = round(Delt(data$NIFTY[i],data$NIFTY[i+1])*100,2)

Step 6: In this step, we write an R code which creates a summary sheet of all the trades for each month in the backtest period. A sample summary sheet has been shown below. It also includes the Profit/Loss from every trade undertaken during the month.

# Create a Summary worksheet for all the trades during a particular month

final_data = final_data[-1,]
final_data = subset(final_data,select=c(TICKER,DATE,CLOSE,NEXT_CLOSE,Max_52,
                                        subset=(Near_52_Week_High == "Near 52-Week High"))

colnames(final_data) = c("Ticker","Date","Close_Price","Next_Close_Price",
                         "Max. 52-Week price","Is Stock near 52-Week high",

merged_file = paste(date_values[a],"- Summary.csv")

Monthly Trades Table

Step 7: In the final step, we compute the portfolio performance over the entire backtest period and also plot the equity curve using the PerformanceAnalytics package in R. The portfolio performance is saved in a CSV file.

cum_returns = Return.cumulative(eq_ts, geometric = TRUE)

charts.PerformanceSummary(eq_ts,geometric=TRUE, wealth.index = FALSE)
print(SharpeRatio.annualized(eq_ts, Rf = 0, scale = 12, geometric = TRUE))

A sample summary of the portfolio performance has been shown below. In this case, the input parameters to our trading strategy were as follows:

Plotting the Equity Curve

As can be observed from the equity curve, our trading strategy performed well during the initial period and then suffered drawdowns in the middle of the backtest period. The Sharpe ratio for the trading strategy comes to 0.4098.

Cumulative Return 1.172446

Annualized Sharpe Ratio (Rf=0%) 0.4098261

This was a simple trading strategy that we developed using the 52-week high effect explanation. One can tweak this trading strategy further to improve its performance and make it more robust or try it out on different markets.

Next Step

You can explore other trading strategies listed on the Quantpedia site under their screener page and if interested you can sign up to get access to hundreds of exciting trading strategies.

If you want to learn various aspects of Algorithmic trading then check out our Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ is designed to equip you with the right skill sets to be a successful trader. Enroll now!

Read more

Trading Using Machine Learning In Python Part-1

Trading using Machine Learning in Python

By Varun Divakar


Machine Learning has many advantages. It is the hot topic right now. For a trader or a fund manager, the pertinent question is “How can I apply this new tool to generate more alpha?”. I will explore one such model that answers this question in a series of blogs.

“How can I apply this new tool to generate more alpha?”Click To Tweet

This blog has been divided into the following segments:

  • Getting the data and making it usable.
  • Creating Hyper-parameters.
  • Splitting the data into test and train sets.
  • Getting the best-fit parameters to create a new function.
  • Making the predictions and checking the performance.
  • Finally, some food for thought.


You may add one line to install the packages “pip install numpy pandas …”
You can install the necessary packages using the following code, in the Anaconda Prompt.


Read more

Multi-Strategy Portfolios: Combining Quantitative Strategies Effectively

Multi-Strategy Portfolios: Combining Quantitative Strategies Effectively

By Derek Wong

Development of a successful algorithmic strategy is already a difficult endeavor. However trading a single strategy can pose its own set of risks, even if the strategy itself is robust and profitable.

So how do we as algorithmic traders understand exactly what our systems are delivering, change our mindset from development to implementation, and increase our risk adjusted returns?

Distribution Analysis of Trading Strategies

Most traders are familiar with looking at standard performance reports which have statistics like CAGR, Sharpe Ratio, and max drawdown. But these single numbers only provide a small glimpse into what the system is actually delivering. By adding return distribution analysis to your tool kit, you will be able to have a better grasp about what the system may produce on a more granular level.

The most common method for classifying a trading system is based on the entry type, either a momentum or mean reversion style.  This in the end is subjective and constraining, as many strategies will incorporate elements from both regimes. For example, a mean reversion strategy may employ the use of a filter that may have momentum characteristics. After this addition of the filter is it still a mean reversion system?

This problem can be solved by using statistical methods in order to classify strategies by their distribution’s descriptive statistics, rather than by subjective type or style. By analyzing the skew, and looking at the tails of our return distribution we can get a much better indication of what the strategy is actually delivering. Thus allowing us to make a quantitative judgement as to which regime it belongs to.

Strategies as investable securities, changing your mindset.

Most novice traders think of their strategies as standalone systems, maintaining the same concept from ideation to implementation.  However, there are two distinct environments, the vacuum of the quantitative research laboratory, and the investment portfolio in which you will execute your strategy. We need to consider the implications of this implementation, and its effect on our current portfolio and the fit into our investment mandate. The best way to do that is to consider a strategy for allocation as an investable security.

At its most fundamental level any strategy has a singular purpose. Which is to deliver a return series with particular characteristics, usually outsized risk adjusted returns. If this is the case, then we can consider a strategy that has been funded as making a long bet on that particular return series. This is the same as investing in any stock, commodity, or other asset.

Now there is basically no difference in motivation between investing in your strategy and investing in any other asset or security. You will allocate the most funds to those who exhibit the most desirable characteristics, and less to those who do not.

Applying portfolio optimization and diversification.

If we can accept this logic that investing in completed strategies, and investing in any other asset is the same. Naturally, the next logical step would be to create a portfolio. No one would recommend their friend to buy only a single stock. So why would you as a systematic trader only want to have one strategy?

We can now rely on two areas that have been heavily researched in academia and practiced in the field for many decades, portfolio optimization and diversification. By applying these very key principles that go into creating a portfolio of traditional assets, we can create a portfolio of multiple strategy systems. The same benefits that you get from creating a portfolio of traditional assets, such as decreased equity curve volatility and increase risk adjusted returns, can be then transferred to your set of systematic trading strategies.


QuantInstiTM hosted a webinar, “Multi-Strategy Portfolios: Combining Quantitative Strategies Effectively which was held on 16th May 2017 and conducted by Derek Wong , Director of Systematic Trading at Foretrade Investment Management Co. LTD. You can click on the link provided above to access the recorded session of the webinar.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!


Read more

Mixture Models for Forecasting Asset Returns

Mixture Models for Forecasting Asset Returns

By Brian Christopher

Asset return prediction is difficult. Most traditional time series techniques don’t work well for asset returns. One significant reason is that time series analysis (TSA) models require your data to be stationary. If it isn’t stationary, then you must transform your data until it is stationary.

That presents a problem.

In practice, we observe multiple phenomena that violate the rules of stationarity including non-linear processes, volatility clustering, seasonality, and autocorrelation. This renders traditional models mostly ineffective for our purposes.

What are our options?

There are many algorithms to choose from, but few are flexible enough to address the challenges of predicting asset returns:

  • mean and volatility changes through time
  • sometimes future returns are correlated with past returns, sometimes not
  • sometimes future volatility is correlated with past volatility, sometimes not
  • non-linear behavior

To recap, we need a model framework that is flexible enough to (1) adapt to non-stationary processes and (2) provide a reasonable approximation of the non-linear process that is generating the data.

Can Mixture Models offer a solution?

They have potential. First, they are based on several well-established concepts.

Markov models – These are used to model sequences where the future state depends only on the current state and not any past states. (memoryless processes)

Hidden Markov models – Used to model processes where the true state is unobserved (hidden) but there are observable factors that give us useful information to guess the true state.

Expectation-Maximization (E-M) – This is an algorithm that iterates between computing class parameters and maximizing the likelihood of the data given those parameters.

An easy way to think about applying mixture models to asset return prediction is to consider asset returns as a sequence of states or regimes. Each regime is characterized by its own descriptive statistics including mean and volatility. Example regimes could include low-volatility and high-volatility. We can also assume that asset returns will transition between these regimes based on probability. By framing the problem this way we can use mixture models, which are designed to try to estimate the sequence of regimes, each regime’s mean and variance, and the transition probabilities between regimes.

The most common is the Gaussian mixture model (GMM).

The underlying model assumption is that each regime is generated by a Gaussian process with parameters we can estimate. Under the hood, GMM employs an expectation-maximization algorithm to estimate regime parameters and the most likely sequence of regimes.

GMMs are flexible, generative models that have had success approximating non-linear data. Generative models are special in that they try to mimic the underlying data process such that we can create new data that should look like original data.


BlackArbs LLC in collaboration with QuantInstiTM hosted a webinar, Can we use mixture models to predict market bottoms?” which was held on 25th April 2017 and conducted by Brian Christopher, Founder of Blackarbs LLC. You can click on the link provided above to access the recorded session of the webinar.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

Read more

Forecasting Markets using eXtreme Gradient Boosting (XGBoost)

Forecasting Markets using Gradient Boosting (XGBoost)

By Milind Paradkar

In recent years, machine learning has been generating a lot of curiosity for its profitable application to trading. Numerous machine learning models like Linear/Logistic regression, Support Vector Machines, Neural Networks, Tree-based models etc. are being tried and applied in an attempt to analyze and forecast the markets. Researchers have found that some models have more success rate compared to other machine learning models. eXtreme Gradient Boosting also called XGBoost is one such machine learning model that has received rave from the machine learning practitioners.

In this post, we will cover the basics of XGBoost, a winning model for many kaggle competitions. We then attempt to develop an XGBoost stock forecasting model using the “xgboost” package in R programming.

Basics of XGBoost and related concepts

Developed by Tianqi Chen, the eXtreme Gradient Boosting (XGBoost) model is an implementation of the gradient boosting framework. Gradient Boosting algorithm is a machine learning technique used for building predictive tree-based models. (Machine Learning: An Introduction to Decision Trees).

Boosting is an ensemble technique in which new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.

The ensemble technique uses the tree ensemble model which is a set of classification and regression trees (CART). The ensemble approach is used because a single CART, usually, does not have a strong predictive power. By using a set of CART (i.e. a tree ensemble model) a sum of the predictions of multiple trees is considered.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction.

The objective of the XGBoost model is given as:

Obj = L +

L is the loss function which controls the predictive power, and
Ω is regularization component which controls simplicity and overfitting

The loss function (L) which needs to be optimized can be Root Mean Squared Error for regression, Logloss for binary classification, or mlogloss for multi-class classification.

The regularization component (Ω) is dependent on the number of leaves and the prediction score assigned to the leaves in the tree ensemble model.

It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. The Gradient boosting algorithm supports both regression and classification predictive modeling problems.

Sample XGBoost model:

We will use the “xgboost” R package to create a sample XGBoost model. You can refer to the documentation of the “xgboost” package here.

Install and load the xgboost library –

We install the xgboost library using the install.packages function. To load this package we use the library function. We also load other relevant packages required to run the code.


# Load the relevant libraries
library(quantmod); library(TTR); library(xgboost);


Create the input features and target variable – We take the 5-year OHLC and volume data of a stock and compute the technical indicators (input features) using this dataset. The indicators computed include Relative Strength Index (RSI), Average Directional Index (ADX), and Parabolic SAR (SAR). We create a lag in the computed indicators to avoid the look-ahead bias. This gives us our input features for building the XGBoost model. Since this is a sample model, we have included only a few indicators to build our set of input features.

# Read the stock data 
symbol = "ACC"
fileName = paste(getwd(),"/",symbol,".csv",sep="") ; 
df =
colnames(df) = c("Date","Time","Close","High", "Low", "Open","Volume")

# Define the technical indicators to build the model 
rsi = RSI(df$Close, n=14, maType="WMA")
adx = data.frame(ADX(df[,c("High","Low","Close")]))
sar = SAR(df[,c("High","Low")], accel = c(0.02, 0.2))
trend = df$Close - sar

# create a lag in the technical indicators to avoid look-ahead bias 
rsi = c(NA,head(rsi,-1)) 
adx$ADX = c(NA,head(adx$ADX,-1)) 
trend = c(NA,head(trend,-1))

Our objective is to predict the direction of the daily stock price change (Up/Down) using these input features. This makes it a binary classification problem. We compute the daily price change and assigned a positive 1 if the daily price change is positive. If the price change is negative, we assign a zero value.

# Create the target variable
price = df$Close-df$Open
class = ifelse(price > 0,1,0)


Combine the input features into a matrix – The input features and the target variable created in the above step are combined to form a single matrix. We use the matrix structure in the XGBoost model since the xgboost library allows data in the matrix format.

# Create a Matrix
model_df = data.frame(class,rsi,adx$ADX,trend)
model = matrix(c(class,rsi,adx$ADX,trend), nrow=length(class))
model = na.omit(model)
colnames(model) = c("class","rsi","adx","trend")


Split the dataset into training data and test data – In the next step, we split the dataset into training and test data. Using this training and test dataset we create the respective input features set and the target variable.

# Split data into train and test sets 
train_size = 2/3
breakpoint = nrow(model) * train_size

training_data = model[1:breakpoint,]
test_data = model[(breakpoint+1):nrow(model),]

# Split data training and test data into X and Y
X_train = training_data[,2:4] ; Y_train = training_data[,1]
class(X_train)[1]; class(Y_train)

X_test = test_data[,2:4] ; Y_test = test_data[,1]
class(X_test)[1]; class(Y_test)


Train the XGBoost model on the training dataset –

We use the xgboost function to train the model. The arguments of the xgboost function are shown in the picture below.

The data argument in the xgboost function is for the input features dataset. It accepts a matrix, dgCMatrix, or local data file. The nrounds argument refers to the max number of iterations (i.e. the number of trees added to the model). The obj argument refers to the customized objective function. It returns gradient and second order gradient with given prediction and dtrain.

# Train the xgboost model using the "xgboost" function
dtrain = xgb.DMatrix(data = X_train, label = Y_train)
xgModel = xgboost(data = dtrain, nround = 5, objective = "binary:logistic")


Output – The output is the classification error on the training data set.


We can also use the cross-validation function of xgboost i.e. In this case, the original sample is randomly partitioned into nfold equal size subsamples. Of the nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining (nfold – 1) subsamples are used as training data. The cross-validation process is then repeated nrounds times, with each of the nfold subsamples used exactly once as the validation data.

# Using cross validation
dtrain = xgb.DMatrix(data = X_train, label = Y_train)
cv = = dtrain, nround = 10, nfold = 5, objective = "binary:logistic")


Output – The returns a data.table object containing the cross validation results.

Make predictions on the test data

To make predictions on the unseen data set (i.e. the test data), we apply the trained XGBoost model on it which gives a series of numbers.

# Make the predictions on the test data
preds = predict(xgModel, X_test)

# Determine the size of the prediction vector

# Limit display of predictions to the first 6


Output –

These numbers do not look like binary classification {0, 1}. We have to, therefore, perform a simple transformation before we are able to use these results. In the example code shown below, we are comparing the predicted number to the threshold of 0.5. The threshold value can be changed depending upon the objective of the modeler, the metrics (e.g. F1 score, Precision, Recall) that the modeler wants to track and optimize.

prediction = as.numeric(preds > 0.5)


Output –

Measuring model performance

Different evaluation metrics can be used to measure the model performance. In our example, we will compute a simple metric, the average error. It compares the predicted score with the threshold of 0.50.

For example: If the predicted score is less than 0.50, then the (preds > 0.5) expression gives a value of 0. If this value is not equal to the actual result from the test data set, then it is taken as a wrong result.

We compare all the preds with the respective data points in the Y_test and compute the average error. The code for measuring the performance is given below. Alternatively, we can use hit rate or create a confusion matrix to measure the model performance.

# Measuring model performance
error_value = mean(as.numeric(preds > 0.5) != Y_test)
print(paste("test-error=", error_value))


Output –

Plot the feature importance set – We can find the top important features in the model by using the xgb.importance function.

# View feature importance from the learnt model
importance_matrix = xgb.importance(model = xgModel)


Plot the XGBoost Trees

Finally, we can plot the XGBoost trees using the xgb.plot.tree function. To limit the plot to a specific number of trees, we can use the n_first_tree argument. If NULL, all trees of the model are plotted.

# View the trees from a model
xgb.plot.tree(model = xgModel)

# View only the first tree in the XGBoost model
xgb.plot.tree(model = xgModel, n_first_tree = 1)



This post covered the popular XGBoost model along with a sample code in R programming to forecast the daily direction of the stock price change. Readers can catch some of our previous machine learning blogs (links given below). We will be covering more machine learning concepts and techniques in our coming posts.

Predictive Modeling in R for Algorithmic Trading
Machine Learning and Its Application in Forex Markets

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

Download Data Files

  • ACC.csv


Read more

Mean Reversion in Time Series

Mean Reversion in Time Series

By Devang Singh

Time series data is simply a collection of observations generated over time. For example, the speed of a race car at each second, daily temperature, weekly sales figures, stock returns per minute, etc. In the financial markets, a time series tracks the movement of specific data points, such as a security’s price over a specified period of time, with data points recorded at regular intervals. A time series can be generated for any variable that is changing over time. Time series analysis comprises of techniques for analyzing time series data in an attempt to extract useful statistics and identify characteristics of the data. Time series forecasting is the use of a mathematical model to predict future values based on previously observed values in the time series data.

The graph shown below represents the daily closing price of Aluminium futures over a period of 93 trading days, which is a time series.

Mean Reversion

Mean reversion is the theory which suggests that prices, returns, or various economic indicators tend to move to the historical average or mean over time. This theory has led to many trading strategies which involve the purchase or sale of a financial instrument whose recent performance has greatly differed from their historical average without any apparent reason. For example, let the price of gold increase on average by INR 10 every day and one day the price of gold increases by INR 40 without any significant news or factor behind this rise, then by the mean reversion principle we can expect the price of gold to fall in the coming days such that the average change in price of gold remains the same. In such a case, the mean reversionist would sell gold, speculating the price to fall in the coming days. Thus, making profits by buying the same amount of gold he had sold earlier, now at a lower price.

A mean-reverting time series has been plotted below, the horizontal black line represents the mean and the blue curve is the time series which tends to revert back to the mean.

A collection of random variables is defined to be a stochastic or random process. A stochastic process is said to be stationary if its mean and variance are time invariant (constant over time). A stationary time series will be mean reverting in nature, i.e. it will tend to return to its mean and fluctuations around the mean will have roughly equal amplitudes. A stationary time series will not drift too far away from its mean because of its finite constant variance. A non-stationary time series, on the contrary, will have a time varying variance or a time varying mean or both, and will not tend to revert back to its mean. In the financial industry, traders take advantage of stationary time series by placing orders when the price of a security deviates considerably from its historical mean, speculating the price to revert back to its mean. They start by testing for stationarity in a time series. Financial data points, such as prices, are often non-stationary, i.e. they have means and variances that change over time. Non-stationary data tends to be unpredictable and cannot be modeled or forecasted. A non-stationary time series can be converted into a stationary time series by either differencing or detrending the data. A random walk (the movements of an object or changes in a variable that follow no discernible pattern or trend) can be transformed into a stationary series by differencing (computing the difference between Yt and Yt -1). The disadvantage of this process is that it results in losing one observation each time the difference is computed. A non-stationary time series with a deterministic trend can be converted into a stationary time series by detrending (removing the trend). Detrending does not result in loss of observations. A linear combination of two non-stationary time series can also result in a stationary, mean-reverting time series. The time series (integrated of at least order 1), which can be linearly combined to result in a stationary time series are said to be cointegrated.

Shown below is a plot of a non-stationary time series with a deterministic trend (Yt = α + βt + εt) represented by the blue curve and its detrended stationary time series (Yt – βt = α + εt) represented by the red curve.

Become an algotrader. learn EPAT for algorithmic trading

Trading Strategies based on Mean Reversion

One of the simplest mean reversion related trading strategies is to find the average price over a specified period, followed by determining a high-low range around the average value from where the price tends to revert back to the mean. The trading signals will be generated when these ranges are crossed – placing a sell order when the range is crossed on the upper side and a buy order when the range is crossed on the lower side. The trader takes contrarian positions, i.e. goes against the movement of prices (or trend), expecting the price to revert back to the mean. This strategy looks too good to be true and it is, it faces severe obstacles. The lookback period of the moving average might contain a few abnormal prices which are not characteristic to the dataset, this will cause the moving average to misrepresent the security’s trend or the reversal of a trend. Secondly, it might be evident that the security is overpriced as per the trader’s statistical analysis, yet he cannot be sure that other traders have made the exact same analysis. Because other traders don’t see the security to be overpriced, they would continue buying the security which would push the prices even higher. This strategy would result in losses if such a situation arises.

Pairs Trading is another strategy that relies on the principle of mean reversion. Two co-integrated securities are identified, the spread between the price of these securities would be stationary and hence mean reverting in nature. An extended version of Pairs Trading is called Statistical Arbitrage, where many co-integrated pairs are identified and split into buy and sell baskets based on the spreads of each pair. The first step in a Pairs Trading or Stat Arb model is to identify a pair of co-integrated securities. One of the commonly used tests for checking co-integration between a pair of securities is the Augmented Dickey-Fuller Test (ADF Test). It tests the null hypothesis of a unit root being present in a time series sample. A time series which has a unit root, i.e. 1 is a root of the series’ characteristic equation, is non-stationary. The augmented Dickey-Fuller statistic, also known as t-statistic, is a negative number. The more negative it is, the stronger the rejection of the null hypothesis that there is a unit root at some level of confidence, which would imply that the time series is stationary. The t-statistic is compared with a critical value parameter, if the t-statistic is less than the critical value parameter then the test is positive and the null hypothesis is rejected.

Co-integration check – ADF Test

Consider the Python code shown below for checking co-integration:

We start by importing relevant libraries, followed by fetching financial data for two securities using the quandl.get() function. Quandl provides financial and economic data directly in Python by importing the Quandl library. In this example, we have fetched data for Aluminium and Lead futures from MCX. We then print the first five rows of the fetched data using the head() function, in order to view the data being pulled by the code. Using the statsmodels.api library, we compute the Ordinary Least Squares regression on the closing price of the commodity pair and store the result of the regression in the variable named ‘result’. Next, using the statsmodels.tsa.stattools library, we run the adfuller test by passing the residual of the regression as the input and store the result of this computation the array “c_t”. This array contains values like the t-statistic, p-value, and critical value parameters. Here, we consider a significance level of 0.1 (90% confidence level). “c_t[0]” carries the t-statistic, “c_t[1]” contains the p-value and “c_t[4]” stores a dictionary containing critical value parameters for different confidence levels. For co-integration we consider two conditions, firstly we check whether the t-stat is lesser than the critical value parameter (c_t[0] <= c_t[4][‘10%’]) and secondly whether the p-value is lesser than the significance level (c_t[1] <= 0.1). If both these conditions are true, we print that the “Pair of securities is co-integrated”, else print that the “Pair of securities is not cointegrated”.


To know more about Pairs Trading, Statistical Arbitrage and the ADF test you can check out the self-paced online certification course on “Statistical Arbitrage Trading“ offered jointly by QuantInsti and MCX to learn how to trade Statistical Arbitrage strategies using Python and Excel.

Other Links

Statistics behind pairs trading –

ADF test using excel –

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!

Read more

Forecasting Stock Returns using ARIMA model

Forecasting Stock Returns using ARIMA model

By Milind Paradkar

“Prediction is very difficult, especially about the future”. Many of you must have come across this famous quote by Neils Bohr, a Danish physicist. Prediction is the theme of this blog post. In this post, we will cover the popular ARIMA forecasting model to predict returns on a stock and demonstrate a step-by-step process of ARIMA modeling using R programming.

What is a forecasting model in Time Series?

Forecasting involves predicting values for a variable using its historical data points or it can also involve predicting the change in one variable given the change in the value of another variable. Forecasting approaches are primarily categorized into qualitative forecasting and quantitative forecasting. Time series forecasting falls under the category of quantitative forecasting wherein statistical principals and concepts are applied to a given historical data of a variable to forecast the future values of the same variable. Some time series forecasting techniques used include:

  • Autoregressive Models (AR)
  • Moving Average Models (MA)
  • Seasonal Regression Models
  • Distributed Lags Models

Become an algotrader. learn EPAT for algorithmic trading

What is Autoregressive Integrated Moving Average (ARIMA)?

ARIMA stands for Autoregressive Integrated Moving Average. ARIMA is also known as Box-Jenkins approach. Box and Jenkins claimed that non-stationary data can be made stationary by differencing the series, Yt. The general model for Yt is written as,

Yt1Yt1 2Yt2…ϕpYtpt + θ1ϵt1+ θ2ϵt2 +…θqϵtq

Where, Yt is the differenced time series value, ϕ and θ are unknown parameters and ϵ are independent identically distributed error terms with zero mean. Here, Yt is expressed in terms of its past values and the current and past values of error terms.

The ARIMA model combines three basic methods:

  • AutoRegression (AR) – In auto-regression the values of a given time series data are regressed on their own lagged values, which is indicated by the “p” value in the model.
  • Differencing (I-for Integrated) – This involves differencing the time series data to remove the trend and convert a non-stationary time series to a stationary one. This is indicated by the “d” value in the model. If d = 1, it looks at the difference between two time series entries, if d = 2 it looks at the differences of the differences obtained at d =1, and so forth.
  • Moving Average (MA) – The moving average nature of the model is represented by the “q” value which is the number of lagged values of the error term.

This model is called Autoregressive Integrated Moving Average or ARIMA(p,d,q) of Yt.  We will follow the steps enumerated below to build our model.

Step 1: Testing and Ensuring Stationarity

To model a time series with the Box-Jenkins approach, the series has to be stationary. A stationary time series means a time series without trend, one having a constant mean and variance over time, which makes it easy for predicting values.

Testing for stationarity – We test for stationarity using the Augmented Dickey-Fuller unit root test. The p-value resulting from the ADF test has to be less than 0.05 or 5% for a time series to be stationary. If the p-value is greater than 0.05 or 5%, you conclude that the time series has a unit root which means that it is a non-stationary process.

Differencing – To convert a non-stationary process to a stationary process, we apply the differencing method. Differencing a time series means finding the differences between consecutive values of a time series data. The differenced values form a new time series dataset which can be tested to uncover new correlations or other interesting statistical properties.

We can apply the differencing method consecutively more than once, giving rise to the “first differences”, “second order differences”, etc.

We apply the appropriate differencing order (d) to make a time series stationary before we can proceed to the next step.

Step 2: Identification of p and q

In this step, we identify the appropriate order of Autoregressive (AR) and Moving average (MA) processes by using the Autocorrelation function (ACF) and Partial Autocorrelation function (PACF).  Please refer to our blog, “Starting out with Time Series” for an explanation of ACF and PACF functions.

Identifying the p order of AR model

For AR models, the ACF will dampen exponentially and the PACF will be used to identify the order (p) of the AR model. If we have one significant spike at lag 1 on the PACF, then we have an AR model of the order 1, i.e. AR(1). If we have significant spikes at lag 1, 2, and 3 on the PACF, then we have an AR model of the order 3, i.e. AR(3).

Identifying the q order of MA model

For MA models, the PACF will dampen exponentially and the ACF plot will be used to identify the order of the MA process. If we have one significant spike at lag 1 on the ACF, then we have an MA model of the order 1, i.e. MA(1). If we have significant spikes at lag 1, 2, and 3 on the ACF, then we have an MA model of the order 3, i.e. MA(3).

Step 3: Estimation and Forecasting

Once we have determined the parameters (p,d,q) we estimate the accuracy of the ARIMA model on a training data set and then use the fitted model to forecast the values of the test data set using a forecasting function. In the end, we cross check whether our forecasted values are in line with the actual values.

Building ARIMA model using R programming

Now, let us follow the steps explained to build an ARIMA model in R. There are a number of packages available for time series analysis and forecasting. We load the relevant R package for time series analysis and pull the stock data from yahoo finance.


# Pull data from Yahoo finance 
getSymbols('TECHM.NS', from='2012-01-01', to='2015-01-01')

# Select the relevant close price series
stock_prices = TECHM.NS[,4]

In the next step, we compute the logarithmic returns of the stock as we want the ARIMA model to forecast the log returns and not the stock price. We also plot the log return series using the plot function.

# Compute the log returns for the stock
stock = diff(log(stock_prices),lag=1)
stock = stock[!]

# Plot log returns 
plot(stock,type='l', main='log returns plot')

Next, we call the ADF test on the returns series data to check for stationarity. The p-value of 0.01 from the ADF test tells us that the series is stationary. If the series were to be non-stationary, we would have first differenced the returns series to make it stationary.

# Conduct ADF test on log returns series

In the next step, we fixed a breakpoint which will be used to split the returns dataset in two parts further down the code.

# Split the dataset in two parts - training and testing
breakpoint = floor(nrow(stock)*(2.9/3))

We truncate the original returns series till the breakpoint, and call the ACF and PACF functions on this truncated series.

# Apply the ACF and PACF functions
par(mfrow = c(1,1))
acf.stock = acf(stock[c(1:breakpoint),], main='ACF Plot', lag.max=100)
pacf.stock = pacf(stock[c(1:breakpoint),], main='PACF Plot', lag.max=100)

We can observe these plots and arrive at the Autoregressive (AR) order and Moving Average (MA) order.

We know that for AR models, the ACF will dampen exponentially and the PACF plot will be used to identify the order (p) of the AR model. For MA models, the PACF will dampen exponentially and the ACF plot will be used to identify the order (q) of the MA model. From these plots let us select AR order = 2 and MA order = 2. Thus, our ARIMA parameters will be (2,0,2).

Our objective is to forecast the entire returns series from breakpoint onwards. We will make use of the For Loop statement in R and within this loop we will forecast returns for each data point from the test dataset.

In the code given below, we first initialize a series which will store the actual returns and another series to store the forecasted returns.  In the For Loop, we first form the training dataset and the test dataset based on the dynamic breakpoint.

We call the arima function on the training dataset for which the order specified is (2, 0, 2). We use this fitted model to forecast the next data point by using the forecast.Arima function. The function is set at 99% confidence level. One can use the confidence level argument to enhance the model. We will be using the forecasted point estimate from the model. The “h” argument in the forecast function indicates the number of values that we want to forecast, in this case, the next day returns.

We can use the summary function to confirm the results of the ARIMA model are within acceptable limits. In the last part, we append every forecasted return and the actual return to the forecasted returns series and the actual returns series respectively.

# Initialzing an xts object for Actual log returns
Actual_series = xts(0,as.Date("2014-11-25","%Y-%m-%d"))
# Initialzing a dataframe for the forecasted return series
forecasted_series = data.frame(Forecasted = numeric())

for (b in breakpoint:(nrow(stock)-1)) {

stock_train = stock[1:b, ]
stock_test = stock[(b+1):nrow(stock), ]

# Summary of the ARIMA model using the determined (p,d,q) parameters
fit = arima(stock_train, order = c(2, 0, 2),include.mean=FALSE)

# plotting a acf plot of the residuals
acf(fit$residuals,main="Residuals plot")

# Forecasting the log returns
arima.forecast = forecast.Arima(fit, h = 1,level=99)

# plotting the forecast
plot(arima.forecast, main = "ARIMA Forecast")

# Creating a series of forecasted returns for the forecasted period
forecasted_series = rbind(forecasted_series,arima.forecast$mean[1])
colnames(forecasted_series) = c("Forecasted")

# Creating a series of actual returns for the forecasted period
Actual_return = stock[(b+1),]
Actual_series = c(Actual_series,xts(Actual_return))



Before we move to the last part of the code, let us check the results of the ARIMA model for a sample data point from the test dataset.

From the coefficients obtained, the return equation can be written as:

Yt = 0.6072*Y(t-1)  – 0.8818*Y(t-2) – 0.5447ε(t-1) + 0.8972ε(t-2)

The standard error is given for the coefficients, and this needs to be within the acceptable limits. The Akaike information criterion (AIC) score is a good indicator of the ARIMA model accuracy. Lower the AIC score better the model. We can also view the ACF plot of the residuals; a good ARIMA model will have its autocorrelations below the threshold limit. The forecasted point return is -0.001326978, which is given in the last row of the output.

Become an algotrader. learn EPAT for algorithmic trading

Let us check the accuracy of the ARIMA model by comparing the forecasted returns versus the actual returns. The last part of the code computes this accuracy information.

# Adjust the length of the Actual return series
Actual_series = Actual_series[-1]

# Create a time series object of the forecasted series
forecasted_series = xts(forecasted_series,index(Actual_series))

# Create a plot of the two return series - Actual versus Forecasted
plot(Actual_series,type='l',main='Actual Returns Vs Forecasted Returns')

# Create a table for the accuracy of the forecast
comparsion = merge(Actual_series,forecasted_series)
comparsion$Accuracy = sign(comparsion$Actual_series)==sign(comparsion$Forecasted)

# Compute the accuracy percentage metric
Accuracy_percentage = sum(comparsion$Accuracy == 1)*100/length(comparsion$Accuracy)

If the sign of the forecasted return equals the sign of the actual returns we have assigned it a positive accuracy score. The accuracy percentage of the model comes to around 55% which looks like a decent number. One can try running the model for other possible combinations of (p,d,q) or instead use the auto.arima function which selects the best optimal parameters to run the model.


To conclude, in this post we covered the ARIMA model and applied it for forecasting stock price returns using R programming language. We also crossed checked our forecasted results with the actual returns. In our upcoming posts, we will cover other time series forecasting techniques and try them in Python/R programming languages.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!

Read more