Sentiment Analysis on News Articles using Python

Know how to perform sentiment analysis on news articles using Python Programming Language

by Milind Paradkar

In our previous post on sentiment analysis we briefly explained sentiment analysis within the context of trading, and also provided a model code in R. The R model was applied on an earnings call conference transcript of an NSE listed company, and the output of the model was compared with the quarterly earnings numbers, and by charting the one-month stock price movement post the earnings call date. QuantInsti also conducted a webinar on “Quantitative Trading Using Sentiment Analysis” where Rajib Ranjan Borah, Director & Co-founder, iRageCapital and QuantInsti, covered important aspects of the topic in detail, and is a must watch for all enthusiast wanting to learn & apply quantitative trading strategies using sentiment analysis.

Taking these initiatives on sentiment analysis forward, in this blog post we attempt to build a Python model to perform sentiment analysis on news articles that are published on a financial markets portal. We will build a basic model to extract the polarity (positive or negative) of the news articles.

In Rajib’s Webinar, one of the slides details the sensitivity of different sectors to company and sectorial news. In the slide, the Pharma sector ranks at the top as the most sensitive sector, and in this blog we will apply our sentiment analysis model on specific news articles pertaining to select Indian Pharma companies. We will determine the polarity, and then check how the market reacted to these news. For our sample model, we have taken ten Indian Pharma companies that make the NIFTY Pharma index.

Building the Model

Now, let us dive straight in and build our model. We use the following Python libraries to build the model:

  • Requests
  • Beautiful Soup
  • Pattern

Step 1: Create a list of the news section URL of the component companies

We identify the component companies of the NIFTY Pharma index, and create a dictionary in python which contains the company names as the keys, while the dictionary values comprise the respective company abbreviation used by the financial portal site to form the news section URL. Using this dictionary we create a python list of the news section URLs for the all components companies.

import csv
import time
import requests
from bs4 import BeautifulSoup
from pattern.en import ngrams

Base_url = "http://www.moneycontrol.com"

# Build a dictionary of companies and their abbreviated names 
companies = {'cadilahealthcare':'CHC','piramalenterprises':'PH05',
             'glenmarkpharma':'GP08','glaxosmithklinepharmaceuticals':'GSK',
             'sunpharmaceuticalindustries':'SPI','lupinlaboratories':'LL',
             'cipla':'C','aurobindopharma':'AP',
             'drreddyslaboratories':'DRL','divislaboratories':'DL03'}
             
# Create a list of the news section urls of the respective companies 
url_list = ['http://www.moneycontrol.com/company-article/{}/news/{}#{}'.format(k,v,v) for k,v in companies.iteritems()]
print url_list

Step 2: Extract the relevant news articles web-links from the company’s news section page

Using the Python list of the news section URLs, we run a Python For loop which pings the portal with every URL in our Python list. We use the requests.get function from the Python requests library (which is a simple HTTP library). The requests module allows you to send HTTP/1.1 requests. One can add headers, form data, multipart files, and parameters with simple Python dictionaries, and also access the response data in the same way.

The text of the response object is then applied to create a Beautiful Soup object. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with a given parser to provide for ways of navigating, searching, and modifying the parse tree.

HTML parsing basically means taking in the HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings, links, bold text etc.

The news section webpage on the financial portal site contains 20 news articles per page. We target only the first page of the news section, and our objective is to extract the links for all the news articles that appear on the first page using the parsed HTML. We inspect the HTML, and use the find_all method in the code to search for a tag that has the CSS class name as “arial11_summ”. This enables us to extract all the 20 web-links.

Fortunes of the R&D intensive Indian Pharma sector are driven by sales in the US market and by approvals/rejections of new drugs by US Food and Drug Administration (USFDA). Hence, we will select only those news articles pertaining to the US Food and Drug Administration (USFDA) and the US market. Using keywords like “US”, “USA”, and “USFDA” in a If statement which is nested within the Python For Loop, we get us our final list of all the relevant news articles.

# Create an empty list which will contain the selected news articles 
List_of_links = []   

# Extract the relevant news articles weblinks from the news section of selected companies
for urls in url_list:
   html = requests.get(urls)
   soup = BeautifulSoup(html.text,'html.parser') # Create a BeautifulSoup object 

   # Retrieve a list of all the links and the titles for the respective links
   word1,word2,word3 = "US","USA","USFDA"
 
   sub_links = soup.find_all('a', class_='arial11_summ')
   for links in sub_links:
      sp = BeautifulSoup(str(links),'html.parser')  # first convert into a string
      tag = sp.a
      if word1 in tag['title'] or word2 in tag['title'] or word3 in tag['title']:
          category_links = Base_url + tag["href"]
          List_of_links.append(category_links)
          time.sleep(3)
          
# Print the select list of news articles weblinks    
#for p in List_of_links: print p

Step 3: Remove the duplicate news articles based on news title

It may happen that the financial portal publishes important news articles pertaining to the overall pharma sector on every pharma company’s news section webpage. Hence, it becomes necessary to weed out the duplicate news articles that appear in our Python list before we run our sentiment analysis model. We call the set function on our Python list which we generated in Step 2 to give us a list with no duplicate news articles.

# Remove the duplicate news articles based on News Title
unique_links = list(set(List_of_links))
for q in unique_links: print q

# Create a dictionary of positive/negative words related to the Pharma Sector
reader = csv.reader(open('dict.csv', 'r'))
pharma_dict = dict((rows[0],rows[1]) for rows in reader)

# Creating an empty list which will be filled later with news article links, and Polarity values (pos/neg)
df =[]
print df

Step 4: Extract the main text from the selected news articles

In this step we run a Python For Loop and for every news article URL, we call the requests.get() on the URL, and then convert the text of response object into a Beautiful Soup object. Finally, we extract the main text using the find and get_text methods from the  Beautiful Soup module.

# Open the choosen news articles and extract the main text  
for selected_links in unique_links:
   results_url = selected_links 
   #print results_url
   
   results = requests.get(results_url)
   results_text = BeautifulSoup(results.text)
   extract_text = results_text.find(class_='arti_cont')
   final_text = extract_text.get_text()

Step 5: Pre-processing the extracted text

We will use the n-grams function from the Pattern module to pre-process our extracted text. The ngrams() function returns a list of n-grams (i.e., tuples of n successive words) from the given string. Since we are building a simple model, we use a value of one for the n argument in the n-grams function. The Pattern module contains other useful functions for pre-processing like parse, tokenize, tag etc. which can be explored to conduct an in-depth analysis.

 # Pre-processing the extracted text using ngrams function from the pattern package   
   final_text1 = ngrams(final_text, n=1, punctuation=".,;:!?()[]{}`''\"@#$^&*+-|=~_", continuous=False)
   #print final_text1

Step 6: Compute the Sentiment analysis score using a simple dictionary approach

To compute the overall polarity of a news article we use the dictionary method. In this approach a list of positive/negative words help determine the polarity of a given text. This dictionary is created using the words that are specific to the Pharma sector. The code checks for positive/negative matching words from the dictionary with the processed text from the news article.

  # Checking if any of the words in the news article text matches with the words in the Pharma dictionary(pos/neg)
   new_dict = {}
   positive_score,negative_score = 0,0
   
   for x in final_text1:
      if pharma_dict.has_key(x[0]):
         new_dict[x[0]] = pharma_dict[x[0]] 
                  
   positive_list = [] ; negative_list = [];
   for key, value in new_dict.items():
      if value == 'positive': positive_list.append(key)
      if value == 'negative': negative_list.append(key)
                           
   # Compute the positive score, the negative score for each news articles
   positive_score = len(positive_list) ; negative_score = len(negative_list);
   #print positive_list ; print negative_list ;
   #print positive_score ; print negative_score;

Step 7: Create a Python list of model output

 The final output from the model is populated in a Python list. The list contains the URL, positive score and the negative score for each of the selected news articles on which we conducted sentiment analysis.

  # Appending the empty list to create the final Sentiment analysis output
   var_nos = [results_url,positive_score, negative_score]
   df.append(var_nos)
   
# Print the final list of the Sentiment Analysis 
for item in df:
      print(item)

Final Output

sentiment trading using python

Step 8: Plot NIFTY vs NIFTY Pharma returns

Shown below is a plot of NIFTY vs NIFTY Pharma for the months of October-November 2016. In our NIFTY Pharma plot we have drawn arrows highlighting some of the press releases on which we ran our sentiment analysis model. The impact of the uncertainty regarding the US Presidential election results, and the negative news for the Indian Pharma sector emanating from the US is clearly visible on NIFTY Pharma as it fell substantially from the highs made in late October’2016. Thus, our attempt to gauge the direction of the Pharma Index using the Sentiment analysis model in Python programming language is giving us accurate results (more or less).

sentiment trading using python

 

Next Step:

One can build more robust sentiment models using other approaches and trade profitably. As a next step we would recommend watching QuantInsti’s webinar on “Quantitative Trading Using Sentiment Analysis” by Rajib Ranjan Borah. Watch it by clicking on the video below:

 

Also, catch our other exciting Python trading blogs and if you are interested in knowing more about our EPAT course feel free to contact our QuantInsti team by clicking here.

Algorithmic trading course

  • Download.rar
    • Sentiment Analysis of News Article – Python Code
    • dict(1).csv
    • Nifty and Nifty Pharma(1).csv
    • Pharma vs Nifty.py

(more…)

Read more

Predictive Modeling in R for Algorithmic Trading

Predictive Modeling for Algorithmic Trading

By Milind Paradkar

Predictive modeling in R

What is Predictive Modeling?

Predictive modeling is a process used in predictive analytics to create a statistical model of future behavior. Predictive analytics is the area of data mining concerned with forecasting probabilities and trends [1]

The predictive modeling in trading is a modeling process wherein we predict the probability of an outcome using a set of predictor variables. In this post, we will be illustrating predictive modeling in R.

Who should use it?

Predictive models can be built for different assets like stocks, futures, currencies, commodities etc. For example, we can build a model to predict the next day price change for a stock, or a model to predict the foreign currency exchange rates.

How/Why should we use it?

The power of predictive modeling can be harnessed for making the right investment decisions, and in building profitable portfolios.

Objective of this article

In this article, we will be detailing the step-by-step process for predictive modeling in R used for trading using different technical indicators. This model attempts to predict the next day price change (Up/Down) using these indicators and machine learning algorithms.

Become an algotrader. learn EPAT for algorithmic trading

Step 1: Feature construction

Our process commences with the construction of a dataset that contains the features which will be used to make the predictions, and the output variable.

First, we build our dataset using raw data comprising of a 5-year price series for a stock and an index. This stock and index data consists of Date, Open, High, Low, Last and Volume. Using this data we compute our features based on various technical indicators (listed below).

No Name of the Technical Indicator
1 Force Index(FI)
2 William %R
3 Relative Strength Index(RSI)
4 Rate Of Change(ROC)
5 Momentum (MOM)
6 Average True Range(ATR)

Note: Before you begin, make sure that you have the following packages installed and selected on your RStudio: Quantmode, PRoc, TTR, Caret, Corrplot, FSelector, rJava, kLar, randomforest, kernlab, rpart

R Code:

Predictive Modeling for Algorithmic Trading 1Predictive Modeling for Algorithmic Trading 1

The computed technical indicators along with the price change class (Up/Down) are combined to form a single dataset.

Step 2: Understanding the dataset using numbers and visuals

The most significant pre-requisite for predictive modeling is to have a good understanding of the dataset. The understanding helps in:

  • Data transforms
  • Choosing the right machine learning algorithms
  • Explains the results obtained from the model
  • Improves its accuracy

To attain this understanding of the dataset, you can use descriptive statistics like standard deviation, mean, skewness, along with graphical understanding of the data.

R Code:

Predictive Modeling for Algorithmic Trading 1

Correlation MatrixPredictive Modeling for Algorithmic Trading 1

Step 3: Feature selection

Feature selection is the process of selecting a subset of features that are most relevant for model construction which aid in creating an accurate predictive model. There are a wide range of feature selection algorithms, and these mainly fall in one of the three categories:

  1. Filter method– selects features by assigning a score to them using some statistical measure.
  2. Wrapper method– evaluates different subset of features, and determines the best subset.
  3. Embedded method – This method figures out which of the features give the best accuracy while the model is being trained.

 

In our model, we will use filter method utilising the random.forest.importance function from the FSelector package. The random.forest.importance function rates the importance of each feature in the classification of the outcome, i.e. class variable. The function returns a data frame containing the name of each attribute and the importance value based on the mean decrease in accuracy.

R Code:

Predictive Modeling for Algorithmic Trading 1

Output:

Predictive Modeling for Algorithmic Trading 1

Now, in order to choose the best features using the importance values returned by random.forest.importance, we use the cutoff.k function which provides k features with the highest importance values.

In our model, we have selected ten out of the seventeen features that were initially extracted from the price data series. Using these ten features we create a dataset that will be used on the machine learning algorithms.

Step 4: Set the Resampling method

In Predictive modeling we need data for two reasons:

  • To train the model
  • To test data to determine the accuracy of the predictions made by the model.

When we have a limited data we can use resampling methods which split data into training and testing parts. There are different resampling methods available in R such as data splitting, bootstrap method, k-fold cross validation, Repeated k-fold cross validation etc.

In our example, we are using k-fold cross validation method which splits the dataset into k-subsets. We keep each subset aside while the model trains on all the remaining subsets. This process repeats until it determines accuracy for each instance in the dataset. Finally, the function determines an overall accuracy estimate.

Code:
Predictive Modeling for Algorithmic Trading 1

Step 5: Training different algorithms

There are hundreds of machine learning algorithms available in R, and determining which model to use can be confusing for beginners. Modelers are expected to try different algorithms based on the problem at hand and with more experience & practice you will be able to determine the right set.

Few kinds of problems:

  • Classification
  • Regression
  • Clustering
  • Rule Extraction

In our example we are dealing with a classification problem, hence we will explore some algorithms that are geared to solve such problems.  We will be using the following:

  • k-Nearest Neighbors (KNN)
  • Classification and Regression Trees (CART)
  • Naive Bayes (NB)
  • Support Vector Machine with Radial Basis Function (SVM) algorithms.

We won’t be detailing the working of these algorithms as the objective of the post is to detail the steps in the modeling process, and not the underlying working of these algorithms.

We use the train function from the caret package which fits different predictive models using a grid of tuning parameters. For example:

predictive-modeling-for-algorithmic-trading-8

In the above example, we are using the KNN algorithm which is specified via the method argument. class is the output variable, dataset_rf is the dataset that is used to train and test the model. The preProc argument defines the data transform method, while the trControl argument defines the computational nuances of the train function.

Predictive Modeling for Algorithmic Trading 1

Step 6: Evaluating the Models

The trained models are evaluated for their accuracy in predicting the outcome using different metrics like Accuracy, Kappa, Root Mean Squared Error (RMSE), R2etc.

We are using the “Accuracy” metric to evaluate our trained models. Accuracy is the percentage of correctly classified instances out of all instances in the test dataset. We have used the resamples function from the caret package which takes the trained objects, and produces a summary using the summary function.

Predictive Modeling for Algorithmic Trading 1

Predictive Modeling for Algorithmic Trading 1

Predictive Modeling for Algorithmic Trading 1

Become an algotrader. learn EPAT for algorithmic trading

Step 7: Tuning the shortlisted model

As we can observe from the accuracy metric, all the models have accuracy between 50-54%. Ideally we should try to tune models with highest accuracies. However, for the example’s sake, we will select the KNN algorithm and try to improve its accuracy by tuning the parameters.

Tuning the parameters

We will tune the KNN algorithm (parameter k) over values ranging from 1 to 10. The tuneGrid argument in the train function helps determine the accuracy over this range when evaluate the model.

Predictive Modeling for Algorithmic Trading 1

Output:

Predictive Modeling for Algorithmic Trading 1

As can be seen from the tuning process, the accuracy of KNN algorithm has not increased and it is similar to the accuracy obtained earlier. You can try tuning other algorithms as well based on their respective tuning parameters, and select the algorithm with the best accuracy.

Conclusion:

In the initial stages, modelers can try working with different technical indicators, create models based on other asset classes, and can try out different predictive modeling problems. There are some very good packages which make predictive modeling in R easy and with experience and practice, you will start building robust models to trade profitably in the markets.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enrol now!

Download Data Files

  • Predictive Modeling program.rar
    • BAJAJ-AUTO 5 Yr data.csv
    • NIFTY 5 Yr data.csv
    • R code – Predictive Modeling program.rar

(more…)

Read more

How to Check Data Quality Using R

 

How to check data quality

By Milind Paradkar

Do You Use Clean Data?

Always go for clean data! Why is it that experienced traders/authors stress this point in their trading articles/books so often? As a novice trader, you might be using the freely available data from sources like Google or Yahoo finance. Do such sources provide accurate, quality data?

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

Here is our finding. Out of the 143 stocks scanned, 89 stocks had data points less than 5250, that’s more than 60% of our sample set!! The table shown below lists downs 10 such stocks from those 89 stocks.

Symbols

Let’s take the case of PAGEIND. Google finance has captured only 4348 1-minute data points for the stock, thus missing 902 points!!

Example – Missing the 1306 minute bar on 20160801:

Missing the 1306 minute bar on 20160801

Example – Missing the 1032 minute bar on 20160802:

Missing the 1032 minute bar on 20160802

If a trader is running an intraday strategy which generates buy/sell signals based on 1-minute bars, the strategy is bound to give some false signals.

As can be seen from the quick check above, data quality from free sources or from cheap data vendors is not always guaranteed. Many of the cheap data vendors source the data from Yahoo finance and provide it to their clients. Poor data feed is a big issue faced by many traders and you will find many traders complaining about the same on various trading forums.

Backtesting a trading strategy using such data will give false results. If are using the data in live trading and in case there is a server problem with Google or Yahoo finance, it will lead to a delay in the data feed. As a trader, you don’t want to be in a position where you have an open trade, and the data feed stops or is delayed. When trading with real money, one is always advised to use quality data from reliable data vendors. After all, Data is Everything!

Next Step

If you’re a retail trader interested in learning various aspects of Algorithmic trading, check out the Executive Programme in Algorithmic Trading (EPAT). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. The course equips you with the required skillsets to be a successful trader.

Download Data Files

  • Do You Use Clean Data.rar
    • 15 Day Intraday Historical Data.zip
    • F&O Stock List.csv
    • R code – Google_Data_Quality_Check.txt
    • R code – Stock price data.txt

(more…)

Read more

Sentiment Analysis in Trading Using R [WORKING MODEL]

Sentiment Analysis in Trading

By Milind Paradkar

In this post we discuss sentiment analysis in brief and then present a basic model of sentiment analysis in R. Sentiment analysis is the analysis of the feelings (i.e. attitudes, emotions and opinions) which are expressed in the news reports/blog posts/twitter messages etc., using natural language processing tools.

Natural language processing (NLP) in simple terms refers to the use of computers to process sentences/text in a natural language such as English. The objective here is to extract information from unstructured or semi-structured data found in these tweets/blogs/articles. To enable this NLP makes use of artificial intelligence, computational linguistics, and computer science.

Using NLP models hundreds of text documents can be processed to ascertain the sentiment in seconds. These days sentiment analysis is a hot topic and has found wide application in areas like Business intelligence, Politics, Finance, Policy making etc.

Sentiment analysis in Trading – Sentiments can often drive the direction of the markets. Hence, traders and other participants in the financial markets seek to gauge the sentiment expressed in news reports/tweets/blog posts. Traders build automatic trading systems which extract the sentiment from natural language. These trading systems take long/short positions in the markets based on the trading signals generated. The trading systems can also be combined with other trading systems. The objective at the end of the day is to generate superior returns from the extracted information.

There are various methods and models for sentimental analysis. Let us take a look at a very basic model in R for sentimental analysis.

Model: Sentiment analysis in R

In this model, we implement the “Bag-of-words” approach to sentiment analysis in R. The process identifies positive and negative words (or a string of words) within an article. For this, it makes use of a large dictionary which contains words that carry sentiment.  Each word in this dictionary can be assigned a weight. The sum of the positive and negative words is the final sentiment score generated by the model.

We will test our model on the management commentary text taken from the latest earnings call transcript of Eicher Motors Ltd. Eicher Motors is a leading Indian automaker company which owns the Royal Enfield Motors. The objective of our model will be to gauge the opinion expressed in their fourth quarter 2015 earnings call.          Basic sentiment analysis model

To build our sentiment analysis model we are using the “tm” and the “Rweka” package in R. We load the libraries and then read the two documents which contain the positive and the negative terms. To prepare these documents we have gone through four previous conference call transcripts prior to the fourth quarter 2015. We picked the positive/negative words from these transcripts to populate our dictionary. In addition to these words, we have also added some general positive/negative words that relate to the Motorcycle industry.

load the document

We will be considering only the management’s commentary for sentiment analysis in R. We load the text document (fourth quarter 2015) containing the CEO’s prepared text commentary in R using the Corpus function. For this, we have stored the commentary document in the TextMining folder in the R’s working directory.

Clean the text

Next step is to clean the text. We convert all words to lowercase, remove punctuations, remove numbers, and strip the whitespace. The writeLines function enables us to see the text post the cleansing.

In the code below, we tokenize the text which was cleaned above. Tokenization is the process of breaking a stream of text into words or a string of words. We are using the NGramTokenizer function here. This creates N-grams of text.

N-grams are basically a set of co-occuring words within a given text. For example, consider this sentence “The food is delicious”. If n= 2, then the n-grams would be:

  • the food
  • food is
  • is delicious

Thereafter we create a term-document matrix (called “terms” in the code) which a matrix that lists all occurrences of words in the corpus.

write the tokenizer

Below we check if the positive/negative words in the dictionary are present in the text document.

check for positive negative words

Now we extract all the positive/negative words from the text document which matched with the words in our dictionary.

Extract positive negative words

The code lines below compute the positive/negative score, and finally the sentiment score.

Calculate score

Sentiment analysis in R: Final sentiment score

Sentiment score

The sentiment analysis model found 14 positive words and 4 negative words, and the final sentiment score was 10. This tells us that the quarterly result for Q4 2015 was good from the management’s perspective. The word cloud below shows some of the positive/negative words that were picked from the text document on which we ran the model.

Sentiment analysis

Sentiment analysis in R: Validating our model – let us check the quarterly performance numbers to confirm the positive sentiment score generated by our model. As can be seen, Eicher Motors posted a strong quarter. EBIT growth was around 72% y/y on a strong sales volume of 125,690 motorcycles. The strong results were despite the production shutdown for few days which was caused by the floods experienced during the quarter at its production facility.

The chart below shows the stock market’s reaction to Eicher Motors strong results on the day of earnings result announcement. The stock opened at around 17100 levels, made a big move touching an intraday high of around Rs.18500, and finally closed at Rs. 18,175.

Eicher Motors Consolidated Results

Eicher Motors Chart

Conclusion

This was a basic introduction to sentiment analysis in R. The model above can be made more robust and fine-tuned further. In future posts, we will try to cover other sentiment analysis approaches and attempt to build a model around them.

QuantInsti has been actively participating in conferences on sentiment analysis and was one of the lead marketing and education partner at the recently held “Sentiment analysis in Finance” conference in Singapore, 2016. Rajib Ranjan Borah, Co-founder & Director of iRageCapital Advisory Pvt. Ltd, & QuantInsti was one of the esteemed panelists for the session, “New Paradigms for Sentiment Analysis Applied to Finance” at the conference.

To know more about QuantInsti and the Executive Programme in Algorithmic Trading (EPAT) course offered by QuantInsti, check our website and the EPAT course page. Feel free to contact our team at contact@quantinsti.com for queries on EPAT.

Download Data File

  • Sentiment analysis in Trading – Files.rar
    • Eicher Motors Sentiment Analysis – R Code.txt
    • Negative terms.csv
    • Positive Terms.csv
    • Q4.txt

(more…)

Read more

Machine Learning and Its Application in Forex Markets – Part 2 [WORKING MODEL]

Machine Learning and Its Application in Forex Markets [WORKING MODEL]

In our previous post on Machine learning we derived rules for a forex strategy using the SVM algorithm in R. In this post we take a step further, and demonstrate how to backtest our findings.

To recap the last post, we used Parabolic SAR and MACD histogram as our indicators for machine learning.  Parabolic SAR indicator trails price as the trend extends over time. SAR is below prices when prices are rising and above prices when prices are falling. SAR stops and reverses when the price trend reverses and breaks above or below it.

The MACD oscillator comprises of the MACD line, Signal line and the MACD histogram. The MACD Line is the 12-day Exponential Moving Average (EMA) less the 26-day EMA. MACD Signal line is a 9-day EMA of the MACD line. The MACD Histogram represents the difference between MACD line and the MACD Signal line. The histogram is positive when the MACD Line is above its Signal line and negative when the MACD Line is below its Signal line.

The EUR/USD price series chart below shows Parabolic SAR plotted in blue, and the MACD line, MACD signal line, and the MACD histogram below the EURUSD price series.

EURUSD

Our intention is to take positions around the MACD line crossovers and Parabolic SAR reversal points. When the Parabolic SAR gives buy signal and MACD lines crosses upwards, we buy. When Parabolic SAR gives sell signal and MACD lines crosses downwards, we sell.

SVM Predictions

After selecting the indicators we ran the SVM algorithm on EUR/USD data, which gave us the plot as shown above. Looking at the SVM predictions, we now frame the rules, and backtest them to see the performance of our strategy.

Short rule = (Price – SAR) < 0.0010 & MACD histogram < 0.0010
Long rule = (Price – SAR) > -0.0050 & MACD histogram > -0.0010

We have used Michael Kapler’s Systematic Investor Toolbox to backtest our model in R. We start by loading the toolbox and the necessary libraries.

install libraries

Next we create a new environment and load the historical EUR/USD data using the getSymbols function.

create a new environment

We will check the performance of our rule-based model against a simple ‘buy and hold’ model.  To do that, we first create a ‘buy and hold’ model.

specify the prices and store our models

Our next step is to compute the indicators for our rule-based model.

calculate selected indicators

We run two models here, ‘long short’ model, and another ‘long short’ model using stop loss and take profit. First we create a long short model without stop loss and take profit.

set conditions

Next we set the takeprofit and stop loss levels, and create a long short model using these levels. We call this model as ‘stop.loss.take.profit’ model.

set take profit

Let us now run all the three models, and check their relative performance.

view equity curve

Cumulative Performance Performance table

As you can be seen, the rule-based strategy has a smooth equity curve, and is giving a better CAGR of 5.97 than the simple ‘buy hold’ model CAGR of 1.18. The Maximum drawdown of our strategy is at 13.92 compared to the ‘buy hold’ strategy drawdown of 30.11. You can play with the indicator settings or change the short-long rules or the stop loss-take profit levels to refine the model further.

Once you understand Machine learning algorithms, these can be a great tool for formulating profit-making strategies. To learn more on Machine Learning you can watch our latest webinar, “Machine Learning in Trading”, which was hosted by QuantInsti, and conducted by our guest speaker Tad Slaff, CEO/Co-founder Inovance.

Machine learning is covered in the Executive Programme in Algorithmic Trading (EPAT) course conducted by QuantInsti. To know more about EPAT check the EPAT course page or feel free to contact our team at contact@quantinsti.com for queries on EPAT.

Download R Code

(more…)

Read more

Machine Learning and Its Application in Forex Markets [WORKING MODEL]

Machine Learning and Its Application in Forex Markets [WORKING MODEL]

By Milind Paradkar

In the last post we covered Machine learning (ML) concept in brief. In this post we explain some more ML terms, and then frame rules for a forex strategy using the SVM algorithm in R.

To use ML in trading, we start with historical data (stock price/forex data) and add indicators to build a model in R/Python/Java. We then select the right Machine learning algorithm to make the predictions.

First, let’s look at some of the terms related to ML.

Machine Learning algorithms – There are many ML algorithms (list of algorithms) designed to learn and make predictions on the data. ML algorithms can be either used to predict a category (tackle classification problem) or to predict the direction and magnitude (tackle regression problem).

Examples:

  • Predict the price of a stock in 3 months from now, on the basis of company’s past quarterly results.
  • Predict whether Fed will hike its benchmark interest rate.

Indicators/Features – Indicators can include Technical indicators (EMA, BBANDS, MACD, etc.), Fundamental indicators, or/and Macroeconomic indicators.

Example 1 – RSI(14), Price – SMA(50) , and CCI(30). We can use these three indicators, to build our model, and then use an appropriate ML algorithm to predict future values.

Example 2 – RSI(14), RSI(5), RSI(10), Price – SMA(50), Price – SMA(10), CCI(30), CCI(15), CCI(5)

In this example we have selected 8 indicators. Some of these indicators may be irrelevant for our model. In order to select the right subset of indicators we make use of feature selection techniques.

Feature selection – It is the process of selecting a subset of relevant features for use in the model. Feature selection techniques are put into 3 broad categories: Filter methods, Wrapper based methods and embedded methods. To select the right subset we basically make use of a ML algorithm in some combination. The selected features are known as predictors in machine learning.

 

Feature Selection

Support Vector Machine (SVM) – SVM is a well-known algorithm for supervised Machine Learning, and is used to solve both for classification and regression problem.

A SVM algorithm works on the given labeled data points, and separates them via a boundary or a Hyperplane. SVM tries to maximize the margin around the separating hyperplane. Support vectors are the data points that lie closest to the decision surface.

Support vectors

Framing rules for a forex strategy using SVM in R – Given our understanding of features and SVM, let us start with the code in R. We have selected the EUR/USD currency pair with a 1 hour time frame dating back to 2010. Indicators used here are MACD (12, 26, 9), and Parabolic SAR with default settings of (0.02, 0.2).

First, we load the necessary libraries in R, and then read the EUR/USD data. We then compute MACD and Parabolic SAR using their respective functions available in the “TTR” package. To compute the trend, we subtract the closing EUR/USD price from the SAR value for each data point. We lag the indicator values to avoid look-ahead bias. We also create an Up/down class based on the price change.

Frame rules

Thereafter we merge the indicators and the class into one data frame called model data. The model data is then divided into training, and test data.

Merging the indicators amd creating training

We then use the SVM function from the “e1071” package and train the data. We make predictions using the predict function and also plot the pattern. We are getting an accuracy of 53% here.

Use SVM to find pattern

From the plot we see two distinct areas, an upper larger area in red where the algorithm made short predictions, and the lower smaller area in blue where it went long.

SVM Predictions

SAR indicator trails price as the trend extends over time. SAR is below prices when prices are rising and above prices when prices are falling. SAR stops and reverses when the price trend reverses and breaks above or below it. We are interested in the crossover of Price and SAR, and hence are taking trend measure as the difference between price and SAR in the code. Similarly, we are using the MACD Histogram values, which is the difference between the MACD Line and Signal Line values.

Looking at the plot we frame our two rules and test these over the test data.
Short rule = (Price–SAR) > -0.0025 & (Price – SAR) < 0.0100 & MACD > -0.0010 & MACD < 0.0010
Long rule = (Price–SAR) > -0.0150 & (Price – SAR) < -0.0050 & MACD > -0.0005

Define rules for trades

We are getting 54% accuracy for our short trades and an accuracy of 50% for our long trades. The SVM algorithm seems to be doing a good job here. We stop at this point, and in our next post on Machine learning we will see how framed rules like the ones devised above can be coded and backtested to check the viability of a trading strategy.

To learn more on Machine Learning you can watch the latest webinar, “Machine Learning in Trading”, which was hosted by QuantInsti, and conducted by our guest speaker Tad Slaff, CEO/Co-founder Inovance.

Machine learning is covered in the Executive Programme in Algorithmic Trading (EPAT) course conducted by QuantInsti. To know more about EPAT check the EPAT course page or feel free to contact our team at contact@quantinsti.com for queries on EPAT.

Downloadables

(more…)

Read more

Trading with ETF as a Lead Indicator [EXCEL MODEL]

Index tracking trading model

Index tracking trading is a strategy where you observe price on the previous ‘n’ candlesticks and make your bets accordingly. The intuition is that MSCI FUTURES follows the ETF. Hence if ETF is performing well we assume MSCI FUTURES perform well too thus making buying and selling decision accordingly.

Who can use it?

People interested in algorithmic trading and those who want to learn about ETF as a lead indicator.

How it helps?

This excel model will help you to:

  • Build a strategy with ETF as a lead indicator
  • Understand the trading logic of strategy implementation
  • Optimize trading parameters

Why should you download the trading model?

As the trading logic is coded in the cells of the sheet, you can better the understanding by downloading and analyzing the files at your own convenience. Not just that, you can play around the numbers to obtain better results. You might find suitable parameters that provide higher profits than specified in the article.

Explanation of the model

In this example we consider the MSCI FUTURES data. We track an ETF and assume that the MSCI has a strong positive beta with the ETF. We observe the 5 minute intervals prices of ETF and MSCI and buy/sell MSCI based on the ETF returns. If the ETF returns are positive we buy one lot of MSCI FUTURES. If the ETF returns are negative we sell one lot of MSCI FUTURES. The ETF we track is Indian SP Equity. In essence we go long (buy) on MSCI futures if the ETF is bullish and go short (sell) on MSCI FUTURES if the ETF is bearish.

The data used for MSCI FUTURES contract is the data separated by 5 minute interval from 2nd Feb 2015 to 4th of March 2015.

 

Assumptions

  1. For simplification purpose, we ignore bid ask spreads.
  2. Prices are available at 5 minutes interval and we trade at the 5 minute closing price only.
  3. Since this is discrete data, squaring off of the position happens at the end of the candle i.e. at the price available at the end of 5 minutes.
  4. Only the regular session (T) is traded
  5. Transaction cost is $1.10 for MSCI FUTURE.
  6. Margin for each trade is $1500.
  7. Trading quantity is 1 lot (MSCI order size 50) and trading hours are 11:30 a.m. to 5:55 p.m. SGT.

Input parameters

Please note that all the values for the input parameters mentioned below are configurable.

  1. Price at the end of 5 minute interval is considered.
  2. We use ETF as a lead indicator

The market data and trading model are included in the spread sheet from the 12th row onwards. So when the reference is made to column D, it should be obvious that the reference commences from D12 onwards.

Explanation of the columns in the Excel Model

Column C represents the price for ETF.

Column D represents the price for MSCI FUTURES.

Column E represents log returns of ETF data.

Column F represents log returns of MSCI data.

Column G represents average returns of ETF data.

Column H represents average returns of MSCI data.

Column I calculates the trading signal. The formula =IF(G13=””, “”, IF(G13>H13, “Buy”, IF(G13<H13, “Sell”, “I12”))) means if the entry in cell G13 is blank then keep I13 blank otherwise if G13 (MSCI FUTURES data) is greater than H13 then buy signal for the MSCI FUTURES contract is generated else if G13 is lower than H13 then sell signal for the MSCI FUTURES contract is generated.

Column J represents trade price. This is the price at which the trading signal is generated. The formula =IF(I13=””, “”, IF(I13=I12, J12, D13)) means if the entry in cell I13 is blank then the entry is blank. Otherwise if I13=I12, then the trade price is given by the entry in J12. If I13 is neither blank nor equal to I12 then the trade price is given by D13.

Column K represents the Mark To Market. The formula =IF(OR(I13=””, I12=””), 0, IF(I13=I12, K12, IF(I13=”Buy”, K12+J12-J13, IF(I13=”Sell”, K12+J13-J12, 0)))) means if the cells I12 or I13 are blank then the MTM is zero. Otherwise if I13=I12 then MTM is K12 else if I13 is not equal to I12 and if I13 = “Buy” the MTM is given by K12+J12-J13 if I13 is not equal to I12 and if I13=”Sell” then MTM is given by K12+J13-J12.

Column L represents the profit/loss status of the trade. The formula =IF(OR(I13=””, I12=””), “”, IF(K13<K12, “Loss”, IF(K13>K12, “Profit”, IF(I13<>I12, “NPNL”, “”)))) means if either I13 or I12 is blank then the profit entry is also blank. Otherwise if K13 is less than K12 the entry is loss, if K13 is greater than K12 the entry is profit. The next part I13<>I12 means if I13 is not equal to I12 then it’s No Profit No Loss (NPNL). This makes sense since we have already calculated profit from previous squared off position.

Column M calculates the trade profit/loss. The formula =IF(L22=””, 0,K22-K21 ) means if L23 is blank meaning no profit no loss then the trade profit/loss is zero. If L23 is not blank then the profit is calculated as the difference in MTM values.

Outputs

The output table has some performance metrics tabulated.

Number of profitable trades is 118 and the number of loss making trades is 77.

Total trades are 277 and the total profit is $1970. Average profit per trade is $7.24. Net profit per trade is $5.04. This is calculated as average profit per trade minus twice the transaction costs. Number of trading intervals is 23. Monthly returns calculated as the product of total trades and net profit per trade divided by the product of margin for each trade and the number of days interpolation

Now it is your turn!

  • First, download the model
  • Modify the parameters and study the backtesting results
  • Run the model for other historical prices
  • Modify the formula and strategy to add new parameters and indicators! Play with logic! Explore and study!
  • Comment below with your results and suggestions

Click to download the excel (If you have not logged in yet, first login/sign up!)

(more…)

Read more

Momentum Based Strategies for Low and High Frequency Trading – [EXCEL MODEL]

Momentum-Trading-Strategies-Webinar-Download

On 3rd December 2015, QuantInsti held a comprehensive webinar session on Momentum Trading Strategies, where Mr. Nitesh Khandelwal, Co-founder, iRage Capital, discussed regarding momentum trading in Low and High frequency trading.

This webinar focused on the various aspects of Momentum Trading Strategies for both Conventional/Low Frequency as well as High Frequency (HFT). Some popular strategies in momentum based trading were also dug deeper into to select niche momentum trading strategies. The webinar aimed to evaluate how HFT momentum strategies differ from conventional momentum strategies both from logic and deployment perspective.

Following points were discussed in detail

  • Popular Momentum Trading Strategies
  • Momentum Trading in HFT
  • Risks in Momentum Trading
  • Sample Model

Now it is your turn!

  • First, download the model
  • Modify the parameters and study the backtesting results
  • Run the model for other historical prices
  • Modify the formula and strategy to add new parameters and indicators! Play with logic! Explore and study!
  • Comment below with your results and suggestions

Download Sample Model of a Momentum Trading Strategy

(more…)

Read more

Candlestick Trading – A Momentum Strategy with Example [EXCEL MODEL]

ndlestick Trading – A Momentum Strategy with Example [EXCEL MODEL]

Candle stick trading is a momentum strategy where you observe price on the previous ‘n’ candlesticks and make your bets accordingly. The intuition is if the price is increasing continuously for, say, 3 candle sticks then it is highly probable that it will rise further.

Who can use it?

People interested in algorithmic trading, those who want to learn about momentum trading.

How it helps?

This excel model will help you in,

  • Learn how momentum strategy is implemented
  • Understand the trading logic of strategy implementation
  • Optimize trading parameters
  • Understand intraday returns of momentum trading

Candlestick Trading Example

In this example we consider the INR FUTURES data on SGX (Singapore exchange). We implement momentum strategy on this contract. This momentum strategy is simply based on the fact that a rising market will be followed by a rising market and a falling market will be followed by a falling market. We hope to ride on the tide and make some profit before the momentum vanishes. The data used for INR FUTURES contract is the data separated by 5 minute interval for 3rd Feb 2015 to 19th March 2015. The trading strategy is intuitional and unlike mean reversion for pairs trading there is no hypothesis as such explaining why this strategy would work or not. We would like to benefit from the market wave and optimize our bet by specifying stop loss and take profit limits. This model is flexible and can be varied to achieve different limits to exit the trade depending upon the trader’s risk appetite.

INR Graph

Assumptions

  1. For simplification purpose, we ignore bid ask spreads.
  2. Prices are available at 5 minutes interval and we trade at the 5 minute closing price only.
  3. Since this is discrete data, squaring off of the position happens at the end of the candle i.e. at the price available at the end of 5 minutes.
  4. Only the regular session (T) is traded
  5. Transaction cost is $0.35 for INR FUTURE.
  6. Margin for each trade is $800.
  7. Trading quantity is 1 lot and trading hours are 7:40 a.m. to 2:00 a.m. SGT.

Input parameters

Please note that all the values for the input parameters mentioned below are configurable.

  • High/Low of 3 candles (one candle=every 5 minute price) is considered.
  • A stop loss of 0.08 and profit limit of 0.16 is set.
  • The order size for trading INR FUTURE is INR 2000000.

The market data and trading model are included in the spread sheet from the 12th row onwards. So when the reference is made to column D, it should be obvious that the reference commences from D12 onwards.

Column C represents the price for INR FUTURE.

Column D represents 3 candle high meaning the highest price of the previous 3 candles.

Column E represents 3 candle low meaning the lowest price of the previous 3 candles.

Column F calculates the trading signal. The formula =IF(D13=””, “”, IF(C13>D13, “Buy”, IF(C13<E13, “Sell”, “”))) means if the entry in cell D13 is blank then keep F13 blank otherwise if C13 (INR FUTURES data) is greater than D13 (3 candle high) then buy signal for the INR FUTURES contract is generated else if C13 is lower than E13 (3 candle low) then sell signal for the INR FUTURES contract is generated.

Column G represents entry price. This is the price at which the trading signal is generated. The formula =IF(H13=H12, G12, IF(OR(H13=”Buy”, H13=”Sell”), C13, “”)) means if the entry in cell H13 is same as H12 then the value in G13 should be the value in G12 otherwise if H13 is either “Buy” or “Sell” then the entry in G13 is the value in C13 (INR FUTURES price) else if H13 is neither “Buy” nor “Sell” leave it blank.

Column H represents the status of the trade. Given our assumptions and input parameters there are four status that can occur, “Buy”, “Sell”, “TP (Take Profit)” and “SL (Stop Loss)”.

The formula =IF(OR(H17=””, H17=”TP”, H17=”SL”), F18, IF(H17=”Buy”, IF(C18<G17+$C$4, “SL”, IF(C18>G17+$C$5, “TP”, H17)), IF(H17=”Sell”, IF(C18>G17-$C$4, “SL”, IF(C18<G17-$C$5, “TP”, H17)), “”))) can be simplified as follows:-

If the entry in H17 is either blank or TP or SL then choose the value in F18 (F column has either Buy or Sell or blank values). Otherwise look into the next If condition.

If the entry in H17 is “Buy”, meaning we have a buy position, and if the price of the contract goes below the stop loss limit then we exit the position at stop loss and if the price of the contract goes above the take profit limit then we exit the position at take profit. Similarly, if the position is “Sell” and the contract price rises above the selling price beyond the stop loss limit then exit the position at stop loss and if the contract price falls below the selling price beyond the take profit limit then exit the position by taking the profit.

Column I represents the profit/loss status of the trade. P/L is calculated only when we have squared off our position. The formula =IF(OR(H13=”SL”, H13=”TP”), IF(H12=”Buy”, C13-G12, IF(H12=”Sell”, G12-C13, 0)), 0) can be summarized as follows:-

The first if condition states that proceed to the next if condition only if the corresponding status in column H is either “SL” or “TP” else the entry in the cell is zero.

The next set of if conditions calculate profit assuming either stop loss or take profit has been achieved. If the status in column H is “Buy”, then the profit/loss is calculated as C13-G12. Remember that the column G has the price at which you traded (in this case “Buy”) and the column C has the market data for INR FUTURES contract. Hence the profit/loss is simply the difference between the price at which you sold minus the price at which you bought. If the status in column H is “Sell”, then the profit/loss is calculated as G12-C13 simply meaning the difference between the price at which you sold ( shorted) and the price at which you bought later thus squaring off the position.

Column J calculates the cumulative profit.

Outputs

The output table has some performance metrics tabulated. Loss from all loss making trades is $1588 and profit from trades that hit TP is $1988. So the total P/L is $1988-$1588=$400. Loss trades are the trades that resulted in losing money on the trading positions. Profitable trades are the successful trades ending in gaining cause. Average profit is the ratio of total profit to the total number of trades. Net average profit is calculated after subtracting the transaction costs which amounts to $3.03.

Download Excel Model

(more…)

Read more

How to Use Black Scholes Option Pricing Model [EXCEL MODEL]

In this post, we will discuss on modelling option pricing using Black Scholes Option Pricing model and plotting the same for a combination of various options. You can put any number of call and/or put options in the model and use built in macro (named ‘BS’) for calculating the BS model based option pricing for each option. The macro (named ‘PayOff’) is used for plotting the Profit/Loss for the overall combination of the option positions against the spot price.

Sheet1 named Payoff has a table where we specify all option parameters. Column B specifies Expiry data for the options. Column C specifies the option type. Column D has the strike price of the underlying asset. Column E shows the premium amount in INR at which the option is bought. Column F tells us about the number of option contracts we have bought. Coumn G specifies the volatility, column H specifies Black Shcoles price of the option (calculated by the macro “BS”. Column I is the current spot price of the underlying asset, column J shows the time to expiry of the option (calculated using the formula). Column K specifies the Expected PnL of the option (calculated using the formula). It is calculated as the difference between the Black Scholes price and the premium paid multiplied by the number of option contracts. Column L shows the actual premium in the market currently, meaning the current premium should you wish to buy the option.

The 13th row calculates the total investment. Since we have bought two call and put options at a premium of 120 and 152 our total investment is 120*2+152*2=544. The 14th row shows the Expected present value. Since the market has moved after the options are bought, the current expected price of the option multiplied by the number of option contracs gives the expected value. Hence the expected payoff is 170.18*2+124.59*2=589.5475.

The present value in row 15 is calculated similarly by taking the product of actual premium in the market currently and number of options contract. Hence the present value is 150*2+120*2=540.

The graph below shows the plot of expected payoff for the option portfolio. This is done by taking the expected payoff values from sheet4. More on this later.

Options Trading Screenshot

BS Price sheet shows the pricing of an option using Black Scholes model. From Black-Scholes option pricing model, we know the price of call option on a non-dividend stock can be written as:

$$C_t = S_t N(d_1) – Xe^{-r\tau} N(d_2)$$

and the price of put option on a non-dividend stock can be written as:

$$P_t = Xe^{-r\tau} N (-d_2) – S_tN (-d_1)$$

where

$$d_1 = \frac {{ln ( \frac {S_t} {X}) + (r + \frac {\sigma_s^2} {2}) \tau}} {{\sigma_s} {\sqrt{\tau}}}$$

$$d_1 = \frac {{ln ( \frac {S_t} {X}) + (r + \frac {\sigma_s^2} {2}) \tau}} {{\sigma_s} {\sqrt{\tau}}} = d_1 – \sigma_s \sqrt{\tau}$$

$$\tau = T – t$$

is the cumulative density function of normal distribution.

$$S$$

Current price of the underlying

$$X$$

Strike price

$$r$$

Risk free interest rate

$$\tau$$

Time to expiry

$$ln$$

Natural log

The call and put value using Black Scholes framework is calculated in the 13th and 14th row for the parameters specified in row 1 to 5.

Options Trading Screenshot

“Back-end BS” sheet has the same set of values of Payoff sheet from columns A to G. Column H onwards shows the spot price ranges in the 2nd row. You can change the starting point for the price range of Spot Price in Cell H2. The increment (presently of 10 points) can be changed from Cell I2 and then drag it across the range horizontally. The 3rd row shows the Black Scholes call option for the specified parameters and varying spot price. The 4th row shows the Black Scholes put option for the specified parameters and varying spot price. Please note that though the post shows the calculation for three options, you can go upto for a combination of 10 options by just filling appropriate values in the table in Sheet1. For more than 10 options, you can edit the sheet and the macro.

The 13th row calculates the total payoff from the option position. This is calculated as the difference between the profits from options and the total investment.

Options Trading Screenshot

In this case the profit from overall option position is the sum of H3 and H4. The total investment (calculated in Payoff Sheet 13th row) of 544 has to be subtracted from the sum of H3 and H4 to obtain the final payoff. Similar calculations are done to all other columns henceforth.

Options Trading Screenshot

The Expected Payoff graph in Sheet1 is the plot of total payoff calculated in Sheet3 against the underlying spot price.

Options Trading Screenshot

There are two macros. One in BS Price sheet that calculates Black Scholes option price depending upon the values entered in the Payoff sheet. The other one is in the Payoff sheet that plots the Expected Payoff graph. Please make a note that the Expiry Date in Payoff sheet is set beyond the current date, else the Black Scholes price will not return a numerical value for a negative time period.

Download Excel Model

(more…)

Read more