## Starting Out with Time Series

Time series analysis and forecasting find wide usage in the financial markets across assets like stocks, F&O, Forex, and Commodities. As such, it becomes pertinent for aspiring quants to have sound knowledge in time series forecasting. In this post, we will introduce the basic concepts of time series and illustrate how to create time series plots and analysis in R programming language.

### Time series defined

A time series is a sequence of observations over time, which are usually spaced at regular intervals of time. For example:

• Daily stock prices for the last 5 years
• 1-minute stock price data for the last 90 days
• Quarterly revenues of a company over the last 10 years
• Monthly car sales of an automaker for the last 3 years
• Annual unemployment rate of a state in the last 50 years

### Univariate time series and Multivariate time series

A univariate time series refers to the set of observations over time of a single variable. Correspondingly, a multivariate time series refers to the set of observations over time of several variables.

### Time Series Analysis and Forecasting

In time series analysis, the objective is to apply/develop models which are able to describe the given time series with a fair amount of accuracy. On the other hand, time series forecasting involves forecasting the future values of a given time series using the past observed values. There are various models that are used for forecasting and the viability of a particular model used for forecasting is determined by its performance at predicting the future values.

Some examples of time series forecasting:

• Forecasting the closing price of a stock every day
• Forecasting the quarterly revenues of a company
• Forecasting the monthly number of cars sold.

### Plotting a time series

A plot of a time series data gives a clear picture of the spread over the given time period. It becomes easy for a human eye to detect any seasonality or abnormality in a given time series.

### Plotting a time series in R

To plot a time series in R, we first need to read the data in R. If the data is available in a CSV file or in an Excel file, we can read the data in R using the csv.read() function or the read.xlsx() function respectively. Once the data has been read, we can create a time series plot by using the plot.ts() function. See the example given below.

We will use the time series data set from the Time Series Data Library (TSDL) created by Rob Hyndman. We will plot the monthly closings of the Dow-Jones industrial index, Aug. 1968 – Oct. 1992. Save the dataset in your current R working directory with the name monthly-closings-of-the-dowjones.csv

### Decomposing time series

A time series generally comprises of a trend component, irregular (noise) component, and can also have a seasonal component, in the case of a seasonal time series. Decomposing time series means separating the original time series into these components.

Trend – The increasing or decreasing values in a given time series.

Seasonal – The repeating cycle over a specific period (day, week, month, etc.) in a given time series.

Irregular (Noise) – The random (irregularity) of values in a given time series

### Why do we need to decompose a time series?

As mentioned in the above paragraph, a time series might include a seasonal component or an irregular component. In such a case, we would not get a true picture of the trending property of the time series. Hence, we need to separate out the seasonality effect and/or the noise which will give us a clear picture, and help in further analysis.

### How do we decompose a time series?

There are two structures which can be used for decomposing a given time series.

1. Additive decomposition – If the seasonal variation is relatively constant over time, we can use the additive structure for decomposing a given time series. The additive structure is given as –

Xt = Trend + Random + Seasonal

1. Multiplicative decomposition – If the seasonal variation is increasing over time, we can use the multiplicative structure for decomposing a time series. The multiplicative structure is given as –

Xt = Trend * Random * Seasonal

### Decomposing a time series in R

To decompose a non-seasonal time series in R, we can use a smoothing method for calculating the moving average of a given time series. We can use the SMA() function from the TTR package to smooth out the time series.

To decompose a seasonal time series in R, we can use the decompose() function. This function estimates the trend, seasonal, and irregular (noise) components of a given time series. The decompose function is given as –

decompose(x, type = c(“additive”, “multiplicative”), filter = NULL)

Arguments
x – A time series
type – The type of seasonal component. Can be abbreviated
filter – A vector of filter coefficients in reverse time order (as for AR or MA coefficients), used for filtering out the seasonal component. If NULL, a moving average with the symmetric window is performed.

When we use the decompose function, we need to specify the trend type (multiplicative, additive) and seasonality type (multiplicative, additive) in the arguments.

### Stationary and non-stationary time series

A stationary time series is one where the mean and the variance are both constant over time or is one whose properties do not depend on the time at which the series is observed. Thus, the time series is a flat series without trend, constant variance over time, a constant mean, a constant autocorrelation and no seasonality. This makes a stationary time series is easy to predict. On the other hand, a non-stationary time series is one where either mean or variance or both are not constant over time.

There are different tests that can use to check whether a given time series is stationary. These include the Autocorrelation function (ACF), Partial autocorrelation function (PACF), Ljung-Box test, Augmented Dickey–Fuller (ADF) t-statistic test, and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.

Let us test our sample time series with the Autocorrelation function (ACF), Partial autocorrelation function (PACF) to check if it is stationary.

Autocorrelation function (ACF) – The autocorrelation function checks for correlation between two different data points of a time series separated by a lag “h”. For example, the ACF will check for correlation between points #1 and #2, #2 and #3 etc. Similarly, for lag 3, the ACF function will check between points #1 and #4, #2 and #5, #3 and #6 etc.

R code for ACF –

Partial autocorrelation function (PACF) – In some cases, the effect of autocorrelation at smaller lags will have an influence on the estimate of autocorrelation at longer lags. For example, a strong lag one autocorrelation can cause an autocorrelation with lag three. The Partial Autocorrelation Function (PACF) removes the effect of shorter lag autocorrelation from the correlation estimate at longer lags.

R code for PACF

The values of ACF and PACF each vary between plus and minus one. When the values are closer to plus or minus one it indicates a strong correlation. If the time series is stationary, the ACF will drop to zero relatively quickly, while the ACF of non-stationary time series will decrease slowly. From the ACF graph, we can conclude that the given time series in non-stationary.

### Conclusion

In this post, we gave an overview of time series, plotting time series data, and decomposition of a time series into its constituent components using R programming language. We also got introduced to the concept of stationary and non-stationary time series and the tests which can be carried out to check if the given time series is stationary. In our upcoming post, we will continue with the concept stationary time series and see how to convert a non-stationary time series into a stationary time series. For further reference, you might like go through the following: http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html

### Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!

## Creating Heatmap Using Python Seaborn

Python Data Visualization – Creating Heatmap using Seaborn

In our previous blog we talked about Data Visualization in Python using Bokeh. Now, let’s take our series on Python data visualization forward, and cover another cool data visualization Python package. In this post we will use the Python Seaborn package to create Heatmaps which can be used for various purposes, including by traders for tracking markets.

### Seaborn for Python Data Visualization

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Because seaborn is built on top of Matplotlib, the graphics can be further tweaked using Matplotlib tools and rendered with any of the Matplotlib backends to generate publication-quality figures. [1]

Types of plots that can be created using seaborn includes:

• Distribution plots
• Regression plots
• Categorical plots
• Matrix plots
• Timeseries plots

The plotting functions operate on Python dataframes and arrays containing a whole dataset, and internally perform the necessary aggregation and statistical model-fitting to produce informative plots.[2]

Source: seaborn.pydata.org

### What is a heatmap?

A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors. The seaborn package allows for creation of annotated heatmaps which can be tweaked using Matplotlib tools as per the creator’s requirement.

Annotated Heatmap

### Python Heatmap Code

We will create a seaborn heatmap for a group of 30 Pharmaceutical Company stocks listed on the National Stock Exchange of India Ltd (NSE). The heatmap will display the stock symbols and its respective single-day percentage price change.

We collate the required market data on Pharma stocks and construct a comma-separated values (CSV) file comprising of the stock symbols and their respective percentage price change in the first two columns of the CSV file.

Since we have 30 Pharma companies in our list, we will create a heatmap matrix of 6 rows and 5 columns. Further, we want our heatmap to display the percentage price change for the stocks in a descending order. To that effect we arrange the stocks in a descending order in the CSV file and add two more columns which indicate the position of each stock on X & Y axis of our heatmap.

### Import the required Python packages

We import the following Python packages:

We read the dataset using the read_csv function from pandas, and visualize the first ten rows using the print statement.

#### Create a Python Numpy array

Since we want to construct a 6 x 5 matrix, we create an n-dimensional array of the same shape for “Symbol” and the “Change” columns.

#### Create a Pivot in Python

The pivot function is used to create a new derived table from the given dataframe object “df”. The function takes three arguments; index, columns, and values. The cell values of the new table are taken from column given as the values parameter, which in our case is the “Change” column.

#### Create an Array to Annotate the Heatmap

In this step we create an array which will be used to annotate the heatmap. We call the flatten method on the “symbol” and “percentage” arrays to flatten a Python list of lists in one line. The zip function which returns an iterator zips a list in Python. We run a Python For loop and by using the format function; we format the stock symbol and the percentage price change value as per our requirement.

#### Create the Matplotlib figure and define the plot

We create an empty Matplotlib plot and define the figure size. We also add the title to the plot and set the title’s font size, and its distance from the plot using set_position method.

We wish to display only the stock symbols and their respective single-day percentage price change. Hence, we hide the ticks for the X & Y axis, and also remove both the axes from the heatmap plot.

#### Create the Heatmap

In the final step, we create the heatmap using the heatmap function from the Python seaborn package. The heatmap function takes the following arguments:

data – 2D dataset that can be coerced into an ndarray. If a Pandas DataFrame is provided, the index/column information will be used to label the columns and rows.

annot – an array of same shape as data which is used to annotate the heatmap.

cmap – a matplotlib colormap name or object. This maps the data values to the color space.

fmt – string formatting code to use when adding annotations.

linewidths – sets the width of the lines that will divide each cell.

Here’s our final output of the seaborn heatmap for the chosen group of pharmaceutical companies. Looks pretty neat and clean, doesn’t it? A quick glance at this heatmap and one can easily make out how the market is faring for the period.

Readers can download the entire Python code plus the excel file using the download button provided below and create their own custom heatmaps. A little tweak in the Python code and you can create Python heatmaps of any size, for any market index, or for any period using this Python code. The heatmap can be used in live markets by connecting the real time data feed to the excel file that is read in the Python code.

### To Conclude

As illustrated from the heatmap example above, seaborn is easy to use and one can tweak the seaborn plots to one’s requirement. You can refer to the documentation of seaborn for creating other impressive charts that you can put to use for analyzing the markets.

### Next Step

Python Data Visualization is just one of the elements covered in the vast domain of Algorithmic Trading. To understand the patterns, one must be well-versed in the basics. Want to know more about Algorithmic trading? You should click here and check out more about Algorithmic Trading.

• Data Visualization using Seaburn.rar
• Pharma Heatmap using Seaburn.py
• Pharma Heatmap.data

## Put-Call Parity in Python Programming Language

Put Call Parity in Python

We talked about Covered Call Strategy and Long Call Butterfly Strategy in our previous articles on the blog. Now, we shall talk about the Put-call Parity.

Put-call parity principle defines the relationship between the price of a European Put option and European Call option, both having the same underlying asset, strike price and expiration date.

If there is a deviation from put-call parity, then it would result in an arbitrage opportunity. Traders would take advantage of this opportunity to make riskless profits till the time the put-call parity is established again.

The put-call parity principle can be used to validate an option pricing model. If the option prices as computed by the model violate the put-call parity rule, such a model can be considered to be incorrect.

### Understanding Put Call Parity

To understand put-call parity, consider a portfolio “A” comprising of a call option and cash. The amount of cash held equals the call strike price. Consider another portfolio “B” comprising of a put option and the underlying asset. S0 is the initial price of the underlying asset and ST is its price at expiration. Let “r” be the risk-free rate and “T” be the time for expiration. In time “T” the cash will be worth K (strike price) given the risk-free rate of “r”.

Portfolio A = Call option + Cash

Portfolio B = Put option + Underlying Asset

If the share price is higher than K the call option will be exercised. Else, cash will be retained. Hence, at “T” portfolio A’s worth will be given by max(ST, K).

If the share price is lower than K, the put option will be exercised. Else, the underlying asset will be retained. Hence, at “T”, portfolio B’s worth will be given by max(ST, K).

If the two portfolios are equal at time “T”, then they should be equal at any time. This gives us the put-call parity equation –

C + Ke-rT = P + S0

When put-call parity principle gets violated, traders will try to take advantage of the arbitrage opportunity. An arbitrage trader will go long on the undervalued portfolio and short the overvalued portfolio to make a risk-free profit.

Python codes used for plotting the charts:

### Next Step

This was a brief explanation of put-call parity wherein we provided the Python code for plotting the constituents of the put-call parity equation. In our future posts we will cover and attempt to illustrate other derivatives concepts using Python. Our Executive Programme in Algorithmic Trading (EPAT) includes dedicated lectures on Python and Derivatives. To know more about EPAT, check the EPAT course page or feel free to contact our team at contact@quantinsti.com for queries on EPAT.

• Put_Call_Parity.rar
• putcallparity.py

## Long Call Butterfly Strategy on Python

We talked about Covered Call Strategy in a previous article. In this post, we will cover the Long Call Butterfly. The Long Call Butterfly is a popular strategy deployed by traders when little price movement is expected in the underlying security. The Long Call Butterfly strategy involves three legs:

• Buying a lower strike In-the-money (ITM) Call option
• Buying a higher strike Out-of-the-money (OTM) Call option
• Selling two At-the-money (ATM) Call options

In this strategy, all Call options have the same expiration date, and the distance between each strike price of the constituent legs must be the same. Let us take an example to understand the working of a Long Call Butterfly, its payoff, and the risk involved in the strategy.

#### Example:

ABC stock is trading at Rs. 225 on Jan 2nd, 2015. To create a Long Call Butterfly we,

1) Buy the 215 strike Jan 29th 2015 Call for Rs.12.50, Lot size – 100 shares

2) Sell 2 lots of 225 strike Jan 29th 2015 Call for Rs.6.50, Lot size – 100 shares

3) Buy the 235 strike Jan 29th 2015 Call for Rs.3.00, Lot size – 100 shares

The net debit amount to take these positions equals:

Rs.1300 – Rs.1550 = – Rs.250

If the stock price at expiration stands at Rs.230, the lower strike and the middle strike call will be exercised, while the higher strike Call will expire worthless.

The profit made is given by –

Profit = (Profit on the ITM Call – Premium Paid) plus (Premium received – Loss on the ATM Call) minus (Premium Paid on the OTM Call)

Profit = (Rs.1500 – Rs.1250) + (Rs.1300 – Rs.1000) – Rs.300 = Rs.250

The Risk-Reward Profile for Long Call Butterfly is as given below:

1. Maximum Risk – Net debit paid
2. Maximum Reward – (Difference between adjacent strikes – Net debit paid)

#### Python code for Long Call Butterfly Payoff chart:

We are using the same example used above to illustrate how the strategy is coded in python.

We use the np.where function from numpy to compute the payoffs for each leg of the strategy. We then use the matplotlib library to plot the chart. We first create an empty figure and add a subplot to it. We then remove the top & the right border, and move the X-axis at the center. Using the plt.plot function we plot payoffs for the Long Call Butterfly. Finally, we add the Title and labels in the chart. Please note that we haven’t plotted the payoffs for any of its constituent legs.

### Next Step

This was a brief post on Long Call Butterfly and its payoff chart using Python. In our coming posts we will cover more option strategies and illustrate how to plot their payoff chart using Python. If you want to know more strategies and the ways to implement them in live markets, then you should consider enrolling for EPAT by clicking here.

• Long Call Butterfly Payoff.rar
• Long Call Butterfly Payoff(2).py

## Sentiment Analysis on News Articles using Python

Know how to perform sentiment analysis on news articles using Python Programming Language

In our previous post on sentiment analysis we briefly explained sentiment analysis within the context of trading, and also provided a model code in R. The R model was applied on an earnings call conference transcript of an NSE listed company, and the output of the model was compared with the quarterly earnings numbers, and by charting the one-month stock price movement post the earnings call date. QuantInsti also conducted a webinar on “Quantitative Trading Using Sentiment Analysis” where Rajib Ranjan Borah, Director & Co-founder, iRageCapital and QuantInsti, covered important aspects of the topic in detail, and is a must watch for all enthusiast wanting to learn & apply quantitative trading strategies using sentiment analysis.

Taking these initiatives on sentiment analysis forward, in this blog post we attempt to build a Python model to perform sentiment analysis on news articles that are published on a financial markets portal. We will build a basic model to extract the polarity (positive or negative) of the news articles.

In Rajib’s Webinar, one of the slides details the sensitivity of different sectors to company and sectorial news. In the slide, the Pharma sector ranks at the top as the most sensitive sector, and in this blog we will apply our sentiment analysis model on specific news articles pertaining to select Indian Pharma companies. We will determine the polarity, and then check how the market reacted to these news. For our sample model, we have taken ten Indian Pharma companies that make the NIFTY Pharma index.

### Building the Model

Now, let us dive straight in and build our model. We use the following Python libraries to build the model:

• Requests
• Beautiful Soup
• Pattern

#### Step 1: Create a list of the news section URL of the component companies

We identify the component companies of the NIFTY Pharma index, and create a dictionary in python which contains the company names as the keys, while the dictionary values comprise the respective company abbreviation used by the financial portal site to form the news section URL. Using this dictionary we create a python list of the news section URLs for the all components companies.

#### Step 2: Extract the relevant news articles web-links from the company’s news section page

Using the Python list of the news section URLs, we run a Python For loop which pings the portal with every URL in our Python list. We use the requests.get function from the Python requests library (which is a simple HTTP library). The requests module allows you to send HTTP/1.1 requests. One can add headers, form data, multipart files, and parameters with simple Python dictionaries, and also access the response data in the same way.

The text of the response object is then applied to create a Beautiful Soup object. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with a given parser to provide for ways of navigating, searching, and modifying the parse tree.

HTML parsing basically means taking in the HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings, links, bold text etc.

The news section webpage on the financial portal site contains 20 news articles per page. We target only the first page of the news section, and our objective is to extract the links for all the news articles that appear on the first page using the parsed HTML. We inspect the HTML, and use the find_all method in the code to search for a tag that has the CSS class name as “arial11_summ”. This enables us to extract all the 20 web-links.

Fortunes of the R&D intensive Indian Pharma sector are driven by sales in the US market and by approvals/rejections of new drugs by US Food and Drug Administration (USFDA). Hence, we will select only those news articles pertaining to the US Food and Drug Administration (USFDA) and the US market. Using keywords like “US”, “USA”, and “USFDA” in a If statement which is nested within the Python For Loop, we get us our final list of all the relevant news articles.

#### Step 3: Remove the duplicate news articles based on news title

It may happen that the financial portal publishes important news articles pertaining to the overall pharma sector on every pharma company’s news section webpage. Hence, it becomes necessary to weed out the duplicate news articles that appear in our Python list before we run our sentiment analysis model. We call the set function on our Python list which we generated in Step 2 to give us a list with no duplicate news articles.

#### Step 4: Extract the main text from the selected news articles

In this step we run a Python For Loop and for every news article URL, we call the requests.get() on the URL, and then convert the text of response object into a Beautiful Soup object. Finally, we extract the main text using the find and get_text methods from the  Beautiful Soup module.

#### Step 5: Pre-processing the extracted text

We will use the n-grams function from the Pattern module to pre-process our extracted text. The ngrams() function returns a list of n-grams (i.e., tuples of n successive words) from the given string. Since we are building a simple model, we use a value of one for the n argument in the n-grams function. The Pattern module contains other useful functions for pre-processing like parse, tokenize, tag etc. which can be explored to conduct an in-depth analysis.

#### Step 6: Compute the Sentiment analysis score using a simple dictionary approach

To compute the overall polarity of a news article we use the dictionary method. In this approach a list of positive/negative words help determine the polarity of a given text. This dictionary is created using the words that are specific to the Pharma sector. The code checks for positive/negative matching words from the dictionary with the processed text from the news article.

#### Step 7: Create a Python list of model output

The final output from the model is populated in a Python list. The list contains the URL, positive score and the negative score for each of the selected news articles on which we conducted sentiment analysis.

Final Output

#### Step 8: Plot NIFTY vs NIFTY Pharma returns

Shown below is a plot of NIFTY vs NIFTY Pharma for the months of October-November 2016. In our NIFTY Pharma plot we have drawn arrows highlighting some of the press releases on which we ran our sentiment analysis model. The impact of the uncertainty regarding the US Presidential election results, and the negative news for the Indian Pharma sector emanating from the US is clearly visible on NIFTY Pharma as it fell substantially from the highs made in late October’2016. Thus, our attempt to gauge the direction of the Pharma Index using the Sentiment analysis model in Python programming language is giving us accurate results (more or less).

Next Step:

One can build more robust sentiment models using other approaches and trade profitably. As a next step we would recommend watching QuantInsti’s webinar on “Quantitative Trading Using Sentiment Analysis” by Rajib Ranjan Borah. Watch it by clicking on the video below:

Also, catch our other exciting Python trading blogs and if you are interested in knowing more about our EPAT course feel free to contact our QuantInsti team by clicking here.

• Sentiment Analysis of News Article – Python Code
• dict(1).csv
• Nifty and Nifty Pharma(1).csv
• Pharma vs Nifty.py

## How to Check Data Quality Using R

### Do You Use Clean Data?

Always go for clean data! Why is it that experienced traders/authors stress this point in their trading articles/books so often? As a novice trader, you might be using the freely available data from sources like Google or Yahoo finance. Do such sources provide accurate, quality data?

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

Here is our finding. Out of the 143 stocks scanned, 89 stocks had data points less than 5250, that’s more than 60% of our sample set!! The table shown below lists downs 10 such stocks from those 89 stocks.

Let’s take the case of PAGEIND. Google finance has captured only 4348 1-minute data points for the stock, thus missing 902 points!!

Example – Missing the 1306 minute bar on 20160801:

Example – Missing the 1032 minute bar on 20160802:

If a trader is running an intraday strategy which generates buy/sell signals based on 1-minute bars, the strategy is bound to give some false signals.

As can be seen from the quick check above, data quality from free sources or from cheap data vendors is not always guaranteed. Many of the cheap data vendors source the data from Yahoo finance and provide it to their clients. Poor data feed is a big issue faced by many traders and you will find many traders complaining about the same on various trading forums.

Backtesting a trading strategy using such data will give false results. If are using the data in live trading and in case there is a server problem with Google or Yahoo finance, it will lead to a delay in the data feed. As a trader, you don’t want to be in a position where you have an open trade, and the data feed stops or is delayed. When trading with real money, one is always advised to use quality data from reliable data vendors. After all, Data is Everything!

### Next Step

If you’re a retail trader interested in learning various aspects of Algorithmic trading, check out the Executive Programme in Algorithmic Trading (EPAT). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. The course equips you with the required skillsets to be a successful trader.

• Do You Use Clean Data.rar
• 15 Day Intraday Historical Data.zip
• F&O Stock List.csv
• R code – Stock price data.txt

## Sentiment Analysis in Trading Using R [WORKING MODEL]

In this post we discuss sentiment analysis in brief and then present a basic sentiment analysis model in R. Sentiment analysis is the analysis of the feelings (i.e. attitudes, emotions and opinions) which are expressed in the news reports/blog posts/twitter messages etc., using natural language processing tools.

Natural language processing (NLP) in simple terms refers to the use of computers to process sentences/text in a natural language such as English. The objective here is to extract information from unstructured or semi-structured data found in these tweets/blogs/articles. To enable this NLP makes use of artificial intelligence, computational linguistics, and computer science.

Using NLP models hundreds of text documents can be processed to ascertain the sentiment in seconds. These days sentiment analysis is a hot topic and has found wide application in areas like Business intelligence, Politics, Finance, Policy making etc.

Sentiment analysis in Trading – Sentiments can often drive the direction of the markets. Hence, traders and other participants in the financial markets seek to gauge the sentiment expressed in news reports/tweets/blog posts. Traders build automatic trading systems which extract the sentiment from natural language. These trading systems take long/short positions in the markets based on the trading signals generated. The trading systems can also be combined with other trading systems. The objective at the end of the day is to generate superior returns from the extracted information.

There are various methods and models for sentimental analysis. Let us take a look at a very basic model in R for sentimental analysis.

### Sentiment analysis model in R

In this model, we implement the “Bag-of-words” approach to sentiment analysis. The process identifies positive and negative words (or a string of words) within an article. For this, it makes use of a large dictionary which contains words that carry sentiment.  Each word in this dictionary can be assigned a weight. The sum of the positive and negative words is the final sentiment score generated by the model.

We will test our model on the management commentary text taken from the latest earnings call transcript of Eicher Motors Ltd. Eicher Motors is a leading Indian automaker company which owns the Royal Enfield Motors. The objective of our model will be to gauge the opinion expressed in their fourth quarter 2015 earnings call.

To build this model we are using the “tm” and the “Rweka” package in R. We load the libraries and then read the two documents which contain the positive and the negative terms. To prepare these documents we have gone through four previous conference call transcripts prior to the fourth quarter 2015. We picked the positive/negative words from these transcripts to populate our dictionary. In addition to these words, we have also added some general positive/negative words that relate to the Motorcycle industry.

We will be considering only the management’s commentary in our sentiment analysis model. We load the text document (fourth quarter 2015) containing the CEO’s prepared text commentary in R using the Corpus function. For this, we have stored the commentary document in the TextMining folder in the R’s working directory.

Next step is to clean the text. We convert all words to lowercase, remove punctuations, remove numbers, and strip the whitespace. The writeLines function enables us to see the text post the cleansing.

In the code below, we tokenize the text which was cleaned above. Tokenization is the process of breaking a stream of text into words or a string of words. We are using the NGramTokenizer function here. This creates N-grams of text.

N-grams are basically a set of co-occuring words within a given text. For example, consider this sentence “The food is delicious”. If n= 2, then the n-grams would be:

• the food
• food is
• is delicious

Thereafter we create a term-document matrix (called “terms” in the code) which a matrix that lists all occurrences of words in the corpus.

Below we check if the positive/negative words in the dictionary are present in the text document.

Now we extract all the positive/negative words from the text document which matched with the words in our dictionary.

The code lines below compute the positive/negative score, and finally the sentiment score.

Final result – Sentiment score

The model found 14 positive words and 4 negative words, and the final sentiment score was 10. This tells us that the quarterly result for Q4 2015 was good from the management’s perspective. The word cloud below shows some of the positive/negative words that were picked from the text document on which we ran the model.

Validate our sentiment analysis model – let us check the quarterly performance numbers to confirm the positive sentiment score generated by our model. As can be seen, Eicher Motors posted a strong quarter. EBIT growth was around 72% y/y on a strong sales volume of 125,690 motorcycles. The strong results were despite the production shutdown for few days which was caused by the floods experienced during the quarter at its production facility.

The chart below shows the stock market’s reaction to Eicher Motors strong results on the day of earnings result announcement. The stock opened at around 17100 levels, made a big move touching an intraday high of around Rs.18500, and finally closed at Rs. 18,175.

### Conclusion

This was a basic introduction to sentiment analysis. The model above can be made more robust and fine-tuned further. In future posts, we will try to cover other sentiment analysis approaches and attempt to build a model around them.

QuantInsti has been actively participating in conferences on sentiment analysis and was one of the lead marketing and education partner at the recently held “Sentiment analysis in Finance” conference in Singapore, 2016. Rajib Ranjan Borah, Co-founder & Director of iRageCapital Advisory Pvt. Ltd, & QuantInsti was one of the esteemed panelists for the session, “New Paradigms for Sentiment Analysis Applied to Finance” at the conference.

To know more about QuantInsti and the Executive Programme in Algorithmic Trading (EPAT) course offered by QuantInsti, check our website and the EPAT course page. Feel free to contact our team at contact@quantinsti.com for queries on EPAT.

• Sentiment analysis in Trading – Files.rar
• Eicher Motors Sentiment Analysis – R Code.txt
• Negative terms.csv
• Positive Terms.csv
• Q4.txt

## Machine Learning and Its Application in Forex Markets – Part 2 [WORKING MODEL]

In our previous post on Machine learning we derived rules for a forex strategy using the SVM algorithm in R. In this post we take a step further, and demonstrate how to backtest our findings.

To recap the last post, we used Parabolic SAR and MACD histogram as our indicators for machine learning.  Parabolic SAR indicator trails price as the trend extends over time. SAR is below prices when prices are rising and above prices when prices are falling. SAR stops and reverses when the price trend reverses and breaks above or below it.

The MACD oscillator comprises of the MACD line, Signal line and the MACD histogram. The MACD Line is the 12-day Exponential Moving Average (EMA) less the 26-day EMA. MACD Signal line is a 9-day EMA of the MACD line. The MACD Histogram represents the difference between MACD line and the MACD Signal line. The histogram is positive when the MACD Line is above its Signal line and negative when the MACD Line is below its Signal line.

The EUR/USD price series chart below shows Parabolic SAR plotted in blue, and the MACD line, MACD signal line, and the MACD histogram below the EURUSD price series.

Our intention is to take positions around the MACD line crossovers and Parabolic SAR reversal points. When the Parabolic SAR gives buy signal and MACD lines crosses upwards, we buy. When Parabolic SAR gives sell signal and MACD lines crosses downwards, we sell.

After selecting the indicators we ran the SVM algorithm on EUR/USD data, which gave us the plot as shown above. Looking at the SVM predictions, we now frame the rules, and backtest them to see the performance of our strategy.

Short rule = (Price – SAR) < 0.0010 & MACD histogram < 0.0010
Long rule = (Price – SAR) > -0.0050 & MACD histogram > -0.0010

We have used Michael Kapler’s Systematic Investor Toolbox to backtest our model in R. We start by loading the toolbox and the necessary libraries.

Next we create a new environment and load the historical EUR/USD data using the getSymbols function.

We will check the performance of our rule-based model against a simple ‘buy and hold’ model.  To do that, we first create a ‘buy and hold’ model.

Our next step is to compute the indicators for our rule-based model.

We run two models here, ‘long short’ model, and another ‘long short’ model using stop loss and take profit. First we create a long short model without stop loss and take profit.

Next we set the takeprofit and stop loss levels, and create a long short model using these levels. We call this model as ‘stop.loss.take.profit’ model.

Let us now run all the three models, and check their relative performance.

As you can be seen, the rule-based strategy has a smooth equity curve, and is giving a better CAGR of 5.97 than the simple ‘buy hold’ model CAGR of 1.18. The Maximum drawdown of our strategy is at 13.92 compared to the ‘buy hold’ strategy drawdown of 30.11. You can play with the indicator settings or change the short-long rules or the stop loss-take profit levels to refine the model further.

Once you understand Machine learning algorithms, these can be a great tool for formulating profit-making strategies. To learn more on Machine Learning you can watch our latest webinar, “Machine Learning in Trading”, which was hosted by QuantInsti, and conducted by our guest speaker Tad Slaff, CEO/Co-founder Inovance.

Machine learning is covered in the Executive Programme in Algorithmic Trading (EPAT) course conducted by QuantInsti. To know more about EPAT check the EPAT course page or feel free to contact our team at contact@quantinsti.com for queries on EPAT.

## Machine Learning and Its Application in Forex Markets [WORKING MODEL]

In the last post we covered Machine learning (ML) concept in brief. In this post we explain some more ML terms, and then frame rules for a forex strategy using the SVM algorithm in R.

To use ML in trading, we start with historical data (stock price/forex data) and add indicators to build a model in R/Python/Java. We then select the right Machine learning algorithm to make the predictions.

First, let’s look at some of the terms related to ML.

Machine Learning algorithms – There are many ML algorithms (list of algorithms) designed to learn and make predictions on the data. ML algorithms can be either used to predict a category (tackle classification problem) or to predict the direction and magnitude (tackle regression problem).

Examples:

• Predict the price of a stock in 3 months from now, on the basis of company’s past quarterly results.
• Predict whether Fed will hike its benchmark interest rate.

Indicators/Features – Indicators can include Technical indicators (EMA, BBANDS, MACD, etc.), Fundamental indicators, or/and Macroeconomic indicators.

Example 1 – RSI(14), Price – SMA(50) , and CCI(30). We can use these three indicators, to build our model, and then use an appropriate ML algorithm to predict future values.

Example 2 – RSI(14), RSI(5), RSI(10), Price – SMA(50), Price – SMA(10), CCI(30), CCI(15), CCI(5)

In this example we have selected 8 indicators. Some of these indicators may be irrelevant for our model. In order to select the right subset of indicators we make use of feature selection techniques.

Feature selection – It is the process of selecting a subset of relevant features for use in the model. Feature selection techniques are put into 3 broad categories: Filter methods, Wrapper based methods and embedded methods. To select the right subset we basically make use of a ML algorithm in some combination. The selected features are known as predictors in machine learning.

Support Vector Machine (SVM) – SVM is a well-known algorithm for supervised Machine Learning, and is used to solve both for classification and regression problem.

A SVM algorithm works on the given labeled data points, and separates them via a boundary or a Hyperplane. SVM tries to maximize the margin around the separating hyperplane. Support vectors are the data points that lie closest to the decision surface.

Framing rules for a forex strategy using SVM in R – Given our understanding of features and SVM, let us start with the code in R. We have selected the EUR/USD currency pair with a 1 hour time frame dating back to 2010. Indicators used here are MACD (12, 26, 9), and Parabolic SAR with default settings of (0.02, 0.2).

First, we load the necessary libraries in R, and then read the EUR/USD data. We then compute MACD and Parabolic SAR using their respective functions available in the “TTR” package. To compute the trend, we subtract the closing EUR/USD price from the SAR value for each data point. We lag the indicator values to avoid look-ahead bias. We also create an Up/down class based on the price change.

Thereafter we merge the indicators and the class into one data frame called model data. The model data is then divided into training, and test data.

We then use the SVM function from the “e1071” package and train the data. We make predictions using the predict function and also plot the pattern. We are getting an accuracy of 53% here.

From the plot we see two distinct areas, an upper larger area in red where the algorithm made short predictions, and the lower smaller area in blue where it went long.

SAR indicator trails price as the trend extends over time. SAR is below prices when prices are rising and above prices when prices are falling. SAR stops and reverses when the price trend reverses and breaks above or below it. We are interested in the crossover of Price and SAR, and hence are taking trend measure as the difference between price and SAR in the code. Similarly, we are using the MACD Histogram values, which is the difference between the MACD Line and Signal Line values.

Looking at the plot we frame our two rules and test these over the test data.
Short rule = (Price–SAR) > -0.0025 & (Price – SAR) < 0.0100 & MACD > -0.0010 & MACD < 0.0010
Long rule = (Price–SAR) > -0.0150 & (Price – SAR) < -0.0050 & MACD > -0.0005

We are getting 54% accuracy for our short trades and an accuracy of 50% for our long trades. The SVM algorithm seems to be doing a good job here. We stop at this point, and in our next post on Machine learning we will see how framed rules like the ones devised above can be coded and backtested to check the viability of a trading strategy.

To learn more on Machine Learning you can watch the latest webinar, “Machine Learning in Trading”, which was hosted by QuantInsti, and conducted by our guest speaker Tad Slaff, CEO/Co-founder Inovance.

Machine learning is covered in the Executive Programme in Algorithmic Trading (EPAT) course conducted by QuantInsti. To know more about EPAT check the EPAT course page or feel free to contact our team at contact@quantinsti.com for queries on EPAT.

Index tracking trading is a strategy where you observe price on the previous ‘n’ candlesticks and make your bets accordingly. The intuition is that MSCI FUTURES follows the ETF. Hence if ETF is performing well we assume MSCI FUTURES perform well too thus making buying and selling decision accordingly.

#### Who can use it?

People interested in algorithmic trading and those who want to learn about ETF as a lead indicator.

#### How it helps?

• Build a strategy with ETF as a lead indicator
• Understand the trading logic of strategy implementation

As the trading logic is coded in the cells of the sheet, you can better the understanding by downloading and analyzing the files at your own convenience. Not just that, you can play around the numbers to obtain better results. You might find suitable parameters that provide higher profits than specified in the article.

#### Explanation of the model

In this example we consider the MSCI FUTURES data. We track an ETF and assume that the MSCI has a strong positive beta with the ETF. We observe the 5 minute intervals prices of ETF and MSCI and buy/sell MSCI based on the ETF returns. If the ETF returns are positive we buy one lot of MSCI FUTURES. If the ETF returns are negative we sell one lot of MSCI FUTURES. The ETF we track is Indian SP Equity. In essence we go long (buy) on MSCI futures if the ETF is bullish and go short (sell) on MSCI FUTURES if the ETF is bearish.

The data used for MSCI FUTURES contract is the data separated by 5 minute interval from 2nd Feb 2015 to 4th of March 2015.

#### Assumptions

2. Prices are available at 5 minutes interval and we trade at the 5 minute closing price only.
3. Since this is discrete data, squaring off of the position happens at the end of the candle i.e. at the price available at the end of 5 minutes.
4. Only the regular session (T) is traded
5. Transaction cost is $1.10 for MSCI FUTURE. 6. Margin for each trade is$1500.
7. Trading quantity is 1 lot (MSCI order size 50) and trading hours are 11:30 a.m. to 5:55 p.m. SGT.

#### Input parameters

Please note that all the values for the input parameters mentioned below are configurable.

1. Price at the end of 5 minute interval is considered.
2. We use ETF as a lead indicator

The market data and trading model are included in the spread sheet from the 12th row onwards. So when the reference is made to column D, it should be obvious that the reference commences from D12 onwards.

#### Explanation of the columns in the Excel Model

Column C represents the price for ETF.

Column D represents the price for MSCI FUTURES.

Column E represents log returns of ETF data.

Column F represents log returns of MSCI data.

Column G represents average returns of ETF data.

Column H represents average returns of MSCI data.

Column I calculates the trading signal. The formula =IF(G13=””, “”, IF(G13>H13, “Buy”, IF(G13<H13, “Sell”, “I12”))) means if the entry in cell G13 is blank then keep I13 blank otherwise if G13 (MSCI FUTURES data) is greater than H13 then buy signal for the MSCI FUTURES contract is generated else if G13 is lower than H13 then sell signal for the MSCI FUTURES contract is generated.

Column J represents trade price. This is the price at which the trading signal is generated. The formula =IF(I13=””, “”, IF(I13=I12, J12, D13)) means if the entry in cell I13 is blank then the entry is blank. Otherwise if I13=I12, then the trade price is given by the entry in J12. If I13 is neither blank nor equal to I12 then the trade price is given by D13.

Column K represents the Mark To Market. The formula =IF(OR(I13=””, I12=””), 0, IF(I13=I12, K12, IF(I13=”Buy”, K12+J12-J13, IF(I13=”Sell”, K12+J13-J12, 0)))) means if the cells I12 or I13 are blank then the MTM is zero. Otherwise if I13=I12 then MTM is K12 else if I13 is not equal to I12 and if I13 = “Buy” the MTM is given by K12+J12-J13 if I13 is not equal to I12 and if I13=”Sell” then MTM is given by K12+J13-J12.

Column L represents the profit/loss status of the trade. The formula =IF(OR(I13=””, I12=””), “”, IF(K13<K12, “Loss”, IF(K13>K12, “Profit”, IF(I13<>I12, “NPNL”, “”)))) means if either I13 or I12 is blank then the profit entry is also blank. Otherwise if K13 is less than K12 the entry is loss, if K13 is greater than K12 the entry is profit. The next part I13<>I12 means if I13 is not equal to I12 then it’s No Profit No Loss (NPNL). This makes sense since we have already calculated profit from previous squared off position.

Column M calculates the trade profit/loss. The formula =IF(L22=””, 0,K22-K21 ) means if L23 is blank meaning no profit no loss then the trade profit/loss is zero. If L23 is not blank then the profit is calculated as the difference in MTM values.

#### Outputs

The output table has some performance metrics tabulated.

Number of profitable trades is 118 and the number of loss making trades is 77.

Total trades are 277 and the total profit is $1970. Average profit per trade is$7.24. Net profit per trade is \$5.04. This is calculated as average profit per trade minus twice the transaction costs. Number of trading intervals is 23. Monthly returns calculated as the product of total trades and net profit per trade divided by the product of margin for each trade and the number of days interpolation