## Sentiment Analysis on News Articles using Python

Know how to perform sentiment analysis on news articles using Python Programming Language

In our previous post on sentiment analysis we briefly explained sentiment analysis within the context of trading, and also provided a model code in R. The R model was applied on an earnings call conference transcript of an NSE listed company, and the output of the model was compared with the quarterly earnings numbers, and by charting the one-month stock price movement post the earnings call date. QuantInsti also conducted a webinar on “Quantitative Trading Using Sentiment Analysis” where Rajib Ranjan Borah, Director & Co-founder, iRageCapital and QuantInsti, covered important aspects of the topic in detail, and is a must watch for all enthusiast wanting to learn & apply quantitative trading strategies using sentiment analysis.

Taking these initiatives on sentiment analysis forward, in this blog post we attempt to build a Python model to perform sentiment analysis on news articles that are published on a financial markets portal. We will build a basic model to extract the polarity (positive or negative) of the news articles.

In Rajib’s Webinar, one of the slides details the sensitivity of different sectors to company and sectorial news. In the slide, the Pharma sector ranks at the top as the most sensitive sector, and in this blog we will apply our sentiment analysis model on specific news articles pertaining to select Indian Pharma companies. We will determine the polarity, and then check how the market reacted to these news. For our sample model, we have taken ten Indian Pharma companies that make the NIFTY Pharma index.

### Building the Model

Now, let us dive straight in and build our model. We use the following Python libraries to build the model:

• Requests
• Beautiful Soup
• Pattern

#### Step 1: Create a list of the news section URL of the component companies

We identify the component companies of the NIFTY Pharma index, and create a dictionary in python which contains the company names as the keys, while the dictionary values comprise the respective company abbreviation used by the financial portal site to form the news section URL. Using this dictionary we create a python list of the news section URLs for the all components companies.

import csv
import time
import requests
from bs4 import BeautifulSoup
from pattern.en import ngrams

Base_url = "http://www.moneycontrol.com"

# Build a dictionary of companies and their abbreviated names
'glenmarkpharma':'GP08','glaxosmithklinepharmaceuticals':'GSK',
'sunpharmaceuticalindustries':'SPI','lupinlaboratories':'LL',
'cipla':'C','aurobindopharma':'AP',
'drreddyslaboratories':'DRL','divislaboratories':'DL03'}

# Create a list of the news section urls of the respective companies
url_list = ['http://www.moneycontrol.com/company-article/{}/news/{}#{}'.format(k,v,v) for k,v in companies.iteritems()]
print url_list


#### Step 2: Extract the relevant news articles web-links from the company’s news section page

Using the Python list of the news section URLs, we run a Python For loop which pings the portal with every URL in our Python list. We use the requests.get function from the Python requests library (which is a simple HTTP library). The requests module allows you to send HTTP/1.1 requests. One can add headers, form data, multipart files, and parameters with simple Python dictionaries, and also access the response data in the same way.

The text of the response object is then applied to create a Beautiful Soup object. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with a given parser to provide for ways of navigating, searching, and modifying the parse tree.

HTML parsing basically means taking in the HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings, links, bold text etc.

The news section webpage on the financial portal site contains 20 news articles per page. We target only the first page of the news section, and our objective is to extract the links for all the news articles that appear on the first page using the parsed HTML. We inspect the HTML, and use the find_all method in the code to search for a tag that has the CSS class name as “arial11_summ”. This enables us to extract all the 20 web-links.

Fortunes of the R&D intensive Indian Pharma sector are driven by sales in the US market and by approvals/rejections of new drugs by US Food and Drug Administration (USFDA). Hence, we will select only those news articles pertaining to the US Food and Drug Administration (USFDA) and the US market. Using keywords like “US”, “USA”, and “USFDA” in a If statement which is nested within the Python For Loop, we get us our final list of all the relevant news articles.

# Create an empty list which will contain the selected news articles

# Extract the relevant news articles weblinks from the news section of selected companies
for urls in url_list:
html = requests.get(urls)
soup = BeautifulSoup(html.text,'html.parser') # Create a BeautifulSoup object

# Retrieve a list of all the links and the titles for the respective links
word1,word2,word3 = "US","USA","USFDA"

sp = BeautifulSoup(str(links),'html.parser')  # first convert into a string
tag = sp.a
if word1 in tag['title'] or word2 in tag['title'] or word3 in tag['title']:
time.sleep(3)

# Print the select list of news articles weblinks
#for p in List_of_links: print p

#### Step 3: Remove the duplicate news articles based on news title

It may happen that the financial portal publishes important news articles pertaining to the overall pharma sector on every pharma company’s news section webpage. Hence, it becomes necessary to weed out the duplicate news articles that appear in our Python list before we run our sentiment analysis model. We call the set function on our Python list which we generated in Step 2 to give us a list with no duplicate news articles.

# Remove the duplicate news articles based on News Title
for q in unique_links: print q

# Create a dictionary of positive/negative words related to the Pharma Sector
pharma_dict = dict((rows[0],rows[1]) for rows in reader)

# Creating an empty list which will be filled later with news article links, and Polarity values (pos/neg)
df =[]
print df


#### Step 4: Extract the main text from the selected news articles

In this step we run a Python For Loop and for every news article URL, we call the requests.get() on the URL, and then convert the text of response object into a Beautiful Soup object. Finally, we extract the main text using the find and get_text methods from the  Beautiful Soup module.

# Open the choosen news articles and extract the main text
#print results_url

results = requests.get(results_url)
results_text = BeautifulSoup(results.text)
extract_text = results_text.find(class_='arti_cont')
final_text = extract_text.get_text()

#### Step 5: Pre-processing the extracted text

We will use the n-grams function from the Pattern module to pre-process our extracted text. The ngrams() function returns a list of n-grams (i.e., tuples of n successive words) from the given string. Since we are building a simple model, we use a value of one for the n argument in the n-grams function. The Pattern module contains other useful functions for pre-processing like parse, tokenize, tag etc. which can be explored to conduct an in-depth analysis.

 # Pre-processing the extracted text using ngrams function from the pattern package

#### Now it is your turn!

• Modify the parameters and study the backtesting results
• Run the model for other historical prices
• Modify the formula and strategy to add new parameters and indicators! Play with logic! Explore and study!
• Comment below with your results and suggestions

## Momentum Based Strategies for Low and High Frequency Trading – [EXCEL MODEL]

On 3rd December 2015, QuantInsti held a comprehensive webinar session on Momentum Trading Strategies, where Mr. Nitesh Khandelwal, Co-founder, iRage Capital, discussed regarding momentum trading in Low and High frequency trading.

This webinar focused on the various aspects of Momentum Trading Strategies for both Conventional/Low Frequency as well as High Frequency (HFT). Some popular strategies in momentum based trading were also dug deeper into to select niche momentum trading strategies. The webinar aimed to evaluate how HFT momentum strategies differ from conventional momentum strategies both from logic and deployment perspective.

• Sample Model

#### Now it is your turn!

• Modify the parameters and study the backtesting results
• Run the model for other historical prices
• Modify the formula and strategy to add new parameters and indicators! Play with logic! Explore and study!
• Comment below with your results and suggestions

## Candlestick Trading – A Momentum Strategy with Example [EXCEL MODEL]

Candle stick trading is a momentum strategy where you observe price on the previous ‘n’ candlesticks and make your bets accordingly. The intuition is if the price is increasing continuously for, say, 3 candle sticks then it is highly probable that it will rise further.

### How it helps?

• Learn how momentum strategy is implemented
• Understand the trading logic of strategy implementation

In this example we consider the INR FUTURES data on SGX (Singapore exchange). We implement momentum strategy on this contract. This momentum strategy is simply based on the fact that a rising market will be followed by a rising market and a falling market will be followed by a falling market. We hope to ride on the tide and make some profit before the momentum vanishes. The data used for INR FUTURES contract is the data separated by 5 minute interval for 3rd Feb 2015 to 19th March 2015. The trading strategy is intuitional and unlike mean reversion for pairs trading there is no hypothesis as such explaining why this strategy would work or not. We would like to benefit from the market wave and optimize our bet by specifying stop loss and take profit limits. This model is flexible and can be varied to achieve different limits to exit the trade depending upon the trader’s risk appetite.

#### Assumptions

2. Prices are available at 5 minutes interval and we trade at the 5 minute closing price only.
3. Since this is discrete data, squaring off of the position happens at the end of the candle i.e. at the price available at the end of 5 minutes.
4. Only the regular session (T) is traded
5. Transaction cost is $0.35 for INR FUTURE. 6. Margin for each trade is$800.
7. Trading quantity is 1 lot and trading hours are 7:40 a.m. to 2:00 a.m. SGT.

#### Input parameters

Please note that all the values for the input parameters mentioned below are configurable.

• High/Low of 3 candles (one candle=every 5 minute price) is considered.
• A stop loss of 0.08 and profit limit of 0.16 is set.
• The order size for trading INR FUTURE is INR 2000000.

The market data and trading model are included in the spread sheet from the 12th row onwards. So when the reference is made to column D, it should be obvious that the reference commences from D12 onwards.

Column C represents the price for INR FUTURE.

Column D represents 3 candle high meaning the highest price of the previous 3 candles.

Column E represents 3 candle low meaning the lowest price of the previous 3 candles.

Column F calculates the trading signal. The formula =IF(D13=””, “”, IF(C13>D13, “Buy”, IF(C13<E13, “Sell”, “”))) means if the entry in cell D13 is blank then keep F13 blank otherwise if C13 (INR FUTURES data) is greater than D13 (3 candle high) then buy signal for the INR FUTURES contract is generated else if C13 is lower than E13 (3 candle low) then sell signal for the INR FUTURES contract is generated.

Column G represents entry price. This is the price at which the trading signal is generated. The formula =IF(H13=H12, G12, IF(OR(H13=”Buy”, H13=”Sell”), C13, “”)) means if the entry in cell H13 is same as H12 then the value in G13 should be the value in G12 otherwise if H13 is either “Buy” or “Sell” then the entry in G13 is the value in C13 (INR FUTURES price) else if H13 is neither “Buy” nor “Sell” leave it blank.

Column H represents the status of the trade. Given our assumptions and input parameters there are four status that can occur, “Buy”, “Sell”, “TP (Take Profit)” and “SL (Stop Loss)”.

The formula =IF(OR(H17=””, H17=”TP”, H17=”SL”), F18, IF(H17=”Buy”, IF(C18<G17+$C$4, “SL”, IF(C18>G17+$C$5, “TP”, H17)), IF(H17=”Sell”, IF(C18>G17-$C$4, “SL”, IF(C18<G17-$C$5, “TP”, H17)), “”))) can be simplified as follows:-

If the entry in H17 is either blank or TP or SL then choose the value in F18 (F column has either Buy or Sell or blank values). Otherwise look into the next If condition.

If the entry in H17 is “Buy”, meaning we have a buy position, and if the price of the contract goes below the stop loss limit then we exit the position at stop loss and if the price of the contract goes above the take profit limit then we exit the position at take profit. Similarly, if the position is “Sell” and the contract price rises above the selling price beyond the stop loss limit then exit the position at stop loss and if the contract price falls below the selling price beyond the take profit limit then exit the position by taking the profit.

Column I represents the profit/loss status of the trade. P/L is calculated only when we have squared off our position. The formula =IF(OR(H13=”SL”, H13=”TP”), IF(H12=”Buy”, C13-G12, IF(H12=”Sell”, G12-C13, 0)), 0) can be summarized as follows:-

The first if condition states that proceed to the next if condition only if the corresponding status in column H is either “SL” or “TP” else the entry in the cell is zero.

The next set of if conditions calculate profit assuming either stop loss or take profit has been achieved. If the status in column H is “Buy”, then the profit/loss is calculated as C13-G12. Remember that the column G has the price at which you traded (in this case “Buy”) and the column C has the market data for INR FUTURES contract. Hence the profit/loss is simply the difference between the price at which you sold minus the price at which you bought. If the status in column H is “Sell”, then the profit/loss is calculated as G12-C13 simply meaning the difference between the price at which you sold ( shorted) and the price at which you bought later thus squaring off the position.

Column J calculates the cumulative profit.

#### Outputs

The output table has some performance metrics tabulated. Loss from all loss making trades is $1588 and profit from trades that hit TP is$1988. So the total P/L is $1988-$1588=$400. Loss trades are the trades that resulted in losing money on the trading positions. Profitable trades are the successful trades ending in gaining cause. Average profit is the ratio of total profit to the total number of trades. Net average profit is calculated after subtracting the transaction costs which amounts to$3.03.

## How to Use Black Scholes Option Pricing Model [EXCEL MODEL]

In this post, we will discuss on modelling option pricing using Black Scholes Option Pricing model and plotting the same for a combination of various options. You can put any number of call and/or put options in the model and use built in macro (named ‘BS’) for calculating the BS model based option pricing for each option. The macro (named ‘PayOff’) is used for plotting the Profit/Loss for the overall combination of the option positions against the spot price.

Sheet1 named Payoff has a table where we specify all option parameters. Column B specifies Expiry data for the options. Column C specifies the option type. Column D has the strike price of the underlying asset. Column E shows the premium amount in INR at which the option is bought. Column F tells us about the number of option contracts we have bought. Coumn G specifies the volatility, column H specifies Black Shcoles price of the option (calculated by the macro “BS”. Column I is the current spot price of the underlying asset, column J shows the time to expiry of the option (calculated using the formula). Column K specifies the Expected PnL of the option (calculated using the formula). It is calculated as the difference between the Black Scholes price and the premium paid multiplied by the number of option contracts. Column L shows the actual premium in the market currently, meaning the current premium should you wish to buy the option.

The 13th row calculates the total investment. Since we have bought two call and put options at a premium of 120 and 152 our total investment is 120*2+152*2=544. The 14th row shows the Expected present value. Since the market has moved after the options are bought, the current expected price of the option multiplied by the number of option contracs gives the expected value. Hence the expected payoff is 170.18*2+124.59*2=589.5475.

The present value in row 15 is calculated similarly by taking the product of actual premium in the market currently and number of options contract. Hence the present value is 150*2+120*2=540.

The graph below shows the plot of expected payoff for the option portfolio. This is done by taking the expected payoff values from sheet4. More on this later.

BS Price sheet shows the pricing of an option using Black Scholes model. From Black-Scholes option pricing model, we know the price of call option on a non-dividend stock can be written as:

$$C_t = S_t N(d_1) – Xe^{-r\tau} N(d_2)$$

and the price of put option on a non-dividend stock can be written as:

$$P_t = Xe^{-r\tau} N (-d_2) – S_tN (-d_1)$$

where

$$d_1 = \frac {{ln ( \frac {S_t} {X}) + (r + \frac {\sigma_s^2} {2}) \tau}} {{\sigma_s} {\sqrt{\tau}}}$$

$$d_1 = \frac {{ln ( \frac {S_t} {X}) + (r + \frac {\sigma_s^2} {2}) \tau}} {{\sigma_s} {\sqrt{\tau}}} = d_1 – \sigma_s \sqrt{\tau}$$

$$\tau = T – t$$

is the cumulative density function of normal distribution.

$$S$$

Current price of the underlying

$$X$$

Strike price

$$r$$

Risk free interest rate

$$\tau$$

Time to expiry

$$ln$$

Natural log

The call and put value using Black Scholes framework is calculated in the 13th and 14th row for the parameters specified in row 1 to 5.

“Back-end BS” sheet has the same set of values of Payoff sheet from columns A to G. Column H onwards shows the spot price ranges in the 2nd row. You can change the starting point for the price range of Spot Price in Cell H2. The increment (presently of 10 points) can be changed from Cell I2 and then drag it across the range horizontally. The 3rd row shows the Black Scholes call option for the specified parameters and varying spot price. The 4th row shows the Black Scholes put option for the specified parameters and varying spot price. Please note that though the post shows the calculation for three options, you can go upto for a combination of 10 options by just filling appropriate values in the table in Sheet1. For more than 10 options, you can edit the sheet and the macro.

The 13th row calculates the total payoff from the option position. This is calculated as the difference between the profits from options and the total investment.

In this case the profit from overall option position is the sum of H3 and H4. The total investment (calculated in Payoff Sheet 13th row) of 544 has to be subtracted from the sum of H3 and H4 to obtain the final payoff. Similar calculations are done to all other columns henceforth.

The Expected Payoff graph in Sheet1 is the plot of total payoff calculated in Sheet3 against the underlying spot price.

There are two macros. One in BS Price sheet that calculates Black Scholes option price depending upon the values entered in the Payoff sheet. The other one is in the Payoff sheet that plots the Expected Payoff graph. Please make a note that the Expiry Date in Payoff sheet is set beyond the current date, else the Black Scholes price will not return a numerical value for a negative time period.