Sentiment Analysis on News Articles using Python

Share on Facebook7Tweet about this on Twitter7Share on LinkedIn110Share on Google+2

Know how to perform sentiment analysis on news articles using Python Programming Language

by Milind Paradkar

In our previous post on sentiment analysis we briefly explained sentiment analysis within the context of trading, and also provided a model code in R. The R model was applied on an earnings call conference transcript of an NSE listed company, and the output of the model was compared with the quarterly earnings numbers, and by charting the one-month stock price movement post the earnings call date. QuantInsti also conducted a webinar on “Quantitative Trading Using Sentiment Analysis” where Rajib Ranjan Borah, Director & Co-founder, iRageCapital and QuantInsti, covered important aspects of the topic in detail, and is a must watch for all enthusiast wanting to learn & apply quantitative trading strategies using sentiment analysis.

Taking these initiatives on sentiment analysis forward, in this blog post we attempt to build a Python model to perform sentiment analysis on news articles that are published on a financial markets portal. We will build a basic model to extract the polarity (positive or negative) of the news articles.

In Rajib’s Webinar, one of the slides details the sensitivity of different sectors to company and sectorial news. In the slide, the Pharma sector ranks at the top as the most sensitive sector, and in this blog we will apply our sentiment analysis model on specific news articles pertaining to select Indian Pharma companies. We will determine the polarity, and then check how the market reacted to these news. For our sample model, we have taken ten Indian Pharma companies that make the NIFTY Pharma index.

Building the Model

Now, let us dive straight in and build our model. We use the following Python libraries to build the model:

  • Requests
  • Beautiful Soup
  • Pattern

Step 1: Create a list of the news section URL of the component companies

We identify the component companies of the NIFTY Pharma index, and create a dictionary in python which contains the company names as the keys, while the dictionary values comprise the respective company abbreviation used by the financial portal site to form the news section URL. Using this dictionary we create a python list of the news section URLs for the all components companies.

Step 2: Extract the relevant news articles web-links from the company’s news section page

Using the Python list of the news section URLs, we run a Python For loop which pings the portal with every URL in our Python list. We use the requests.get function from the Python requests library (which is a simple HTTP library). The requests module allows you to send HTTP/1.1 requests. One can add headers, form data, multipart files, and parameters with simple Python dictionaries, and also access the response data in the same way.

The text of the response object is then applied to create a Beautiful Soup object. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with a given parser to provide for ways of navigating, searching, and modifying the parse tree.

HTML parsing basically means taking in the HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings, links, bold text etc.

The news section webpage on the financial portal site contains 20 news articles per page. We target only the first page of the news section, and our objective is to extract the links for all the news articles that appear on the first page using the parsed HTML. We inspect the HTML, and use the find_all method in the code to search for a tag that has the CSS class name as “arial11_summ”. This enables us to extract all the 20 web-links.

Fortunes of the R&D intensive Indian Pharma sector are driven by sales in the US market and by approvals/rejections of new drugs by US Food and Drug Administration (USFDA). Hence, we will select only those news articles pertaining to the US Food and Drug Administration (USFDA) and the US market. Using keywords like “US”, “USA”, and “USFDA” in a If statement which is nested within the Python For Loop, we get us our final list of all the relevant news articles.

Step 3: Remove the duplicate news articles based on news title

It may happen that the financial portal publishes important news articles pertaining to the overall pharma sector on every pharma company’s news section webpage. Hence, it becomes necessary to weed out the duplicate news articles that appear in our Python list before we run our sentiment analysis model. We call the set function on our Python list which we generated in Step 2 to give us a list with no duplicate news articles.

Step 4: Extract the main text from the selected news articles

In this step we run a Python For Loop and for every news article URL, we call the requests.get() on the URL, and then convert the text of response object into a Beautiful Soup object. Finally, we extract the main text using the find and get_text methods from the  Beautiful Soup module.

Step 5: Pre-processing the extracted text

We will use the n-grams function from the Pattern module to pre-process our extracted text. The ngrams() function returns a list of n-grams (i.e., tuples of n successive words) from the given string. Since we are building a simple model, we use a value of one for the n argument in the n-grams function. The Pattern module contains other useful functions for pre-processing like parse, tokenize, tag etc. which can be explored to conduct an in-depth analysis.

Step 6: Compute the Sentiment analysis score using a simple dictionary approach

To compute the overall polarity of a news article we use the dictionary method. In this approach a list of positive/negative words help determine the polarity of a given text. This dictionary is created using the words that are specific to the Pharma sector. The code checks for positive/negative matching words from the dictionary with the processed text from the news article.

Step 7: Create a Python list of model output

 The final output from the model is populated in a Python list. The list contains the URL, positive score and the negative score for each of the selected news articles on which we conducted sentiment analysis.

Final Output

sentiment trading using python

Step 8: Plot NIFTY vs NIFTY Pharma returns

Shown below is a plot of NIFTY vs NIFTY Pharma for the months of October-November 2016. In our NIFTY Pharma plot we have drawn arrows highlighting some of the press releases on which we ran our sentiment analysis model. The impact of the uncertainty regarding the US Presidential election results, and the negative news for the Indian Pharma sector emanating from the US is clearly visible on NIFTY Pharma as it fell substantially from the highs made in late October’2016. Thus, our attempt to gauge the direction of the Pharma Index using the Sentiment analysis model in Python programming language is giving us accurate results (more or less).

sentiment trading using python

 

Next Step:

One can build more robust sentiment models using other approaches and trade profitably. As a next step we would recommend watching QuantInsti’s webinar on “Quantitative Trading Using Sentiment Analysis” by Rajib Ranjan Borah. Watch it by clicking on the video below:

 

Also, catch our other exciting Python trading blogs and if you are interested in knowing more about our EPAT course feel free to contact our QuantInsti team by clicking here.

Algorithmic trading course

  • Download.rar
    • Sentiment Analysis of News Article – Python Code
    • dict(1).csv
    • Nifty and Nifty Pharma(1).csv
    • Pharma vs Nifty.py

Share on Facebook7Tweet about this on Twitter7Share on LinkedIn110Share on Google+2

4 thoughts on “Sentiment Analysis on News Articles using Python

  1. December 9, 2016

    Yash Shah Reply

    Good read for a beginner like me. Nice one 🙂

  2. December 13, 2016

    Rohit Kelkar Reply

    Good first approach towards sentiment analysis. You are using pharma related key words. Can you give some examples?

    • December 14, 2016

      admin Reply

      Some of the keywords used include:
      1) recall
      2) sanctions
      3) violations
      4) adulterated
      5) contaminated
      6) approval
      7) initiated

      The dictionary needs to be more robust, but this post illustrates just a basic model.

  3. January 27, 2017

    Laurent Reply

    Nice intro to sentiment analysis using Python. Please note that the package pattern is written for Python 2.5+ and has no support for Python 3

Leave a Reply

Your email address will not be published. Required fields are marked *