R Weekly Bulletin Vol – V

This week’s R bulletin will cover topics like how to avoid for-loops, add or shorten an existing vector, and play a beep sound in R. We will also cover functions like env.new function, readSeries, and the with and within functions. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To stop debugging – Shift+F8
2. To quit an R session (desktop only) – Ctrl+Q
3. To restart an R Session – Ctrl+Shift+0

Problem Solving Ideas

Avoiding For Loop by using “with” function

For Loop can be slow in terms of execution speed when we are dealing with large data sets. For faster execution, one can use the “with” function as an alternative. The syntax of the with function is given below:

with(data, expr)

where, “data” is typically a data frame, and “expr” stands for one or more expressions to be evaluated using the contents of the data frame. If there is more than one expression, then the expressions need to be wrapped in curly braces.

Example: Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods and time them using the system.time function. Note the time taken to execute the For Loop versus the time to execute the with function in combination with the lagpad function.

library(quantmod)

# Using FOR Loop
system.time({

df = read.csv("NIFTY.csv")
df = df[,c(1,3:6)]

df$GapOpen = double(nrow(df))
for ( i in 2:nrow(df)) {
    df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2)
}

print(head(df))

})

# Using with function + lagpad, instead of FOR Loop
system.time({

dt = read.csv("NIFTY.csv")
dt = dt[,c(1,3:6)]

lagpad = function(x, k) {
c(rep(NA, k), x)[1 : length(x)]
}

dt$PrevClose = lagpad(dt$CLOSE, 1)
dt$GapOpen_ = with(dt, round(Delt(dt$PrevClose,dt$OPEN)*100,2))
print(head(dt))

})

Adding to an existing vector or shortening it

Adding or shortening an existing vector can be done by assigning a new length to the vector. When we shorten a vector, the values at the end will be removed, and when we extend an existing vector, missing values will be added at the end.

Example:

# Shorten an existing vector
even = c(2,4,6,8,10,12)
length(even)
[1] 6

# The new length equals the number of elements required in the vector to be shortened.
length(even) = 3
print(even)
[1] 2 4 6

# Add to an existing vector
odd = c(1,3,5,7,9,11)
length(odd)
[1] 6

# The new length equals the number of elements required in the extended vector.
length(odd) = 8
odd[c(7,8)] = c(13,15)
print(odd)
[1] 1 3 5 7 9 11 13 15

Make R beep/play a sound

If you want R to play a sound/beep upon executing the code, we can do this using the “beepr” package. The beep function from the package plays a sound when the code gets executed. One also needs to install the “audio” package along with the “beepr” package.

install.packages("beepr")
install.packages("audio")
library(beepr)
beep()

One can select from the various sounds using the “sound” argument and by assigning one of the specified values to it.

beep(sound = 9)

One can keep repeating the message using beepr as illustrated in the example below (source:http: //stackoverflow.com/)

Example:

work_complete <- function() {
  cat("Work complete. Press esc to sound the fanfare!!!\n")
  on.exit(beepr::beep(3))

  while (TRUE) {
  beepr::beep(4)
  Sys.sleep(1)
  }
}

work_complete()

One can also use the beep function to play a sound if an error occurs during the code execution.

options(error = function() {beep(sound =5)})

Functions Demystified

env.new function

Environments act as a storehouse. When we create variables in R from the command prompt these get stored in the R’s global environment. To access the variables stored in the global environment, one can use the following expression:

head(ls(envir = globalenv()), 15)
[1] “df”  “dt”  “even”  “i”  “lagpad”  “odd”

If we want to store the variables in a specific environment, we can assign the variable to that environment or create a new environment which will store the variable. To create a new environment we use the new.env function.

Example:

my_environment = new.env()

Once we create a new environment, assigning a variable to the environment can be done in multiple ways. Following are some of the methods:

Examples:

# By using double square brackets
my_environment[["AutoCompanies"]] = c("MARUTI", "TVSMOTOR", "TATAMOTORS")

# By using dollar sign operator
my_environment$AutoCompanies = c("MARUTI", "TVSMOTOR", "TATAMOTORS")

# By using the assign function
assign("AutoCompanies", c("MARUTI", "TVSMOTOR", "TATAMOTORS"), my_environment)

The variables existing in an environment can be viewed or listed using the get function or by using the ls function.

Example:

ls(envir = my_environment)
[1] “AutoCompanies”

get("AutoCompanies", my_environment)
[1] “MARUTI”  “TVSMOTOR”  “TATAMOTORS”

readSeries function

The readSeries function is part of the timeSeries package, and it reads a file in table format and creates a timeSeries object from it. The main arguments of the function are:

readSeries(file, header = TRUE, sep = “,”,format)

where,
file: the filename of a spreadsheet dataset from which to import the data records.
header: a logical value indicating whether the file contains the names of the variables as its first line.
format: a character string with the format in POSIX notation specifying the timestamp format.
sep: the field separator used in the spreadsheet file to separate columns. By default, it is set as “;”.

Example:

library(timeSeries)

# Reading the NIFTY data using read.csv
df = read.csv(file = "NIFTY.csv")
print(head(df))

# Reading the NIFTY data and creating a time series object using readSeries
# function
df = readSeries(file = "NIFTY.csv", header = T, sep = ",", format = "%Y%m%d")
print(head(df))

with and within functions

The with and within functions apply an expression to a given data set and allows one to manipulate it. The within function even keeps track of changes made, including adding or deleting elements and returns a new object with these revised contents. The syntax for these two functions is given as:

with(data, expr)
within(data, expr)

where,
data – typically is a list or data frame, although other options exist for with.
expr – one or more expressions to evaluate using the contents of data, the commands must be wrapped in braces if there is more than one expression to evaluate.

# Consider the NIFTY data
df = as.data.frame(read.csv("NIFTY.csv"))
print(head(df, 3))

# Example of with function:
df$Average = with(df, apply(df[3:6], 1, mean))
print(head(df, 3))

# Example of within function:
df = within(df, {
   df$Average = apply(df[3:6], 1, mean)
})
print(head(df, 3))

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Read more

Mixture Models for Forecasting Asset Returns

Mixture Models for Forecasting Asset Returns

By Brian Christopher

Asset return prediction is difficult. Most traditional time series techniques don’t work well for asset returns. One significant reason is that time series analysis (TSA) models require your data to be stationary. If it isn’t stationary, then you must transform your data until it is stationary.

That presents a problem.

In practice, we observe multiple phenomena that violate the rules of stationarity including non-linear processes, volatility clustering, seasonality, and autocorrelation. This renders traditional models mostly ineffective for our purposes.

What are our options?

There are many algorithms to choose from, but few are flexible enough to address the challenges of predicting asset returns:

  • mean and volatility changes through time
  • sometimes future returns are correlated with past returns, sometimes not
  • sometimes future volatility is correlated with past volatility, sometimes not
  • non-linear behavior

To recap, we need a model framework that is flexible enough to (1) adapt to non-stationary processes and (2) provide a reasonable approximation of the non-linear process that is generating the data.

Can Mixture Models offer a solution?

They have potential. First, they are based on several well-established concepts.

Markov models – These are used to model sequences where the future state depends only on the current state and not any past states. (memoryless processes)

Hidden Markov models – Used to model processes where the true state is unobserved (hidden) but there are observable factors that give us useful information to guess the true state.

Expectation-Maximization (E-M) – This is an algorithm that iterates between computing class parameters and maximizing the likelihood of the data given those parameters.

An easy way to think about applying mixture models to asset return prediction is to consider asset returns as a sequence of states or regimes. Each regime is characterized by its own descriptive statistics including mean and volatility. Example regimes could include low-volatility and high-volatility. We can also assume that asset returns will transition between these regimes based on probability. By framing the problem this way we can use mixture models, which are designed to try to estimate the sequence of regimes, each regime’s mean and variance, and the transition probabilities between regimes.

The most common is the Gaussian mixture model (GMM).

The underlying model assumption is that each regime is generated by a Gaussian process with parameters we can estimate. Under the hood, GMM employs an expectation-maximization algorithm to estimate regime parameters and the most likely sequence of regimes.

GMMs are flexible, generative models that have had success approximating non-linear data. Generative models are special in that they try to mimic the underlying data process such that we can create new data that should look like original data.

In this upcoming webinar on Mixture Models, we design and evaluate an example strategy to find out if Gaussian mixture models are useful for return prediction, and specifically for identifying market bottoms.

Next Step

Join us for the webinar, “Can we use Mixture Models to Predict Market Bottoms?” on Tuesday 25th Apr, 8:30 AM MST | 8:00 PM IST. The webinar will be conducted by Brian Christopher, Quantitative researcher, Python developer, CFA charter holder, and founder of Blackarbs LLC, a quantitative research firm. Register now!

 

 

Read more

Sentiment Trading Indicators and Strategy – Part 2

Sentiment Trading Indicators and Strategy

By Jay Maniar

In our last post on the sentiment indicators, we saw how we can use sentiment indicators like Put/Call ratio, Arms Index or Short term trading Index (TRIN) for trading and formulate a strategy around such sentiment indicators. In this post, we will explore more such sentiment indicators and illustrate different strategies that can be devised using these indicators.

Volatility Index

VIX is a trademarked ticker symbol for the Chicago Board Options Exchange (CBOE) Volatility Index. It is a measure of the implied volatility over the next 30 days, of the S&P 500 index options.

VIX as an Indicator

  • CBOE Volatility Index (VIX) is an up-to-the-minute market estimate of implied volatility of the S&P 500 Index which is calculated by taking the midpoints of the bid/ask quotes (price of options) of real-time S&P 500 index options.
  • At each tick in the VIX volatility index, it provides an instantaneous measure of how much the market would fluctuate in 30 days from the last tick.
  • Hence, the volatility index is forward looking and predicts the volatility of the market in future.
  • VIX is quoted as percentage points, i.e. a VIX of 20 represents an expected annualized change of 20% in either direction of the S&P 500 index, at a 68% confidence level or within one standard deviation of the normal probability distribution.
  • The generalized formula for calculation for VIX is

generalized formula for calculation for VIX

VIX Interpretation

  • Practically, a high VIX corresponds to falling prices of the index level.
  • Before understanding the reason for this, it is important to understand the basics of Option Pricing.
    Option Price = Intrinsic value + Extrinsic value; where extrinsic value is the summation of time value and volatility. Hence, volatility plays an important role in the pricing of the options.
  • A fall in the market typically results in higher premiums of the put options due to volatility. Also, demand for Put options among investors is high since the investors who are holding the stock would like an insurance of their stock investments by buying these put options. This demand is due to further anticipation that market would fall after a realized fall in market since the risk is high due to volatility. Volatility in the market is due to fall in prices and fear among investors of losing invested or gained capital. As a result, they might decide to take gains or realized losses by selling of underlying. This increases the premiums of options resulting in a sharp rise in VIX.
  • Generally, a VIX value above 30 is an indication of high uncertainty and fear in the market.
  • A low VIX value indicates an expectation of a calm market as a result of the rally seen in the market.
  • A rally increases greed among investors and they expect the market to continuously rise. As a result, option writers price their call options with different strike prices in such a way that it is lucrative enough for an investor to buy the option, but the probability of the option being in the money before expiry is not too high. In a rally, more call options are bought, decreasing the (Put/Call Ratio) PCR ratio – indicating a bullish market. Investors may not want to realize all their gains at once at a particular price level, as they expect the market to rise further and sell only a fraction of the portfolio systematically to new buyers who want to enter this rally and hold onto the other part of their portfolio. There can be steady rallies and small corrections in overpriced stocks which reduce the overall volatility.
  • This, in turn, drives the VIX value lower. VIX below 20 is generally an indication of a calm market.

Strategy

We will take contrarian positions based on VIX. Taking a contrarian position refers to ‘buying’ when the market falls drastically and ‘selling’ when the market rises irrationally. A contrarian profits from the theory that when there is certain positive or negative crowd behavior regarding a security; it leads to mispricing of the security due to the prevailing bullish or bearish sentiment.

  • When VIX is high (generally above 30) we buy the underlying index. Since this is an indication that the market is bearish and the implied volatility is high, we BUY since we expect corrections in the bear market from this level and expect implied volatility to move back to its mean indicating a bull market from this point.

Another strategy could be to be ‘short puts’ that is being delta positive and vega negative. Delta positive means, as the stock price rises so do the option price and a negative vega is a position that can be benefited from falling implied volatility.

  • When VIX is low (generally below 15) it is an indication that the market is bullish and a correction is likely. We go ‘long puts’ i.e. delta negative and vega positive or we can SELL the index.

Image: VIX levels and corresponding S&P 500 Index levels

VIX levels and corresponding S&P 500 Index levels

Margin debt indicator

A regular cash account allows you to buy securities worth the amount of cash available in the account. For e.g. If you have $5000 in your account and you want to invest in ABC Corporation’s shares which are trading at $100 on the exchange, then you can buy ($5000/$100) 50 shares of ABC Corporation. But if according to your analysis, ABC Corp is undervalued and you expect a rise in the stock’s value in the near term, you can capitalize on this opportunity by asking your broker to lend you money in order to buy securities in your account. To do this, the broker would require you to open a margin account. A margin account is an agreement between you and your broker such that the broker agrees to lend you a proportional amount of money only to buy financial securities (stocks, bonds, and other financial instruments). The collateral for this loan would be the financial securities purchased (ABC Corp stocks in our e.g.). However, there would be a few prerequisites before you purchase these securities on loan and sign the margin account agreement.

  • While buying securities on margin, the proportion that is paid by the investor is called as margin and the proportion that is loaned out by the broker to you, to buy these securities is called margin debt.
  • These debts taken by various investors are aggregated and published by exchanges because the brokers are required to report this data to the exchanges.

Interpretation

  • An increase in the total margin debt outstanding over time will coincide with a rise in the market, suggesting aggressive buying and a bullish sentiment.
  • A rational reason for an investor buying a stock on margin would be because free cash has been exhausted and the investor still sees an opportunity in buying, as a result, the investor buys on margin.
  • But every margin account has its own credit limit or the proportion that the broker loans out to his investors. As these margin investors reach their limits of margin credit, their ability to continue buying decreases, as a result of the demand in the market decreases and the prices may come to a standstill or may even decrease because of weaker demand.
  • This weaker demand is a result of investors reaching their limits of their buying capacity both of their own equity (or investor’s cash) and the margin debts (ability to buy securities on loans).
  • This may result in a drop in the prices of the shares or the index as a whole, resulting in margin calls.
  • Unavailability of free cash and decreasing prices and may force the investors or the brokers to sell securities in these margin accounts, further adding selling pressure, further decreasing the prices to new lows.
  • Hence, increasing margin debts tend to coincide with increasing market prices and decreasing margin debts tend to coincide with decreasing market prices.

Strategy

  • At historical low levels of margin debts, we will BUY the index futures, since there is additional space to buy securities on margin and it might be indicative of an oversold market.
  • At historical high levels of margin debts, we will SELL the index futures, since there is no more space to buy securities on margin and a possibility of triggering margin calls.

Image: Margin debt chart and corresponding S&P 500 Index levels

Margin debt chart and corresponding S&P 500 Index levels

Mutual Fund cash position indicator

Mutual funds hold a substantial amount of all the investable assets present in the market.

  • Mutual fund cash position is the ratio of mutual fund’s cash to total assets.

Mutual fund cash position = (mutual fund’s cash/total assets of the mutual fund).

  • This cash can be cash in hand or cash invested in highly liquid money market securities which earn a nominal rate of return.
  • Generally, this cash position is up to 5%, which these funds are required to keep available all the time on hand to handle shares redemptions, operating expenses on daily basis and likes.
  • Cash also comes into a mutual fund on daily basis from customer (investors) deposits, interests earned and dividends received.
  • Cash also increases after a fund manager sells a position and holds the funds before reinvesting them.

Interpretation

  • During uptrends in the markets, the fund managers would want to quickly invest the cash in the markets because cash (ideal or money market instruments) only earns near risk-free rate returns. Keeping money in cash decreases returns, as investing in uptrends with this cash can earn more than the risk-free rate and increases the performance or overall returns of the fund.
  • As a result, generally, when there are medium and long term uptrends in the market, the mutual fund cash position is below 4.5 – 5% as maximum cash is invested in the market with an expectation to make most out of this cash.
  • Similarly, during downtrends, investment in cash would earn a near risk-free rate which would be greater than the possible negative return earned in the market. As a result, fund’s cash investment balance increases, expecting to improve fund’s performance or overall return.
  • Generally, during such short and medium term downtrends, the cash position may increase to more than 11% in a mutual fund.
  • Analysts generally interpret this as a contrarian
  • This is because when mutual funds accumulate cash; the fund managers are bearish and this indicates future buying power in the market by these funds.
  • A high mutual fund cash ratio suggests market prices are likely to rise in near future.
  • On the other hand, when mutual funds’ cash is low, it means they are already invested and market prices reflect their purchases. This leaves less scope for increase in market prices since the fund managers are bullish anticipating rising prices.

Strategy

  • We would BUY index futures when the mutual fund cash ratio rises substantially more than the previous cash positions in the recent past.

Mutual fund cash position levels

Mutual fund cash position levels and corresponding S&P 500 Index levels; Image source: caps.fool.com

Conclusion

Always remember, when you trade, do not use these sentiment indicators in isolation. Use indications from more than one sentiment indicators and try to understand the fundamentals and rationality behind such patterns, but be brave enough to take up the contrarian position and capitalize on the fear or greed of other investors.

Next Step

To understand sentiment indicators like Put call Ratio (PCR), Arms Index or TRading INdex (TRIN) and Volatility Index (VIX) in more detail and to learn how to code an algorithmic trading strategy in Python beating the S&P 500 returns and backtesting it on 2 years data, check out the course Trading Using Options Sentiment Indicators.

Read more

R Weekly Bulletin Vol – IV

This week’s R bulletin will cover topics like removing duplicate rows, finding row number, sorting a data frame in the same order, sorting a data frame in different order, and creating two tabs in an excel workbook. We will also cover functions like Sys.time, julian, and the second & minute function from the lubridate package. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To open a file in R – Ctrl+O
2. To open a new R script – Ctrl+Shift+N
3. To save an R script – Ctrl+S

Problem Solving Ideas

Remove duplicate rows in R

Consider the following data frame “df” comprising of 3 rows of a 1-minute OHC of a stock.

Example:

open = c(221, 221.25, 221.25) 
high = c(221.75, 221.45, 221.45) 
close = c(221.35, 221.2, 221.2) 
df = data.frame(open, high, close) 
print(df)

As can be seen from the output, the 2nd and the 3rd rows are duplicates. If we want to retain only the unique rows, we can use the duplicated function in the following manner:

df[!duplicated(df), ]

This will remove the 3rd duplicate row, and retain the first 2 rows. In case we want the duplicate row, we can do so in the following way:

df[duplicated(df), ]

Return row number for a particular value in a column

To find the row number for a particular value in a column in a vector/data frame we can use the “which” function.

Example 1:

day = c("Mon", "Tue", "Wed", "Thurs", "Fri", "Sat", "Sun") 
row_number = which(day == "Sat") 
print(row_number)
[1] 6

Example 2: Consider a 2 column data frame “df” comprising of the stock symbols and their respective closing price for the day. To find the row number corresponding to the HCL stock we call the “which” function on the Ticker column with its value selected as HCL.

Ticker = c("INFY", "TCS", "HCL") 
ClosePrice = c(2021, 2294, 910) 
data = data.frame(Ticker, ClosePrice) 

row_number = which(data$Ticker == "HCL") 
print(row_number)
[1] 3

Sorting a data frame by two columns in the same order

To sort a data frame by two columns in the same order, we can use the “order” function and the “with” function. Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. We first sort the data frame based on category and then sort based on the percentage change in price.

The order function by default sorts in an ascending manner. Hence to sort both the columns in descending order we keep the decreasing argument as TRUE.

Example – Sorting a data frame by two columns

# Create a data frame 
Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") 
Category = c(1, 1, 1, 2) 
Percent_Change = c(2.3, -0.25, 0.5, 0.25) 

df = data.frame(Ticker, Category, Percent_Change) 
print(df)

# Sorting by Category column first and then the Percent_Change column: 
df_sort = df[with(df, order(Category, Percent_Change, decreasing = TRUE)), ] 
print(df_sort)

Sorting a data frame by two columns in different order

To sort a data frame by two columns in different order, we can use the “order” function along with the “with” function.

Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. Assume that we want to sort first in an ascending order by column “Category”, and then by column “Percent_Change” in a descending order.

The order function by default sorts in an ascending manner. Hence, to sort the “Category” column we mention it as the first variable in the order function without prepending it with any sign. To sort the “Percent_Change” column in a descending order we prepend it with a negative sign.

Example – Sorting a data frame by two columns

# Create a data frame 
Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") 
Category = c(1, 1, 1, 2) 
Percent_Change = c(2.3, -0.25, 0.5, 0.25) 

df = data.frame(Ticker, Category, Percent_Change) 
print(df)

# Sort by Category column first and then the Percent_Change column: 
df_sort = df[with(df, order(Category, -Percent_Change)), ] 
print(df_sort)

Creating two tabs in the output excel workbook

At times we want to write & save the multiple results generated after running our R script in an excel workbook on separate worksheets. To do so, we can make use of the “append” argument in the write.xlsx function.To write to an excel file, one must first install and load the xlsx package in R.

Example: In this example, we are creating two worksheets in the “Stocks.xlsx” workbook. In the worksheet named, “Top Gainers”, we save the table_1 output, while in the “Top Losers” worksheet we save the table_2 output. To create a second worksheet in the same workbook, we keep the append argument as TRUE in the second line.

write.xlsx(table_1, "Stocks.xlsx", sheetName = "Top Gainers", append = FALSE) 
write.xlsx(table_2, " Stocks.xlsx", sheetName = "Top Losers", append = TRUE)

Functions Demystified

Sys.time function

The Sys.time function gives the current date and time.

Example:

date_time = Sys.time() 
print(date_time)
[1] “2017-04-15 16:25:38 IST”

The function can be used to find the time required to run a code by placing this function at the start and the end of the code. The difference between the start and the end will give the time taken to execute the code.

julian function

The julian function is used to extract the Julian date, which is the number of days since January 1, 1970. Given a date, the syntax of the function is given as:

julian(date)

Example:

date = as.Date("2010-03-15") 
julian(date)
[1] 14683
attr(,”origin”)
[1] “1970-01-01”

Alternatively one can use the as.integer function to get the same result

second and minute functions

These functions are part of the lubridate package. The second function retrieves/sets the second component of a date-time object, while the minute function retrieves/sets the minute component. It allows Date-time objects like those belonging to the POSIXct, POSIXlt, Date, zoo, xts, and the timeSeries objects.

Example:

library(lubridate) 
# Retrieving the seconds We have used the ymd_hms function to parse the given object. 
x = ymd_hms("2016-06-01 12:23:45") 
second(x)
[1] 45

minute function:

library(lubridate) 
# Retrieving the minute from a date-time object 
x = ymd_hms("2016-06-01 12:23:45") 
minute(x)
[1] 23

# Retrieving the minute from a time object. We have used the hms function to parse the given object.
x = hms("15:29:06") 
minute(x)
[1] 29

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

 

 

 

Read more

[WEBINAR] Can we use mixture models to predict market bottoms?

Tuesday 25th Apr, 8:30 AM MST | 8:00 PM IST

 Mixture Models to Predict Market Bottoms

The webinar will explain Mixture Models and explore its application to predict an asset’s return distribution and identify outlier returns that are likely to mean revert.

The webinar will cover

  • Why bother? Motivating experimentation with Mixture Models
  • How do Mixture Models work? (An intuitive explanation)
  • Designing the Research Experiment (How do we answer the original question?)
  • Define the strategy
  • Evaluate the strategy
  • Conclusions
  • Further Areas to Explore
  • Resources

(more…)

Read more

R Best Practices: R you writing the R way!

By Milind Paradkar

Any programmer inevitably writes tons of codes in his daily work. However, not all programmers inculcate the habit of writing clean codes which can be easily be understood by others. One of the reasons can be the lack of awareness among programmers of the best practices followed in writing a program. This is especially the case for novice programmers. In this post, we list some of the R programming best practices which will lead to improved code readability, consistency, and repeatability. Read on!

Best practices of writing in R

1) Describe your code – When you start coding describe what the R code does in the very first line. For subsequent blocks of codes follow the same method of describing the blocks. This makes it easy for other people to understand and use the code.

Example:

# This code captures the 52-week high effect in stocks
# Code developed by Milind Paradkar

2) Load Packages – After describing your code in the first line, use the library function to list and load all the relevant packages needed to execute your code.

Example:

library(quantmod);  library(zoo); library(xts);
library(PerformanceAnalytics); library(timeSeries); library(lubridate);

3) Use Updated Packages – While writing your code ensure that you are using the latest updated R packages. To check the version of any R package you can use the packageVersion function.

Example:

packageVersion("TTR")
[1] ‘0.23.1’

4) Organize all source files in the same directory – Store all the necessary files that will be used/sourced in your code in the same directory. You can use the respective relative path to access them.

Example:

# Reading file using relative path
df = read.csv(file = "NIFTY.csv", header = TRUE)

# Reading file using full path
df =  read.csv(file = "C:/Users/Documents/NIFTY.csv", header = TRUE)

5) Use a consistent style for data structure types – R programming language allows for different data structures like vectors, factors, data frames, matrices, and lists. Use a similar naming for a particular type of data structure. This will make it easy to recognize the similar data structures used in the code and to spot the problems easily.

Example:
You can name all different data frames used in your code by adding .df as the suffix.

aapl.df   = as.data.frame(read.csv(file = "AAPL.csv", header = TRUE))
amzn.df = as.data.frame(read.csv(file = "AMZN.csv", header = TRUE))
csco.df  = as.data.frame(read.csv(file = "CSCO.csv", header = TRUE))

6) Indent your code – Indentation makes your code easier to read, especially, if there are multiple nested statements like For-loop and If statement.

Example:

# Computing the Profit & Loss (PL) and the Equity
dt$PL = numeric(nrow(dt))
for (i in 1:nrow(dt)){
   if (dt$Signal[i] == 1) {dt$PL[i+1] = dt$Close[i+1] - dt$Close[i]}
   if (dt$Signal[i] == -1){dt$PL[i+1] = dt$Close[i] - dt$Close[i+1]}
}

7) Remove temporary objects – For long codes, running in thousands of lines, it is a good practice to remove temporary objects after they have served their purpose in the code. This can ensure that R does not into memory issues.

8) Time the code – You can time your code using the system.time function. You can also use the same function to find out the time taken by different blocks of code. The function returns the amount of time taken in seconds to evaluate the expression or a block of code. Timing codes will help to figure out any bottlenecks and help speed up your codes by making the necessary changes in the script.

To find the time taken for different blocks we wrapped them in curly braces within the call to the system.time function.

The two important metrics returned by the function include:
i) User time – time charged to the CPU(s) for the code
ii) Elapsed time – the amount of time elapsed to execute the code in entirety

 Example:

# Generating random numbers
system.time({
mean_1 = rnorm(1e+06, mean = 0, sd = 0.8)
})

user    system    elapsed
0.40      0.00       0.45

9) Use vectorization – Vectorization results in faster execution of codes, especially when we are dealing with large data sets. One can use statements like the ifelse statement or the with function for vectorization.

Example:
Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods (using for-loop and with function) and time them using the system.time function. Note the time taken to execute the for-loop versus the time to execute the with function in combination with the lagpad function.

library(quantmod)
# Using FOR Loop
system.time({
df = read.csv("NIFTY.csv")
df = df[,c(1,3:6)]
df$GapOpen = double(nrow(df))
for ( i in 2:nrow(df)) {
df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2)
}
print(head(df))
})

# Using with function + lagpad, instead of FOR Loop
system.time({
df = read.csv("NIFTY.csv")
df = dt[,c(1,3:6)]

lagpad = function(x, k) {
c(rep(NA, k), x)[1 : length(x)]
}

df$PrevClose = lagpad(df$CLOSE, 1)
df$GapOpen_ = with(df, round(Delt(df$PrevClose,df$OPEN)*100,2))
print(head(df))
})

10) Folding codes – Folding codes is a way wherein the R programmer can fold a code of line or code sections. This allows hiding blocks of code whenever required, and makes it easier to navigate through lengthy codes. Code folding can be done in two ways:
i) Automatic folding of codes
ii) User-defined folding of codes

Automatic folding of codes: RStudio automatically provides the flexibility to fold the codes. When a coder writes a function or conditional blocks, RStudio automatically creates foldable codes.

User-defined folding of codes: 
One can also fold a random group of codes by using Edit -> Folding -> Collapse or by simply selecting the group of codes and pressing Alt+L key.

User-defined folding can also be done via Code Sections:
To insert a new code section you can use the Code -> Insert Section command. Alternatively, any comment line which includes at least four trailing dashes (-), equal signs (=) or pound signs (#) automatically creates a code section.

11) Review and test your code rigorously – Once your code is ready, ensure that you test it code rigorously on different input parameters. Ensure that the logic used in statements like for-loop, if statement, ifelse statement is correct. It is a nice idea to get your code reviewed by your colleague to ensure that the work is of high quality.

12) Don’t save your workspace When you want to exit R it checks if you want to save your workspace. It is advisable to not save the workspace and start in a clean workspace for your next R session. Objects from the previous R sessions can lead to errors which can be hard to debug.

These were some of the best practices of writing in R that one can follow to make your codes easy to read, debug and to ensure consistency.

 Next Step

 If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

Read more

Algorithmic Trading Vs Discretionary Trading

Algorithmic Trading Vs Discretionary Trading

By Nitin Thapar

Introduction

If you are a discretionary trader, you might have asked these questions before

In order to answer these questions, we first need to know what makes these practices stand apart from each other.

In this post, we will make an attempt to decode all the questions related to algorithmic trading vs discretionary trading.

(more…)

Read more

R Weekly Bulletin Vol – III

This week’s R bulletin will cover topics like how to read select columns, the difference between boolean operators, converting a factor to a numeric and changing memory available to R. We will also cover functions like data, format, tolower, toupper, and strsplit function. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To change the working directory – Ctrl+Shift+K
2. To show history – Ctrl+4
3. To display environment – Ctrl+8

Problem Solving Ideas

Read a limited number of columns from a dataset

To read a specific set of columns from a dataset one can use the “fread” function from the data.table package. The syntax for the function to be used for reading/dropping a limited number of columns is given as:

fread(input, select, stringsAsFactors=FALSE)
fread(input, drop, stringsAsFactors=FALSE)

One needs to specify the columns names or column numbers with the select parameter as shown below.
Example:

library(data.table)
data = fread("data.txt")
print(data)

 

 

 

# Reading select columns
data = fread("data.txt", select = c("DATE", "OPEN", "CLOSE"))
print(data)

 

 

 

Alternatively, you can use the drop parameter to indicate which columns should not be read:
Example:

data = fread("data.txt", drop = c(2:3, 6))
print(data)

 

 

 

Difference between the boolean operators & and &&

In R we have the “&” Boolean operator which is the equivalent to the AND operator in excel. Similarly, we have | in R which is the equivalent of OR in excel. One also finds Boolean operators && and || in R. Although, these operators have the same meaning as & and |, they behave in a slightly different manner.

Difference between & and &&: The shorter form “&” is vectorized, meaning it can be applied on a vector. See the example below.
Example:

 

 

 

 

The longer form &&, evaluates left to right examining only the first element of each vector. Thus in the example below, it will first evaluate whether -1 > = 0 (which is FALSE), and then check whether -1 <= 0 (which is TRUE). After this, it will evaluate FALSE & TRUE, which gives the final answer as FALSE.

Example:

 

 

 

Thus the && operator is to be used when we have vectors of length one. For vectors of length greater than one, we should use the short form.

How to convert a factor to a numeric without a loss of information

Consider the following factor “x” created using the factor, sample, and the runif functions. Let us try to convert this factor to a numeric.

Example:

x = factor(sample(runif(5), 6, replace = TRUE))
print(x)
[1] 0.660900804912671 0.735364762600511 0.479244165122509 0.397552277892828
[5] 0.660900804912671 0.660900804912671
4 Levels: 0.397552277892828 0.479244165122509 … 0.735364762600511

as.numeric(x)
[1] 3 4 2 1 3 3

As can be seen from the output, upon conversion we do not get the original values. R does not have a function to execute such conversion error-free. To overcome this problem, one can use the following expression:

as.numeric(levels(x))[x]
[1] 0.6609008 0.7353648 0.4792442 0.3975523 0.6609008 0.6609008

One can also use the as.numeric(as.character(x)) expression, however the as.numeric(levels(x))[x] is slightly more efficient in converting the factor to an integer/numeric.

Replace a part of URL using R

Suppose you want to scrap data from a site having different pages, and want to replace the url using R. We can do this by using the parse_url function and build_url function from the “httr” package. The following example illustrates the method to replace a part in url.

Example:

library(httr)
# In the following url, the value 2 at the end signifies page 2 of the books
# catergory under the section "bestsellers".
testURL = "http://www.amazon.in/gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b#2"

# Entering the url and parsing it using the parse_ulr function
parseURL = parse_url(testURL)
print(parseURL)

$scheme
[1] “http”

$hostname
[1] “www.amazon.in”

$port
NULL

$path
[1] “gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b”

$query
NULL

$params
NULL

$fragment
[1] “2”

$username
NULL

$password
NULL

attr(,”class”)
[1] “url”

# Assigning the next page number (i.e page no.3) and creating the new url
 parseURL$fragment = 3
 newURL <- build_url(parseURL)
 print(newURL)
[1] “http://www.amazon.in/gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b#3”

Increasing (or decreasing) the memory available to R processes

In R there is a memory.limit() function which gives you the amount of available memory in MB. At times, windows users may get the error that R has run out of memory. In such cases, you may set the amount of available memory.

Example:

memory.limit(size=4000)

The unit is MB. You may increase this value up to 2GB or the maximum amount of physical RAM you have installed. If you have R already installed and subsequently install more RAM, you may have to reinstall R in order to take advantage of the additional capacity.

On 32-bit Windows, R can only use up to 3Gb of RAM, regardless of how much you have installed. There is a 64-bit version of R for Windows available from REvolution Computing, which runs on 64-bit Windows, and can use the entire RAM available.

Functions Demystified

data function

R program includes a number packages which often come with different data sets. These data sets can be accessed using the data function. This function loads specified data sets or lists the available data sets. To check the available data sets from all the R packages installed in your application use the data() expression. This will display all the available data sets under each package.

Example: data()
To view data sets from a particular package, specify the package name as the argument to the data function.

Example:

data(package = "lubridate")

To load a particular dataset, use the name of the dataset as the argument to the data function.

Example:

# loading the USDCHF dataset from the timeSeries package
library(timeSeries)
data(USDCHF)
x = USDCHF
print(head(USDCHF))

# loading the sample_matrix dataset from the xts package
library(xts)
data(sample_matrix)
x = sample_matrix
print(head(sample_matrix))

 

 

 

 

 

 

 

format function

The format function is used to format the numbers so that they can appear clean and neat on the reports. The function takes a number of arguments to control the formatting of the result. The main arguments for the format function are given below.

format(x, trim, digits, decimal.mark, nsmall, big.mark, big.interval, small.mark,
small.interval, justify)

Where,
1) x: the number to be formatted. 2) trim: takes a logical value. If FALSE, it adds spaces to right-justify the result. If TRUE, it suppresses the leading spaces. 3) digits: how many significant digits of numeric values to display. 4) decimal mark: the format in which to display the decimal mark. 5) nsmall: the minimum number of digits after the decimal point. 6) big.mark: the mark between intervals before the decimal point. 7) small mark: the mark between intervals after the decimal point.

Example:

format(34562.67435, digits=7, decimal.mark=".",big.mark=" ",
small.mark=",",small.interval=2)
[1] “34 562.67”

In the example above we chose to display 7 digits, and are using point(.) as a decimal mark. The small interval value is 2, which means we want 2 digits to be displayed after the decimal point. The big.mark format displays a space after the first two digits.

tolower, toupper, and strsplit

The tolower and toupper functions help convert strings to lowercase and uppercase respectively.
Example 1:

toupper("I coded a strategy in R")
[1] “I CODED A STRATEGY IN R”

tolower("I coded a strategy in R")
[1] “i coded a strategy in r”

Example 2:

df = as.data.frame(read.csv("NIFTY.csv"))
print(head(df, 4))

colnames(df)
[1] “DATE” “TIME” “CLOSE” “HIGH” “LOW” “OPEN” “VOLUME”

tolower(colnames(df))
[1] “date” “time” “close” “high” “low” “open” “volume”

The strsplit function is used to break/split strings at the specified split points. The function returns a list and not a character vector or a matrix. In the example below, we are splitting the expression on the spaces.

Example:

strsplit("I coded a strategy in R", split = " ")
[[1]] [1] “I” “coded” “a” “strategy” “in” “R”

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

 

 

 

 

 

 

 

 

 

 

 

 

 

Read more

Forecasting Markets using eXtreme Gradient Boosting (XGBoost)

Forecasting Markets using Gradient Boosting (XGBoost)

By Milind Paradkar

In recent years, machine learning has been generating a lot of curiosity for its profitable application to trading. Numerous machine learning models like Linear/Logistic regression, Support Vector Machines, Neural Networks, Tree-based models etc. are being tried and applied in an attempt to analyze and forecast the markets. Researchers have found that some models have more success rate compared to other machine learning models. eXtreme Gradient Boosting also called XGBoost is one such machine learning model that has received rave from the machine learning practitioners.

In this post, we will cover the basics of XGBoost, a winning model for many kaggle competitions. We then attempt to develop an XGBoost stock forecasting model using the “xgboost” package in R programming.

Basics of XGBoost and related concepts

Developed by Tianqi Chen, the eXtreme Gradient Boosting (XGBoost) model is an implementation of the gradient boosting framework. Gradient Boosting algorithm is a machine learning technique used for building predictive tree-based models. (Machine Learning: An Introduction to Decision Trees).

Boosting is an ensemble technique in which new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.

The ensemble technique uses the tree ensemble model which is a set of classification and regression trees (CART). The ensemble approach is used because a single CART, usually, does not have a strong predictive power. By using a set of CART (i.e. a tree ensemble model) a sum of the predictions of multiple trees is considered.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction.

The objective of the XGBoost model is given as:

Obj = L +

Where,
L is the loss function which controls the predictive power, and
Ω is regularization component which controls simplicity and overfitting

The loss function (L) which needs to be optimized can be Root Mean Squared Error for regression, Logloss for binary classification, or mlogloss for multi-class classification.

The regularization component (Ω) is dependent on the number of leaves and the prediction score assigned to the leaves in the tree ensemble model.

It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. The Gradient boosting algorithm supports both regression and classification predictive modeling problems.

Sample XGBoost model:

We will use the “xgboost” R package to create a sample XGBoost model. You can refer to the documentation of the “xgboost” package here.

Install and load the xgboost library –

We install the xgboost library using the install.packages function. To load this package we use the library function. We also load other relevant packages required to run the code.

install.packages("xgboost")

# Load the relevant libraries
library(quantmod); library(TTR); library(xgboost);

 

Create the input features and target variable – We take the 5-year OHLC and volume data of a stock and compute the technical indicators (input features) using this dataset. The indicators computed include Relative Strength Index (RSI), Average Directional Index (ADX), and Parabolic SAR (SAR). We create a lag in the computed indicators to avoid the look-ahead bias. This gives us our input features for building the XGBoost model. Since this is a sample model, we have included only a few indicators to build our set of input features.

# Read the stock data 
symbol = "ACC"
fileName = paste(getwd(),"/",symbol,".csv",sep="") ; 
df = as.data.frame(read.csv(fileName))
colnames(df) = c("Date","Time","Close","High", "Low", "Open","Volume")

# Define the technical indicators to build the model 
rsi = RSI(df$Close, n=14, maType="WMA")
adx = data.frame(ADX(df[,c("High","Low","Close")]))
sar = SAR(df[,c("High","Low")], accel = c(0.02, 0.2))
trend = df$Close - sar

# create a lag in the technical indicators to avoid look-ahead bias 
rsi = c(NA,head(rsi,-1)) 
adx$ADX = c(NA,head(adx$ADX,-1)) 
trend = c(NA,head(trend,-1))

Our objective is to predict the direction of the daily stock price change (Up/Down) using these input features. This makes it a binary classification problem. We compute the daily price change and assigned a positive 1 if the daily price change is positive. If the price change is negative, we assign a zero value.

# Create the target variable
price = df$Close-df$Open
class = ifelse(price > 0,1,0)

 

Combine the input features into a matrix – The input features and the target variable created in the above step are combined to form a single matrix. We use the matrix structure in the XGBoost model since the xgboost library allows data in the matrix format.

# Create a Matrix
model_df = data.frame(class,rsi,adx$ADX,trend)
model = matrix(c(class,rsi,adx$ADX,trend), nrow=length(class))
model = na.omit(model)
colnames(model) = c("class","rsi","adx","trend")

 

Split the dataset into training data and test data – In the next step, we split the dataset into training and test data. Using this training and test dataset we create the respective input features set and the target variable.

# Split data into train and test sets 
train_size = 2/3
breakpoint = nrow(model) * train_size

training_data = model[1:breakpoint,]
test_data = model[(breakpoint+1):nrow(model),]

# Split data training and test data into X and Y
X_train = training_data[,2:4] ; Y_train = training_data[,1]
class(X_train)[1]; class(Y_train)

X_test = test_data[,2:4] ; Y_test = test_data[,1]
class(X_test)[1]; class(Y_test)

 

Train the XGBoost model on the training dataset –

We use the xgboost function to train the model. The arguments of the xgboost function are shown in the picture below.

The data argument in the xgboost function is for the input features dataset. It accepts a matrix, dgCMatrix, or local data file. The nrounds argument refers to the max number of iterations (i.e. the number of trees added to the model). The obj argument refers to the customized objective function. It returns gradient and second order gradient with given prediction and dtrain.

# Train the xgboost model using the "xgboost" function
dtrain = xgb.DMatrix(data = X_train, label = Y_train)
xgModel = xgboost(data = dtrain, nround = 5, objective = "binary:logistic")

 

Output – The output is the classification error on the training data set.

Cross-validation

We can also use the cross-validation function of xgboost i.e. xgb.cv. In this case, the original sample is randomly partitioned into nfold equal size subsamples. Of the nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining (nfold – 1) subsamples are used as training data. The cross-validation process is then repeated nrounds times, with each of the nfold subsamples used exactly once as the validation data.

# Using cross validation
dtrain = xgb.DMatrix(data = X_train, label = Y_train)
cv = xgb.cv(data = dtrain, nround = 10, nfold = 5, objective = "binary:logistic")

 

Output – The xgb.cv returns a data.table object containing the cross validation results.

Make predictions on the test data

To make predictions on the unseen data set (i.e. the test data), we apply the trained XGBoost model on it which gives a series of numbers.

# Make the predictions on the test data
preds = predict(xgModel, X_test)

# Determine the size of the prediction vector
print(length(preds))

# Limit display of predictions to the first 6
print(head(preds))

 

Output –

These numbers do not look like binary classification {0, 1}. We have to, therefore, perform a simple transformation before we are able to use these results. In the example code shown below, we are comparing the predicted number to the threshold of 0.5. The threshold value can be changed depending upon the objective of the modeler, the metrics (e.g. F1 score, Precision, Recall) that the modeler wants to track and optimize.

prediction = as.numeric(preds > 0.5)
print(head(prediction))

 

Output –

Measuring model performance

Different evaluation metrics can be used to measure the model performance. In our example, we will compute a simple metric, the average error. It compares the predicted score with the threshold of 0.50.

For example: If the predicted score is less than 0.50, then the (preds > 0.5) expression gives a value of 0. If this value is not equal to the actual result from the test data set, then it is taken as a wrong result.

We compare all the preds with the respective data points in the Y_test and compute the average error. The code for measuring the performance is given below. Alternatively, we can use hit rate or create a confusion matrix to measure the model performance.

# Measuring model performance
error_value = mean(as.numeric(preds > 0.5) != Y_test)
print(paste("test-error=", error_value))

 

Output –

Plot the feature importance set – We can find the top important features in the model by using the xgb.importance function.

# View feature importance from the learnt model
importance_matrix = xgb.importance(model = xgModel)
print(importance_matrix)

 

Plot the XGBoost Trees

Finally, we can plot the XGBoost trees using the xgb.plot.tree function. To limit the plot to a specific number of trees, we can use the n_first_tree argument. If NULL, all trees of the model are plotted.

# View the trees from a model
xgb.plot.tree(model = xgModel)

# View only the first tree in the XGBoost model
xgb.plot.tree(model = xgModel, n_first_tree = 1)

 

Conclusion

This post covered the popular XGBoost model along with a sample code in R programming to forecast the daily direction of the stock price change. Readers can catch some of our previous machine learning blogs (links given below). We will be covering more machine learning concepts and techniques in our coming posts.

Predictive Modeling in R for Algorithmic Trading
Machine Learning and Its Application in Forex Markets

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

Download Data Files

  • ACC.csv

Download

Read more

R Weekly Bulletin Vol – II

R CodingThis week’s R bulletin will cover functions calls, sorting data frame, creating time series object, and functions like is.na, na.omit, paste, help, rep, and seq function. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To show files – Ctrl+5
2. To show plots – Ctrl+6
3. To show packages – Ctrl+7

Problem Solving Ideas

Calling a function in an R script

If you want to call a custom-built function in your R script from another script, one can use the “exists” function along with the “source” function. See the example below:

Example:

if(exists("daily_price_data", mode="function")) source("Stock price data.R")

In this case, the expression will check whether a function called “daily_price_data” exists in the “Stock price data.R” script, and if it does, it will load the function in the current script. We can then use the function any number of times in our script by providing the relevant arguments.

Convert dates from Google finance to a time series object

When we download stock price data from Google finance, the “DATE” column shows a date in the yyymmdd format. This format is not considered as a time series object in R. To convert the dates from Google Finance into a time series object, one can use the ymd function from the lubridate package. The ymd function accepts dates in the form year, month, day. In the case of dates in other formats, the lubridate package has functions like ydm, mdy, myd, dmy, and dym, which can be used to convert it into a time series object.

Example:

library(lubridate)
dt = ymd(20160523)
print(dt)
[1] “2016-05-23”

Sorting a data frame in an ascending or descending order

The arrange function from the dplyr package can be used to sort a data frame. The first argument is the data.frame and the next argument is the variable to sort by, either in an ascending or in a descending order.

In the example below, we create a two column data frame comprising of stock symbols and their respective percentage price change. We then sort the Percent change column first in an ascending order, and in the second instance in a descending order.

Example:

library(dplyr)
# Create a dataframe
Ticker = c("UNITECH", "RCOM", "VEDL", "CANBK")
Percent_Change = c(2.3, -0.25, 0.5, 1.24)
df = data.frame(Ticker, Percent_Change)
print(df)

Ticker          Percent_Change
1  UNITECH    2.30
2      RCOM   -0.25
3        VEDL    0.50
4     CANBK    1.24

# Sort in an ascending order
df_descending = arrange(df, Percent_Change)
print(df_descending)

Ticker     Percent_Change
1     RCOM    -0.25
2       VEDL    0.50
3    CANBK    1.24
4 UNITECH    2.30

# Sort in a descending order
df_descending = arrange(df, desc(Percent_Change))
print(df_descending)

Ticker         Percent_Change
1 UNITECH   2.30
2    CANBK   1.24
3       VEDL   0.50
4     RCOM   -0.25

Functions Demystified

paste function

The paste is a very useful function in R and is used to concatenate (join) the arguments supplied to it. To include or remove the space between the arguments use the “sep” argument.

Example 1: Combining a string of words and a function using paste

x = c(20:45)
paste("Mean of x is", mean(x), sep = " ")
[1] “Mean of x is 32.5”

Example 2: Creating a filename using the dirPath, symbol, and the file extension name as the arguments to the paste function.

dirPath = "C:/Users/MyFolder/"
symbol = "INFY"
filename = paste(dirPath, symbol, ".csv", sep = "")
print(filename)
[1] “C:/Users/MyFolder/INFY.csv”

is.na and na.omit function

The is.na functions checks whether there are any NA values in the given data set, whereas, the na.omit function will remove all the NA values from the given data set.

Example: Consider a data frame comprising of open and close prices for a stock corresponding to each date.

date = c(20160501, 20160502, 20160503, 20160504)
open = c(234, NA, 236.85, 237.45)
close = c(236, 237, NA, 238)
df = data.frame(date, open, close)
print(df)

date           open        close
1  20160501  234.00     236
2  20160502        NA     237
3  20160503  236.85     NA
4  20160504  237.45     238

Let us check whether the data frame has any NA values using the is.na function.

is.na(df)

date      open      close
[1,]  FALSE  FALSE   FALSE
[2,]  FALSE  TRUE     FALSE
[3,]  FALSE  FALSE   TRUE
[4,]  FALSE  FALSE   FALSE

As you can see from the result, it has two NA values. Let us now use the na.omit function, and view the results.

na.omit(df)

date           open         close
1  20160501  234.00      236
4  20160504  237.45      238

As can be seen from the result, the rows having NA values got omitted, and the resultant data frame now comprises of non-NA values only.

These functions can be used to check for any NA values in large data sets on which we wish to apply some computations. The presence of NA values can cause the computations to give unwanted results, and hence such NA values need to be either removed or replaced by relevant values.

rep and seq function

The rep function repeats the arguments for the specified number of times, while the sequence function is used to form a required sequence of numbers. Note that in the sequence function we use a comma and not a colon.

Example 1:

rep("Strategy", times = 3)
[1] “Strategy” “Strategy” “Strategy”

rep(1:3, 2)
[1] 1 2 3 1 2 3

Example 2:

seq(1, 5)
[1] 1 2 3 4 5

seq(1, 5, 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

help and example function

The help function provides information on the topic sort, while the example function provides examples on the given topic.

help(sum)
example(sum)

To access the R help files associated with specific functions within a particular package, include the function name as the first argument to the help function along with the package name mentioned in the second argument.

Example:

help(barplot, package="graphics")

Alternatively, one can also type a question mark followed by the function name (e.g. ?barplot) and execute the command to know more about the function.

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Read more