R Weekly Bulletin Vol – V

This week’s R bulletin will cover topics like how to avoid for-loops, add or shorten an existing vector, and play a beep sound in R. We will also cover functions like env.new function, readSeries, and the with and within functions. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To stop debugging – Shift+F8
2. To quit an R session (desktop only) – Ctrl+Q
3. To restart an R Session – Ctrl+Shift+0

Problem Solving Ideas

Avoiding For Loop by using “with” function

For Loop can be slow in terms of execution speed when we are dealing with large data sets. For faster execution, one can use the “with” function as an alternative. The syntax of the with function is given below:

with(data, expr)

where, “data” is typically a data frame, and “expr” stands for one or more expressions to be evaluated using the contents of the data frame. If there is more than one expression, then the expressions need to be wrapped in curly braces.

Example: Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods and time them using the system.time function. Note the time taken to execute the For Loop versus the time to execute the with function in combination with the lagpad function.

library(quantmod)

# Using FOR Loop
system.time({

df = read.csv("NIFTY.csv")
df = df[,c(1,3:6)]

df$GapOpen = double(nrow(df))
for ( i in 2:nrow(df)) {
    df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2)
}

print(head(df))

})

# Using with function + lagpad, instead of FOR Loop
system.time({

dt = read.csv("NIFTY.csv")
dt = dt[,c(1,3:6)]

lagpad = function(x, k) {
c(rep(NA, k), x)[1 : length(x)]
}

dt$PrevClose = lagpad(dt$CLOSE, 1)
dt$GapOpen_ = with(dt, round(Delt(dt$PrevClose,dt$OPEN)*100,2))
print(head(dt))

})

Adding to an existing vector or shortening it

Adding or shortening an existing vector can be done by assigning a new length to the vector. When we shorten a vector, the values at the end will be removed, and when we extend an existing vector, missing values will be added at the end.

Example:

# Shorten an existing vector
even = c(2,4,6,8,10,12)
length(even)
[1] 6

# The new length equals the number of elements required in the vector to be shortened.
length(even) = 3
print(even)
[1] 2 4 6

# Add to an existing vector
odd = c(1,3,5,7,9,11)
length(odd)
[1] 6

# The new length equals the number of elements required in the extended vector.
length(odd) = 8
odd[c(7,8)] = c(13,15)
print(odd)
[1] 1 3 5 7 9 11 13 15

Make R beep/play a sound

If you want R to play a sound/beep upon executing the code, we can do this using the “beepr” package. The beep function from the package plays a sound when the code gets executed. One also needs to install the “audio” package along with the “beepr” package.

install.packages("beepr")
install.packages("audio")
library(beepr)
beep()

One can select from the various sounds using the “sound” argument and by assigning one of the specified values to it.

beep(sound = 9)

One can keep repeating the message using beepr as illustrated in the example below (source:http: //stackoverflow.com/)

Example:

work_complete <- function() {
  cat("Work complete. Press esc to sound the fanfare!!!\n")
  on.exit(beepr::beep(3))

  while (TRUE) {
  beepr::beep(4)
  Sys.sleep(1)
  }
}

work_complete()

One can also use the beep function to play a sound if an error occurs during the code execution.

options(error = function() {beep(sound =5)})

Functions Demystified

env.new function

Environments act as a storehouse. When we create variables in R from the command prompt these get stored in the R’s global environment. To access the variables stored in the global environment, one can use the following expression:

head(ls(envir = globalenv()), 15)
[1] “df”  “dt”  “even”  “i”  “lagpad”  “odd”

If we want to store the variables in a specific environment, we can assign the variable to that environment or create a new environment which will store the variable. To create a new environment we use the new.env function.

Example:

my_environment = new.env()

Once we create a new environment, assigning a variable to the environment can be done in multiple ways. Following are some of the methods:

Examples:

# By using double square brackets
my_environment[["AutoCompanies"]] = c("MARUTI", "TVSMOTOR", "TATAMOTORS")

# By using dollar sign operator
my_environment$AutoCompanies = c("MARUTI", "TVSMOTOR", "TATAMOTORS")

# By using the assign function
assign("AutoCompanies", c("MARUTI", "TVSMOTOR", "TATAMOTORS"), my_environment)

The variables existing in an environment can be viewed or listed using the get function or by using the ls function.

Example:

ls(envir = my_environment)
[1] “AutoCompanies”

get("AutoCompanies", my_environment)
[1] “MARUTI”  “TVSMOTOR”  “TATAMOTORS”

readSeries function

The readSeries function is part of the timeSeries package, and it reads a file in table format and creates a timeSeries object from it. The main arguments of the function are:

readSeries(file, header = TRUE, sep = “,”,format)

where,
file: the filename of a spreadsheet dataset from which to import the data records.
header: a logical value indicating whether the file contains the names of the variables as its first line.
format: a character string with the format in POSIX notation specifying the timestamp format.
sep: the field separator used in the spreadsheet file to separate columns. By default, it is set as “;”.

Example:

library(timeSeries)

# Reading the NIFTY data using read.csv
df = read.csv(file = "NIFTY.csv")
print(head(df))

# Reading the NIFTY data and creating a time series object using readSeries
# function
df = readSeries(file = "NIFTY.csv", header = T, sep = ",", format = "%Y%m%d")
print(head(df))

with and within functions

The with and within functions apply an expression to a given data set and allows one to manipulate it. The within function even keeps track of changes made, including adding or deleting elements and returns a new object with these revised contents. The syntax for these two functions is given as:

with(data, expr)
within(data, expr)

where,
data – typically is a list or data frame, although other options exist for with.
expr – one or more expressions to evaluate using the contents of data, the commands must be wrapped in braces if there is more than one expression to evaluate.

# Consider the NIFTY data
df = as.data.frame(read.csv("NIFTY.csv"))
print(head(df, 3))

# Example of with function:
df$Average = with(df, apply(df[3:6], 1, mean))
print(head(df, 3))

# Example of within function:
df = within(df, {
   df$Average = apply(df[3:6], 1, mean)
})
print(head(df, 3))

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Read more

Mixture Models for Forecasting Asset Returns

Mixture Models for Forecasting Asset Returns

By Brian Christopher

Asset return prediction is difficult. Most traditional time series techniques don’t work well for asset returns. One significant reason is that time series analysis (TSA) models require your data to be stationary. If it isn’t stationary, then you must transform your data until it is stationary.

That presents a problem.

In practice, we observe multiple phenomena that violate the rules of stationarity including non-linear processes, volatility clustering, seasonality, and autocorrelation. This renders traditional models mostly ineffective for our purposes.

What are our options?

There are many algorithms to choose from, but few are flexible enough to address the challenges of predicting asset returns:

  • mean and volatility changes through time
  • sometimes future returns are correlated with past returns, sometimes not
  • sometimes future volatility is correlated with past volatility, sometimes not
  • non-linear behavior

To recap, we need a model framework that is flexible enough to (1) adapt to non-stationary processes and (2) provide a reasonable approximation of the non-linear process that is generating the data.

Can Mixture Models offer a solution?

They have potential. First, they are based on several well-established concepts.

Markov models – These are used to model sequences where the future state depends only on the current state and not any past states. (memoryless processes)

Hidden Markov models – Used to model processes where the true state is unobserved (hidden) but there are observable factors that give us useful information to guess the true state.

Expectation-Maximization (E-M) – This is an algorithm that iterates between computing class parameters and maximizing the likelihood of the data given those parameters.

An easy way to think about applying mixture models to asset return prediction is to consider asset returns as a sequence of states or regimes. Each regime is characterized by its own descriptive statistics including mean and volatility. Example regimes could include low-volatility and high-volatility. We can also assume that asset returns will transition between these regimes based on probability. By framing the problem this way we can use mixture models, which are designed to try to estimate the sequence of regimes, each regime’s mean and variance, and the transition probabilities between regimes.

The most common is the Gaussian mixture model (GMM).

The underlying model assumption is that each regime is generated by a Gaussian process with parameters we can estimate. Under the hood, GMM employs an expectation-maximization algorithm to estimate regime parameters and the most likely sequence of regimes.

GMMs are flexible, generative models that have had success approximating non-linear data. Generative models are special in that they try to mimic the underlying data process such that we can create new data that should look like original data.

In this upcoming webinar on Mixture Models, we design and evaluate an example strategy to find out if Gaussian mixture models are useful for return prediction, and specifically for identifying market bottoms.

Next Step

Join us for the webinar, “Can we use Mixture Models to Predict Market Bottoms?” on Tuesday 25th Apr, 8:30 AM MST | 8:00 PM IST. The webinar will be conducted by Brian Christopher, Quantitative researcher, Python developer, CFA charter holder, and founder of Blackarbs LLC, a quantitative research firm. Register now!

 

 

Read more

R Weekly Bulletin Vol – IV

This week’s R bulletin will cover topics like removing duplicate rows, finding row number, sorting a data frame in the same order, sorting a data frame in different order, and creating two tabs in an excel workbook. We will also cover functions like Sys.time, julian, and the second & minute function from the lubridate package. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To open a file in R – Ctrl+O
2. To open a new R script – Ctrl+Shift+N
3. To save an R script – Ctrl+S

Problem Solving Ideas

Remove duplicate rows in R

Consider the following data frame “df” comprising of 3 rows of a 1-minute OHC of a stock.

Example:

open = c(221, 221.25, 221.25) 
high = c(221.75, 221.45, 221.45) 
close = c(221.35, 221.2, 221.2) 
df = data.frame(open, high, close) 
print(df)

As can be seen from the output, the 2nd and the 3rd rows are duplicates. If we want to retain only the unique rows, we can use the duplicated function in the following manner:

df[!duplicated(df), ]

This will remove the 3rd duplicate row, and retain the first 2 rows. In case we want the duplicate row, we can do so in the following way:

df[duplicated(df), ]

Return row number for a particular value in a column

To find the row number for a particular value in a column in a vector/data frame we can use the “which” function.

Example 1:

day = c("Mon", "Tue", "Wed", "Thurs", "Fri", "Sat", "Sun") 
row_number = which(day == "Sat") 
print(row_number)
[1] 6

Example 2: Consider a 2 column data frame “df” comprising of the stock symbols and their respective closing price for the day. To find the row number corresponding to the HCL stock we call the “which” function on the Ticker column with its value selected as HCL.

Ticker = c("INFY", "TCS", "HCL") 
ClosePrice = c(2021, 2294, 910) 
data = data.frame(Ticker, ClosePrice) 

row_number = which(data$Ticker == "HCL") 
print(row_number)
[1] 3

Sorting a data frame by two columns in the same order

To sort a data frame by two columns in the same order, we can use the “order” function and the “with” function. Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. We first sort the data frame based on category and then sort based on the percentage change in price.

The order function by default sorts in an ascending manner. Hence to sort both the columns in descending order we keep the decreasing argument as TRUE.

Example – Sorting a data frame by two columns

# Create a data frame 
Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") 
Category = c(1, 1, 1, 2) 
Percent_Change = c(2.3, -0.25, 0.5, 0.25) 

df = data.frame(Ticker, Category, Percent_Change) 
print(df)

# Sorting by Category column first and then the Percent_Change column: 
df_sort = df[with(df, order(Category, Percent_Change, decreasing = TRUE)), ] 
print(df_sort)

Sorting a data frame by two columns in different order

To sort a data frame by two columns in different order, we can use the “order” function along with the “with” function.

Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. Assume that we want to sort first in an ascending order by column “Category”, and then by column “Percent_Change” in a descending order.

The order function by default sorts in an ascending manner. Hence, to sort the “Category” column we mention it as the first variable in the order function without prepending it with any sign. To sort the “Percent_Change” column in a descending order we prepend it with a negative sign.

Example – Sorting a data frame by two columns

# Create a data frame 
Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") 
Category = c(1, 1, 1, 2) 
Percent_Change = c(2.3, -0.25, 0.5, 0.25) 

df = data.frame(Ticker, Category, Percent_Change) 
print(df)

# Sort by Category column first and then the Percent_Change column: 
df_sort = df[with(df, order(Category, -Percent_Change)), ] 
print(df_sort)

Creating two tabs in the output excel workbook

At times we want to write & save the multiple results generated after running our R script in an excel workbook on separate worksheets. To do so, we can make use of the “append” argument in the write.xlsx function.To write to an excel file, one must first install and load the xlsx package in R.

Example: In this example, we are creating two worksheets in the “Stocks.xlsx” workbook. In the worksheet named, “Top Gainers”, we save the table_1 output, while in the “Top Losers” worksheet we save the table_2 output. To create a second worksheet in the same workbook, we keep the append argument as TRUE in the second line.

write.xlsx(table_1, "Stocks.xlsx", sheetName = "Top Gainers", append = FALSE) 
write.xlsx(table_2, " Stocks.xlsx", sheetName = "Top Losers", append = TRUE)

Functions Demystified

Sys.time function

The Sys.time function gives the current date and time.

Example:

date_time = Sys.time() 
print(date_time)
[1] “2017-04-15 16:25:38 IST”

The function can be used to find the time required to run a code by placing this function at the start and the end of the code. The difference between the start and the end will give the time taken to execute the code.

julian function

The julian function is used to extract the Julian date, which is the number of days since January 1, 1970. Given a date, the syntax of the function is given as:

julian(date)

Example:

date = as.Date("2010-03-15") 
julian(date)
[1] 14683
attr(,”origin”)
[1] “1970-01-01”

Alternatively one can use the as.integer function to get the same result

second and minute functions

These functions are part of the lubridate package. The second function retrieves/sets the second component of a date-time object, while the minute function retrieves/sets the minute component. It allows Date-time objects like those belonging to the POSIXct, POSIXlt, Date, zoo, xts, and the timeSeries objects.

Example:

library(lubridate) 
# Retrieving the seconds We have used the ymd_hms function to parse the given object. 
x = ymd_hms("2016-06-01 12:23:45") 
second(x)
[1] 45

minute function:

library(lubridate) 
# Retrieving the minute from a date-time object 
x = ymd_hms("2016-06-01 12:23:45") 
minute(x)
[1] 23

# Retrieving the minute from a time object. We have used the hms function to parse the given object.
x = hms("15:29:06") 
minute(x)
[1] 29

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

 

 

 

Read more

R Best Practices: R you writing the R way!

By Milind Paradkar

Any programmer inevitably writes tons of codes in his daily work. However, not all programmers inculcate the habit of writing clean codes which can be easily be understood by others. One of the reasons can be the lack of awareness among programmers of the best practices followed in writing a program. This is especially the case for novice programmers. In this post, we list some of the R programming best practices which will lead to improved code readability, consistency, and repeatability. Read on!

Best practices of writing in R

1) Describe your code – When you start coding describe what the R code does in the very first line. For subsequent blocks of codes follow the same method of describing the blocks. This makes it easy for other people to understand and use the code.

Example:

# This code captures the 52-week high effect in stocks
# Code developed by Milind Paradkar

2) Load Packages – After describing your code in the first line, use the library function to list and load all the relevant packages needed to execute your code.

Example:

library(quantmod);  library(zoo); library(xts);
library(PerformanceAnalytics); library(timeSeries); library(lubridate);

3) Use Updated Packages – While writing your code ensure that you are using the latest updated R packages. To check the version of any R package you can use the packageVersion function.

Example:

packageVersion("TTR")
[1] ‘0.23.1’

4) Organize all source files in the same directory – Store all the necessary files that will be used/sourced in your code in the same directory. You can use the respective relative path to access them.

Example:

# Reading file using relative path
df = read.csv(file = "NIFTY.csv", header = TRUE)

# Reading file using full path
df =  read.csv(file = "C:/Users/Documents/NIFTY.csv", header = TRUE)

5) Use a consistent style for data structure types – R programming language allows for different data structures like vectors, factors, data frames, matrices, and lists. Use a similar naming for a particular type of data structure. This will make it easy to recognize the similar data structures used in the code and to spot the problems easily.

Example:
You can name all different data frames used in your code by adding .df as the suffix.

aapl.df   = as.data.frame(read.csv(file = "AAPL.csv", header = TRUE))
amzn.df = as.data.frame(read.csv(file = "AMZN.csv", header = TRUE))
csco.df  = as.data.frame(read.csv(file = "CSCO.csv", header = TRUE))

6) Indent your code – Indentation makes your code easier to read, especially, if there are multiple nested statements like For-loop and If statement.

Example:

# Computing the Profit & Loss (PL) and the Equity
dt$PL = numeric(nrow(dt))
for (i in 1:nrow(dt)){
   if (dt$Signal[i] == 1) {dt$PL[i+1] = dt$Close[i+1] - dt$Close[i]}
   if (dt$Signal[i] == -1){dt$PL[i+1] = dt$Close[i] - dt$Close[i+1]}
}

7) Remove temporary objects – For long codes, running in thousands of lines, it is a good practice to remove temporary objects after they have served their purpose in the code. This can ensure that R does not into memory issues.

8) Time the code – You can time your code using the system.time function. You can also use the same function to find out the time taken by different blocks of code. The function returns the amount of time taken in seconds to evaluate the expression or a block of code. Timing codes will help to figure out any bottlenecks and help speed up your codes by making the necessary changes in the script.

To find the time taken for different blocks we wrapped them in curly braces within the call to the system.time function.

The two important metrics returned by the function include:
i) User time – time charged to the CPU(s) for the code
ii) Elapsed time – the amount of time elapsed to execute the code in entirety

 Example:

# Generating random numbers
system.time({
mean_1 = rnorm(1e+06, mean = 0, sd = 0.8)
})

user    system    elapsed
0.40      0.00       0.45

9) Use vectorization – Vectorization results in faster execution of codes, especially when we are dealing with large data sets. One can use statements like the ifelse statement or the with function for vectorization.

Example:
Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods (using for-loop and with function) and time them using the system.time function. Note the time taken to execute the for-loop versus the time to execute the with function in combination with the lagpad function.

library(quantmod)
# Using FOR Loop
system.time({
df = read.csv("NIFTY.csv")
df = df[,c(1,3:6)]
df$GapOpen = double(nrow(df))
for ( i in 2:nrow(df)) {
df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2)
}
print(head(df))
})

# Using with function + lagpad, instead of FOR Loop
system.time({
df = read.csv("NIFTY.csv")
df = dt[,c(1,3:6)]

lagpad = function(x, k) {
c(rep(NA, k), x)[1 : length(x)]
}

df$PrevClose = lagpad(df$CLOSE, 1)
df$GapOpen_ = with(df, round(Delt(df$PrevClose,df$OPEN)*100,2))
print(head(df))
})

10) Folding codes – Folding codes is a way wherein the R programmer can fold a code of line or code sections. This allows hiding blocks of code whenever required, and makes it easier to navigate through lengthy codes. Code folding can be done in two ways:
i) Automatic folding of codes
ii) User-defined folding of codes

Automatic folding of codes: RStudio automatically provides the flexibility to fold the codes. When a coder writes a function or conditional blocks, RStudio automatically creates foldable codes.

User-defined folding of codes: 
One can also fold a random group of codes by using Edit -> Folding -> Collapse or by simply selecting the group of codes and pressing Alt+L key.

User-defined folding can also be done via Code Sections:
To insert a new code section you can use the Code -> Insert Section command. Alternatively, any comment line which includes at least four trailing dashes (-), equal signs (=) or pound signs (#) automatically creates a code section.

11) Review and test your code rigorously – Once your code is ready, ensure that you test it code rigorously on different input parameters. Ensure that the logic used in statements like for-loop, if statement, ifelse statement is correct. It is a nice idea to get your code reviewed by your colleague to ensure that the work is of high quality.

12) Don’t save your workspace When you want to exit R it checks if you want to save your workspace. It is advisable to not save the workspace and start in a clean workspace for your next R session. Objects from the previous R sessions can lead to errors which can be hard to debug.

These were some of the best practices of writing in R that one can follow to make your codes easy to read, debug and to ensure consistency.

 Next Step

 If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

Read more

R Weekly Bulletin Vol – III

This week’s R bulletin will cover topics like how to read select columns, the difference between boolean operators, converting a factor to a numeric and changing memory available to R. We will also cover functions like data, format, tolower, toupper, and strsplit function. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To change the working directory – Ctrl+Shift+K
2. To show history – Ctrl+4
3. To display environment – Ctrl+8

Problem Solving Ideas

Read a limited number of columns from a dataset

To read a specific set of columns from a dataset one can use the “fread” function from the data.table package. The syntax for the function to be used for reading/dropping a limited number of columns is given as:

fread(input, select, stringsAsFactors=FALSE)
fread(input, drop, stringsAsFactors=FALSE)

One needs to specify the columns names or column numbers with the select parameter as shown below.
Example:

library(data.table)
data = fread("data.txt")
print(data)

 

 

 

# Reading select columns
data = fread("data.txt", select = c("DATE", "OPEN", "CLOSE"))
print(data)

 

 

 

Alternatively, you can use the drop parameter to indicate which columns should not be read:
Example:

data = fread("data.txt", drop = c(2:3, 6))
print(data)

 

 

 

Difference between the boolean operators & and &&

In R we have the “&” Boolean operator which is the equivalent to the AND operator in excel. Similarly, we have | in R which is the equivalent of OR in excel. One also finds Boolean operators && and || in R. Although, these operators have the same meaning as & and |, they behave in a slightly different manner.

Difference between & and &&: The shorter form “&” is vectorized, meaning it can be applied on a vector. See the example below.
Example:

 

 

 

 

The longer form &&, evaluates left to right examining only the first element of each vector. Thus in the example below, it will first evaluate whether -1 > = 0 (which is FALSE), and then check whether -1 <= 0 (which is TRUE). After this, it will evaluate FALSE & TRUE, which gives the final answer as FALSE.

Example:

 

 

 

Thus the && operator is to be used when we have vectors of length one. For vectors of length greater than one, we should use the short form.

How to convert a factor to a numeric without a loss of information

Consider the following factor “x” created using the factor, sample, and the runif functions. Let us try to convert this factor to a numeric.

Example:

x = factor(sample(runif(5), 6, replace = TRUE))
print(x)
[1] 0.660900804912671 0.735364762600511 0.479244165122509 0.397552277892828
[5] 0.660900804912671 0.660900804912671
4 Levels: 0.397552277892828 0.479244165122509 … 0.735364762600511

as.numeric(x)
[1] 3 4 2 1 3 3

As can be seen from the output, upon conversion we do not get the original values. R does not have a function to execute such conversion error-free. To overcome this problem, one can use the following expression:

as.numeric(levels(x))[x]
[1] 0.6609008 0.7353648 0.4792442 0.3975523 0.6609008 0.6609008

One can also use the as.numeric(as.character(x)) expression, however the as.numeric(levels(x))[x] is slightly more efficient in converting the factor to an integer/numeric.

Replace a part of URL using R

Suppose you want to scrap data from a site having different pages, and want to replace the url using R. We can do this by using the parse_url function and build_url function from the “httr” package. The following example illustrates the method to replace a part in url.

Example:

library(httr)
# In the following url, the value 2 at the end signifies page 2 of the books
# catergory under the section "bestsellers".
testURL = "http://www.amazon.in/gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b#2"

# Entering the url and parsing it using the parse_ulr function
parseURL = parse_url(testURL)
print(parseURL)

$scheme
[1] “http”

$hostname
[1] “www.amazon.in”

$port
NULL

$path
[1] “gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b”

$query
NULL

$params
NULL

$fragment
[1] “2”

$username
NULL

$password
NULL

attr(,”class”)
[1] “url”

# Assigning the next page number (i.e page no.3) and creating the new url
 parseURL$fragment = 3
 newURL <- build_url(parseURL)
 print(newURL)
[1] “http://www.amazon.in/gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b#3”

Increasing (or decreasing) the memory available to R processes

In R there is a memory.limit() function which gives you the amount of available memory in MB. At times, windows users may get the error that R has run out of memory. In such cases, you may set the amount of available memory.

Example:

memory.limit(size=4000)

The unit is MB. You may increase this value up to 2GB or the maximum amount of physical RAM you have installed. If you have R already installed and subsequently install more RAM, you may have to reinstall R in order to take advantage of the additional capacity.

On 32-bit Windows, R can only use up to 3Gb of RAM, regardless of how much you have installed. There is a 64-bit version of R for Windows available from REvolution Computing, which runs on 64-bit Windows, and can use the entire RAM available.

Functions Demystified

data function

R program includes a number packages which often come with different data sets. These data sets can be accessed using the data function. This function loads specified data sets or lists the available data sets. To check the available data sets from all the R packages installed in your application use the data() expression. This will display all the available data sets under each package.

Example: data()
To view data sets from a particular package, specify the package name as the argument to the data function.

Example:

data(package = "lubridate")

To load a particular dataset, use the name of the dataset as the argument to the data function.

Example:

# loading the USDCHF dataset from the timeSeries package
library(timeSeries)
data(USDCHF)
x = USDCHF
print(head(USDCHF))

# loading the sample_matrix dataset from the xts package
library(xts)
data(sample_matrix)
x = sample_matrix
print(head(sample_matrix))

 

 

 

 

 

 

 

format function

The format function is used to format the numbers so that they can appear clean and neat on the reports. The function takes a number of arguments to control the formatting of the result. The main arguments for the format function are given below.

format(x, trim, digits, decimal.mark, nsmall, big.mark, big.interval, small.mark,
small.interval, justify)

Where,
1) x: the number to be formatted. 2) trim: takes a logical value. If FALSE, it adds spaces to right-justify the result. If TRUE, it suppresses the leading spaces. 3) digits: how many significant digits of numeric values to display. 4) decimal mark: the format in which to display the decimal mark. 5) nsmall: the minimum number of digits after the decimal point. 6) big.mark: the mark between intervals before the decimal point. 7) small mark: the mark between intervals after the decimal point.

Example:

format(34562.67435, digits=7, decimal.mark=".",big.mark=" ",
small.mark=",",small.interval=2)
[1] “34 562.67”

In the example above we chose to display 7 digits, and are using point(.) as a decimal mark. The small interval value is 2, which means we want 2 digits to be displayed after the decimal point. The big.mark format displays a space after the first two digits.

tolower, toupper, and strsplit

The tolower and toupper functions help convert strings to lowercase and uppercase respectively.
Example 1:

toupper("I coded a strategy in R")
[1] “I CODED A STRATEGY IN R”

tolower("I coded a strategy in R")
[1] “i coded a strategy in r”

Example 2:

df = as.data.frame(read.csv("NIFTY.csv"))
print(head(df, 4))

colnames(df)
[1] “DATE” “TIME” “CLOSE” “HIGH” “LOW” “OPEN” “VOLUME”

tolower(colnames(df))
[1] “date” “time” “close” “high” “low” “open” “volume”

The strsplit function is used to break/split strings at the specified split points. The function returns a list and not a character vector or a matrix. In the example below, we are splitting the expression on the spaces.

Example:

strsplit("I coded a strategy in R", split = " ")
[[1]] [1] “I” “coded” “a” “strategy” “in” “R”

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

 

 

 

 

 

 

 

 

 

 

 

 

 

Read more

Forecasting Markets using eXtreme Gradient Boosting (XGBoost)

Forecasting Markets using Gradient Boosting (XGBoost)

By Milind Paradkar

In recent years, machine learning has been generating a lot of curiosity for its profitable application to trading. Numerous machine learning models like Linear/Logistic regression, Support Vector Machines, Neural Networks, Tree-based models etc. are being tried and applied in an attempt to analyze and forecast the markets. Researchers have found that some models have more success rate compared to other machine learning models. eXtreme Gradient Boosting also called XGBoost is one such machine learning model that has received rave from the machine learning practitioners.

In this post, we will cover the basics of XGBoost, a winning model for many kaggle competitions. We then attempt to develop an XGBoost stock forecasting model using the “xgboost” package in R programming.

Basics of XGBoost and related concepts

Developed by Tianqi Chen, the eXtreme Gradient Boosting (XGBoost) model is an implementation of the gradient boosting framework. Gradient Boosting algorithm is a machine learning technique used for building predictive tree-based models. (Machine Learning: An Introduction to Decision Trees).

Boosting is an ensemble technique in which new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.

The ensemble technique uses the tree ensemble model which is a set of classification and regression trees (CART). The ensemble approach is used because a single CART, usually, does not have a strong predictive power. By using a set of CART (i.e. a tree ensemble model) a sum of the predictions of multiple trees is considered.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction.

The objective of the XGBoost model is given as:

Obj = L +

Where,
L is the loss function which controls the predictive power, and
Ω is regularization component which controls simplicity and overfitting

The loss function (L) which needs to be optimized can be Root Mean Squared Error for regression, Logloss for binary classification, or mlogloss for multi-class classification.

The regularization component (Ω) is dependent on the number of leaves and the prediction score assigned to the leaves in the tree ensemble model.

It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. The Gradient boosting algorithm supports both regression and classification predictive modeling problems.

Sample XGBoost model:

We will use the “xgboost” R package to create a sample XGBoost model. You can refer to the documentation of the “xgboost” package here.

Install and load the xgboost library –

We install the xgboost library using the install.packages function. To load this package we use the library function. We also load other relevant packages required to run the code.

install.packages("xgboost")

# Load the relevant libraries
library(quantmod); library(TTR); library(xgboost);

 

Create the input features and target variable – We take the 5-year OHLC and volume data of a stock and compute the technical indicators (input features) using this dataset. The indicators computed include Relative Strength Index (RSI), Average Directional Index (ADX), and Parabolic SAR (SAR). We create a lag in the computed indicators to avoid the look-ahead bias. This gives us our input features for building the XGBoost model. Since this is a sample model, we have included only a few indicators to build our set of input features.

# Read the stock data 
symbol = "ACC"
fileName = paste(getwd(),"/",symbol,".csv",sep="") ; 
df = as.data.frame(read.csv(fileName))
colnames(df) = c("Date","Time","Close","High", "Low", "Open","Volume")

# Define the technical indicators to build the model 
rsi = RSI(df$Close, n=14, maType="WMA")
adx = data.frame(ADX(df[,c("High","Low","Close")]))
sar = SAR(df[,c("High","Low")], accel = c(0.02, 0.2))
trend = df$Close - sar

# create a lag in the technical indicators to avoid look-ahead bias 
rsi = c(NA,head(rsi,-1)) 
adx$ADX = c(NA,head(adx$ADX,-1)) 
trend = c(NA,head(trend,-1))

Our objective is to predict the direction of the daily stock price change (Up/Down) using these input features. This makes it a binary classification problem. We compute the daily price change and assigned a positive 1 if the daily price change is positive. If the price change is negative, we assign a zero value.

# Create the target variable
price = df$Close-df$Open
class = ifelse(price > 0,1,0)

 

Combine the input features into a matrix – The input features and the target variable created in the above step are combined to form a single matrix. We use the matrix structure in the XGBoost model since the xgboost library allows data in the matrix format.

# Create a Matrix
model_df = data.frame(class,rsi,adx$ADX,trend)
model = matrix(c(class,rsi,adx$ADX,trend), nrow=length(class))
model = na.omit(model)
colnames(model) = c("class","rsi","adx","trend")

 

Split the dataset into training data and test data – In the next step, we split the dataset into training and test data. Using this training and test dataset we create the respective input features set and the target variable.

# Split data into train and test sets 
train_size = 2/3
breakpoint = nrow(model) * train_size

training_data = model[1:breakpoint,]
test_data = model[(breakpoint+1):nrow(model),]

# Split data training and test data into X and Y
X_train = training_data[,2:4] ; Y_train = training_data[,1]
class(X_train)[1]; class(Y_train)

X_test = test_data[,2:4] ; Y_test = test_data[,1]
class(X_test)[1]; class(Y_test)

 

Train the XGBoost model on the training dataset –

We use the xgboost function to train the model. The arguments of the xgboost function are shown in the picture below.

The data argument in the xgboost function is for the input features dataset. It accepts a matrix, dgCMatrix, or local data file. The nrounds argument refers to the max number of iterations (i.e. the number of trees added to the model). The obj argument refers to the customized objective function. It returns gradient and second order gradient with given prediction and dtrain.

# Train the xgboost model using the "xgboost" function
dtrain = xgb.DMatrix(data = X_train, label = Y_train)
xgModel = xgboost(data = dtrain, nround = 5, objective = "binary:logistic")

 

Output – The output is the classification error on the training data set.

Cross-validation

We can also use the cross-validation function of xgboost i.e. xgb.cv. In this case, the original sample is randomly partitioned into nfold equal size subsamples. Of the nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining (nfold – 1) subsamples are used as training data. The cross-validation process is then repeated nrounds times, with each of the nfold subsamples used exactly once as the validation data.

# Using cross validation
dtrain = xgb.DMatrix(data = X_train, label = Y_train)
cv = xgb.cv(data = dtrain, nround = 10, nfold = 5, objective = "binary:logistic")

 

Output – The xgb.cv returns a data.table object containing the cross validation results.

Make predictions on the test data

To make predictions on the unseen data set (i.e. the test data), we apply the trained XGBoost model on it which gives a series of numbers.

# Make the predictions on the test data
preds = predict(xgModel, X_test)

# Determine the size of the prediction vector
print(length(preds))

# Limit display of predictions to the first 6
print(head(preds))

 

Output –

These numbers do not look like binary classification {0, 1}. We have to, therefore, perform a simple transformation before we are able to use these results. In the example code shown below, we are comparing the predicted number to the threshold of 0.5. The threshold value can be changed depending upon the objective of the modeler, the metrics (e.g. F1 score, Precision, Recall) that the modeler wants to track and optimize.

prediction = as.numeric(preds > 0.5)
print(head(prediction))

 

Output –

Measuring model performance

Different evaluation metrics can be used to measure the model performance. In our example, we will compute a simple metric, the average error. It compares the predicted score with the threshold of 0.50.

For example: If the predicted score is less than 0.50, then the (preds > 0.5) expression gives a value of 0. If this value is not equal to the actual result from the test data set, then it is taken as a wrong result.

We compare all the preds with the respective data points in the Y_test and compute the average error. The code for measuring the performance is given below. Alternatively, we can use hit rate or create a confusion matrix to measure the model performance.

# Measuring model performance
error_value = mean(as.numeric(preds > 0.5) != Y_test)
print(paste("test-error=", error_value))

 

Output –

Plot the feature importance set – We can find the top important features in the model by using the xgb.importance function.

# View feature importance from the learnt model
importance_matrix = xgb.importance(model = xgModel)
print(importance_matrix)

 

Plot the XGBoost Trees

Finally, we can plot the XGBoost trees using the xgb.plot.tree function. To limit the plot to a specific number of trees, we can use the n_first_tree argument. If NULL, all trees of the model are plotted.

# View the trees from a model
xgb.plot.tree(model = xgModel)

# View only the first tree in the XGBoost model
xgb.plot.tree(model = xgModel, n_first_tree = 1)

 

Conclusion

This post covered the popular XGBoost model along with a sample code in R programming to forecast the daily direction of the stock price change. Readers can catch some of our previous machine learning blogs (links given below). We will be covering more machine learning concepts and techniques in our coming posts.

Predictive Modeling in R for Algorithmic Trading
Machine Learning and Its Application in Forex Markets

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

Download Data Files

  • ACC.csv

Download

Read more

R Weekly Bulletin Vol – II

R CodingThis week’s R bulletin will cover functions calls, sorting data frame, creating time series object, and functions like is.na, na.omit, paste, help, rep, and seq function. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To show files – Ctrl+5
2. To show plots – Ctrl+6
3. To show packages – Ctrl+7

Problem Solving Ideas

Calling a function in an R script

If you want to call a custom-built function in your R script from another script, one can use the “exists” function along with the “source” function. See the example below:

Example:

if(exists("daily_price_data", mode="function")) source("Stock price data.R")

In this case, the expression will check whether a function called “daily_price_data” exists in the “Stock price data.R” script, and if it does, it will load the function in the current script. We can then use the function any number of times in our script by providing the relevant arguments.

Convert dates from Google finance to a time series object

When we download stock price data from Google finance, the “DATE” column shows a date in the yyymmdd format. This format is not considered as a time series object in R. To convert the dates from Google Finance into a time series object, one can use the ymd function from the lubridate package. The ymd function accepts dates in the form year, month, day. In the case of dates in other formats, the lubridate package has functions like ydm, mdy, myd, dmy, and dym, which can be used to convert it into a time series object.

Example:

library(lubridate)
dt = ymd(20160523)
print(dt)
[1] “2016-05-23”

Sorting a data frame in an ascending or descending order

The arrange function from the dplyr package can be used to sort a data frame. The first argument is the data.frame and the next argument is the variable to sort by, either in an ascending or in a descending order.

In the example below, we create a two column data frame comprising of stock symbols and their respective percentage price change. We then sort the Percent change column first in an ascending order, and in the second instance in a descending order.

Example:

library(dplyr)
# Create a dataframe
Ticker = c("UNITECH", "RCOM", "VEDL", "CANBK")
Percent_Change = c(2.3, -0.25, 0.5, 1.24)
df = data.frame(Ticker, Percent_Change)
print(df)

Ticker          Percent_Change
1  UNITECH    2.30
2      RCOM   -0.25
3        VEDL    0.50
4     CANBK    1.24

# Sort in an ascending order
df_descending = arrange(df, Percent_Change)
print(df_descending)

Ticker     Percent_Change
1     RCOM    -0.25
2       VEDL    0.50
3    CANBK    1.24
4 UNITECH    2.30

# Sort in a descending order
df_descending = arrange(df, desc(Percent_Change))
print(df_descending)

Ticker         Percent_Change
1 UNITECH   2.30
2    CANBK   1.24
3       VEDL   0.50
4     RCOM   -0.25

Functions Demystified

paste function

The paste is a very useful function in R and is used to concatenate (join) the arguments supplied to it. To include or remove the space between the arguments use the “sep” argument.

Example 1: Combining a string of words and a function using paste

x = c(20:45)
paste("Mean of x is", mean(x), sep = " ")
[1] “Mean of x is 32.5”

Example 2: Creating a filename using the dirPath, symbol, and the file extension name as the arguments to the paste function.

dirPath = "C:/Users/MyFolder/"
symbol = "INFY"
filename = paste(dirPath, symbol, ".csv", sep = "")
print(filename)
[1] “C:/Users/MyFolder/INFY.csv”

is.na and na.omit function

The is.na functions checks whether there are any NA values in the given data set, whereas, the na.omit function will remove all the NA values from the given data set.

Example: Consider a data frame comprising of open and close prices for a stock corresponding to each date.

date = c(20160501, 20160502, 20160503, 20160504)
open = c(234, NA, 236.85, 237.45)
close = c(236, 237, NA, 238)
df = data.frame(date, open, close)
print(df)

date           open        close
1  20160501  234.00     236
2  20160502        NA     237
3  20160503  236.85     NA
4  20160504  237.45     238

Let us check whether the data frame has any NA values using the is.na function.

is.na(df)

date      open      close
[1,]  FALSE  FALSE   FALSE
[2,]  FALSE  TRUE     FALSE
[3,]  FALSE  FALSE   TRUE
[4,]  FALSE  FALSE   FALSE

As you can see from the result, it has two NA values. Let us now use the na.omit function, and view the results.

na.omit(df)

date           open         close
1  20160501  234.00      236
4  20160504  237.45      238

As can be seen from the result, the rows having NA values got omitted, and the resultant data frame now comprises of non-NA values only.

These functions can be used to check for any NA values in large data sets on which we wish to apply some computations. The presence of NA values can cause the computations to give unwanted results, and hence such NA values need to be either removed or replaced by relevant values.

rep and seq function

The rep function repeats the arguments for the specified number of times, while the sequence function is used to form a required sequence of numbers. Note that in the sequence function we use a comma and not a colon.

Example 1:

rep("Strategy", times = 3)
[1] “Strategy” “Strategy” “Strategy”

rep(1:3, 2)
[1] 1 2 3 1 2 3

Example 2:

seq(1, 5)
[1] 1 2 3 4 5

seq(1, 5, 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

help and example function

The help function provides information on the topic sort, while the example function provides examples on the given topic.

help(sum)
example(sum)

To access the R help files associated with specific functions within a particular package, include the function name as the first argument to the help function along with the package name mentioned in the second argument.

Example:

help(barplot, package="graphics")

Alternatively, one can also type a question mark followed by the function name (e.g. ?barplot) and execute the command to know more about the function.

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Read more

R Weekly Bulletin Vol – I

R Coding

We are starting with R weekly bulletins which will contain some interesting ways and methods to write codes in R and solve bugging problems. We will also cover R functions and shortcut keys for beginners. We understand that there can be more than one way of writing a code in R, and the solutions listed in the bulletins may not be the sole reference point for you. Nevertheless, we believe that the solutions listed will be helpful to many of our readers. Hope you like our R weekly bulletins. Enjoy reading them!

Shortcut Keys

  1. To move cursor to R Source Editor – Ctrl+1
  2. To move cursor to R Console – Ctrl+2
  3. To clear the R console – Ctrl+L

Problem Solving Ideas

Creating user input functionality

To create the user input functionality in R, we can make use of the readline function. This gives us the flexibility to set the input values for variables for our choice in the code is run.

Example: Suppose that we have coded a backtesting strategy. We want to have the flexibility to choose the backtest period. To do so, we can create a user input “n”, signifying the backtest period in years, and add the line shown below at the start of the code.

When the code is run, it will prompt the user to enter the value for “n”. Upon entering the value, the R code will get executed for the set period and produce the desired output.

n = readline(prompt = "Enter the backtest period in years: ")

Refresh a code in every x seconds

To refresh a code in every x seconds we can use the while loop and the Sys.sleep function. The “while loop” keeps executing the enclosed block of commands until the condition remains satisfied. We enclose the code in the while statement and keep the condition as TRUE. By keeping the condition as TRUE, it will keep looping. At the end of the code, we add the Sys.sleep function and specify the delay time in seconds. This way the code will get refreshed after every “x” seconds.

Example: In this example, we initialize the x value to zero. The code is refreshed every 1 second, and it will keep printing the value of x. One can hit the escape button on the keyboard to terminate the code.

x = 0 
while (TRUE) { 
x = x + 1 print(x) 
Sys.sleep(1) 
}

Running multiple R scripts sequentially

To run multiple R scripts, one can have the main script which will contain the names of the scripts to be run. Running the main script will lead to the execution of the other R scripts. Assume the name of the main script is “NSE Stocks.R”. In this script, we will mention the names of the scripts we wish to run within the source function. In this example, we wish to run the “Top gainers.R” and the “Top losers.R” script. These will be the part of the “NSE Stocks.R” as shown below and we run the main script to run these 2 scripts.

source("Top gainers.R")
source("Top losers.R")

Enclosing the R script name within the “source” function causes R to accept its input from the named file. Input is read and parsed from that file until the end of the file is reached, then the parsed expressions are evaluated sequentially in the chosen environment. Alternatively, one can also place the R script names in a vector, and use the sapply function.

Example:

filenames = c("Top gainers.R", "Top losers.R") 
sapply(filenames, source)

Converting a date in the American format to Standard date format

The American date format is of the type mm/dd/yyyy, whereas the ISO 8601 standard format is yyyy-mm-dd. To convert a date from American format to the Standard date format we will use the as.Date function along with the format function. The example below illustrates the method.

Example:

# date in American format dt = "07/24/2016" # If we call the as.Date function on the date, it will  
# throw up an error, as the default format assumed by the as.Date function is yyyy-mmm-dd.
as.Date(dt)

Error in charToDate(x): character string is not in a standard unambiguous format

# Correct way of formatting the date 
as.Date(dt, format = "%m/%d/%Y")
[1] “2016-07-24”

How to remove all the existing files from a folder

To remove all the files from a particular folder, one can use the unlink function. Specify the path of the folder as the argument to the function. A forward slash with an asterisk is added at the end of the path. The syntax is given below.

unlink(“path/*”)

Example:

unlink("C:/Users/Documents/NSE Stocks/*")

This will remove all the files present in the “NSE Stocks” folder.

Functions Demystified

write.csv function

If you want to save a data frame or matrix in a csv file, R provides for the write.csv function. The syntax for the write.csv function is given as:

write.csv(x, file=”filename”, row.names=FALSE)

If we specify row.names=TRUE, the function prepends each row with a label taken from the row.names attribute of your data. If your data doesn’t have row names then the function just uses the row numbers. Column header line is written by default. If you do not want the column headers, set col.names=FALSE.

Example:

# Create a data frame 
Ticker = c("PNB","CANBK","KTKBANK","IOB") 
Percent_Change = c(2.30,-0.25,0.50,1.24) 
df = data.frame(Ticker,Percent_Change) 
write.csv(df, file="Banking Stocks.csv", row.names=FALSE)

This will write the data contained in the “df” dataframe to the “Banking Stocks.csv” file. The file gets saved in the R working directory.

fix function

The fix function shows the underlying code of a function provided as an argument to it.

Example:

fix(sd)

The underlying code for the standard deviation function is as shown below. This is displayed when we executed the fix function with “sd” as the argument.

function (x, na.rm = FALSE) 
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm))

download.file function

The download.file function helps download a file from a website. This could be a webpage, a csv file, an R file, etc. The syntax for the function is given as:

download.file(url, destfile)

where,
url – the Uniform Resource Locator (URL) of the file to be downloaded
destfile – the location to save the downloaded file, i.e. path with a file name

Example: In this example, the function will download the file from the path given in the “url” argument, and saved it in the D drive within the “Skills” folder with the name “betawacc.xls”.

url = "http://www.exinfm.com/excel%20files/betawacc.xls"
destfile = "D:/Skills/wacc.xls"
download.file(url, destfile)

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Read more

Architecture Explained of R Package for IB – IBrokers

Architecture Explained of R Package for IB - IBrokers

By Milind Paradkar

In the our last post on Using IBrokers package, we introduced our readers to some of the basic functions from the IBrokers package which are used to retrieve market data, view account info, and execute/modify orders via R. This post will cover the structure of the IBrokers package which will enable the R users to build their custom trading strategies and get them executed via Interactive Brokers Trader Workstation (TWS).

Overview of the Interactive Brokers API Architecture

Before we explain the underlying structure of the IBrokers package, let us take an overview of the Interactive Brokers API architecture. Interactive Brokers provides its API program which can be run on Windows, Linux, and MacOS. The API makes a connection to the IB TWS. The TWS, in turn, is connected to the IB data centers and thus, all the communication is routed via the TWS.

The IBrokers R package enables a user to write his strategy in R and helps it get executed via the IB TWS. Below is the flow structure diagram.
Overview of the Interactive Brokers API Architecture

Getting data from the TWS

To retrieve data from the IB TWS, the IBrokers R package includes five important functions.

  • reqContractDetails: retrieves detailed product information.
  • reqMktData: retrieves real-time market data.
  • reqMktDepth: retrieves real-time order book data.
  • reqRealTimeBars: retrieves real-time OHLC data.
  • reqHistoricalData: retrieves historical data.

In addition to these functions, there are helper functions which enable a user to create the above-mentioned data functions easily. These helper functions include:

  • twsContract: create a general Contract object.
  • twsEquity/twsSTK: wrapper to create equity Contract objects
  • twsOption/twsOPT: wrapper to create option Contract objects.
  • twsFuture/twsFUT: wrapper to create futures Contract objects.
  • twsFuture/twsFOP: wrapper to create futures options Contract objects.
  • twsCurrency/twsCASH: wrapper to create currency Contract objects.

Example:

tws = twsConnect()
reqMktData(tws, twsSTK("AAPL"))

Getting data from the TWS

 Real-time Data Model Structure

When a data function is used to access market data streams, the data streams received by the TWS API follow a certain path which enables to bucket these data streams into the relevant message type. Shown below is the list of arguments of the reqMktData function.

                                                     Example: Arguments of the reqMktData function


Source: Algorithmic Trading in R – Malcolm Sherrington

Real-time Data Model

In the following sections, we will see how this data model works and how the arguments of the real-time data functions (e.g. reqMktData) can be customized to create user-defined automated trading programs in R.

Using the CALLBACK Argument

The data functions like reqMktData, reqMktDepth, and reqRealTimeBars all have a special CALLBACK argument. By default, this argument calls the twsCALLBACK function from the IBrokers package.

The general logic of the twsCALLBACK function is to receive the header to each incoming message from the TWS. This is then passed to the processMsg function, along with the eWrapper object. The eWrapper object can maintain state data (prices) and has functions for managing all incoming message types from the TWS. Once the processMsg call returns, another cycle of the infinite loop occurs.

In the example of the incoming messages shown below, we have circled a single message in green (1 6 1 4 140.76 1 0). The first digit (i.e. 1) is the header and the remaining numbers (i.e. 6 1 4 140.76 1 0) constitute the body of the message.

Using the CALLBACK ArgumentIncoming messages from the reqMktData function call

Each message received will invoke the appropriately named eWrapper callback, depending on the message type. By default when nothing is specified, the code will call the default method for printing the results to the screen via cat.

Example with Default Method:

Default Method

Result:

Result

Setting CALLBACK = NULL will send raw message level data to cat, which in turn will use the file argument to that function to either return the data to the standard output, or redirected via an open connection, a file, or a pipe.

Example with CALLBACK argument set to NULL:

CALLBACK argument set to NULL

Result:
Result

Callbacks, via CALLBACK and eventWrapper, are designed to allow for R level processing of the real-time data stream. Callback helps to customize the output (i.e. incoming results) which can be used to create automated trading programs in R based on the user-defined criteria.

Internal code of the twsCALLBACK function

Inside of the CALLBACK (i.e. twsCALLBACK function) is a loop that fetches the incoming message type and calls the processMsg function at each new message.

Internal code of the twsCALLBACK functionInternal code snippet of the twsCALLBACK function

The ProcessMsg Function

The processMsg function internally is a series of if-else statements that branch according to a known incoming message type. A snippet of the internal code structure of the processMsg function is shown below.

The ProcessMsg FunctionInternal code snippet of the processMsg function

The eWrapper Closure

The eWrapper ClosureCreating eWrapper closure in twsCALLBACK using the eWrapper function

The eWrapper function creates an eWrapper closure to allow for the custom incoming message management. The eWrapper closure contains a list of functions to manage all incoming message type. Each message has a corresponding function in the eWrapper designed to handle the particular details of each incoming message type.

List of functions contained in the eWrapper ClosureList of functions contained in the eWrapper Closure

The data environment is .Data, with accessor methods get.Data, assign.Data, and remove.Data. These methods can be called from the closure object eWrapper$get.Data, eWrapper$assign.Data, etc. By creating an instance of eWrapper, accomplished by calling it as a function call, one can then modify any or all the particular methods embedded in the object.

Summarizing the Internal Structure of the IBrokers Package

We have seen above how the internal structure of the IBrokers package works. To summarize the entire mechanism, it can be depicted as shown below:

Request to TWS for data -> twsCALLBACK -> processMsg -> eWrapper

Real-Time Data Model

We will use the snapShotTest code example published by Jeff Ryan. The code below modifies the twsCALLBACK function. This modified callback is used as an argument to the reqMktData function. The output using the modified callback is more convenient to read than the normal output when we use the reqMktData function.
Real-Time Data Model

 Another change in the snapShotTest code is to record any error messages from IB API to a separate file. (Under the default method the eWrapper prints such error messages to the console).  To do this, we create a different wrapper using eWrapper(debug=NULL). Once we construct this, we can assign its errorMessage() function to the eWrapper we should use.

simple trade logic

We then apply a simple trade logic which generates a buy signal if the last bid price is greater than a pre-specified threshold value. One can similarly tweak the logic of the twsCALLBACK to create custom callbacks based on one’s requirement of trading strategy.

twsCALLBACK
custom callbacks

 Order getting filled in the IB Trader Workstation (TWS)

IB Trader Workstation

Conclusion

To conclude, the post gave a detailed overview of the architecture of the IBrokers package which is the R implementation of Interactive Brokers API. Interactive Brokers in collaboration with QuantInstiTM hosted a webinar, “Trading using R on Interactive Brokers” which was held on 2nd March 2017 and conducted by Anil Yadav, Director at QuantInstiTM. You can click on the link provided above to learn more about the IBrokers package.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

Read more

Mean Reversion in Time Series

Mean Reversion in Time Series

By Devang Singh

Time series data is simply a collection of observations generated over time. For example, the speed of a race car at each second, daily temperature, weekly sales figures, stock returns per minute, etc. In the financial markets, a time series tracks the movement of specific data points, such as a security’s price over a specified period of time, with data points recorded at regular intervals. A time series can be generated for any variable that is changing over time. Time series analysis comprises of techniques for analyzing time series data in an attempt to extract useful statistics and identify characteristics of the data. Time series forecasting is the use of a mathematical model to predict future values based on previously observed values in the time series data.

The graph shown below represents the daily closing price of Aluminium futures over a period of 93 trading days, which is a time series.

Mean Reversion

Mean reversion is the theory which suggests that prices, returns, or various economic indicators tend to move to the historical average or mean over time. This theory has led to many trading strategies which involve the purchase or sale of a financial instrument whose recent performance has greatly differed from their historical average without any apparent reason. For example, let the price of gold increase on average by INR 10 every day and one day the price of gold increases by INR 40 without any significant news or factor behind this rise, then by the mean reversion principle we can expect the price of gold to fall in the coming days such that the average change in price of gold remains the same. In such a case, the mean reversionist would sell gold, speculating the price to fall in the coming days. Thus, making profits by buying the same amount of gold he had sold earlier, now at a lower price.

A mean-reverting time series has been plotted below, the horizontal black line represents the mean and the blue curve is the time series which tends to revert back to the mean.

A collection of random variables is defined to be a stochastic or random process. A stochastic process is said to be stationary if its mean and variance are time invariant (constant over time). A stationary time series will be mean reverting in nature, i.e. it will tend to return to its mean and fluctuations around the mean will have roughly equal amplitudes. A stationary time series will not drift too far away from its mean because of its finite constant variance. A non-stationary time series, on the contrary, will have a time varying variance or a time varying mean or both, and will not tend to revert back to its mean. In the financial industry, traders take advantage of stationary time series by placing orders when the price of a security deviates considerably from its historical mean, speculating the price to revert back to its mean. They start by testing for stationarity in a time series. Financial data points, such as prices, are often non-stationary, i.e. they have means and variances that change over time. Non-stationary data tends to be unpredictable and cannot be modeled or forecasted. A non-stationary time series can be converted into a stationary time series by either differencing or detrending the data. A random walk (the movements of an object or changes in a variable that follow no discernible pattern or trend) can be transformed into a stationary series by differencing (computing the difference between Yt and Yt -1). The disadvantage of this process is that it results in losing one observation each time the difference is computed. A non-stationary time series with a deterministic trend can be converted into a stationary time series by detrending (removing the trend). Detrending does not result in loss of observations. A linear combination of two non-stationary time series can also result in a stationary, mean-reverting time series. The time series (integrated of at least order 1), which can be linearly combined to result in a stationary time series are said to be cointegrated.

Shown below is a plot of a non-stationary time series with a deterministic trend (Yt = α + βt + εt) represented by the blue curve and its detrended stationary time series (Yt – βt = α + εt) represented by the red curve.

Become an algotrader. learn EPAT for algorithmic trading

Trading Strategies based on Mean Reversion

One of the simplest mean reversion related trading strategies is to find the average price over a specified period, followed by determining a high-low range around the average value from where the price tends to revert back to the mean. The trading signals will be generated when these ranges are crossed – placing a sell order when the range is crossed on the upper side and a buy order when the range is crossed on the lower side. The trader takes contrarian positions, i.e. goes against the movement of prices (or trend), expecting the price to revert back to the mean. This strategy looks too good to be true and it is, it faces severe obstacles. The lookback period of the moving average might contain a few abnormal prices which are not characteristic to the dataset, this will cause the moving average to misrepresent the security’s trend or the reversal of a trend. Secondly, it might be evident that the security is overpriced as per the trader’s statistical analysis, yet he cannot be sure that other traders have made the exact same analysis. Because other traders don’t see the security to be overpriced, they would continue buying the security which would push the prices even higher. This strategy would result in losses if such a situation arises.

Pairs Trading is another strategy that relies on the principle of mean reversion. Two co-integrated securities are identified, the spread between the price of these securities would be stationary and hence mean reverting in nature. An extended version of Pairs Trading is called Statistical Arbitrage, where many co-integrated pairs are identified and split into buy and sell baskets based on the spreads of each pair. The first step in a Pairs Trading or Stat Arb model is to identify a pair of co-integrated securities. One of the commonly used tests for checking co-integration between a pair of securities is the Augmented Dickey-Fuller Test (ADF Test). It tests the null hypothesis of a unit root being present in a time series sample. A time series which has a unit root, i.e. 1 is a root of the series’ characteristic equation, is non-stationary. The augmented Dickey-Fuller statistic, also known as t-statistic, is a negative number. The more negative it is, the stronger the rejection of the null hypothesis that there is a unit root at some level of confidence, which would imply that the time series is stationary. The t-statistic is compared with a critical value parameter, if the t-statistic is less than the critical value parameter then the test is positive and the null hypothesis is rejected.

Co-integration check – ADF Test

Consider the Python code shown below for checking co-integration:

We start by importing relevant libraries, followed by fetching financial data for two securities using the quandl.get() function. Quandl provides financial and economic data directly in Python by importing the Quandl library. In this example, we have fetched data for Aluminium and Lead futures from MCX. We then print the first five rows of the fetched data using the head() function, in order to view the data being pulled by the code. Using the statsmodels.api library, we compute the Ordinary Least Squares regression on the closing price of the commodity pair and store the result of the regression in the variable named ‘result’. Next, using the statsmodels.tsa.stattools library, we run the adfuller test by passing the residual of the regression as the input and store the result of this computation the array “c_t”. This array contains values like the t-statistic, p-value, and critical value parameters. Here, we consider a significance level of 0.1 (90% confidence level). “c_t[0]” carries the t-statistic, “c_t[1]” contains the p-value and “c_t[4]” stores a dictionary containing critical value parameters for different confidence levels. For co-integration we consider two conditions, firstly we check whether the t-stat is lesser than the critical value parameter (c_t[0] <= c_t[4][‘10%’]) and secondly whether the p-value is lesser than the significance level (c_t[1] <= 0.1). If both these conditions are true, we print that the “Pair of securities is co-integrated”, else print that the “Pair of securities is not cointegrated”.

Output

To know more about Pairs Trading, Statistical Arbitrage and the ADF test you can check out the self-paced online certification course on “Statistical Arbitrage Trading“ offered jointly by QuantInsti and MCX to learn how to trade Statistical Arbitrage strategies using Python and Excel.

Other Links

Statistics behind pairs trading –
https://www.quantinsti.com/blog/statistics-behind-pair-trading-i-understanding-correlation-and-cointegration/

ADF test using excel –
https://www.quantinsti.com/blog/augmented-dickey-fuller-adf-test-for-a-pairs-trading-strategy/

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!

Read more