## R Weekly Bulletin Vol – V

This week’s R bulletin will cover topics like how to avoid for-loops, add or shorten an existing vector, and play a beep sound in R. We will also cover functions like env.new function, readSeries, and the with and within functions. Hope you like this R weekly bulletin. Enjoy reading!

### Shortcut Keys

1. To stop debugging – Shift+F8
2. To quit an R session (desktop only) – Ctrl+Q
3. To restart an R Session – Ctrl+Shift+0

### Problem Solving Ideas

#### Avoiding For Loop by using “with” function

For Loop can be slow in terms of execution speed when we are dealing with large data sets. For faster execution, one can use the “with” function as an alternative. The syntax of the with function is given below:

with(data, expr)

where, “data” is typically a data frame, and “expr” stands for one or more expressions to be evaluated using the contents of the data frame. If there is more than one expression, then the expressions need to be wrapped in curly braces.

Example: Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods and time them using the system.time function. Note the time taken to execute the For Loop versus the time to execute the with function in combination with the lagpad function.

library(quantmod)

# Using FOR Loop
system.time({

df = df[,c(1,3:6)]

df$GapOpen = double(nrow(df)) for ( i in 2:nrow(df)) { df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2)
}

})

# Using with function + lagpad, instead of FOR Loop
system.time({

dt = dt[,c(1,3:6)]

c(rep(NA, k), x)[1 : length(x)]
}

dt$PrevClose = lagpad(dt$CLOSE, 1)
dt$GapOpen_ = with(dt, round(Delt(dt$PrevClose,dt$OPEN)*100,2)) print(head(dt)) }) #### Adding to an existing vector or shortening it Adding or shortening an existing vector can be done by assigning a new length to the vector. When we shorten a vector, the values at the end will be removed, and when we extend an existing vector, missing values will be added at the end. Example: # Shorten an existing vector even = c(2,4,6,8,10,12) length(even) [1] 6 # The new length equals the number of elements required in the vector to be shortened. length(even) = 3 print(even) [1] 2 4 6 # Add to an existing vector odd = c(1,3,5,7,9,11) length(odd) [1] 6 # The new length equals the number of elements required in the extended vector. length(odd) = 8 odd[c(7,8)] = c(13,15) print(odd) [1] 1 3 5 7 9 11 13 15 #### Make R beep/play a sound If you want R to play a sound/beep upon executing the code, we can do this using the “beepr” package. The beep function from the package plays a sound when the code gets executed. One also needs to install the “audio” package along with the “beepr” package. install.packages("beepr") install.packages("audio") library(beepr) beep() One can select from the various sounds using the “sound” argument and by assigning one of the specified values to it. beep(sound = 9) One can keep repeating the message using beepr as illustrated in the example below (source:http: //stackoverflow.com/) Example: work_complete <- function() { cat("Work complete. Press esc to sound the fanfare!!!\n") on.exit(beepr::beep(3)) while (TRUE) { beepr::beep(4) Sys.sleep(1) } } work_complete() One can also use the beep function to play a sound if an error occurs during the code execution. options(error = function() {beep(sound =5)}) ### Functions Demystified #### env.new function Environments act as a storehouse. When we create variables in R from the command prompt these get stored in the R’s global environment. To access the variables stored in the global environment, one can use the following expression: head(ls(envir = globalenv()), 15) [1] “df” “dt” “even” “i” “lagpad” “odd” If we want to store the variables in a specific environment, we can assign the variable to that environment or create a new environment which will store the variable. To create a new environment we use the new.env function. Example: my_environment = new.env() Once we create a new environment, assigning a variable to the environment can be done in multiple ways. Following are some of the methods: Examples: # By using double square brackets my_environment[["AutoCompanies"]] = c("MARUTI", "TVSMOTOR", "TATAMOTORS") # By using dollar sign operator my_environment$AutoCompanies = c("MARUTI", "TVSMOTOR", "TATAMOTORS")

# By using the assign function
assign("AutoCompanies", c("MARUTI", "TVSMOTOR", "TATAMOTORS"), my_environment)

The variables existing in an environment can be viewed or listed using the get function or by using the ls function.

Example:

ls(envir = my_environment)
[1] “AutoCompanies”

get("AutoCompanies", my_environment)
[1] “MARUTI”  “TVSMOTOR”  “TATAMOTORS”

The readSeries function is part of the timeSeries package, and it reads a file in table format and creates a timeSeries object from it. The main arguments of the function are:

where,
file: the filename of a spreadsheet dataset from which to import the data records.
header: a logical value indicating whether the file contains the names of the variables as its first line.
format: a character string with the format in POSIX notation specifying the timestamp format.
sep: the field separator used in the spreadsheet file to separate columns. By default, it is set as “;”.

Example:

library(timeSeries)

print(head(df))

# Reading the NIFTY data and creating a time series object using readSeries
# function
df = readSeries(file = "NIFTY.csv", header = T, sep = ",", format = "%Y%m%d")
print(head(df))

#### with and within functions

The with and within functions apply an expression to a given data set and allows one to manipulate it. The within function even keeps track of changes made, including adding or deleting elements and returns a new object with these revised contents. The syntax for these two functions is given as:

with(data, expr)
within(data, expr)

where,
data – typically is a list or data frame, although other options exist for with.
expr – one or more expressions to evaluate using the contents of data, the commands must be wrapped in braces if there is more than one expression to evaluate.

# Consider the NIFTY data
print(head(df, 3))

# Example of with function:
df$Average = with(df, apply(df[3:6], 1, mean)) print(head(df, 3)) # Example of within function: df = within(df, { df$Average = apply(df[3:6], 1, mean)
})
print(head(df, 3))

### Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

## Mixture Models for Forecasting Asset Returns

Asset return prediction is difficult. Most traditional time series techniques don’t work well for asset returns. One significant reason is that time series analysis (TSA) models require your data to be stationary. If it isn’t stationary, then you must transform your data until it is stationary.

That presents a problem.

In practice, we observe multiple phenomena that violate the rules of stationarity including non-linear processes, volatility clustering, seasonality, and autocorrelation. This renders traditional models mostly ineffective for our purposes.

### What are our options?

There are many algorithms to choose from, but few are flexible enough to address the challenges of predicting asset returns:

• mean and volatility changes through time
• sometimes future returns are correlated with past returns, sometimes not
• sometimes future volatility is correlated with past volatility, sometimes not
• non-linear behavior

To recap, we need a model framework that is flexible enough to (1) adapt to non-stationary processes and (2) provide a reasonable approximation of the non-linear process that is generating the data.

### Can Mixture Models offer a solution?

They have potential. First, they are based on several well-established concepts.

Markov models – These are used to model sequences where the future state depends only on the current state and not any past states. (memoryless processes)

Hidden Markov models – Used to model processes where the true state is unobserved (hidden) but there are observable factors that give us useful information to guess the true state.

Expectation-Maximization (E-M) – This is an algorithm that iterates between computing class parameters and maximizing the likelihood of the data given those parameters.

An easy way to think about applying mixture models to asset return prediction is to consider asset returns as a sequence of states or regimes. Each regime is characterized by its own descriptive statistics including mean and volatility. Example regimes could include low-volatility and high-volatility. We can also assume that asset returns will transition between these regimes based on probability. By framing the problem this way we can use mixture models, which are designed to try to estimate the sequence of regimes, each regime’s mean and variance, and the transition probabilities between regimes.

The most common is the Gaussian mixture model (GMM).

The underlying model assumption is that each regime is generated by a Gaussian process with parameters we can estimate. Under the hood, GMM employs an expectation-maximization algorithm to estimate regime parameters and the most likely sequence of regimes.

GMMs are flexible, generative models that have had success approximating non-linear data. Generative models are special in that they try to mimic the underlying data process such that we can create new data that should look like original data.

In this upcoming webinar on Mixture Models, we design and evaluate an example strategy to find out if Gaussian mixture models are useful for return prediction, and specifically for identifying market bottoms.

### Next Step

Join us for the webinar, “Can we use Mixture Models to Predict Market Bottoms?” on Tuesday 25th Apr, 8:30 AM MST | 8:00 PM IST. The webinar will be conducted by Brian Christopher, Quantitative researcher, Python developer, CFA charter holder, and founder of Blackarbs LLC, a quantitative research firm. Register now!

## R Weekly Bulletin Vol – IV

This week’s R bulletin will cover topics like removing duplicate rows, finding row number, sorting a data frame in the same order, sorting a data frame in different order, and creating two tabs in an excel workbook. We will also cover functions like Sys.time, julian, and the second & minute function from the lubridate package. Hope you like this R weekly bulletin. Enjoy reading!

### Shortcut Keys

1. To open a file in R – Ctrl+O
2. To open a new R script – Ctrl+Shift+N
3. To save an R script – Ctrl+S

### Problem Solving Ideas

#### Remove duplicate rows in R

Consider the following data frame “df” comprising of 3 rows of a 1-minute OHC of a stock.

Example:

open = c(221, 221.25, 221.25)
high = c(221.75, 221.45, 221.45)
close = c(221.35, 221.2, 221.2)
df = data.frame(open, high, close)
print(df)

As can be seen from the output, the 2nd and the 3rd rows are duplicates. If we want to retain only the unique rows, we can use the duplicated function in the following manner:

df[!duplicated(df), ]

This will remove the 3rd duplicate row, and retain the first 2 rows. In case we want the duplicate row, we can do so in the following way:

df[duplicated(df), ]

#### Return row number for a particular value in a column

To find the row number for a particular value in a column in a vector/data frame we can use the “which” function.

Example 1:

day = c("Mon", "Tue", "Wed", "Thurs", "Fri", "Sat", "Sun")
row_number = which(day == "Sat")
print(row_number)
[1] 6

Example 2: Consider a 2 column data frame “df” comprising of the stock symbols and their respective closing price for the day. To find the row number corresponding to the HCL stock we call the “which” function on the Ticker column with its value selected as HCL.

Ticker = c("INFY", "TCS", "HCL")
ClosePrice = c(2021, 2294, 910)
data = data.frame(Ticker, ClosePrice)

row_number = which(data$Ticker == "HCL") print(row_number) [1] 3 #### Sorting a data frame by two columns in the same order To sort a data frame by two columns in the same order, we can use the “order” function and the “with” function. Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. We first sort the data frame based on category and then sort based on the percentage change in price. The order function by default sorts in an ascending manner. Hence to sort both the columns in descending order we keep the decreasing argument as TRUE. Example – Sorting a data frame by two columns # Create a data frame Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") Category = c(1, 1, 1, 2) Percent_Change = c(2.3, -0.25, 0.5, 0.25) df = data.frame(Ticker, Category, Percent_Change) print(df) # Sorting by Category column first and then the Percent_Change column: df_sort = df[with(df, order(Category, Percent_Change, decreasing = TRUE)), ] print(df_sort) #### Sorting a data frame by two columns in different order To sort a data frame by two columns in different order, we can use the “order” function along with the “with” function. Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. Assume that we want to sort first in an ascending order by column “Category”, and then by column “Percent_Change” in a descending order. The order function by default sorts in an ascending manner. Hence, to sort the “Category” column we mention it as the first variable in the order function without prepending it with any sign. To sort the “Percent_Change” column in a descending order we prepend it with a negative sign. Example – Sorting a data frame by two columns # Create a data frame Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") Category = c(1, 1, 1, 2) Percent_Change = c(2.3, -0.25, 0.5, 0.25) df = data.frame(Ticker, Category, Percent_Change) print(df) # Sort by Category column first and then the Percent_Change column: df_sort = df[with(df, order(Category, -Percent_Change)), ] print(df_sort) #### Creating two tabs in the output excel workbook At times we want to write & save the multiple results generated after running our R script in an excel workbook on separate worksheets. To do so, we can make use of the “append” argument in the write.xlsx function.To write to an excel file, one must first install and load the xlsx package in R. Example: In this example, we are creating two worksheets in the “Stocks.xlsx” workbook. In the worksheet named, “Top Gainers”, we save the table_1 output, while in the “Top Losers” worksheet we save the table_2 output. To create a second worksheet in the same workbook, we keep the append argument as TRUE in the second line. write.xlsx(table_1, "Stocks.xlsx", sheetName = "Top Gainers", append = FALSE) write.xlsx(table_2, " Stocks.xlsx", sheetName = "Top Losers", append = TRUE) ### Functions Demystified #### Sys.time function The Sys.time function gives the current date and time. Example: date_time = Sys.time() print(date_time) [1] “2017-04-15 16:25:38 IST” The function can be used to find the time required to run a code by placing this function at the start and the end of the code. The difference between the start and the end will give the time taken to execute the code. #### julian function The julian function is used to extract the Julian date, which is the number of days since January 1, 1970. Given a date, the syntax of the function is given as: julian(date) Example: date = as.Date("2010-03-15") julian(date) [1] 14683 attr(,”origin”) [1] “1970-01-01” Alternatively one can use the as.integer function to get the same result #### second and minute functions These functions are part of the lubridate package. The second function retrieves/sets the second component of a date-time object, while the minute function retrieves/sets the minute component. It allows Date-time objects like those belonging to the POSIXct, POSIXlt, Date, zoo, xts, and the timeSeries objects. Example: library(lubridate) # Retrieving the seconds We have used the ymd_hms function to parse the given object. x = ymd_hms("2016-06-01 12:23:45") second(x) [1] 45 minute function: library(lubridate) # Retrieving the minute from a date-time object x = ymd_hms("2016-06-01 12:23:45") minute(x) [1] 23 # Retrieving the minute from a time object. We have used the hms function to parse the given object. x = hms("15:29:06") minute(x) [1] 29 ### Next Step We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers. ## R Best Practices: R you writing the R way! Any programmer inevitably writes tons of codes in his daily work. However, not all programmers inculcate the habit of writing clean codes which can be easily be understood by others. One of the reasons can be the lack of awareness among programmers of the best practices followed in writing a program. This is especially the case for novice programmers. In this post, we list some of the R programming best practices which will lead to improved code readability, consistency, and repeatability. Read on! ### Best practices of writing in R 1) Describe your code – When you start coding describe what the R code does in the very first line. For subsequent blocks of codes follow the same method of describing the blocks. This makes it easy for other people to understand and use the code. Example: # This code captures the 52-week high effect in stocks # Code developed by Milind Paradkar 2) Load Packages – After describing your code in the first line, use the library function to list and load all the relevant packages needed to execute your code. Example: library(quantmod); library(zoo); library(xts); library(PerformanceAnalytics); library(timeSeries); library(lubridate); 3) Use Updated Packages – While writing your code ensure that you are using the latest updated R packages. To check the version of any R package you can use the packageVersion function. Example: packageVersion("TTR") [1] ‘0.23.1’ 4) Organize all source files in the same directory – Store all the necessary files that will be used/sourced in your code in the same directory. You can use the respective relative path to access them. Example: # Reading file using relative path df = read.csv(file = "NIFTY.csv", header = TRUE) # Reading file using full path df = read.csv(file = "C:/Users/Documents/NIFTY.csv", header = TRUE) 5) Use a consistent style for data structure types – R programming language allows for different data structures like vectors, factors, data frames, matrices, and lists. Use a similar naming for a particular type of data structure. This will make it easy to recognize the similar data structures used in the code and to spot the problems easily. Example: You can name all different data frames used in your code by adding .df as the suffix. aapl.df = as.data.frame(read.csv(file = "AAPL.csv", header = TRUE)) amzn.df = as.data.frame(read.csv(file = "AMZN.csv", header = TRUE)) csco.df = as.data.frame(read.csv(file = "CSCO.csv", header = TRUE)) 6) Indent your code – Indentation makes your code easier to read, especially, if there are multiple nested statements like For-loop and If statement. Example: # Computing the Profit & Loss (PL) and the Equity dt$PL = numeric(nrow(dt))
for (i in 1:nrow(dt)){
if (dt$Signal[i] == 1) {dt$PL[i+1] = dt$Close[i+1] - dt$Close[i]}
if (dt$Signal[i] == -1){dt$PL[i+1] = dt$Close[i] - dt$Close[i+1]}
}

7) Remove temporary objects – For long codes, running in thousands of lines, it is a good practice to remove temporary objects after they have served their purpose in the code. This can ensure that R does not into memory issues.

8) Time the code – You can time your code using the system.time function. You can also use the same function to find out the time taken by different blocks of code. The function returns the amount of time taken in seconds to evaluate the expression or a block of code. Timing codes will help to figure out any bottlenecks and help speed up your codes by making the necessary changes in the script.

To find the time taken for different blocks we wrapped them in curly braces within the call to the system.time function.

The two important metrics returned by the function include:
i) User time – time charged to the CPU(s) for the code
ii) Elapsed time – the amount of time elapsed to execute the code in entirety

Example:

# Generating random numbers
system.time({
mean_1 = rnorm(1e+06, mean = 0, sd = 0.8)
})

user    system    elapsed
0.40      0.00       0.45

9) Use vectorization – Vectorization results in faster execution of codes, especially when we are dealing with large data sets. One can use statements like the ifelse statement or the with function for vectorization.

Example:
Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods (using for-loop and with function) and time them using the system.time function. Note the time taken to execute the for-loop versus the time to execute the with function in combination with the lagpad function.

library(quantmod)
# Using FOR Loop
system.time({
df = df[,c(1,3:6)]
df$GapOpen = double(nrow(df)) for ( i in 2:nrow(df)) { df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2)
}
})

# Using with function + lagpad, instead of FOR Loop
system.time({
df = dt[,c(1,3:6)]

c(rep(NA, k), x)[1 : length(x)]
}

df$PrevClose = lagpad(df$CLOSE, 1)
df$GapOpen_ = with(df, round(Delt(df$PrevClose,df$OPEN)*100,2)) print(head(df)) }) 10) Folding codes – Folding codes is a way wherein the R programmer can fold a code of line or code sections. This allows hiding blocks of code whenever required, and makes it easier to navigate through lengthy codes. Code folding can be done in two ways: i) Automatic folding of codes ii) User-defined folding of codes Automatic folding of codes: RStudio automatically provides the flexibility to fold the codes. When a coder writes a function or conditional blocks, RStudio automatically creates foldable codes. User-defined folding of codes: One can also fold a random group of codes by using Edit -> Folding -> Collapse or by simply selecting the group of codes and pressing Alt+L key. User-defined folding can also be done via Code Sections: To insert a new code section you can use the Code -> Insert Section command. Alternatively, any comment line which includes at least four trailing dashes (-), equal signs (=) or pound signs (#) automatically creates a code section. 11) Review and test your code rigorously – Once your code is ready, ensure that you test it code rigorously on different input parameters. Ensure that the logic used in statements like for-loop, if statement, ifelse statement is correct. It is a nice idea to get your code reviewed by your colleague to ensure that the work is of high quality. 12) Don’t save your workspace When you want to exit R it checks if you want to save your workspace. It is advisable to not save the workspace and start in a clean workspace for your next R session. Objects from the previous R sessions can lead to errors which can be hard to debug. These were some of the best practices of writing in R that one can follow to make your codes easy to read, debug and to ensure consistency. ### Next Step If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now! ## R Weekly Bulletin Vol – III This week’s R bulletin will cover topics like how to read select columns, the difference between boolean operators, converting a factor to a numeric and changing memory available to R. We will also cover functions like data, format, tolower, toupper, and strsplit function. Hope you like this R weekly bulletin. Enjoy reading! ### Shortcut Keys 1. To change the working directory – Ctrl+Shift+K 2. To show history – Ctrl+4 3. To display environment – Ctrl+8 ### Problem Solving Ideas #### Read a limited number of columns from a dataset To read a specific set of columns from a dataset one can use the “fread” function from the data.table package. The syntax for the function to be used for reading/dropping a limited number of columns is given as: fread(input, select, stringsAsFactors=FALSE) fread(input, drop, stringsAsFactors=FALSE) One needs to specify the columns names or column numbers with the select parameter as shown below. Example: library(data.table) data = fread("data.txt") print(data) # Reading select columns data = fread("data.txt", select = c("DATE", "OPEN", "CLOSE")) print(data) Alternatively, you can use the drop parameter to indicate which columns should not be read: Example: data = fread("data.txt", drop = c(2:3, 6)) print(data) #### Difference between the boolean operators & and && In R we have the “&” Boolean operator which is the equivalent to the AND operator in excel. Similarly, we have | in R which is the equivalent of OR in excel. One also finds Boolean operators && and || in R. Although, these operators have the same meaning as & and |, they behave in a slightly different manner. Difference between & and &&: The shorter form “&” is vectorized, meaning it can be applied on a vector. See the example below. Example: The longer form &&, evaluates left to right examining only the first element of each vector. Thus in the example below, it will first evaluate whether -1 > = 0 (which is FALSE), and then check whether -1 <= 0 (which is TRUE). After this, it will evaluate FALSE & TRUE, which gives the final answer as FALSE. Example: Thus the && operator is to be used when we have vectors of length one. For vectors of length greater than one, we should use the short form. #### How to convert a factor to a numeric without a loss of information Consider the following factor “x” created using the factor, sample, and the runif functions. Let us try to convert this factor to a numeric. Example: x = factor(sample(runif(5), 6, replace = TRUE)) print(x) [1] 0.660900804912671 0.735364762600511 0.479244165122509 0.397552277892828 [5] 0.660900804912671 0.660900804912671 4 Levels: 0.397552277892828 0.479244165122509 … 0.735364762600511 as.numeric(x) [1] 3 4 2 1 3 3 As can be seen from the output, upon conversion we do not get the original values. R does not have a function to execute such conversion error-free. To overcome this problem, one can use the following expression: as.numeric(levels(x))[x] [1] 0.6609008 0.7353648 0.4792442 0.3975523 0.6609008 0.6609008 One can also use the as.numeric(as.character(x)) expression, however the as.numeric(levels(x))[x] is slightly more efficient in converting the factor to an integer/numeric. #### Replace a part of URL using R Suppose you want to scrap data from a site having different pages, and want to replace the url using R. We can do this by using the parse_url function and build_url function from the “httr” package. The following example illustrates the method to replace a part in url. Example: library(httr) # In the following url, the value 2 at the end signifies page 2 of the books # catergory under the section "bestsellers". testURL = "http://www.amazon.in/gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b#2" # Entering the url and parsing it using the parse_ulr function parseURL = parse_url(testURL) print(parseURL)$scheme
[1] “http”

$hostname [1] “www.amazon.in”$port
NULL

$path [1] “gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b”$query
NULL

$params NULL$fragment
[1] “2”

$username NULL$password
NULL

attr(,”class”)
[1] “url”

# Assigning the next page number (i.e page no.3) and creating the new url
parseURL$fragment = 3 newURL <- build_url(parseURL) print(newURL) [1] “http://www.amazon.in/gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b#3” #### Increasing (or decreasing) the memory available to R processes In R there is a memory.limit() function which gives you the amount of available memory in MB. At times, windows users may get the error that R has run out of memory. In such cases, you may set the amount of available memory. Example: memory.limit(size=4000) The unit is MB. You may increase this value up to 2GB or the maximum amount of physical RAM you have installed. If you have R already installed and subsequently install more RAM, you may have to reinstall R in order to take advantage of the additional capacity. On 32-bit Windows, R can only use up to 3Gb of RAM, regardless of how much you have installed. There is a 64-bit version of R for Windows available from REvolution Computing, which runs on 64-bit Windows, and can use the entire RAM available. ### Functions Demystified #### data function R program includes a number packages which often come with different data sets. These data sets can be accessed using the data function. This function loads specified data sets or lists the available data sets. To check the available data sets from all the R packages installed in your application use the data() expression. This will display all the available data sets under each package. Example: data() To view data sets from a particular package, specify the package name as the argument to the data function. Example: data(package = "lubridate") To load a particular dataset, use the name of the dataset as the argument to the data function. Example: # loading the USDCHF dataset from the timeSeries package library(timeSeries) data(USDCHF) x = USDCHF print(head(USDCHF)) # loading the sample_matrix dataset from the xts package library(xts) data(sample_matrix) x = sample_matrix print(head(sample_matrix)) #### format function The format function is used to format the numbers so that they can appear clean and neat on the reports. The function takes a number of arguments to control the formatting of the result. The main arguments for the format function are given below. format(x, trim, digits, decimal.mark, nsmall, big.mark, big.interval, small.mark, small.interval, justify) Where, 1) x: the number to be formatted. 2) trim: takes a logical value. If FALSE, it adds spaces to right-justify the result. If TRUE, it suppresses the leading spaces. 3) digits: how many significant digits of numeric values to display. 4) decimal mark: the format in which to display the decimal mark. 5) nsmall: the minimum number of digits after the decimal point. 6) big.mark: the mark between intervals before the decimal point. 7) small mark: the mark between intervals after the decimal point. Example: format(34562.67435, digits=7, decimal.mark=".",big.mark=" ", small.mark=",",small.interval=2) [1] “34 562.67” In the example above we chose to display 7 digits, and are using point(.) as a decimal mark. The small interval value is 2, which means we want 2 digits to be displayed after the decimal point. The big.mark format displays a space after the first two digits. #### tolower, toupper, and strsplit The tolower and toupper functions help convert strings to lowercase and uppercase respectively. Example 1: toupper("I coded a strategy in R") [1] “I CODED A STRATEGY IN R” tolower("I coded a strategy in R") [1] “i coded a strategy in r” Example 2: df = as.data.frame(read.csv("NIFTY.csv")) print(head(df, 4)) colnames(df) [1] “DATE” “TIME” “CLOSE” “HIGH” “LOW” “OPEN” “VOLUME” tolower(colnames(df)) [1] “date” “time” “close” “high” “low” “open” “volume” The strsplit function is used to break/split strings at the specified split points. The function returns a list and not a character vector or a matrix. In the example below, we are splitting the expression on the spaces. Example: strsplit("I coded a strategy in R", split = " ") [[1]] [1] “I” “coded” “a” “strategy” “in” “R” ### Next Step We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers. ## Forecasting Markets using eXtreme Gradient Boosting (XGBoost) In recent years, machine learning has been generating a lot of curiosity for its profitable application to trading. Numerous machine learning models like Linear/Logistic regression, Support Vector Machines, Neural Networks, Tree-based models etc. are being tried and applied in an attempt to analyze and forecast the markets. Researchers have found that some models have more success rate compared to other machine learning models. eXtreme Gradient Boosting also called XGBoost is one such machine learning model that has received rave from the machine learning practitioners. In this post, we will cover the basics of XGBoost, a winning model for many kaggle competitions. We then attempt to develop an XGBoost stock forecasting model using the “xgboost” package in R programming. ### Basics of XGBoost and related concepts Developed by Tianqi Chen, the eXtreme Gradient Boosting (XGBoost) model is an implementation of the gradient boosting framework. Gradient Boosting algorithm is a machine learning technique used for building predictive tree-based models. (Machine Learning: An Introduction to Decision Trees). Boosting is an ensemble technique in which new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. The ensemble technique uses the tree ensemble model which is a set of classification and regression trees (CART). The ensemble approach is used because a single CART, usually, does not have a strong predictive power. By using a set of CART (i.e. a tree ensemble model) a sum of the predictions of multiple trees is considered. Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. The objective of the XGBoost model is given as: Obj = L + Where, L is the loss function which controls the predictive power, and Ω is regularization component which controls simplicity and overfitting The loss function (L) which needs to be optimized can be Root Mean Squared Error for regression, Logloss for binary classification, or mlogloss for multi-class classification. The regularization component (Ω) is dependent on the number of leaves and the prediction score assigned to the leaves in the tree ensemble model. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. The Gradient boosting algorithm supports both regression and classification predictive modeling problems. ### Sample XGBoost model: We will use the “xgboost” R package to create a sample XGBoost model. You can refer to the documentation of the “xgboost” package here. Install and load the xgboost library – We install the xgboost library using the install.packages function. To load this package we use the library function. We also load other relevant packages required to run the code. install.packages("xgboost") # Load the relevant libraries library(quantmod); library(TTR); library(xgboost); Create the input features and target variable – We take the 5-year OHLC and volume data of a stock and compute the technical indicators (input features) using this dataset. The indicators computed include Relative Strength Index (RSI), Average Directional Index (ADX), and Parabolic SAR (SAR). We create a lag in the computed indicators to avoid the look-ahead bias. This gives us our input features for building the XGBoost model. Since this is a sample model, we have included only a few indicators to build our set of input features. # Read the stock data symbol = "ACC" fileName = paste(getwd(),"/",symbol,".csv",sep="") ; df = as.data.frame(read.csv(fileName)) colnames(df) = c("Date","Time","Close","High", "Low", "Open","Volume") # Define the technical indicators to build the model rsi = RSI(df$Close, n=14, maType="WMA")
sar = SAR(df[,c("High","Low")], accel = c(0.02, 0.2))
trend = df$Close - sar # create a lag in the technical indicators to avoid look-ahead bias rsi = c(NA,head(rsi,-1)) adx$ADX = c(NA,head(adx$ADX,-1)) trend = c(NA,head(trend,-1)) Our objective is to predict the direction of the daily stock price change (Up/Down) using these input features. This makes it a binary classification problem. We compute the daily price change and assigned a positive 1 if the daily price change is positive. If the price change is negative, we assign a zero value. # Create the target variable price = df$Close-df$Open class = ifelse(price > 0,1,0) Combine the input features into a matrix – The input features and the target variable created in the above step are combined to form a single matrix. We use the matrix structure in the XGBoost model since the xgboost library allows data in the matrix format. # Create a Matrix model_df = data.frame(class,rsi,adx$ADX,trend)

### Summarizing the Internal Structure of the IBrokers Package

We have seen above how the internal structure of the IBrokers package works. To summarize the entire mechanism, it can be depicted as shown below:

Request to TWS for data -> twsCALLBACK -> processMsg -> eWrapper

### Real-Time Data Model

We will use the snapShotTest code example published by Jeff Ryan. The code below modifies the twsCALLBACK function. This modified callback is used as an argument to the reqMktData function. The output using the modified callback is more convenient to read than the normal output when we use the reqMktData function.

Another change in the snapShotTest code is to record any error messages from IB API to a separate file. (Under the default method the eWrapper prints such error messages to the console).  To do this, we create a different wrapper using eWrapper(debug=NULL). Once we construct this, we can assign its errorMessage() function to the eWrapper we should use.

We then apply a simple trade logic which generates a buy signal if the last bid price is greater than a pre-specified threshold value. One can similarly tweak the logic of the twsCALLBACK to create custom callbacks based on one’s requirement of trading strategy.

Order getting filled in the IB Trader Workstation (TWS)

### Conclusion

To conclude, the post gave a detailed overview of the architecture of the IBrokers package which is the R implementation of Interactive Brokers API. Interactive Brokers in collaboration with QuantInstiTM hosted a webinar, “Trading using R on Interactive Brokers” which was held on 2nd March 2017 and conducted by Anil Yadav, Director at QuantInstiTM. You can click on the link provided above to learn more about the IBrokers package.

### Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

## Mean Reversion in Time Series

Time series data is simply a collection of observations generated over time. For example, the speed of a race car at each second, daily temperature, weekly sales figures, stock returns per minute, etc. In the financial markets, a time series tracks the movement of specific data points, such as a security’s price over a specified period of time, with data points recorded at regular intervals. A time series can be generated for any variable that is changing over time. Time series analysis comprises of techniques for analyzing time series data in an attempt to extract useful statistics and identify characteristics of the data. Time series forecasting is the use of a mathematical model to predict future values based on previously observed values in the time series data.

The graph shown below represents the daily closing price of Aluminium futures over a period of 93 trading days, which is a time series.

### Mean Reversion

Mean reversion is the theory which suggests that prices, returns, or various economic indicators tend to move to the historical average or mean over time. This theory has led to many trading strategies which involve the purchase or sale of a financial instrument whose recent performance has greatly differed from their historical average without any apparent reason. For example, let the price of gold increase on average by INR 10 every day and one day the price of gold increases by INR 40 without any significant news or factor behind this rise, then by the mean reversion principle we can expect the price of gold to fall in the coming days such that the average change in price of gold remains the same. In such a case, the mean reversionist would sell gold, speculating the price to fall in the coming days. Thus, making profits by buying the same amount of gold he had sold earlier, now at a lower price.

A mean-reverting time series has been plotted below, the horizontal black line represents the mean and the blue curve is the time series which tends to revert back to the mean.

A collection of random variables is defined to be a stochastic or random process. A stochastic process is said to be stationary if its mean and variance are time invariant (constant over time). A stationary time series will be mean reverting in nature, i.e. it will tend to return to its mean and fluctuations around the mean will have roughly equal amplitudes. A stationary time series will not drift too far away from its mean because of its finite constant variance. A non-stationary time series, on the contrary, will have a time varying variance or a time varying mean or both, and will not tend to revert back to its mean. In the financial industry, traders take advantage of stationary time series by placing orders when the price of a security deviates considerably from its historical mean, speculating the price to revert back to its mean. They start by testing for stationarity in a time series. Financial data points, such as prices, are often non-stationary, i.e. they have means and variances that change over time. Non-stationary data tends to be unpredictable and cannot be modeled or forecasted. A non-stationary time series can be converted into a stationary time series by either differencing or detrending the data. A random walk (the movements of an object or changes in a variable that follow no discernible pattern or trend) can be transformed into a stationary series by differencing (computing the difference between Yt and Yt -1). The disadvantage of this process is that it results in losing one observation each time the difference is computed. A non-stationary time series with a deterministic trend can be converted into a stationary time series by detrending (removing the trend). Detrending does not result in loss of observations. A linear combination of two non-stationary time series can also result in a stationary, mean-reverting time series. The time series (integrated of at least order 1), which can be linearly combined to result in a stationary time series are said to be cointegrated.

Shown below is a plot of a non-stationary time series with a deterministic trend (Yt = α + βt + εt) represented by the blue curve and its detrended stationary time series (Yt – βt = α + εt) represented by the red curve.

### Trading Strategies based on Mean Reversion

One of the simplest mean reversion related trading strategies is to find the average price over a specified period, followed by determining a high-low range around the average value from where the price tends to revert back to the mean. The trading signals will be generated when these ranges are crossed – placing a sell order when the range is crossed on the upper side and a buy order when the range is crossed on the lower side. The trader takes contrarian positions, i.e. goes against the movement of prices (or trend), expecting the price to revert back to the mean. This strategy looks too good to be true and it is, it faces severe obstacles. The lookback period of the moving average might contain a few abnormal prices which are not characteristic to the dataset, this will cause the moving average to misrepresent the security’s trend or the reversal of a trend. Secondly, it might be evident that the security is overpriced as per the trader’s statistical analysis, yet he cannot be sure that other traders have made the exact same analysis. Because other traders don’t see the security to be overpriced, they would continue buying the security which would push the prices even higher. This strategy would result in losses if such a situation arises.

Pairs Trading is another strategy that relies on the principle of mean reversion. Two co-integrated securities are identified, the spread between the price of these securities would be stationary and hence mean reverting in nature. An extended version of Pairs Trading is called Statistical Arbitrage, where many co-integrated pairs are identified and split into buy and sell baskets based on the spreads of each pair. The first step in a Pairs Trading or Stat Arb model is to identify a pair of co-integrated securities. One of the commonly used tests for checking co-integration between a pair of securities is the Augmented Dickey-Fuller Test (ADF Test). It tests the null hypothesis of a unit root being present in a time series sample. A time series which has a unit root, i.e. 1 is a root of the series’ characteristic equation, is non-stationary. The augmented Dickey-Fuller statistic, also known as t-statistic, is a negative number. The more negative it is, the stronger the rejection of the null hypothesis that there is a unit root at some level of confidence, which would imply that the time series is stationary. The t-statistic is compared with a critical value parameter, if the t-statistic is less than the critical value parameter then the test is positive and the null hypothesis is rejected.

### Co-integration check – ADF Test

Consider the Python code shown below for checking co-integration:

We start by importing relevant libraries, followed by fetching financial data for two securities using the quandl.get() function. Quandl provides financial and economic data directly in Python by importing the Quandl library. In this example, we have fetched data for Aluminium and Lead futures from MCX. We then print the first five rows of the fetched data using the head() function, in order to view the data being pulled by the code. Using the statsmodels.api library, we compute the Ordinary Least Squares regression on the closing price of the commodity pair and store the result of the regression in the variable named ‘result’. Next, using the statsmodels.tsa.stattools library, we run the adfuller test by passing the residual of the regression as the input and store the result of this computation the array “c_t”. This array contains values like the t-statistic, p-value, and critical value parameters. Here, we consider a significance level of 0.1 (90% confidence level). “c_t[0]” carries the t-statistic, “c_t[1]” contains the p-value and “c_t[4]” stores a dictionary containing critical value parameters for different confidence levels. For co-integration we consider two conditions, firstly we check whether the t-stat is lesser than the critical value parameter (c_t[0] <= c_t[4][‘10%’]) and secondly whether the p-value is lesser than the significance level (c_t[1] <= 0.1). If both these conditions are true, we print that the “Pair of securities is co-integrated”, else print that the “Pair of securities is not cointegrated”.

### Output

To know more about Pairs Trading, Statistical Arbitrage and the ADF test you can check out the self-paced online certification course on “Statistical Arbitrage Trading“ offered jointly by QuantInsti and MCX to learn how to trade Statistical Arbitrage strategies using Python and Excel.