This week’s R bulletin will cover topics like how to read select columns, the difference between boolean operators, converting a factor to a numeric and changing memory available to R. We will also cover functions like data, format, tolower, toupper, and strsplit function. Hope you like this R weekly bulletin. Enjoy reading!
1. To change the working directory – Ctrl+Shift+K
2. To show history – Ctrl+4
3. To display environment – Ctrl+8
Problem Solving Ideas
Read a limited number of columns from a dataset
To read a specific set of columns from a dataset one can use the “fread” function from the data.table package. The syntax for the function to be used for reading/dropping a limited number of columns is given as:
fread(input, select, stringsAsFactors=FALSE)
fread(input, drop, stringsAsFactors=FALSE)
One needs to specify the columns names or column numbers with the select parameter as shown below.
library(data.table) data = fread("data.txt") print(data)
# Reading select columns data = fread("data.txt", select = c("DATE", "OPEN", "CLOSE")) print(data)
Alternatively, you can use the drop parameter to indicate which columns should not be read:
data = fread("data.txt", drop = c(2:3, 6)) print(data)
Difference between the boolean operators & and &&
In R we have the “&” Boolean operator which is the equivalent to the AND operator in excel. Similarly, we have | in R which is the equivalent of OR in excel. One also finds Boolean operators && and || in R. Although, these operators have the same meaning as & and |, they behave in a slightly different manner.
Difference between & and &&: The shorter form “&” is vectorized, meaning it can be applied on a vector. See the example below.
The longer form &&, evaluates left to right examining only the first element of each vector. Thus in the example below, it will first evaluate whether -1 > = 0 (which is FALSE), and then check whether -1 <= 0 (which is TRUE). After this, it will evaluate FALSE & TRUE, which gives the final answer as FALSE.
Thus the && operator is to be used when we have vectors of length one. For vectors of length greater than one, we should use the short form.
How to convert a factor to a numeric without a loss of information
Consider the following factor “x” created using the factor, sample, and the runif functions. Let us try to convert this factor to a numeric.
x = factor(sample(runif(5), 6, replace = TRUE)) print(x) 0.660900804912671 0.735364762600511 0.479244165122509 0.397552277892828
 0.660900804912671 0.660900804912671
4 Levels: 0.397552277892828 0.479244165122509 … 0.735364762600511
as.numeric(x) 3 4 2 1 3 3
As can be seen from the output, upon conversion we do not get the original values. R does not have a function to execute such conversion error-free. To overcome this problem, one can use the following expression:
as.numeric(levels(x))[x] 0.6609008 0.7353648 0.4792442 0.3975523 0.6609008 0.6609008
One can also use the as.numeric(as.character(x)) expression, however the as.numeric(levels(x))[x] is slightly more efficient in converting the factor to an integer/numeric.
Replace a part of URL using R
Suppose you want to scrap data from a site having different pages, and want to replace the url using R. We can do this by using the parse_url function and build_url function from the “httr” package. The following example illustrates the method to replace a part in url.
library(httr) # In the following url, the value 2 at the end signifies page 2 of the books # catergory under the section "bestsellers". testURL = "http://www.amazon.in/gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b#2" # Entering the url and parsing it using the parse_ulr function parseURL = parse_url(testURL) print(parseURL)
# Assigning the next page number (i.e page no.3) and creating the new url parseURL$fragment = 3 newURL <- build_url(parseURL) print(newURL) “http://www.amazon.in/gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b#3”
Increasing (or decreasing) the memory available to R processes
In R there is a memory.limit() function which gives you the amount of available memory in MB. At times, windows users may get the error that R has run out of memory. In such cases, you may set the amount of available memory.
The unit is MB. You may increase this value up to 2GB or the maximum amount of physical RAM you have installed. If you have R already installed and subsequently install more RAM, you may have to reinstall R in order to take advantage of the additional capacity.
On 32-bit Windows, R can only use up to 3Gb of RAM, regardless of how much you have installed. There is a 64-bit version of R for Windows available from REvolution Computing, which runs on 64-bit Windows, and can use the entire RAM available.
R program includes a number packages which often come with different data sets. These data sets can be accessed using the data function. This function loads specified data sets or lists the available data sets. To check the available data sets from all the R packages installed in your application use the data() expression. This will display all the available data sets under each package.
To view data sets from a particular package, specify the package name as the argument to the data function.
data(package = "lubridate")
To load a particular dataset, use the name of the dataset as the argument to the data function.
# loading the USDCHF dataset from the timeSeries package library(timeSeries) data(USDCHF) x = USDCHF print(head(USDCHF))
# loading the sample_matrix dataset from the xts package library(xts) data(sample_matrix) x = sample_matrix print(head(sample_matrix))
The format function is used to format the numbers so that they can appear clean and neat on the reports. The function takes a number of arguments to control the formatting of the result. The main arguments for the format function are given below.
format(x, trim, digits, decimal.mark, nsmall, big.mark, big.interval, small.mark,
1) x: the number to be formatted. 2) trim: takes a logical value. If FALSE, it adds spaces to right-justify the result. If TRUE, it suppresses the leading spaces. 3) digits: how many significant digits of numeric values to display. 4) decimal mark: the format in which to display the decimal mark. 5) nsmall: the minimum number of digits after the decimal point. 6) big.mark: the mark between intervals before the decimal point. 7) small mark: the mark between intervals after the decimal point.
format(34562.67435, digits=7, decimal.mark=".",big.mark=" ", small.mark=",",small.interval=2) “34 562.67”
In the example above we chose to display 7 digits, and are using point(.) as a decimal mark. The small interval value is 2, which means we want 2 digits to be displayed after the decimal point. The big.mark format displays a space after the first two digits.
tolower, toupper, and strsplit
The tolower and toupper functions help convert strings to lowercase and uppercase respectively.
toupper("I coded a strategy in R") “I CODED A STRATEGY IN R”
tolower("I coded a strategy in R") “i coded a strategy in r”
df = as.data.frame(read.csv("NIFTY.csv")) print(head(df, 4))
colnames(df) “DATE” “TIME” “CLOSE” “HIGH” “LOW” “OPEN” “VOLUME”
tolower(colnames(df)) “date” “time” “close” “high” “low” “open” “volume”
The strsplit function is used to break/split strings at the specified split points. The function returns a list and not a character vector or a matrix. In the example below, we are splitting the expression on the spaces.
strsplit("I coded a strategy in R", split = " ")[]  “I” “coded” “a” “strategy” “in” “R”
We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.