Part I

Outline

• R Basics

• debugging tools and exception handling

• performance and profiling

• graphics in R

• dynamic reporting with R Markdown

• R Shiny applications

Why R?

• open source (free software)
• runs on every major operating system
• lightweight (compared with commerical software such as MATLAB, SAS)
• frequent releases (active development)
• fantastic package ecosystem (> 12,000 packages available on CRAN now)
• easy to get started (with some programming experience of MATLAB, C++, Python, etc.)

Setting up

• editors
• RStudio IDE (probably the most popular choice)
• Emacs + ESS (Emacs Speaks Statistics)
• other alternatives: Notepad++/NppToR, StatET (R + Eclipse), Vim, …
• A list of R packages needed is given below.
## for reproducing the slides
install.packages("revealjs")
## for following examples
install.packages(
c("bookdown", "data.table", "dplyr", "fortunes", "ggplot2",
"microbenchmark", "plotly", "profvis", "Rcpp", "shiny")
)

Getting help

• help() or ? for getting help on a specific topic (usually a function name) or package.

?help                         # details on using the help function
help(package = "stats")       # information about the stats package
• help.search() or ?? for fuzzy (local) search.

??splines                     # search "splines"
help.search("row mean")       # search "row mean"
• RSiteSearch(): http://search.r-project.org

• StackOverflow: http://stackoverflow.com/tags/r

Simple calculations

• use R as a calculator (?Arithmetic)
• assignment by <- or = (or ->?)
• : (colon operator) for generating regular sequences
• indexing and subsetting by [] for vectors
• c() for combining/concatenating vectors
x = (1 + 2) * 3 / 4                   # x = 2.25
x ^ 2                                 # 5.0625
x %% 2                                # 0.25
x %/% 2                               # 1
z <- y <- 4:1 - 1                     # z = y = (3, 2, 1, 0)
log(y)                                # (1.099, 0.693, 0.000, -∞)
y[1]                                  # 3
y[c(1, 2)]                            # (3, 2)
y[- 1]                                # (2, 1, 0)
y[- c(2, 3)]                          # (3, 0)
c(x, y)                               # (2.25, 3, 2, 1, 0)
c(exp(1), pi)                         # (2.718282, 3.141593)

Base data structures

Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array
• R does not have scalar types.

• Almost all other objects are built upon these foundations.

• str() (short for structure) gives a compact, human readable description of any R data structure.

Atomic vectors

• Four common types of atomic vectors are logical, integer, double (often called numeric), and character.
dbl_var <- c(1, 2.5, 4.5)                 # double
int_var <- c(1L, 6L, 10L)                 # integer (with L suffix)
log_var <- c(TRUE, FALSE, T, F)           # logical (avoid using T, F)
chr_var <- c("these are", "some strings") # character
• NA: Not Available/missing values
• NaN: Not a Number
• NULL: The null object
is.na(c(NA, NaN))                         # (TRUE, TRUE)
is.nan(c(NA, NaN))                        # (FALSE, TRUE)
is.null(NA)                               # FALSE
is.null(NaN)                              # FALSE
is.null(NULL)                             # TRUE

Some useful operators and functions for atomic vectors:

c(), vector(), length(), names(), setNames()
rep(), rep.int(), rep_len(),
seq(), seq.int(), seq_len(), seq.along(),
%in%, match(), rev(), head(), tail(), paste(), paste0()
typeof(), is.character(), as.character(), is.numeric(), as.numeric()
is.logical(), as.logical(), ...
## e.g.,
x <- 1:3                         # (1, 2, 3)
length(x)                        # 3
names(x) <- letters[x]           # add names to x: "a", "b", and "c"
y <- setNames(x, LETTERS[x])     # (1, 2, 3) with names "A", "B", and "C"
names(y)                         # "A", "B", and "C"
y["A"]                           # 1 with name "A"
head(y, 2)                       # (1, 2) with names "A", "B"
head(y, - 1)                     # the same with head(y, 2)
y[c("B", "C")] <- c(4, 6)        # y = (1, 4, 6) with names "A", "B", "C"
rev(y)                           # (4, 2, 1) with names "C", "B", and "A"
c("A", "B") %in% names(y)        # (TRUE, TRUE)
z <- rep(letters[x], x)          # z = ("a", "b", "b", "c", "c", "c")
paste0(z, 1:6)                   # "a1", "b2", "b3", "c4", "c5", and "c6"
## Basic math functions
abs(), sign()
sin(), cos(), tan()
acos(), asin(), atan(), atan2()
ceiling(), floor(), round(), trunc(), signif()
exp(), log(), log10(), log2(), sqrt()
choose(), factorial(), combn()
union(), intersect(), setdiff(), setequal()
sum(), prod(), max(), min(), pmax(), pmin()
cummax(), cummin(), cumprod(), cumsum(), diff()
mean(), median(), cor(), sd(), var(), range()
## Ordering and tabulating
duplicated(), unique()
sort(), order(), rank(), quantile()
rle(), table(), xtabs(), split(), cut()
## Random variables
(d, p, q, r) + (beta, binom, cauchy, chisq, exp, f, gamma, geom,
hyper, lnorm, logis, multinom, nbinom, norm, pois, signrank, t,
unif, weibull, wilcox, birthday, tukey)

e.g., rnorm() for generating random numbers following normal distribution.

Matrices and arrays

a <- matrix(LETTERS[1:6], nrow = 2)  # 2 x 3 matrix ("A"-"F" by columns)
dim(a)                               # (2, 3)
a[2]                                 # "B"
a[1, ]                               # the first row
a[, 2]                               # the second column
a[1, 1:2]                            # "A" and "C"
a[rbind(c(1, 3), c(2, 1), c(2, 2))]  # "E", "B", "D"
b <- array(1:12, c(2, 3, 2))         # a 2 x 3 x 2 array
dim(b)                               # (2, 3, 2)

Some useful functions and operators:

dim(), ncol(), NCOL(), nrow(), NROW()
names(), colnames(), rownames(), dimnames()
cbind(), rbind()
t(), aperm(), diag(), sweep()
rowSums(), colSums(), rowMeans(), colMeans()
as.matrix(), data.matrix()
crossprod(), tcrossprod(), %*%, %o%, outer()
eigen(), qr(), svd(), rcond(), solve()

Data frames

dat <- data.frame(x = 1:3, y = c("a", "b", "c"))
dat$z <- as.character(dat$x + 10)
dat$fac_z <- factor(dat$z)
str(dat)
## 'data.frame':    3 obs. of  4 variables:
##  $x : int 1 2 3 ##$ y    : Factor w/ 3 levels "a","b","c": 1 2 3
##  $z : chr "11" "12" "13" ##$ fac_z: Factor w/ 3 levels "11","12","13": 1 2 3
• Factors: integer vectors with attribute class and levels
• related functions: factor(), gl(), levels(), relevel(), as.factor(), is.factor()
• Let stringsAsFactors = FALSE when creating data frame or set options(stringsAsFactors = FALSE) globally for not converting character strings as factors.

Lists

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
names(x) <- c("foo", "bar", "alpha", "beta")
str(x)
## List of 4
##  $foo : int [1:3] 1 2 3 ##$ bar  : chr "a"
##  $alpha: logi [1:3] TRUE FALSE TRUE ##$ beta : num [1:2] 2.3 5.9
y <- list(x, sum)
str(y)
## List of 2
##  $:List of 4 ## ..$ foo  : int [1:3] 1 2 3
##   ..$bar : chr "a" ## ..$ alpha: logi [1:3] TRUE FALSE TRUE
##   ..$beta : num [1:2] 2.3 5.9 ##$ :function (..., na.rm = FALSE)
y[[2]](1:10)
## [1] 55
z <- list(list(list()))
str(z)
## List of 1
##  $:List of 1 ## ..$ : list()

Some commonly used functions for data frames and lists:

list(), unlist()
data.frame(), as.data.frame()
names(), colnames(), rownames()
##  $b: logi TRUE f() ## list() • The ... is also useful when the number of arguments passed to the function cannot be known in advance. str(paste) ## function (..., sep = " ", collapse = NULL) str(cat) ## function (..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE) • All the arguments that appears after the ... in the argument list has to be specified explicitly by the names (cannot be partially matched). paste(letters[1:3], c(2, 4, 5), sep = " = ", collapse = ", ") ## [1] "a = 2, b = 4, c = 5" Lexical scoping • How does R look up the value of a symbol? • The scoping rules determine how a value is associated with a free variable in a function. f <- function(x, y) { x * 3 + y / z # z is called a free variable } z <- 10 f(1, 20) ## [1] 5 • The values of free variables are first searched for in the environment where the function was defined. • What is an environment? • a collection of symbol and value pairs • Every environment has a parent environment except the empty environment. • Things get really interesting when we define a function inside another function. power <- function(pow) { function(x) x ^ pow } square <- power(2) square(3) ## [1] 9 cube <- power(3) cube(2) ## [1] 8 ls(environment(square)) ## [1] "pow" get("pow", environment(square)) ## [1] 2 ls(environment(cube)) ## [1] "pow" get("pow", environment(cube)) ## [1] 3 search() ## [1] ".GlobalEnv" "package:shiny" "package:Rcpp" ## [4] "package:profvis" "package:plotly" "package:microbenchmark" ## [7] "package:ggplot2" "package:fortunes" "package:dplyr" ## [10] "package:data.table" "package:bookdown" "package:stats" ## [13] "package:graphics" "package:grDevices" "package:utils" ## [16] "package:datasets" "package:methods" "Autoloads" ## [19] "package:base" Consider the following example: y <- 10 f <- function(x) { y <- 2 y ^ 2 + g(x) } g <- function(x) x * y # what is the value of y here? f(3) • Inside of the function g, what is the value of y? • lexical scoping: y = 10 • dynamic scoping: y = 2 instead • In R, the calling environment (where the function is called) is known as the parent frame. Control structures ## a simple example of the if-else statement ## for some vector variable x if (length(x) > 10) { ## do something here if length of x > 10 } else if (length(x) > 5) { ## do something else if length of x > 5 but <= 10 } else { ## do something else } ## a simple loop that seems to be okay ## however, it is a not necessary loop in R! x <- 0; y <- c(4, 3, 9) for (i in 1:3) { x <- x + y[i] } x == sum(y) ## [1] TRUE ## see also switch(), ifelse(), while(), break, next, repeat(), return() Loop helper functions • apply and its friends lapply, sapply, tapply, mapply, vapply, and more str(apply) ## function (X, MARGIN, FUN, ...) set.seed(123) mat <- matrix(rnorm(200), nrow = 10) apply(mat, 1, quantile, probs = c(0.25, 0.75)) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## 25% -0.5641935 -0.3024751 -0.6244192 -0.7630466 -0.7939067 -0.5482448 -0.314301 ## 75% 0.7232730 0.4018363 0.2620994 0.7028158 0.2268176 1.0520128 0.609674 ## [,8] [,9] [,10] ## 25% -0.7857089 -0.4379614 -0.6305012 ## 75% 0.3416559 0.7605316 0.2541973 apply(mat, 2, function(a) max(a) - min(a)) ## [1] 2.980126 3.753530 2.940508 1.275597 3.434352 3.065223 3.121876 3.334740 1.519468 ## [10] 3.213754 2.586939 2.225847 2.805719 3.962351 3.701645 1.985350 4.501195 3.224448 ## [19] 2.326511 3.308015 • also works for arrays x <- array(rnorm(200), dim = c(2, 2, 10)) apply(x, c(1, 2), mean) ## [,1] [,2] ## [1,] 0.4205324 -0.1052822 ## [2,] -0.1114394 -0.1615257 rowMeans(x, dims = 2) # better performance as shown later ## [,1] [,2] ## [1,] 0.4205324 -0.1052822 ## [2,] -0.1114394 -0.1615257 apply(x, c(1, 2), sum) ## [,1] [,2] ## [1,] 4.205324 -1.052822 ## [2,] -1.114394 -1.615257 rowSums(x, dims = 2) # better performance as shown later ## [,1] [,2] ## [1,] 4.205324 -1.052822 ## [2,] -1.114394 -1.615257 • lapply: loops over a vector/list and apply a function to each element • sapply: same with lapply but try to simplify the result a <- 1:3 (foo <- lapply(a, function(b) rnorm(b))) ## [[1]] ## [1] -0.07355602 ## ## [[2]] ## [1] -1.1686514 -0.6347483 ## ## [[3]] ## [1] -0.02884155 0.67069597 -1.65054654 sapply(foo, mean) ## [1] -0.07355602 -0.90169984 -0.33623071 dat <- subset(iris, subset = Species %in% levels(iris$Species)[1],
select = - Species)
str(dat)
## 'data.frame':    50 obs. of  4 variables:
##  $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
str(lapply(dat, quantile, probs = c(0.25, 0.75)))
## List of 4
##  $Sepal.Length: Named num [1:2] 4.8 5.2 ## ..- attr(*, "names")= chr [1:2] "25%" "75%" ##$ Sepal.Width : Named num [1:2] 3.2 3.68
##   ..- attr(*, "names")= chr [1:2] "25%" "75%"
##  $Petal.Length: Named num [1:2] 1.4 1.58 ## ..- attr(*, "names")= chr [1:2] "25%" "75%" ##$ Petal.Width : Named num [1:2] 0.2 0.3
##   ..- attr(*, "names")= chr [1:2] "25%" "75%"
sapply(dat, quantile, probs = c(0.25, 0.75))
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 25%          4.8       3.200        1.400         0.2
## 75%          5.2       3.675        1.575         0.3

Vectorization

• Many operations and functions in R are vectorized, which makes computation more efficient, code more concise and easier to read.
• It is recommended to take advantage of the vectorized operations and avoid unnecessary (explicit) loops in R.
x <- 1:3; y <- 4:9; x + y               # x is recycled
## [1]  5  7  9  8 10 12
x > 2
## [1] FALSE FALSE  TRUE
pmax(x, 2)
## [1] 2 2 3
x <- matrix(1:4, nrow = 2); y <- matrix(10, 2, 2)
x * y                                   # element-wise multiplication
##      [,1] [,2]
## [1,]   10   30
## [2,]   20   40
x %*% y                                 # true matrix multiplication
##      [,1] [,2]
## [1,]   40   40
## [2,]   60   60
rep(1:3, times = 2:4)
## [1] 1 1 2 2 2 3 3 3 3
rnorm(10, mean = 1:5, sd = 0.1)         # mean is recycled
##  [1] 0.9650246 2.0756406 2.9461191 4.0227292 5.0492229 1.0267835 2.0653258 2.9877291
##  [9] 3.9586323 4.7356851
paste0("No.", 1:3)                      # "No." is recycled
## [1] "No.1" "No.2" "No.3"

• basic functions:
read.table(), read.csv(), ...
write.table(), write.csv(), ...
readLines(), cat()
• able to work with databases: RODBC, RMySQL, …
• JSON files? jsonlite, rjson, …
• YAML files? yaml, …
• BibTeX files? bibtex, …

Working with data.frame

• Base R has a variety of useful functions for working with data.

• two great add-on packages for working with data.frame: data.table and dplyr

Example: Sum aggregation by group

• Consider a StackOverflow question: How to sum a variable by group?
• Suppose we have integer vector x representing the group.
• We want to sum/aggregate y by group x.
set.seed(123)
dat <- data.frame(x = rpois(200, lambda = 5),
y = round(runif(200, max = 10)))
str(dat)
## 'data.frame':    200 obs. of  2 variables:
##  $x: int 4 7 4 8 9 2 5 8 5 5 ... ##$ y: num  2 10 6 5 4 9 4 3 2 2 ...
• The following solutions give equivalent results in slightly different formats.
res_1 <- with(dat, by(y, x, sum))                   # base::by

res_2 <- with(dat, tapply(y, x, sum))               # base::tapply

res_3 <- aggregate(y ~ x, data = dat, FUN = sum)    # stats::aggregate

res_4 <- xtabs(y ~ x, data = dat)                   # stats::xtabs

suppressMessages(library(plyr))                     # plyr::ddply
res_5 <- ddply(dat, "x", summarise, sum = sum(y))

suppressMessages(library(dplyr))                    # dplyr package
res_6 <- dat %>% group_by(x) %>% summarise(sum = sum(y))

suppressMessages(library(data.table))               # data.table package
dat <- as.data.table(dat)
res_7 <- dat[, .(sum = sum(y)), keyby = x]

• The package ecosystem is one of the reasons that makes R so great!

• too many packages? check out the task view: https://cran.r-project.org/web/views/

• quality control? a few thing we may check from CRAN
• package authors and maintainers
• public repository for version control and bug reports
• automated tests
• package vignettes, documentation, NEWS/changeLog
• reverse depends/imports/suggests
• release history

Debugging tools

traceback()
debug(), undebug()
trace(), untrace()
browser(), recover()
foo <- function(x) {
x2 <- x * 2
sum(log(x2))
}
foo(1:3)
## [1] 3.871201
foo("a")                                # leads to error
set.seed(123); foo(rnorm(100))          # leads to warning
foo <- function(x) {
x2 <- x * 2
browser()
sum(log(x2))
}
f <- function(x) {
x - g(x + 1)
}
g <- function(y) {
h(2 * y)
}
h <- function(z) {
a <- log(z)
if (all(a < 10))
a ^ 2
else
a
}
f(rnorm(10))          # Error: missing value where TRUE/FALSE needed
traceback()

Exception handling

• try(): allows execution to continue even after an error has occurred
• tryCatch(): a general tool for handling conditions in addition to errors
• withCallingHandlers(): a variant of tryCatch(), rarely needed
bar <- function(x) {
res <- tryCatch(as.numeric(x), warning = function(w) w)
if ("warning" %in% class(res)) x else res
}
str(bar("123"))
##  num 123
str(bar("abc"))
##  chr "abc"

Performance

• Consider the following example of computing the row sum for a given matrix mat.
mat <- matrix(rnorm(200), nrow = 10)
a <- vector(mode = "numeric", length = nrow(mat))
for (i in 1:nrow(mat)) a[i] <- sum(mat[i, ]) # 1. using a for loop
a
##  [1]  7.4389886 -0.6842283  3.0934989 -1.3245809  3.6883585  3.3878318 -2.9784455
##  [8] -6.0472031  1.5152872  0.3347129
apply(mat, 1, sum)                      # 2. using the apply function
##  [1]  7.4389886 -0.6842283  3.0934989 -1.3245809  3.6883585  3.3878318 -2.9784455
##  [8] -6.0472031  1.5152872  0.3347129
rowSums(mat)                            # 3. using the rowSums function
##  [1]  7.4389886 -0.6842283  3.0934989 -1.3245809  3.6883585  3.3878318 -2.9784455
##  [8] -6.0472031  1.5152872  0.3347129

Benchmarking

• microbenchmark, rbenchmark, benchmark, …
library(microbenchmark)
microbenchmark(
"for loop" = { a <- vector(mode = "numeric", length = nrow(mat));
for (i in 1:nrow(mat)) a[i] <- sum(mat[i, ]) },
"apply" = apply(mat, 1, sum),
"rowSums" = rowSums(mat),
times = 200, unit = "relative"
)
## Unit: relative
##      expr        min         lq       mean     median         uq       max neval cld
##  for loop 557.497754 405.001274 332.020959 302.998435 278.875915 296.68764   200   b
##     apply   7.568284   6.057741   5.280688   4.881616   4.894252   4.14813   200  a
##   rowSums   1.000000   1.000000   1.000000   1.000000   1.000000   1.00000   200  a
• R is not a fast language compared with C/C++/Fortran. However, a lot of R code is slow simply because it is poorly written.

A few quick benchmarkings

The R session information for benchmarking is given below.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
##
## Matrix products: default
## BLAS: /usr/lib/libopenblasp-r0.3.0.dev.so
## LAPACK: /usr/lib/liblapack.so.3.8.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##  [1] plyr_1.8.4           shiny_1.1.0          Rcpp_0.12.18         profvis_0.3.5
##  [5] plotly_4.8.0         microbenchmark_1.4-4 ggplot2_3.0.0        fortunes_1.5-4
##  [9] dplyr_0.7.6          data.table_1.11.4    bookdown_0.7
##
## loaded via a namespace (and not attached):
##  [1] revealjs_0.9      zoo_1.8-3         tidyselect_0.2.4  xfun_0.3
##  [5] purrr_0.2.5       splines_3.5.1     lattice_0.20-35   colorspace_1.3-2
##  [9] htmltools_0.3.6   viridisLite_0.3.0 yaml_2.2.0        survival_2.42-6
## [13] rlang_0.2.1       pillar_1.3.0      later_0.7.3       glue_1.3.0
## [17] withr_2.1.2       bindrcpp_0.2.2    multcomp_1.4-8    bindr_0.1.1
## [21] stringr_1.3.1     munsell_0.5.0     gtable_0.2.0      htmlwidgets_1.2
## [25] mvtnorm_1.0-8     codetools_0.2-15  evaluate_0.11     knitr_1.20
## [29] httpuv_1.4.5      TH.data_1.0-9     xtable_1.8-2      scales_0.5.0
## [33] backports_1.1.2   promises_1.0.1    jsonlite_1.5      mime_0.5
## [37] digest_0.6.15     stringi_1.2.4     grid_3.5.1        rprojroot_1.3-2
## [41] tools_3.5.1       sandwich_2.4-0    magrittr_1.5      lazyeval_0.2.1
## [45] tibble_1.4.2      crayon_1.3.4      tidyr_0.8.1       pkgconfig_2.0.1
## [49] MASS_7.3-50       Matrix_1.2-14     assertthat_0.2.0  rmarkdown_1.10
## [53] httr_1.3.1        R6_2.2.2          compiler_3.5.1
• 1:length(vec) vs. seq_along(vec)
vec <- rnorm(50)
stopifnot(all.equal(1:length(vec), seq_along(vec)))
microbenchmark(1:length(vec), seq_along(vec),
times = 1e3, unit = "relative")
## Unit: relative
##            expr      min       lq     mean   median       uq      max neval cld
##   1:length(vec) 1.567742 1.649103 1.534697 1.585778 1.539155 1.074651  1000   b
##  seq_along(vec) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000  1000  a
• 1:n vs. seq_len(n)
stopifnot(all.equal(1:100, seq_len(100)))
microbenchmark(1:100, seq_len(100), times = 1e3, unit = "relative")
## Unit: relative
##          expr      min       lq     mean   median       uq       max neval cld
##         1:100 1.434783 1.353846 1.211488 1.272997 1.219895 0.8063325  1000   b
##  seq_len(100) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000  1000  a
• t(mat1) %*% mat1 vs. crossprod(mat1)
mat1 <- matrix(rnorm(1e4), 1e2)
stopifnot(all.equal(mat1 %*% t(mat1), tcrossprod(mat1)))
microbenchmark(t(mat1) %*% mat1, crossprod(mat1), unit = "relative")
## Unit: relative
##              expr      min       lq     mean   median       uq      max neval cld
##  t(mat1) %*% mat1 2.068901 2.034417 2.121033 2.041875 2.194208 3.642643   100   b
##   crossprod(mat1) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100  a
• mat1 %*% t(mat1) vs. tcrossprod(mat1)
stopifnot(all.equal(mat1 %*% t(mat1), tcrossprod(mat1)))
microbenchmark(mat1 %*% t(mat1), tcrossprod(mat1), unit = "relative")
## Unit: relative
##              expr      min       lq     mean   median       uq      max neval cld
##  mat1 %*% t(mat1) 1.896281 1.855674 1.891234 1.879163 1.870892 3.561835   100   b
##  tcrossprod(mat1) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100  a
• t(mat1) %*% mat2 vs. crossprod(mat1, mat2)
mat2 <- matrix(rnorm(1e4), 1e2)
stopifnot(all.equal(t(mat1) %*% mat2, crossprod(mat1, mat2)))
microbenchmark(t(mat1) %*% mat2, crossprod(mat1, mat2), unit = "relative")
## Unit: relative
##                   expr      min       lq     mean   median       uq       max neval cld
##       t(mat1) %*% mat2 1.061678 1.085385 1.092117 1.087051 1.131448 0.5650585   100   b
##  crossprod(mat1, mat2) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100  a
• mat1 %*% t(mat2) vs. tcrossprod(mat1, mat2)
stopifnot(all.equal(mat1 %*% t(mat2), tcrossprod(mat1, mat2)))
microbenchmark(mat1 %*% t(mat2), tcrossprod(mat1, mat2), unit = "relative")
## Unit: relative
##                    expr      min       lq    mean   median       uq      max neval cld
##        mat1 %*% t(mat2) 1.034565 1.037275 1.05693 1.028481 1.047323 2.210879   100   b
##  tcrossprod(mat1, mat2) 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000   100  a
• any(is.na(x)) vs. anyNA(x)
x <- 1:1e4; x[5e3] <- NaN               # coerces x to be double
stopifnot(all.equal(any(is.na(x)), anyNA(x)))
microbenchmark(any(is.na(x)), anyNA(x), unit = "relative")
## Unit: relative
##           expr      min       lq   mean   median     uq      max neval cld
##  any(is.na(x)) 5.337361 5.371203 5.6475 5.341983 5.3265 20.68088   100   b
##       anyNA(x) 1.000000 1.000000 1.0000 1.000000 1.0000  1.00000   100  a
• apply vs. rowSums (rowMeans, colSums, colMeans)
stopifnot(all.equal(apply(mat1, 2, mean), colMeans(mat1)))
microbenchmark(apply(mat1, 2, mean), colMeans(mat1), unit = "relative")
## Unit: relative
##                  expr      min      lq     mean   median       uq      max neval cld
##  apply(mat1, 2, mean) 42.90425 41.0403 31.74537 36.29913 25.37704 68.62862   100   b
##        colMeans(mat1)  1.00000  1.0000  1.00000  1.00000  1.00000  1.00000   100  a

Revisit the example of sum aggregation by group

microbenchmark(
"by" = with(dat, by(y, x, sum)),
"tapply" = with(dat, tapply(y, x, sum)),
"aggregate" = aggregate(y ~ x, data = dat, FUN = sum),
"xtabs" = xtabs(y ~ x, data = dat),
"ddply" = ddply(dat, "x", summarise, sum = sum(y)),
"dplyr" = dat %>% group_by(x) %>% summarise(sum = sum(y)),
"data.table" = dat[, .(sum = sum(y)), keyby = x],
times = 1e3, unit = "relative"
)
## Unit: relative
##        expr       min        lq      mean    median        uq       max neval   cld
##          by  3.840108  3.781034  3.462978  3.586863  3.216732  1.478090  1000  b
##      tapply  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  1000 a
##   aggregate  7.237326  7.210388  6.763582  7.142968  6.420644  1.914705  1000    d
##       xtabs  5.406453  5.439945  5.182376  5.476047  4.987239  1.564497  1000   c
##       ddply 22.748202 22.080111 20.367626 20.675366 18.096151 58.834953  1000     e
##       dplyr  5.224383  6.000011  5.971183  5.894329  5.285228 28.702014  1000   cd
##  data.table  3.148850  3.535590  3.332793  3.579980  3.160607  1.413875  1000  b

Profiling

• How to find the bottleneck of the code (the slowest part of your code)?
• related packages: utils, profvis, profmem, …
## example of profvis
if (interactive()) {
profvis({
dat <- data.frame(x = rnorm(5e4),
y = rnorm(5e4))
plot(x ~ y, data = dat)
m <- lm(x ~ y, data = dat)
abline(m, col = "red")
})
}

Integrating R and other languages

• C/Fortran: .Call(), .C, .Fortran, dyn.load(), …

• C++: inline, Rcpp, RcppArmadillo, …

• Julia: XRJulia, JuliaCall, …

• Python: XRPython, reticulate, …

• MATLAB: R.matlab, …

Sum aggregation by groups using Rcpp?

• Try a straightforward (but possibly naive) implementation in Rcpp.
library(Rcpp)
sourceCpp("src/aggregateSum.cpp")     # available at the source repository
res_8 <- with(dat, aggregateSum(y, x))
microbenchmark(
"tapply" = with(dat, tapply(y, x, sum)),
"rcpp_wi_names" = with(dat, aggregateSum(y, x)),
"rcpp_wo_names" = with(dat, aggregateSum(y, x, addNames = FALSE)),
times = 1e3, unit = "relative"
)
## Unit: relative
##           expr      min       lq     mean   median       uq      max neval cld
##         tapply 8.957147 8.799673 8.790180 8.521680 8.436642 51.66355  1000   c
##  rcpp_wi_names 3.540774 3.444513 3.487307 3.346194 3.286836 30.28195  1000  b
##  rcpp_wo_names 1.000000 1.000000 1.000000 1.000000 1.000000  1.00000  1000 a

Parallel computing

• One R session/process normally uses only one core.
• If the computing tasks are independent and can be run individually, we may want to run them using different cores at the same time.
• e.g., fitting models over 1,000 simulated datasets
• Some packages: parallel, snow, snowfall, foreach, …
• See more at the CRAN task view on high-performance and parallel computing.
• Run R script with input arguments in terminal with GNU parallel? Or just use the HPC cluster if available?

Graphics in R

• two basic graphics systems in R
• base graphics (the graphics package)
• grid graphics (the grid package)
• lattice: Trellis Graphics for R
• sub-plots conditional on categorical variables
• one of the recommended packages shipped with R
• ggplot2: An R implementation of Leland Wilkinson’s Grammar of Graphics
• interactive plots? plotly, highcharter, Dygraphs, …
• See more at the CRAN task view on graphics.
## a simple example of using base plotting system
par(mfrow = c(1, 2), mgp = c(2, 1, 0), mar = c(3, 3, 2, 0.1))
x <- 1:22; plot(x, pch = x, col = x, cex = 2)
abline(a = 0, b = 1, lty = 3, col = "#64C7C9")
abline(h = 21, col = "#BDC0C3", lty = 2)
hist(rnorm(200), breaks = 30, col = "#85B3CC", freq = FALSE)

## example boxplots of using ggplot2
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
(p1 <- p + geom_boxplot(aes(colour = drv)))

## flip the coordinate and use the classic dark-on-light theme
p1 + coord_flip() + theme_bw()

R Markdown

• code + narratives = report
• some existing tools:
• WEB (Donald Knuth, Literate Programming)
• Noweb (Norman Ramsey)
• Sweave (Friedrich Leisch and R-core)
• knitr (Yihui Xie)
• Org mode (Carsten Dominik, Emacs)
• Jupyter notebook
• knitr:
• .Rnw (R + LaTeX)
• .Rmd (R + Markdown)
• any computing language + any authoring language

Trend of vertical difference?

x <- seq(0.1, 10, 0.1)
plot(x, 1/x, type = "l", lwd = 2)
lines(x, 1/x + 2, col = "red", lwd = 2)

“Use the source, Luke!”

library(fortunes)
fortune(250)
##
## As Obi-Wan Kenobi may have said in Star Wars: "Use the source, Luke!"
##    -- Barry Rowlingson (answering a question on the documentation of some implementation
##       details)
##       R-devel (January 2010)
• from the ESS manual:

The source code is real. The objects are realizations of the source code.

R Shiny applications

• I saw this cool plate at (UConn) Y lot in last Summer!

The “Hello Shiny!” example

library("shiny")
ui <- fluidPage(
titlePanel("Hello Shiny!"),
sidebarLayout(
sidebarPanel(sliderInput("bins", "Number of bins:",
min = 1, max = 50, value = 30)),
mainPanel(plotOutput("distPlot"))
)
)
server <- function(input, output) {
output$distPlot = renderPlot({ x = faithful[, 2] bins = seq(min(x), max(x), length.out = input$bins + 1)
hist(x, breaks = bins, col = 'darkgray', border = 'white')
})
}
shinyApp(ui = ui, server = server)

A simple miniUI example

• source available at here