Introduction to R Programming

Part I

Wenjie Wang
Department of Statistics, UConn

January 19, 2018

Getting Started

Outline

  • R Basics

  • debugging tools and exception handling

  • performance and profiling

  • graphics in R

  • dynamic reporting with R Markdown

  • R Shiny applications

What is R?

Why R?

  • open source (free software)
  • runs on every major operating system
  • lightweight (compared with commerical software such as MATLAB, SAS)
  • frequent releases (active development)
  • fantastic package ecosystem (> 12,000 packages available on CRAN now)
  • easy to get started (with some programming experience of MATLAB, C++, Python, etc.)

Setting up

  • editors
    • RStudio IDE (probably the most popular choice)
    • Emacs + ESS (Emacs Speaks Statistics)
    • other alternatives: Notepad++/NppToR, StatET (R + Eclipse), Vim, …
  • A list of R packages needed is given below.

Getting help

Simple calculations

  • use R as a calculator (?Arithmetic)
  • assignment by <- or = (or ->?)
  • : (colon operator) for generating regular sequences
  • indexing and subsetting by [] for vectors
  • c() for combining/concatenating vectors

Base data structures

Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array
  • R does not have scalar types.

  • Almost all other objects are built upon these foundations.

  • str() (short for structure) gives a compact, human readable description of any R data structure.

Atomic vectors

  • Four common types of atomic vectors are logical, integer, double (often called numeric), and character.
  • NA: Not Available/missing values
  • NaN: Not a Number
  • NULL: The null object

Some useful operators and functions for atomic vectors:

e.g., rnorm() for generating random numbers following normal distribution.

Matrices and arrays

Some useful functions and operators:

Data frames

## 'data.frame':    3 obs. of  4 variables:
##  $ x    : int  1 2 3
##  $ y    : Factor w/ 3 levels "a","b","c": 1 2 3
##  $ z    : chr  "11" "12" "13"
##  $ fac_z: Factor w/ 3 levels "11","12","13": 1 2 3
  • Factors: integer vectors with attribute class and levels
    • related functions: factor(), gl(), levels(), relevel(), as.factor(), is.factor()
  • Let stringsAsFactors = FALSE when creating data frame or set options(stringsAsFactors = FALSE) globally for not converting character strings as factors.

Lists

## List of 4
##  $ foo  : int [1:3] 1 2 3
##  $ bar  : chr "a"
##  $ alpha: logi [1:3] TRUE FALSE TRUE
##  $ beta : num [1:2] 2.3 5.9
## List of 2
##  $ :List of 4
##   ..$ foo  : int [1:3] 1 2 3
##   ..$ bar  : chr "a"
##   ..$ alpha: logi [1:3] TRUE FALSE TRUE
##   ..$ beta : num [1:2] 2.3 5.9
##  $ :function (..., na.rm = FALSE)
## [1] 55
## List of 1
##  $ :List of 1
##   ..$ : list()

Some commonly used functions for data frames and lists:

Functions

## [1] "foobar"

Lazy evaluation

  • Arguments to functions are evaluated lazily (only as needed).
## [1] 3

The ... (ellipsis) argument

  • a “placeholder” for possibly a number of arguments that may passed on to other functions
  • can be left unspecified
## List of 2
##  $ a: chr "alpha"
##  $ b: logi TRUE
## list()
  • The ... is also useful when the number of arguments passed to the function cannot be known in advance.
## function (..., sep = " ", collapse = NULL)
## function (..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE)
  • All the arguments that appears after the ... in the argument list has to be specified explicitly by the names (cannot be partially matched).
## [1] "a = 2, b = 4, c = 5"

Lexical scoping

  • How does R look up the value of a symbol?
  • The scoping rules determine how a value is associated with a free variable in a function.
## [1] 5
  • The values of free variables are first searched for in the environment where the function was defined.
  • What is an environment?
    • a collection of symbol and value pairs
  • Every environment has a parent environment except the empty environment.

  • Things get really interesting when we define a function inside another function.

## [1] 9
## [1] 8
## [1] "pow"
## [1] 2
## [1] "pow"
## [1] 3
##  [1] ".GlobalEnv"             "package:shiny"          "package:Rcpp"          
##  [4] "package:profvis"        "package:plotly"         "package:microbenchmark"
##  [7] "package:ggplot2"        "package:fortunes"       "package:dplyr"         
## [10] "package:data.table"     "package:bookdown"       "package:stats"         
## [13] "package:graphics"       "package:grDevices"      "package:utils"         
## [16] "package:datasets"       "package:methods"        "Autoloads"             
## [19] "package:base"

Consider the following example:

  • Inside of the function g, what is the value of y?
    • lexical scoping: y = 10
    • dynamic scoping: y = 2 instead
  • In R, the calling environment (where the function is called) is known as the parent frame.

Control structures

## [1] TRUE

Loop helper functions

  • apply and its friends lapply, sapply, tapply, mapply, vapply, and more
## function (X, MARGIN, FUN, ...)
##           [,1]       [,2]       [,3]       [,4]       [,5]       [,6]      [,7]
## 25% -0.5641935 -0.3024751 -0.6244192 -0.7630466 -0.7939067 -0.5482448 -0.314301
## 75%  0.7232730  0.4018363  0.2620994  0.7028158  0.2268176  1.0520128  0.609674
##           [,8]       [,9]      [,10]
## 25% -0.7857089 -0.4379614 -0.6305012
## 75%  0.3416559  0.7605316  0.2541973
##  [1] 2.980126 3.753530 2.940508 1.275597 3.434352 3.065223 3.121876 3.334740 1.519468
## [10] 3.213754 2.586939 2.225847 2.805719 3.962351 3.701645 1.985350 4.501195 3.224448
## [19] 2.326511 3.308015
  • also works for arrays
##            [,1]       [,2]
## [1,]  0.4205324 -0.1052822
## [2,] -0.1114394 -0.1615257
##            [,1]       [,2]
## [1,]  0.4205324 -0.1052822
## [2,] -0.1114394 -0.1615257
##           [,1]      [,2]
## [1,]  4.205324 -1.052822
## [2,] -1.114394 -1.615257
##           [,1]      [,2]
## [1,]  4.205324 -1.052822
## [2,] -1.114394 -1.615257
  • lapply: loops over a vector/list and apply a function to each element
  • sapply: same with lapply but try to simplify the result
## [[1]]
## [1] -0.07355602
## 
## [[2]]
## [1] -1.1686514 -0.6347483
## 
## [[3]]
## [1] -0.02884155  0.67069597 -1.65054654
## [1] -0.07355602 -0.90169984 -0.33623071
## 'data.frame':    50 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## List of 4
##  $ Sepal.Length: Named num [1:2] 4.8 5.2
##   ..- attr(*, "names")= chr [1:2] "25%" "75%"
##  $ Sepal.Width : Named num [1:2] 3.2 3.68
##   ..- attr(*, "names")= chr [1:2] "25%" "75%"
##  $ Petal.Length: Named num [1:2] 1.4 1.58
##   ..- attr(*, "names")= chr [1:2] "25%" "75%"
##  $ Petal.Width : Named num [1:2] 0.2 0.3
##   ..- attr(*, "names")= chr [1:2] "25%" "75%"
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 25%          4.8       3.200        1.400         0.2
## 75%          5.2       3.675        1.575         0.3

Vectorization

  • Many operations and functions in R are vectorized, which makes computation more efficient, code more concise and easier to read.
  • It is recommended to take advantage of the vectorized operations and avoid unnecessary (explicit) loops in R.
## [1]  5  7  9  8 10 12
## [1] FALSE FALSE  TRUE
## [1] 2 2 3
##      [,1] [,2]
## [1,]   10   30
## [2,]   20   40
##      [,1] [,2]
## [1,]   40   40
## [2,]   60   60
## [1] 1 1 2 2 2 3 3 3 3
##  [1] 0.9650246 2.0756406 2.9461191 4.0227292 5.0492229 1.0267835 2.0653258 2.9877291
##  [9] 3.9586323 4.7356851
## [1] "No.1" "No.2" "No.3"

Reading and writing data

  • basic functions:
  • useful packages: foreign, openxlsx, readr, haven, XLConnect, readxl, …
  • able to work with databases: RODBC, RMySQL, …
  • JSON files? jsonlite, rjson, …
  • YAML files? yaml, …
  • BibTeX files? bibtex, …

Working with data.frame

  • Base R has a variety of useful functions for working with data.

  • two great add-on packages for working with data.frame: data.table and dplyr

Example: Sum aggregation by group

  • Consider a StackOverflow question: How to sum a variable by group?
  • Suppose we have integer vector x representing the group.
  • We want to sum/aggregate y by group x.
## 'data.frame':    200 obs. of  2 variables:
##  $ x: int  4 7 4 8 9 2 5 8 5 5 ...
##  $ y: num  2 10 6 5 4 9 4 3 2 2 ...
  • The following solutions give equivalent results in slightly different formats.

Add-on packages

  • The package ecosystem is one of the reasons that makes R so great!

  • too many packages? check out the task view: https://cran.r-project.org/web/views/

  • quality control? a few thing we may check from CRAN
    • package authors and maintainers
    • public repository for version control and bug reports
    • automated tests
    • package vignettes, documentation, NEWS/changeLog
    • reverse depends/imports/suggests
    • release history

Debugging tools

## [1] 3.871201

Exception handling

  • try(): allows execution to continue even after an error has occurred
  • tryCatch(): a general tool for handling conditions in addition to errors
  • withCallingHandlers(): a variant of tryCatch(), rarely needed
##  num 123
##  chr "abc"

Performance

  • Consider the following example of computing the row sum for a given matrix mat.
##  [1]  7.4389886 -0.6842283  3.0934989 -1.3245809  3.6883585  3.3878318 -2.9784455
##  [8] -6.0472031  1.5152872  0.3347129
##  [1]  7.4389886 -0.6842283  3.0934989 -1.3245809  3.6883585  3.3878318 -2.9784455
##  [8] -6.0472031  1.5152872  0.3347129
##  [1]  7.4389886 -0.6842283  3.0934989 -1.3245809  3.6883585  3.3878318 -2.9784455
##  [8] -6.0472031  1.5152872  0.3347129

Benchmarking

  • microbenchmark, rbenchmark, benchmark, …
## Unit: relative
##      expr        min         lq       mean     median         uq       max neval cld
##  for loop 557.497754 405.001274 332.020959 302.998435 278.875915 296.68764   200   b
##     apply   7.568284   6.057741   5.280688   4.881616   4.894252   4.14813   200  a 
##   rowSums   1.000000   1.000000   1.000000   1.000000   1.000000   1.00000   200  a
  • R is not a fast language compared with C/C++/Fortran. However, a lot of R code is slow simply because it is poorly written.

A few quick benchmarkings

The R session information for benchmarking is given below.

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
## 
## Matrix products: default
## BLAS: /usr/lib/libopenblasp-r0.3.0.dev.so
## LAPACK: /usr/lib/liblapack.so.3.8.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plyr_1.8.4           shiny_1.1.0          Rcpp_0.12.18         profvis_0.3.5       
##  [5] plotly_4.8.0         microbenchmark_1.4-4 ggplot2_3.0.0        fortunes_1.5-4      
##  [9] dplyr_0.7.6          data.table_1.11.4    bookdown_0.7        
## 
## loaded via a namespace (and not attached):
##  [1] revealjs_0.9      zoo_1.8-3         tidyselect_0.2.4  xfun_0.3         
##  [5] purrr_0.2.5       splines_3.5.1     lattice_0.20-35   colorspace_1.3-2 
##  [9] htmltools_0.3.6   viridisLite_0.3.0 yaml_2.2.0        survival_2.42-6  
## [13] rlang_0.2.1       pillar_1.3.0      later_0.7.3       glue_1.3.0       
## [17] withr_2.1.2       bindrcpp_0.2.2    multcomp_1.4-8    bindr_0.1.1      
## [21] stringr_1.3.1     munsell_0.5.0     gtable_0.2.0      htmlwidgets_1.2  
## [25] mvtnorm_1.0-8     codetools_0.2-15  evaluate_0.11     knitr_1.20       
## [29] httpuv_1.4.5      TH.data_1.0-9     xtable_1.8-2      scales_0.5.0     
## [33] backports_1.1.2   promises_1.0.1    jsonlite_1.5      mime_0.5         
## [37] digest_0.6.15     stringi_1.2.4     grid_3.5.1        rprojroot_1.3-2  
## [41] tools_3.5.1       sandwich_2.4-0    magrittr_1.5      lazyeval_0.2.1   
## [45] tibble_1.4.2      crayon_1.3.4      tidyr_0.8.1       pkgconfig_2.0.1  
## [49] MASS_7.3-50       Matrix_1.2-14     assertthat_0.2.0  rmarkdown_1.10   
## [53] httr_1.3.1        R6_2.2.2          compiler_3.5.1
  • 1:length(vec) vs. seq_along(vec)
## Unit: relative
##            expr      min       lq     mean   median       uq      max neval cld
##   1:length(vec) 1.567742 1.649103 1.534697 1.585778 1.539155 1.074651  1000   b
##  seq_along(vec) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000  1000  a
  • 1:n vs. seq_len(n)
## Unit: relative
##          expr      min       lq     mean   median       uq       max neval cld
##         1:100 1.434783 1.353846 1.211488 1.272997 1.219895 0.8063325  1000   b
##  seq_len(100) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000  1000  a
  • t(mat1) %*% mat1 vs. crossprod(mat1)
## Unit: relative
##              expr      min       lq     mean   median       uq      max neval cld
##  t(mat1) %*% mat1 2.068901 2.034417 2.121033 2.041875 2.194208 3.642643   100   b
##   crossprod(mat1) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100  a
  • mat1 %*% t(mat1) vs. tcrossprod(mat1)
## Unit: relative
##              expr      min       lq     mean   median       uq      max neval cld
##  mat1 %*% t(mat1) 1.896281 1.855674 1.891234 1.879163 1.870892 3.561835   100   b
##  tcrossprod(mat1) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100  a
  • t(mat1) %*% mat2 vs. crossprod(mat1, mat2)
## Unit: relative
##                   expr      min       lq     mean   median       uq       max neval cld
##       t(mat1) %*% mat2 1.061678 1.085385 1.092117 1.087051 1.131448 0.5650585   100   b
##  crossprod(mat1, mat2) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100  a
  • mat1 %*% t(mat2) vs. tcrossprod(mat1, mat2)
## Unit: relative
##                    expr      min       lq    mean   median       uq      max neval cld
##        mat1 %*% t(mat2) 1.034565 1.037275 1.05693 1.028481 1.047323 2.210879   100   b
##  tcrossprod(mat1, mat2) 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000   100  a
  • any(is.na(x)) vs. anyNA(x)
## Unit: relative
##           expr      min       lq   mean   median     uq      max neval cld
##  any(is.na(x)) 5.337361 5.371203 5.6475 5.341983 5.3265 20.68088   100   b
##       anyNA(x) 1.000000 1.000000 1.0000 1.000000 1.0000  1.00000   100  a
  • apply vs. rowSums (rowMeans, colSums, colMeans)
## Unit: relative
##                  expr      min      lq     mean   median       uq      max neval cld
##  apply(mat1, 2, mean) 42.90425 41.0403 31.74537 36.29913 25.37704 68.62862   100   b
##        colMeans(mat1)  1.00000  1.0000  1.00000  1.00000  1.00000  1.00000   100  a

Revisit the example of sum aggregation by group

## Unit: relative
##        expr       min        lq      mean    median        uq       max neval   cld
##          by  3.840108  3.781034  3.462978  3.586863  3.216732  1.478090  1000  b   
##      tapply  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  1000 a    
##   aggregate  7.237326  7.210388  6.763582  7.142968  6.420644  1.914705  1000    d 
##       xtabs  5.406453  5.439945  5.182376  5.476047  4.987239  1.564497  1000   c  
##       ddply 22.748202 22.080111 20.367626 20.675366 18.096151 58.834953  1000     e
##       dplyr  5.224383  6.000011  5.971183  5.894329  5.285228 28.702014  1000   cd 
##  data.table  3.148850  3.535590  3.332793  3.579980  3.160607  1.413875  1000  b

Profiling

  • How to find the bottleneck of the code (the slowest part of your code)?
  • related packages: utils, profvis, profmem, …

Integrating R and other languages

  • C/Fortran: .Call(), .C, .Fortran, dyn.load(), …

  • C++: inline, Rcpp, RcppArmadillo, …

  • Julia: XRJulia, JuliaCall, …

  • Python: XRPython, reticulate, …

  • MATLAB: R.matlab, …

Sum aggregation by groups using Rcpp?

  • Try a straightforward (but possibly naive) implementation in Rcpp.
## Unit: relative
##           expr      min       lq     mean   median       uq      max neval cld
##         tapply 8.957147 8.799673 8.790180 8.521680 8.436642 51.66355  1000   c
##  rcpp_wi_names 3.540774 3.444513 3.487307 3.346194 3.286836 30.28195  1000  b 
##  rcpp_wo_names 1.000000 1.000000 1.000000 1.000000 1.000000  1.00000  1000 a

Parallel computing

  • One R session/process normally uses only one core.
  • If the computing tasks are independent and can be run individually, we may want to run them using different cores at the same time.
    • e.g., fitting models over 1,000 simulated datasets
  • Some packages: parallel, snow, snowfall, foreach, …
  • See more at the CRAN task view on high-performance and parallel computing.
  • Run R script with input arguments in terminal with GNU parallel? Or just use the HPC cluster if available?

Graphics in R

  • two basic graphics systems in R
    • base graphics (the graphics package)
    • grid graphics (the grid package)
  • lattice: Trellis Graphics for R
    • sub-plots conditional on categorical variables
    • one of the recommended packages shipped with R
  • ggplot2: An R implementation of Leland Wilkinson’s Grammar of Graphics
  • interactive plots? plotly, highcharter, Dygraphs, …
  • See more at the CRAN task view on graphics.

R Markdown

  • code + narratives = report
  • some existing tools:
    • WEB (Donald Knuth, Literate Programming)
    • Noweb (Norman Ramsey)
    • Sweave (Friedrich Leisch and R-core)
    • knitr (Yihui Xie)
    • Org mode (Carsten Dominik, Emacs)
    • Jupyter notebook
  • knitr:
    • .Rnw (R + LaTeX)
    • .Rmd (R + Markdown)
    • any computing language + any authoring language

Trend of vertical difference?

“Use the source, Luke!”

## 
## As Obi-Wan Kenobi may have said in Star Wars: "Use the source, Luke!"
##    -- Barry Rowlingson (answering a question on the documentation of some implementation
##       details)
##       R-devel (January 2010)
  • from the ESS manual:

The source code is real. The objects are realizations of the source code.

R Shiny applications

  • I saw this cool plate at (UConn) Y lot in last Summer!

The “Hello Shiny!” example

A simple miniUI example

  • source available at here

Some reference and further reading

Thanks and happy coding!