Introduction to R Programming

Part II

Wenjie Wang
Department of Statistics, UConn

April 6, 2018

Getting Started

Outline

  • reading and writing data

  • data processing and manipulation

  • reproducible reports with R Markdown

  • interactive data visualization with R Shiny

Session information

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
## 
## Matrix products: default
## BLAS: /usr/lib/libblas.so.3.8.0
## LAPACK: /usr/lib/liblapack.so.3.8.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
##  [1] shiny_1.0.5          readr_1.1.1          R6_2.2.2             plotly_4.7.1        
##  [5] ggplot2_2.2.1        microbenchmark_1.4-4 leaflet_1.1.0        htmltools_0.3.6     
##  [9] dygraphs_1.1.1.4     DT_0.4               dplyr_0.7.4          data.table_1.10.4-3 
## [13] bookdown_0.7        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16      plyr_1.8.4        pillar_1.2.1      compiler_3.4.4    bindr_0.1.1      
##  [6] tools_3.4.4       digest_0.6.15     viridisLite_0.3.0 jsonlite_1.5      gtable_0.2.0     
## [11] evaluate_0.10.1   tibble_1.4.2      lattice_0.20-35   pkgconfig_2.0.1   rlang_0.2.0      
## [16] crosstalk_1.0.0   yaml_2.1.18       xfun_0.1          bindrcpp_0.2.2    httr_1.3.1       
## [21] stringr_1.3.0     knitr_1.20        hms_0.4.2         htmlwidgets_1.0   revealjs_0.9     
## [26] rprojroot_1.3-2   grid_3.4.4        glue_1.2.0        rmarkdown_1.9     tidyr_0.8.0      
## [31] purrr_0.2.4       magrittr_1.5      scales_0.5.0      backports_1.1.2   assertthat_0.2.0 
## [36] colorspace_1.3-2  mime_0.5          xtable_1.8-2      httpuv_1.3.6.2    stringi_1.1.7    
## [41] lazyeval_0.2.1    munsell_0.4.3     zoo_1.8-1

Reading and writing data

  • basic functions:
  • can be slow when working with large data

  • some useful packages:
    • for plain-text rectangular data (such as csv, tsv, and fwf): utils, readr, data.table, …
    • for data stored in other formats: foreign, haven, openxlsx, …

Some tricks for efficiently reading large text files using read.table()

  • specify nrows: the maximum number of rows to read in
    • e.g., we may determine the number of rows of data.csv by get_nrows("data.csv"), where get_nrows() is a simple function as follows:
  • specify colClasses: column classes
    • e.g., only read in the first ten rows (nrows = 10) and decide on the appropriate classes
  • see this short post by Toby Hocking for a nice summary

Using readr package

readr providers a few features that make it more user-friendly than base R:

  • more consistent arguments’ naming (e.g., col_names vs. header and col_types vs. colClasses)
  • leave strings as is by default, and automatically parse common date/time formats
  • able to read compressed files (e.g., .gz, .bz2, .xz, or .zip) automatically

Using data.table package

  • fread()/fwrite() is similar to read.table()/write.table() but much faster and more convenient.
  • All controls such as sep, colClasses and nrows are automatically detected (or guessed) in fread().

Example data

  • randomly generate a “not so small” csv file: data.csv
## 'data.frame':    1000000 obs. of  5 variables:
##  $ foo  : num  2.026 -1.261 -0.454 0.156 -0.905 ...
##  $ bar  : int  5 3 4 8 7 4 7 4 4 5 ...
##  $ alpha: chr  "d" "k" "a" "b" ...
##  $ beta : Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
##  $ gamma: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## [1] "File size: 37 MB"

Examples of reading csv file

## Unit: relative
##               expr      min       lq     mean   median       uq      max neval  cld
##           read.csv 3.907657 4.225091 4.292683 4.532457 4.326669 3.476520    30    d
##         read.table 3.844444 4.306483 4.306137 4.531600 4.309966 3.637753    30    d
##  read.table_tricks 2.630846 2.628233 2.761902 2.779265 2.917780 2.370641    30   c 
##           read_csv 2.333504 2.328710 2.377687 2.384162 2.410099 2.029660    30  b  
##    read_csv_tricks 2.172285 2.180941 2.351495 2.427447 2.416107 2.002160    30  b  
##              fread 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    30 a   
##       fread_tricks 1.006732 1.007077 1.016511 1.009074 1.005879 1.109504    30 a

Data processing and manipulation

  • It is said that data scientists spent about 80% of time on the process of cleaning and preparing the data.
  • How to make data cleaning as efficient and effective as possible?
  • In fact, base R provides a variety of useful functions for data processing and manipulation.

dplyr: A grammar of data manipulation

some of the key “verbs”:

  • select(): picks columns/variables based on names or indices
  • filter(): extracts a subset of rows based on logical conditions
  • arrange(): reorders rows
  • rename(): renames columns/variables
  • mutate(): creates new columns/variables
  • summarise() or summarize(): computes summary statistics
  • group_by(): helps perform operations by group.

Common properties

In particular,

  • We may directly refer to columns without using $ operator.
  • The first argument is a data frame and the returned result is a new data frame.
  • must be tidy data (Wickham, 2014): one observation per row, and each column representing a feature or characteristic of that observation

select()

## [1] "foo"   "bar"   "alpha" "beta"  "gamma"
## 'data.frame':    1000000 obs. of  2 variables:
##  $ foo  : num  2.026 -1.261 -0.454 0.156 -0.905 ...
##  $ alpha: chr  "d" "k" "a" "b" ...
## 'data.frame':    1000000 obs. of  3 variables:
##  $ bar  : int  5 3 4 8 7 4 7 4 4 5 ...
##  $ alpha: chr  "d" "k" "a" "b" ...
##  $ beta : Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
## 'data.frame':    1000000 obs. of  2 variables:
##  $ foo  : num  2.026 -1.261 -0.454 0.156 -0.905 ...
##  $ gamma: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
  • select() helpers:
## 'data.frame':    1000000 obs. of  2 variables:
##  $ bar : int  5 3 4 8 7 4 7 4 4 5 ...
##  $ beta: Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
## 'data.frame':    1000000 obs. of  3 variables:
##  $ alpha: chr  "d" "k" "a" "b" ...
##  $ beta : Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
##  $ gamma: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## 'data.frame':    1000000 obs. of  1 variable:
##  $ foo: num  2.026 -1.261 -0.454 0.156 -0.905 ...

filter()

## 'data.frame':    7086 obs. of  5 variables:
##  $ foo  : num  2.24 2.36 2.14 2.02 2.14 ...
##  $ bar  : int  3 4 4 4 4 3 3 4 4 4 ...
##  $ alpha: chr  "h" "j" "l" "o" ...
##  $ beta : Date, format: "2018-04-15" "2018-04-21" "2018-04-17" ...
##  $ gamma: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
  • hmmm… why inventing another wheel?

A quick benchmarking

## Unit: relative
##        expr      min       lq     mean   median       uq      max neval cld
##         [_$ 1.200435 1.188129 1.220260 1.206183 1.189927 1.063432   200  b 
##      [_with 1.183658 1.193324 1.254565 1.206986 1.186019 1.160299   200  b 
##      subset 1.338723 1.315983 1.363429 1.324302 1.313516 1.060694   200   c
##    filter_& 1.056401 1.056870 1.075000 1.051222 1.038148 1.022696   200 a  
##    filter_, 1.060556 1.060126 1.052689 1.055792 1.043553 1.045714   200 a  
##  filter_tbl 1.054250 1.050534 1.072195 1.048356 1.043918 1.028078   200 a  
##  data.table 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   200 a

arrange()

##         beta       foo
## 1 2018-04-06 -2.224411
## 2 2018-04-06 -1.721596
## 3 2018-04-06 -1.653048
##               beta        foo
## 999998  2018-05-04  1.0577441
## 999999  2018-05-05 -0.2527760
## 1000000 2018-05-05  0.3215835
##         beta        foo
## 1 2018-05-05 -0.2527760
## 2 2018-05-05  0.3215835
## 3 2018-05-04 -0.1928452
  • again, why not use base::order?

A quick benchmarking

## Unit: relative
##        expr      min       lq     mean   median       uq       max neval cld
##     order_$ 1.734019 1.737106 1.737325 1.744590 1.752398 0.7778197   100  b 
##  order_with 1.698671 1.742659 1.793096 1.747756 1.842331 1.8191696   100  b 
##     arrange 5.346866 5.301355 5.129763 5.301986 5.240173 2.2288166   100   c
##  arrange_as 5.329339 5.321722 5.163903 5.308573 5.280767 2.2507773   100   c
##  data.table 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100 a

rename()

## [1] "x"     "y"     "alpha" "beta"  "gamma"
  • how to do it with base R?
  • what about data.table?

mutate()

## 'data.frame':    1000000 obs. of  6 variables:
##  $ foo         : num  2.026 -1.261 -0.454 0.156 -0.905 ...
##  $ bar         : int  5 3 4 8 7 4 7 4 4 5 ...
##  $ alpha       : chr  "d" "k" "a" "b" ...
##  $ beta        : Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
##  $ gamma       : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bar_centered: num  -0.00317 -2.00317 -1.00317 2.99683 1.99683 ...
  • how to do it with base R?
  • what about data.table?

summarize and group_by()

## # A tibble: 5 x 3
##   gamma sd_foo mean_bar
##   <fct>  <dbl>    <dbl>
## 1 A      0.999     5.00
## 2 B      0.998     5.01
## 3 C      0.999     5.01
## 4 D      1.00      5.00
## 5 E      1.00      5.00

Forward-pipe operator: %>%

  • provided by magrittr package
  • similar idea to the “piping” (using |) in Linux and other Unix-like operating systems
  • examples of basic chaining or piping:
    • x %>% f \(\Leftrightarrow\) f(x)
    • x %>% f(y) \(\Leftrightarrow\) f(x, y)
    • x %>% f %>% g %>% h \(\Leftrightarrow\) h(g(f(x)))
  • example of argument placeholder
    • x %>% f(y, .) \(\Leftrightarrow\) f(y, x)
  • pros: clearly expressing a sequence of multiple operations
  • cons: possibly hard to debug without intermediate steps

Quick benchmarkings

  • any performance compromised?
## Unit: relative
##    expr       min       lq     mean   median        uq      max neval cld
##  nested 1.0000000 1.000000 1.000000 1.000000 1.0000000 1.000000  1000  a 
##   steps 0.9995403 1.000438 1.000223 1.000924 0.9966717 1.001759  1000  a 
##    pipe 1.0627355 1.062922 1.080314 1.063366 1.0580139 1.017031  1000   b
## Unit: relative
##    expr      min       lq     mean   median       uq      max neval cld
##  nested 2.353744 3.039528 2.468968 3.010881 2.108407 1.086845  1000   b
##   steps 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000  1000  a 
##    pipe 1.111918 1.109824 1.086706 1.120418 0.942647 1.012724  1000  a

Using dplyr with piping

## # A tibble: 5 x 2
##   group median_abs_foo
##   <fct>          <dbl>
## 1 A              0.654
## 2 B              0.670
## 3 C              0.722
## 4 D              0.561
## 5 E              0.620

Reproducible reports with R Markdown

  • code + narratives = report
  • some existing tools:
    • WEB (Donald Knuth, Literate Programming)
    • Noweb (Norman Ramsey)
    • Sweave (Friedrich Leisch and R-core)
    • knitr (Yihui Xie)
    • Org mode + Babel (Carsten Dominik, Eric Schulte, …)
    • Jupyter notebook
  • knitr:
    • .Rnw (R + LaTeX)
    • .Rmd (R + Markdown)
    • any computing language + any authoring language

Interactive data visualization with R Shiny

Some examples

Structure of a Shiny app

By execution order:

  1. global.R: an optional script for code needed in ui.R and server.R

  2. ui.R: define user interface (UI) design

  3. server.R: define server-side logic

Alternative structure

  • a sinlge script called app.R

HTML builders

Shiny HTML5 creates
p() <p> A paragraph of text
h1(), …, h6() <h1>, …, <h6> a first, …, sixth level header
a() <a> A hyper link
br() <br> A line break
div() <div> A division with a uniform style
span() <span> An in-line version of division
strong() <strong> Bold text
em() <em> Italicized text
HTML()   Directly passes character strings as HTML
  • shiny imports HTML builder functions from htmltools.
  • names(tags) returns a complete valid HTML5 tag list.

Basic widgets

  • What is a web widget? A web element that users can interact with.
  • Standard widgets gallery
  • The first two arguments for each widget function are
    • id for widget name: users will not see the name, but you can use it to access the widget’s value. The name should be a character string.
    • label for widget label: this label will appear with the widget in your app. It should be a character string, but it can be an empty string "".

The standard Shiny widgets include

function widget
actionButton Action Button
checkboxGroupInput A group of check boxes
checkboxInput A single check box
dateInput A calendar to aid date selection
dateRangeInput A pair of calendars for selecting a date range
fileInput A file upload control wizard
helpText Help text that can be added to an input form
numericInput A field to enter numbers
radioButtons A set of radio buttons
selectInput A box with choices to select from
sliderInput A slider bar
submitButton A submit button
textInput A field to enter text

Reactive output

*Output functions in ui or ui.R turn R objects into output of UI.

Output function Creates
htmlOutput raw HTML
imageOutput image
plotOutput plot
tableOutput table
textOutput text
uiOutput raw HTML
verbatimTextOutput text
  • These *Output functions take output name/ID as input.

render* functions in server or server.R

render function creates
renderImage images (saved as a link to a source file)
renderPlot plots
renderPrint any printed output
renderTable data frame, matrix, other table like structures
renderText character strings
renderUI a Shiny tag object or HTML
  • These render* functions take a single argument: an R expression surrounded by {}.
  • Shiny runs the R expressions inside render* functions once each time a user changes the value of a widget.

Reactive expressions

  • Create a reactive expression by the reactive() function, which takes an R expression surrounded by {} similar to render* functions.
  • Reactive expressions cache values and update them only when it is necessary, which make the app faster.

Share Shiny apps

  • share as R scripts by runApp(), runUrl(), runGitHub() or runGist()

  • share as a web page:

More advanced HTML widgets

  • htmlwidgets package provides a framework that helps to bring the best of JavaScript data visualization to R.
  • example R packages built with htmltwidgets:
    • leaflet for geo-spatial mapping powered by JavaScript library leaflet
    • dygraphs for time series charting powered by JavaScript library dygraphs
    • plotly and Highcharter for general interactive graphics powered by JavaScript library plotly.js and Highcharts, respectively.
    • DT for tabular data display powered by JavaScript library DataTables.

leaflet example

  • locations of earthquakes off Fiji

dygraphs example

plotly example

DT example

More options on UI and control widgets

Some reference and further reading

Thanks and happy coding!