the slides of part I: wenjie-stat.me/2018-01-19-siam/
these slides: wenjie-stat.me/2018-04-06-siam/
source code of these slides and all the examples: https://github.com/wenjie2wang/2018-04-06-siam/
prerequisites: R and possibly the RStudio IDE have been installed.
a list of R packages needed for following examples:
reading and writing data
data processing and manipulation
reproducible reports with R Markdown
interactive data visualization with R Shiny
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
##
## Matrix products: default
## BLAS: /usr/lib/libblas.so.3.8.0
## LAPACK: /usr/lib/liblapack.so.3.8.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] shiny_1.0.5 readr_1.1.1 R6_2.2.2 plotly_4.7.1
## [5] ggplot2_2.2.1 microbenchmark_1.4-4 leaflet_1.1.0 htmltools_0.3.6
## [9] dygraphs_1.1.1.4 DT_0.4 dplyr_0.7.4 data.table_1.10.4-3
## [13] bookdown_0.7
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 plyr_1.8.4 pillar_1.2.1 compiler_3.4.4 bindr_0.1.1
## [6] tools_3.4.4 digest_0.6.15 viridisLite_0.3.0 jsonlite_1.5 gtable_0.2.0
## [11] evaluate_0.10.1 tibble_1.4.2 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.0
## [16] crosstalk_1.0.0 yaml_2.1.18 xfun_0.1 bindrcpp_0.2.2 httr_1.3.1
## [21] stringr_1.3.0 knitr_1.20 hms_0.4.2 htmlwidgets_1.0 revealjs_0.9
## [26] rprojroot_1.3-2 grid_3.4.4 glue_1.2.0 rmarkdown_1.9 tidyr_0.8.0
## [31] purrr_0.2.4 magrittr_1.5 scales_0.5.0 backports_1.1.2 assertthat_0.2.0
## [36] colorspace_1.3-2 mime_0.5 xtable_1.8-2 httpuv_1.3.6.2 stringi_1.1.7
## [41] lazyeval_0.2.1 munsell_0.4.3 zoo_1.8-1
read.table(), read.csv(), readLines(), load(), readRDS(), ...
write.table(), write.csv(), writeLines(), save(), saveRDS(), ...
can be slow when working with large data
read.table()
nrows
: the maximum number of rows to read in
data.csv
by get_nrows("data.csv")
, where get_nrows()
is a simple function as follows:get_nrows <- function(file) {
as.integer(
system2("wc", sprintf("-l %s | awk '{print $1}'", file), TRUE)
)
}
colClasses
: column classes
nrows = 10
) and decide on the appropriate classesread_table(), read_csv(), read_tsv(), read_fwf(), read_lines(), ...
write_table(), write_csv(), write_tsv(), write_fwf(), write_lines(), ...
readr providers a few features that make it more user-friendly than base R:
col_names
vs. header
and col_types
vs. colClasses
).gz
, .bz2
, .xz
, or .zip
) automaticallyfread()
/fwrite()
is similar to read.table()
/write.table()
but much faster and more convenient.sep
, colClasses
and nrows
are automatically detected (or guessed) in fread()
.data.csv
options(stringsAsFactors = FALSE)
set.seed(613); n <- 1e6
dat <- data.frame(foo = rnorm(n),
bar = rpois(n, 5),
alpha = sample(letters, size = n, replace = TRUE),
beta = as.Date(rpois(n, 10), origin = "2018-04-06"),
gamma = gl(5, k = n / 5, labels = LETTERS[1:5]))
str(dat)
## 'data.frame': 1000000 obs. of 5 variables:
## $ foo : num 2.026 -1.261 -0.454 0.156 -0.905 ...
## $ bar : int 5 3 4 8 7 4 7 4 4 5 ...
## $ alpha: chr "d" "k" "a" "b" ...
## $ beta : Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
## $ gamma: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
write.table(dat, file = "data.csv", sep = ",", row.names = FALSE)
sprintf("File size: %.0f MB", file.info("data.csv")$size / 1024 ^ 2)
## [1] "File size: 37 MB"
## simply use `read.csv`
dat1 <- read.csv("data.csv")
## or `read.table`
dat2 <- read.table("data.csv", sep = ",", header = TRUE)
## or specify `nrows` and `colClasses`
dat3 <- read.table("data.csv", sep = ",", header = TRUE,
nrows = get_nrows("data.csv") - 1,
colClasses = c("numeric", "integer",
rep("character", 3)))
## use readr::read_csv
dat4 <- readr::read_csv("data.csv")
dat5 <- readr::read_csv("data.csv", col_types = c("diccc"),
n_max = get_nrows("data.csv") - 1)
## use data.table::fread
dat6 <- data.table::fread("data.csv")
dat7 <- data.table::fread("data.csv", header = TRUE, sep = ",",
nrows = get_nrows("data.csv") - 1)
## exercise:
## check the resulting column types of each `data.frame` object.
## is there any difference? what are the classes of `dat4`,...,`dat7`?
library(microbenchmark)
microbenchmark(
read.csv = read.csv("data.csv"),
read.table = read.table("data.csv", sep = ",", header = TRUE),
read.table_tricks = read.table("data.csv", sep = ",", header = TRUE,
nrows = get_nrows("data.csv") - 1L,
colClasses = c("numeric", "integer",
rep("character", 3))),
read_csv = readr::read_csv("data.csv", col_types = cols()),
read_csv_tricks = readr::read_csv("data.csv", col_types = c("diccc"),
n_max = get_nrows("data.csv") - 1),
fread = data.table::fread("data.csv"),
fread_tricks = data.table::fread("data.csv", header = TRUE, sep = ",",
nrows = get_nrows("data.csv") - 1,
colClasses = c("numeric", "integer",
rep("character", 3))),
times = 30, unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval cld
## read.csv 3.907657 4.225091 4.292683 4.532457 4.326669 3.476520 30 d
## read.table 3.844444 4.306483 4.306137 4.531600 4.309966 3.637753 30 d
## read.table_tricks 2.630846 2.628233 2.761902 2.779265 2.917780 2.370641 30 c
## read_csv 2.333504 2.328710 2.377687 2.384162 2.410099 2.029660 30 b
## read_csv_tricks 2.172285 2.180941 2.351495 2.427447 2.416107 2.002160 30 b
## fread 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 30 a
## fread_tricks 1.006732 1.007077 1.016511 1.009074 1.005879 1.109504 30 a
some of the key “verbs”:
select()
: picks columns/variables based on names or indicesfilter()
: extracts a subset of rows based on logical conditionsarrange()
: reorders rowsrename()
: renames columns/variablesmutate()
: creates new columns/variablessummarise()
or summarize()
: computes summary statisticsgroup_by()
: helps perform operations by group.In particular,
$
operator.select()
## [1] "foo" "bar" "alpha" "beta" "gamma"
## 'data.frame': 1000000 obs. of 2 variables:
## $ foo : num 2.026 -1.261 -0.454 0.156 -0.905 ...
## $ alpha: chr "d" "k" "a" "b" ...
## 'data.frame': 1000000 obs. of 3 variables:
## $ bar : int 5 3 4 8 7 4 7 4 4 5 ...
## $ alpha: chr "d" "k" "a" "b" ...
## $ beta : Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
## 'data.frame': 1000000 obs. of 2 variables:
## $ foo : num 2.026 -1.261 -0.454 0.156 -0.905 ...
## $ gamma: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
select()
helpers:## 'data.frame': 1000000 obs. of 2 variables:
## $ bar : int 5 3 4 8 7 4 7 4 4 5 ...
## $ beta: Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
## 'data.frame': 1000000 obs. of 3 variables:
## $ alpha: chr "d" "k" "a" "b" ...
## $ beta : Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
## $ gamma: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## 'data.frame': 1000000 obs. of 1 variable:
## $ foo: num 2.026 -1.261 -0.454 0.156 -0.905 ...
filter()
## 'data.frame': 7086 obs. of 5 variables:
## $ foo : num 2.24 2.36 2.14 2.02 2.14 ...
## $ bar : int 3 4 4 4 4 3 3 4 4 4 ...
## $ alpha: chr "h" "j" "l" "o" ...
## $ beta : Date, format: "2018-04-15" "2018-04-21" "2018-04-17" ...
## $ gamma: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
dat_tbl <- tbl_df(dat)
dat_dt <- as.data.table(dat)
microbenchmark(
"[_$" = dat[dat$foo > 2 & dat$bar %in% c(3, 4), ],
"[_with" = with(dat, dat[foo > 2 & bar %in% c(3, 4), ]),
"subset" = base::subset(dat, foo > 2 & bar %in% c(3, 4)),
"filter_&" = dplyr::filter(dat, foo > 2 & bar %in% c(3, 4)),
"filter_," = dplyr::filter(dat, foo > 2, bar %in% c(3, 4)),
"filter_tbl" = dplyr::filter(dat_tbl, foo > 2, bar %in% c(3, 4)),
"data.table" = dat_dt[foo > 2 & bar %in% c(3, 4)],
times = 200, unit = "relative"
)
## Unit: relative
## expr min lq mean median uq max neval cld
## [_$ 1.200435 1.188129 1.220260 1.206183 1.189927 1.063432 200 b
## [_with 1.183658 1.193324 1.254565 1.206986 1.186019 1.160299 200 b
## subset 1.338723 1.315983 1.363429 1.324302 1.313516 1.060694 200 c
## filter_& 1.056401 1.056870 1.075000 1.051222 1.038148 1.022696 200 a
## filter_, 1.060556 1.060126 1.052689 1.055792 1.043553 1.045714 200 a
## filter_tbl 1.054250 1.050534 1.072195 1.048356 1.043918 1.028078 200 a
## data.table 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 200 a
arrange()
## beta foo
## 1 2018-04-06 -2.224411
## 2 2018-04-06 -1.721596
## 3 2018-04-06 -1.653048
## beta foo
## 999998 2018-05-04 1.0577441
## 999999 2018-05-05 -0.2527760
## 1000000 2018-05-05 0.3215835
## beta foo
## 1 2018-05-05 -0.2527760
## 2 2018-05-05 0.3215835
## 3 2018-05-04 -0.1928452
base::order
?microbenchmark(
"order_$" = dat[order(as.numeric(dat$beta), dat$foo,
decreasing = c(TRUE, FALSE)), ],
"order_with" = with(dat, dat[order(as.numeric(beta), foo,
decreasing = c(TRUE, FALSE)), ]),
"arrange" = arrange(dat, desc(beta), foo),
"arrange_as" = arrange(dat, desc(as.numeric(beta)), foo),
"data.table" = dat_dt[order(- beta, foo), ],
times = 100, unit = "relative"
)
## Unit: relative
## expr min lq mean median uq max neval cld
## order_$ 1.734019 1.737106 1.737325 1.744590 1.752398 0.7778197 100 b
## order_with 1.698671 1.742659 1.793096 1.747756 1.842331 1.8191696 100 b
## arrange 5.346866 5.301355 5.129763 5.301986 5.240173 2.2288166 100 c
## arrange_as 5.329339 5.321722 5.163903 5.308573 5.280767 2.2507773 100 c
## data.table 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
rename()
## [1] "x" "y" "alpha" "beta" "gamma"
mutate()
## 'data.frame': 1000000 obs. of 6 variables:
## $ foo : num 2.026 -1.261 -0.454 0.156 -0.905 ...
## $ bar : int 5 3 4 8 7 4 7 4 4 5 ...
## $ alpha : chr "d" "k" "a" "b" ...
## $ beta : Date, format: "2018-04-16" "2018-04-13" "2018-04-16" ...
## $ gamma : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bar_centered: num -0.00317 -2.00317 -1.00317 2.99683 1.99683 ...
dat$bar_centered <- dat$bar - mean(dat$bar)
## or using `with`
dat$bar_centered <- with(dat, bar - mean(bar))
summarize
and group_by()
## # A tibble: 5 x 3
## gamma sd_foo mean_bar
## <fct> <dbl> <dbl>
## 1 A 0.999 5.00
## 2 B 0.998 5.01
## 3 C 0.999 5.01
## 4 D 1.00 5.00
## 5 E 1.00 5.00
%>%
|
) in Linux and other Unix-like operating systemsx %>% f
\(\Leftrightarrow\) f(x)
x %>% f(y)
\(\Leftrightarrow\) f(x, y)
x %>% f %>% g %>% h
\(\Leftrightarrow\) h(g(f(x)))
x %>% f(y, .)
\(\Leftrightarrow\) f(y, x)
microbenchmark(
nested = summary(head(dat, n = 100)),
steps = {
tmpDat <- head(dat, n = 100)
summary(tmpDat)
},
pipe = dat %>% head(n = 100) %>% summary,
times = 1e3, unit = "relative"
)
## Unit: relative
## expr min lq mean median uq max neval cld
## nested 1.0000000 1.000000 1.000000 1.000000 1.0000000 1.000000 1000 a
## steps 0.9995403 1.000438 1.000223 1.000924 0.9966717 1.001759 1000 a
## pipe 1.0627355 1.062922 1.080314 1.063366 1.0580139 1.017031 1000 b
microbenchmark(
nested = prop.table(xtabs(~ bar, head(dat, n = 100))),
steps = {
tmpDat <- head(dat, n = 100)
tmpTab <- xtabs(~ bar, tmpDat)
prop.table(tmpTab)
},
pipe = dat %>% head(n = 100) %>% xtabs(~ bar, .) %>% prop.table,
times = 1e3, unit = "relative"
)
## Unit: relative
## expr min lq mean median uq max neval cld
## nested 2.353744 3.039528 2.468968 3.010881 2.108407 1.086845 1000 b
## steps 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1000 a
## pipe 1.111918 1.109824 1.086706 1.120418 0.942647 1.012724 1000 a
dat %>%
select(- alpha) %>%
rename(group = gamma) %>%
mutate(abs_foo = abs(foo)) %>%
group_by(group) %>%
arrange(beta, desc(bar)) %>%
head(n = 200) %>%
summarize(median_abs_foo = median(abs_foo))
## # A tibble: 5 x 2
## group median_abs_foo
## <fct> <dbl>
## 1 A 0.654
## 2 B 0.670
## 3 C 0.722
## 4 D 0.561
## 5 E 0.620
.Rnw
(R + LaTeX).Rmd
(R + Markdown)What is R Shiny?
By execution order:
global.R
: an optional script for code needed in ui.R
and server.R
ui.R
: define user interface (UI) design
server.R
: define server-side logic
app.R
ui <- fluidPage(
titlePanel("my title panel"),
sidebarLayout(
sidebarPanel("my sidebar panel"),
mainPanel("my main panel")
)
)
Shiny uses Twitter Bootstrap 3, the probably most popular HTML, CSS, and JS framework for developing responsive, mobile first projects on the web.
the essential HTML code defined in ui
:
The more fundamental layout constructor functions are fluidRow()
and column()
.
Shiny | HTML5 | creates |
---|---|---|
p()
|
<p>
|
A paragraph of text |
h1(), …, h6()
|
<h1>, …, <h6>
|
a first, …, sixth level header |
a()
|
<a>
|
A hyper link |
br()
|
<br>
|
A line break |
div()
|
<div>
|
A division with a uniform style |
span()
|
<span>
|
An in-line version of division |
strong()
|
<strong>
|
Bold text |
em()
|
<em>
|
Italicized text |
…
|
…
|
… |
HTML()
|
Directly passes character strings as HTML |
names(tags)
returns a complete valid HTML5 tag list.id
for widget name: users will not see the name, but you can use it to access the widget’s value. The name should be a character string.label
for widget label: this label will appear with the widget in your app. It should be a character string, but it can be an empty string ""
.The standard Shiny widgets include
function | widget |
---|---|
actionButton
|
Action Button |
checkboxGroupInput
|
A group of check boxes |
checkboxInput
|
A single check box |
dateInput
|
A calendar to aid date selection |
dateRangeInput
|
A pair of calendars for selecting a date range |
fileInput
|
A file upload control wizard |
helpText
|
Help text that can be added to an input form |
numericInput
|
A field to enter numbers |
radioButtons
|
A set of radio buttons |
selectInput
|
A box with choices to select from |
sliderInput
|
A slider bar |
submitButton
|
A submit button |
textInput
|
A field to enter text |
*Output
functions in ui
or ui.R
turn R objects into output of UI.
Output function | Creates |
---|---|
htmlOutput
|
raw HTML |
imageOutput
|
image |
plotOutput
|
plot |
tableOutput
|
table |
textOutput
|
text |
uiOutput
|
raw HTML |
verbatimTextOutput
|
text |
*Output
functions take output name/ID as input.render*
functions in server
or server.R
render function | creates |
---|---|
renderImage
|
images (saved as a link to a source file) |
renderPlot
|
plots |
renderPrint
|
any printed output |
renderTable
|
data frame, matrix, other table like structures |
renderText
|
character strings |
renderUI
|
a Shiny tag object or HTML |
render*
functions take a single argument: an R expression surrounded by {}
.render*
functions once each time a user changes the value of a widget.ui <- fluidPage(
titlePanel("a simple example of reactive output"),
sidebarLayout(
sidebarPanel(
selectInput("cyl", "Cylinders",
choices = sort(unique(mtcars$cyl)),
selected = 4)
),
mainPanel(plotOutput("mpg_boxplot"))
)
)
server <- function(input, output) {
## the `renderPlot` runs every time a user changes input$cyl
output$mpg_boxplot <- renderPlot({
## filtering the cars models in the mtcars dataset
## by the input cylinder number
dat <- subset(mtcars, cyl == input$cyl)
## draw a boxplot of mpg for the filtered data
with(dat, boxplot(mpg))
})
}
shinyApp(ui, server)
reactive()
function, which takes an R expression surrounded by {}
similar to render*
functions.server <- function(input, output, session) {
## e.g., a simple reactive expression for filtering the cars models
## in the mtcars dataset by the input cylinder
dataInput <- reactive({
subset(mtcars, cyl == input$cyl)
})
## draw a boxplot of mpg for the filtered data
output$mpg_boxplot <- renderPlot({
dat <- dataInput()
with(dat, boxplot(mpg))
})
}
share as R scripts by runApp()
, runUrl()
, runGitHub()
or runGist()