# The R Statistical Computing Environment

## The Basics and Beyond

## John Fox

### Department of Sociology, McMaster University

### ICPSR Summer Program, Berkeley California

### June 2016

### Short URL: tinyurl.com/Berkeley-R-course

### Please read the installation instructions for R and R Studio.

The R statistical programming language and computing environment has become the de-facto standard for writing statistical software among statisticians and has made substantial inroads in the social sciences -- it is now possibly the most widely used statistical software in the world. R is a free, open-source implementation of the S language, and is available for Windows, Mac OS X, and Unix/Linux systems.

The basic R system is developed and maintained by the R Core group, comprising 21 members, many of them eminent in the field of statistical computing. The R Project for Statistical Computing is a project of the R Foundation, whose membership includes the R Core group and several other individuals, and is also associated with the Free Software Foundation.

A statistical software package, such as SPSS, is primarily oriented toward combining instructions, possibly entered via a point-and-click interface, with rectangular case-by-variable datasets to produce (often voluminous) output. Such packages make it easy to perform routine data analysis tasks, but they make it relatively difficult to do things that are innovative or nonstandard – or to extend the built-in capabilities of the package.

In contrast, a good statistical computing environment makes routine data analysis easy and also supports convenient programming. R fulfills both of these requirements, and users can readily write programs that add to its already impressive facilities. Thousands of R add-on packages, freely available on the Internet in the Comprehensive R Archive Network (CRAN), and many others in the Bioconductor package archive, extend the capabilities of R to almost every area of statistical data analysis. R is also particularly capable in the area of statistical graphics.

The first two days of this workshop are meant to provide a basic overview of and introduction to R, including to statistical modeling in R – in effect, using R as a statistical package. I will also show you how to use RStudio, a sophisticated front-end or interactive development environment (IDE) for R, which includes support for “literate programming” to create documents that mix R code with explanatory text, encouraging reproducible research.

The material scheduled for day two is flexible, and I encourage participants to contact me with requests for topics to cover. (The topics that I’ve included in the course syllabus for day two – mixed-effects models, survival analysis, structural-equation models – should be read as suggestions.) Some caveats: (1) If you’d like me to cover another specific topic, please contact me sufficiently in advance of the workshop to prepare the necessary materials. (2) I’ll try to select topics that are of broad interest, to more than one participant. (3) I understand that, unlike the remainder of the workshop, not all day-two topics will be of interest to all participants. (4) Of course, I’m limited in what I can competently teach by my knowledge and expertise.

Learning even a bit of R programming, will greatly increase your ability to manage and analyze data using R. The final three days pick up where the basic material leaves off, and are intended to provide the background required to use R seriously for data analysis and presentation, including an introduction to R programming and to the design of custom statistical graphs, unlocking the power in the R statistical programming environment. Participants should bring their laptops to the workshop and should install R and RStudio in advance of the workshop.

An outline of the workshop follows (with chapter and on-line appendix references to Fox and Weisberg, *An R Companion to Applied Regression, Second Edition*):

## Topics

## Selected Bibliography

Publishers of statistical texts have been producing a steady stream of books on R. Of particular note is Springer's *Use R!* series and Chapman and Hall/CRC’s *The R Series*.

### Basic Texts

The
principal source for this lecture series/workshop is J. Fox and S. Weisberg, *An
R Companion to Applied Regression, Second Edition*, Sage (2011. Additional
materials are available on the web site for the book, including several
appendices (on structural-equation models, mixed models, survival
analysis, etc.). The
book is associated with the **car** and **effects** packages for R. I am a member of the R Foundation.

Alternatively (or additionally), more advanced students
may wish to use W. N. Venables and B. D. Ripley, *Modern
Applied Statistics with S* as a principal source. Bill Venables is a member of the R Foundation, and Brian Ripley is a member of the R Core group.

### Manuals

R is distributed with a set of manuals, which are also available at the CRAN web site.

A manual for S-PLUS Trellis Graphics (also useful for the lattice package in R) is at also available on the web.

A great deal of information about using the RStudio interactive development environment is available on the RStudio website.

### Package Vignettes

Many R packages have "vignettes" -- long-form documentation -- in addition to the mandatory help pages. Enter the command help(package="*package-name*") and click the link *User guides, package vignettes and other documentation* if it appears on the main package help page. The command vignette() displays the names of all vignettes in packages residing in your library, while vignette(package="*package-name*") displays the names of vignettes (if any) in a particular package.

### Programming in S

R. A.
Becker, J. M. Chambers, and A .R. Wilks, *The
New S Language: A Programming Environment for Data Analysis and
Statistics.*

J. M.
Chambers, *Programming with Data: A Guide to the S Language*.

J. M.
Chambers, *Software for Data
Analysis: Programming with R*. New York: Springer, 2008.
Chambers’s newest
book ranges quite widely, and emphasizes a deep understanding of the R
language, along with object-oriented programming, and links between R
and other
software. Some topics are unusual, such as processing text data in R.

J. M.
Chambers and T.J. Hastie, eds., *Statistical
Models in
S. Pacific Grove
,
CA
:
Wadsworth
,
1992. An edited
volume describing the statistical
modeling capabilities in S, Versions 3 and 4, and R, and the
object-oriented
programming system used in S Version 3 and R (and available, for
“backwards
compatibility,” in S Version 4). In addition, the text covers S
software for
particular kinds of statistical models, including linear models,
nonlinear
models, generalized linear models, local-polynomial regression models,
and
generalized additive models. (The “White Book.”)*

D. Eddelbbuettel, *Seamless R and C++ Integration with Rcpp*. New York: Springer, 2013. Judicious use of compiled code written in C, C++, or Fortran can substantially improve the efficiency of some R programs. The Rcpp package and its cousins simplify the process of integrating C++ code in R. I recommend this book to those who already know C++.

R.
Gentleman, *R Programming for Bioinformatics*,
Boca Raton: Chapman and Hall, 2009. A thorough, though at points
relatively
difficult, treatment of programming in R, by one of the original
co-developers
of R and a founder of the related Bioconductor Project (which develops
computing tools for the analysis of genomic data). Don’t let the title
fool
you: Most of the book is of general interest to R programmers.

G. Grolemund, *Hands-On Programming with R*, Sebastopol CA: O'Reilly, 2014. A readable, easy-to-follow, basic introduction to R programming, which also introduces RStudio.

R.
Ihaka and R. Gentleman, “R: A language for data analysis and graphics.” *Journal
of Computational and Graphical Statistics*, 5:299-314, 1996.
The original
published description of the R project, now quite out of date but still
worth
looking at.

W. N.
Venables and B. D. Ripley, *S Programming*.
*Modern Applied Statistics with S*,
and at the time of its publication the definitive treatment of writing
software
in the various versions of S-PLUS and R; now somewhat dated,
particularly with
respect to R. Brian Ripley is a member of the R Core group of developers, and Bill Venables is a member of the R Foundation.

H. Wickham, *Advanced R*. Boca Raton FL: Chapman and Hall/CRC, 2015. Hadley Wickham has contributed a number of widely used R packages (such as **ggplot2** for graphics and **plyr** for data manipulation) and is associated with RStudio. As the name implies, you may (and should!) be interested in reading this book after you’ve learned the basics of R programming. A related volume by Wickham, *R Packages*, Sepastopol CA: O'Reilly, 2015, is (as its name implies) about how to write R packages. Wickham's approach to R programming and package-writing is sometimes idiosyncratic but always carefully considered and interesting. The websites for the Advanced R and R Packages books provide access to the text. Hadley Wickham is a member of the R Foundation.

Xie, Y., *Dynamic Documents with R and knitr*. Boca Raton FL: Chapman and Hall/CRC, 2013. Yihui Xie describes the use of his **knitr** package for creating LaTeX documents with embedded executable R code. This package also provides the basis for R Markdown in RStudio.

**Statistical Computing in R**

The following three books treat traditional topics in statistical computing, such as optimization, simulation, probability calculations, and computational linear algebra, using R (although the coverage of particular topics in the books differs). All offer introductions to R programming. Of these books, Braun and Murdoch is the briefest and most accessible.

W. J.
Braun and D. J. Murdoch, A *First
Course in Statistical Programming with R.* Cambridge:
Cambridge University
Press, 2007. . Duncan Murdoch is a member of the R Core group of developers.

O.
Jones, R. Maillardet, and A. Robinson, *Introduction
to Scientific Programming and Simulation Using R*. Boca
Raton: Chapman and Hall, 2009.

M. L.
Rizzo. *Statistical Computing
with R*, Boca Raton: Chapman and Hall, 2008.

**Graphics in
R
**

P. Murrell. *R Graphics, Second Edition*. New York: Chapman and Hall, 2011. A tour-de-force – the definitive reference on traditional R graphics and on the grid graphics system on which **lattice** graphics (the R implementation of William Cleveland’s Trellis graphics) is built. R code to produce the figures in the book are on Murrell’s web site. Paul Murrell is a member of the R Core group of developers.

P.
Murrell and R. Ihaka, “An approach to providing mathematical annotation
in plots.” *Journal of Computational and
Graphical Statistics*, 9:582-599, 2000. One of the unusual and
very useful
features of R graphics is the ability to include mathematical notation.
This
article explains how. Paul Murrell and Ross Ihaka are both members of the R core group.

D.
Sarkar, *Lattice: Multivariate Data
Visualization with R*. New York: Springer, 2008. Deepayan
Sarkar is the developer
of the powerful **lattice** package in
R, which implements Trellis graphics. This book provides a fine
introduction to
and overview of **lattice** graphics.
Figures from the book and the R code to produce them are available on the web. Deepayan Sarkar is a member of the R Core group of developers.

H. Wickham, ggplot2: Elegant Graphics for Data Analysis. New York: Springer, 2009. A guide to Hadley Wickham's ggplot2 package, which provides an alternative graphics system for R based on an extension of Wilkinson's The Grammar of Graphics (Second Edition, Springer, 2005), which, in turn, provides a systematic basis for constructing statistical graphs.

**Data
Management
**

P.
Spector, *Data Manipulation with R*.
New York: Springer, 2008. Data management is a dry subject, but the
ability to
carry it out is vital to the effective day-to-day use of R (or of any
statistical software). Spector provides a reasonably broad and clear
introduction to the subject.

### (Highly) Selected Statistical Methods Programmed in R

Also see the package listing on CRAN and the various CRAN "task views."

J. Adler, *R in a Nutshell: A Desktop Quick Reference*, Sebastopol CA: O’Reilly. Basic information about using R, including brief illustrations of many R commands. New users of R may find the information in this book useful.

R. S. Bivand, E. J. Pebesma, and V. Gómez-Rubio, *Applied Spatial Data Analysis with R*, New York: Springer, 2008. There is a strong community of researchers in spatial statistics developing R software, much of which is described in this book, including the basic **sp** package, which provides R classes for spatial data. Roger Bivand is a member of the R Foundation.

W.
Bowman and A. Azzalini, *Applied Smoothing Techniques for Data
Analysis: The Kernel Approach with S-Plus Illustrations*.
Oxford: Oxford University Press, 1997. A good introduction to
nonparametric density estimation and nonparametric regression,
associated with the sm package (for both S-PLUS
and R).

C.
Davison and D. V. Hinkley, *Bootstrap Methods and their
Application*. Cambridge: Cambridge University Press, 1997. A
comprehensive introduction to bootstrap resampling, associated with the bootpackage
(written by A. J. Canty). Somewhat more difficult
than Efron and Tibshirani (immediately below).

B.
Efron and R. J. Tibshirani, *An Introduction to the Bootstrap*.
London: Chapman and Hall, 1993. Another extensive treatment of
bootstrapping by its originator (Efron), also accompanied by an R
package, bootstrap (but somewhat less usable than boot).

B. S. Everitt and T. Hothorn, A Handbook of Statistical Analyses Using R, Second Edition. Boca Raton: Chapman and Hall, 2010. Many worked-out, brief examples, illustrating a variety of statistical methods. New users of R may find this book useful.

M. Friendly and D. Meyer, *Discrete Data Analysis with R: Visualization and Modelling Techniques for Categorical and Count Data*. Boca Raton: Chapman and Hall, 2016. A tour-de-force, wide-ranging treatment of the material clearly described by the title of the book. Visit the web site for the book for chapter summaries, illustrative graphs, and a variety of other information.

A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press, 2007. A wide-ranging yet deep treatment of hierarchical models and various related topics, predominantly but not exclusively from a Bayesian perspective, using both R and BUGS software.

F. E.
Harrell, Jr., *Regression Modeling Strategies, With
Applications to Linear Models, Logistic Regression, and Survival
Analysis*. New York: Springer, 2001. Describes an interesting
approach to statistical modeling, with frequent references to Harrell's Hmisc and Design (now **rms**) packages.

T. J.
Hastie and R. J. Tibshirani, *Generalized Additive Models*.
London: Chapman and Hall, 1990. An accessible treatment of generalized
additive models, as implemented in the gam package, and of nonparametric regression analysis in
general. [The gam function in the mgcv package in R takes a somewhat different approach; see Wood
(2000), below.]

R. Koenker, Quantile Regression. Cambridge: Cambridge University Press, 2005. Describes a variety of methods for quantile regression by the leading figure in the area. The methods are implemented in Koenker's quantreg package for R.

C.
Loader, *Local Likelihood and Regression*. New York:
Springer, 1999. Another text on nonparametric regression and density
estimation, using the locfit package. Although the
text is less readable than Bowman and Azzalini, the locfit software in very capable.

T. Lumley, Complex Surveys: A Guide to Analysis Using R. Hoboken NJ, Wiley, 2010. A lucid introduction to the analysis of data from complex survey samples and to Lumley's highly capable survey package. Thomas Lumley is a member of the R Core group of developers.

G. P. Nason, *Wavelet Methods in Statistics with R*. New York: Springer, 2008. Describes the **wavethresh** package for wavelet smoothing, by one of the key figures in the development of wavelet methods in statistics.

J. C.
Pinheiro and D. M. Bates, *Mixed-Effects Models in S and S-PLUS*.
New York: Springer, 2000. An extensive treatment of linear and
nonlinear mixed-effects models in S, focused on the authors' nlme package. Mixed models are appropriate for various kinds of
non-independent (clustered) data, including hierarchical and
longitudinal data. Does not cover Bates's newer lme4 package. Doug Bates is a member of the R Core group of developers.

T. M.
Therneau and P. M. Grambsch, *Modeling Survival Data:
Extending the Cox Model*. New York, Springer: 2000. An
overview of both basic and advanced methods of survival analysis
(event-history analysis), with reference to S and SAS software, the former implemented in Therneau's state-of-the-art survival package.

S. van Buuren, *Flexible Imputation of Missing Data*, Boca Raton FL: CRC Press, 2012. There are several packages in R for multiple imputation of missing data; this book largely describes the **mice** (multiple imputation by chained equations) package.

W. N.
Venables and B. D. Ripley. *Modern Applied Statistics with S,
Fourth Edition*. New York: Springer, 2002. An influential and
wide-ranging treatment of data analysis using S. Many of the facilities
described in the book are programmed in the associated (and
indispensable) MASS, nnet, and spatial packages, which are included in the standard R distribution. This text is
more advanced and has a broader focus than the *R Companion*. Brian Ripley is a member of the R Core group of developers and Bill Venables is a member of the R Foundation.

S. N.
Wood, Generalized
Additive Models: An Introduction with R. New York: Chapman
and Hall, 2006. Describes the mgcv package in R, which contains a gam function
for fitting generalized additive models based on smoothing splines. The
initials "mgcv" stand for multiple generalized cross validation, the
method by which Wood selects GAM smoothing parameters.

### Other Sources (Many Free)

See the publications list on the R web site. The R Journal, the journal of the R Project for Statistical Computing, and its predecessor R News, are also good sources of information, as is the Journal of Statistical Software, an on-line American Statistical Association journal dominated by coverage of R packages.

Information about R packages in a number of application areas is available in various “CRAN task views’’. Also see the package listing on CRAN.

The RStudio web site is a good source of information both on using the RStudio IDE and on other topics, such as R Markdown.