Lecture Series: An Introduction to the R Statistical Computing Environment
July 20-30, 2020, 5:30-7:30 PM
John Fox
Department of Sociology, McMaster University
Teaching Assistant: Allison Leanage
ICPSR Summer Program
Short URL: tinyurl.com/ICPSR-R-course
Please read the installation instructions for R and R Studio.
These lectures comprise an introduction to the R statistical computing environment and to the RStudio IDE (interactive development environment), a powerful programming editor and front-end to R.
The first six lectures present a basic overview of and introduction to R, including to statistical modeling and data management in R – in effect, using R as if it were a statistical package. Although there is some review, it is assumed that the statistical material is familiar, and the focus is on how statistical methods, data-analysis workflow, and data-management tasks are implemented in R and RStudio. The final three sessions pick up where the basic lectures leave off, and provide the background required to use R more flexibly for data analysis and presentation, unlocking the power inherent in the R statistical computing environment, including an introduction to R programming and to the design of custom statistical graphs.
The R statistical programming language and computing environment has become the de-facto standard among statisticians for writing statistical software, and it is widely used in the social sciences and elsewhere. R is freely available for Windows, macOS, and Unix/Linux systems, and is now possibly the most popular statistical software in the world. R users can readily write programs that add to its already impressive capabilities.
Nearly 18,000 R add-on packages (as of June 2021) are freely available on the internet in the Comprehensive R Archive Network (CRAN), along with many others in the Bioconductor package archive, aimed primarily at researchers in bioinformatics but many also of more general interest. These packages extend the capabilities of R to almost every area of statistical data analysis. R packages are also available from other, less curated, sources, including from Github.
R is a free, open-source implementation (and extension) of the S language, which was originally developed at Bell Labs in the mid-1970s. The basic R system, created by Robert Gentleman and Ross Ihaka at the University of Auckland in New Zealand, is currently developed and maintained by the R Core group, comprising 20 members, many of them eminent in the field of statistical computing. The R Project for Statistical Computing is a project of the R Foundation, whose membership includes most of the R Core group along with a number of other individuals, and is also associated with the Free Software Foundation. See the R web site.
A statistical software package, such as SPSS, is primarily oriented toward combining instructions, possibly entered via a point-and-click interface, with rectangular case-by-variable datasets to produce (often voluminous) printed output. Although they may include limited programming capabilities, such packages make it easy to perform routine data analysis tasks, but they make it relatively difficult to do things that are innovative or nonstandard, or to extend the built-in capabilities of the package. Commercial closed-source statistical software can also be expensive.
The overall objectives of this lecture series are to provide some facility in the use of R, to a level that enables participants to employ the software for assignments and projects in other Summer Program courses as well as in their own work, and to provide a foundation for learning more about R and statistical computing.
The lectures are based mostly on Fox and Weisberg,An R Companion to Applied Regression, Third Edition, and partly on materials prepared specifically for this lecture series. An outline of the lectures follows, with chapter references to Fox and Weisberg (click on the date for each lecture to view a video of the lecture; videos will be posted here after the lecture series ends):
Topics
Day/Topic* |
Reading |
Materials** |
1. Getting Started with R and the R Commander (July 20) | Ch. 1, Secs. 1.2-1.3, 1.5-1.7 | script, exercises (answers: MAD-exercise.R); data file: Duncan.txt; resources: getting help with R |
2. Workflow in R, with RStudio and R Markdown (July 21) | Ch. 1, Secs. 1.1, 1.4 | script, exercise (answers: States-exercise.Rmd, States-exercise.html); materials: RMarkdown-examples.Rmd, Duncan.Rmd; resources: Base R, Advanced R, RStudio, and R Markdown "cheat sheets", R Markdown reference |
3. Linear models in R, Part I (July 22) | Ch. 4 | script, exercise (do parts 1 and 2) |
4.
Linear
models in R, Part II (July 23) |
Ch. 5, Secs. 8.1-8.5 | script, exercise (do parts 3-5) (answers: Burt-exercise.Rmd, Burt-exercise.html) |
5.
Generalized linear models and mixed-effects models in
R (July 26) |
Chs. 6, 7, Sec. 8.6, 8.7 | script, exercises (answers: Powers-exercise.Rmd, Powers-exercise.html), data files: Powers.txt, Long.txt, Goldstein.txt (data management: Goldstein.R) |
6. Data and data management in R, including an introduction to the “tidyverse” (July 27) | Ch.2 | script, exercises (answers: merging-exercise.Rmd, merging-exercise.html), data files: Duncan.csv, Duncan.txt, Duncan.xlsx, flights14.csv, Prestige.xlsx |
7. R programming, part I (July 28) |
Ch. 10 | script, exercise (answers: Fibonacci-exercise.Rmd, Fibonacci-exercise.html) |
8. R programming, part II (July 29) | script, exercises (answers: Least-squares-program-exercise.Rmd, Least-squares-program-exercise.html) | |
9. R Graphics (July 30) | Ch. 9 | script, exercise (answers: Anscombe-exercise.Rmd, Anscombe-exercise.html), symbols and colors demo |
*The division into days is approximate: Occasionally, material may spill over from one day to the next. There was an equipment failure during the second lecture that produced a brief interruption of the video, but no content was lost.
**I intend to do further work on the daily scripts as the lecture series progresses and so it's best to download each script on the day of the corresponding lecture. At the beginning of most lectures after the first, I'll briefly go over the solution to an exercise (marked by an asterisk in the daily exercises) from the preceding lecture. Answers to selected exercises (shown in parentheses) will be posted after exercises are taken up in the lectures.
The lectures are largely modular, so you should be able to attend those that interest you. Lectures 1 and 2 are basic, however, and necessary for much of the material in the other lectures. I'll also use a small number of PDF slides, which you can download as slides or as notes.
Selected Bibliography
Publishers of statistical texts have been producing a steady stream of books on R. Of particular note is Springer's Use R! series and Chapman and Hall/CRC’s The R Series. Books on R have proliferated to the extent that it's no longer possible to compile a comprehensive list of even the better books, so take the following as some of my personal recommendations.
Basic Text
The principal source for this workshop is J. Fox and S. Weisberg, An R Companion to Applied Regression, Third Edition, Sage (2019). Additional materials are available on the web site for the book, including several appendices (on multivariate linear models, survival analysis, Bayesian regression modeling, and more). The book is associated with the car, carData, and effects packages for R. It should not be necessary to read the book to understand the lectures.
Manuals
R is distributed with a set of manuals, which are accessible from RStudio, and which are also available at the CRAN web site.
The manual for S-PLUS Trellis Graphics is also useful for the lattice package in R.
A great deal of information about using the RStudio interactive development environment is available on the RStudio website.
Package Vignettes
Many R packages have "vignettes" -- long-form documentation -- in addition to the mandatory help pages. Enter the command help(package="package-name") and click the link User guides, package vignettes and other documentation if it appears on the main package help page. The command vignette() displays the names of all vignettes in packages residing in your library, while vignette(package="package-name") displays the names of vignettes (if any) in a particular package.
Programming in R
R. A. Becker, J. M. Chambers, and
A .R. Wilks, The New S Language: A Programming
Environment for Data Analysis and Statistics.
J. M. Chambers, Programming with Data: A Guide to the S Language. New York: Springer, 1998. Describes the then-new features in S Version 4, including the newer formal object-oriented programming system (also incorporated in R), by the principal designer of the S language and a member of the R Core group of developers. Not an easy read. (The “Green Book.”)
J. M. Chambers, Software for Data Analysis: Programming with R. New York: Springer, 2008. Chambers’s newest book ranges quite widely, and emphasizes a deep understanding of the R language, along with object-oriented programming, and links between R and other software. Some topics are unusual, such as processing text data in R.
J. M. Chambers, Extending R. Boca Raton: CRC/Chapman & Hall, 2016. Chambers's newest book covers a variety of advanced topics, with an emphasis on object-oriented programming in R, including the relatively new system of "reference classes," and on interfacing R with other programming languages.
J. M. Chambers and T.J. Hastie,
eds., Statistical Models in
D. Eddelbbuettel, Seamless R and C++ Integration with Rcpp. New York: Springer, 2013. Judicious use of compiled code written in C, C++, or Fortran can substantially improve the efficiency of some R programs. The Rcpp package and its cousins simplify the process of integrating C++ code in R. I recommend this book to those who already know C++.
R. Gentleman, R Programming for Bioinformatics, Boca Raton: Chapman and Hall, 2009. A thorough, though at points relatively difficult, treatment of programming in R, by one of the original co-developers of R and a founder of the related Bioconductor Project (which develops computing tools for the analysis of genomic data). Don’t let the title fool you: Most of the book is of general interest to R programmers.
G. Grolemund, Hands-On Programming with R, Sebastopol CA: O'Reilly, 2014. A readable, easy-to-follow, basic introduction to R programming, which also introduces RStudio.
R. Ihaka and R. Gentleman, “R: A language for data analysis and graphics.” Journal of Computational and Graphical Statistics, 5:299-314, 1996. The original published description of the R project, now quite out of date but still worth looking at.
W. N. Venables and B. D. Ripley,
S Programming.
H. Wickham, Advanced R, Second Edition. Boca Raton: Chapman and Hall/CRC, 2019. Hadley Wickham has contributed a number of widely used R packages (such as ggplot2 for graphics and the "tidyverse" suite of packages for data manipulation) and is associated with RStudio. As the name implies, you may (and should!) be interested in reading this book after you’ve learned the basics of R programming. A related volume by Wickham, R Packages, Sepastopol CA: O'Reilly, 2015, is (as its name implies) about how to write R packages. Wickham's approach to R programming and package-writing is sometimes idiosyncratic but always carefully considered and interesting. The websites for the Advanced R and R Packages books provide access to the texts (to the first edition, in the case of Advanced R, and to the work-in-progress second edition in the case of R Packages).
Xie, Y., Dynamic Documents with R and knitr, Second Edition. Boca Raton: Chapman and Hall/CRC, 2016. Yihui Xie describes the use of his knitr package for creating LaTeX documents with embedded executable R code. This package also provides the basis for R Markdown in RStudio. The website for knitr also has as useful information.
Y. Xie, J. J. Allaire, and G. Grolemund, R Markdown: The Definitive Guide. Boca Raton: CRC/Chapman and Hall, 2019. Provides a more extensive treatment of R Markdown for a variety of purposes, including documents of various sorts and presentations.
Statistical Computing in R
The following three books treat traditional topics in statistical computing, such as optimization, simulation, probability calculations, and computational linear algebra, using R (although the coverage of particular topics in the books differs). All offer introductions to R programming. Of these books, Braun and Murdoch is the briefest and most accessible.
W. J. Braun and D. J. Murdoch, A First Course in Statistical Programming with R, Second Edition. Cambridge: Cambridge University Press, 2016.
O. Jones, R. Maillardet, and A. Robinson, Introduction to Scientific Programming and Simulation Using R, Second Edition. Boca Raton: Chapman and Hall, 2014.
M. L. Rizzo. Statistical Computing with R, Second Edition, Boca Raton: Chapman and Hall, 2019.
Graphics in R
W. Chang. R Graphics
Cookbook: Practical Recipes for Visualizing Data.
Sebastopol CA: O’Reilly. Includes examples both for base R
graphics and for ggplot2 graphics (see below) with a focus on
the latter. Recipe books for statistical software are
inherently limited, but can be helpful, particularly for
beginners, and particularly for complex software like ggplot2.
A new edition of the book is projected for summer 2021.
P. Murrell. R Graphics, Third Edition. New York: Chapman and Hall, 2019. A tour-de-force – the definitive reference on traditional R graphics and on the grid graphics system on which lattice graphics (the R implementation of William Cleveland’s Trellis graphics) and ggplot2 graphics are built. R code to produce the figures in the book are on Murrell’s web site.
P. Murrell and R. Ihaka, “An approach to providing mathematical annotation in plots.” Journal of Computational and Graphical Statistics, 9:582-599, 2000. One of the unusual and very useful features of R graphics is the ability to include mathematical notation. This article explains how.
D. Sarkar, Lattice: Multivariate Data Visualization with R. New York: Springer, 2008. Deepayan Sarkar is the developer of the powerful lattice package in R, which implements Trellis graphics. This book provides a fine introduction to and overview of lattice graphics. Figures from the book and the R code to produce them are available on the web.
H. Wickham, ggplot2: Elegant Graphics for Data Analysis, Second Edition. New York: Springer, 2016. A guide to Hadley Wickham's ggplot2 package, which provides an alternative graphics system for R based on an extension of Wilkinson's The Grammar of Graphics (Second Edition, Springer, 2005), which, in turn, provides a systematic basis for constructing statistical graphs. There's a website for the ggplot2 package that includes a link to R Markdown source code from which the book can be compiled.
Data Management
B. C. Boehmke, Data Wrangling with R. Cham, Switzerland: Springer, 2016. A wide-ranging if not comprehensive treatment of data management in R, including information on data structures, data input and output, and some R programming.
P. Spector, Data Manipulation with R. New York: Springer, 2008. Data management is a dry subject, but the ability to carry it out is vital to the effective day-to-day use of R (or of any statistical software). Spector provides a reasonably broad and clear introduction to the subject.
H. Wickham
and G. Grolemund. R for Data Science. Sebastopol CA:
O'Reilly, 2017. This book (with an associated website) has quite a wide
focus, touching on subjects such as statistical modeling,
statistical graphics, and reproducible research with R
Markdown, but its real strength is in data management using
various R packages in the "Tidyverse"
created by Hadley Wickham and his colleagues at RStudio. A website for the book
includes the text; also see the tidyverse website. For an
interesting general critique of the Tidyverse (with which I
don’t entirely agree), see an essay
by Norm Matloff.
(Highly) Selected Statistical Methods Implemented in R
Also see the package listing on CRAN and the various CRAN "task views."
J. Adler, R in a Nutshell: A Desktop Quick Reference, Second Edition, Sebastopol CA: O’Reilly, 2012. Basic information about using R, including brief illustrations of many R commands. New users of R may find the information in this book useful.
R. S. Bivand, E. J. Pebesma, and V. Gómez-Rubio, Applied Spatial Data Analysis with R, Second Edition, New York: Springer, 2013. There is a strong community of researchers in spatial statistics developing R software, much of which is described in this book, including the basic sp package, which provides R classes for spatial data.
W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford: Oxford University Press, 1997. A good introduction to nonparametric density estimation and nonparametric regression, associated with the sm package (for both S-PLUS and R).
F. Bretz, T. Hothorn, and P. Westfall, Multiple Comparisons Using R. Boca Raton: Chapman and Hall, 2011. Covers methods for simultaneous statistical inference focusing on the flexible and general multcomp package.
C. Brunsdon and L. Comber, An Introduction to R for Spatial Analysis and Mapping. London: Sage, 2015. Overlaps somewhat with Bivand et al., but the presentation is at a more elementary level; particularly strong on drawing maps.
C. Davison and D. V. Hinkley, Bootstrap Methods and their Application. Cambridge: Cambridge University Press, 1997. A comprehensive introduction to bootstrap resampling, associated with the boot package (written by A. J. Canty). Somewhat more difficult than Efron and Tibshirani (immediately below).
B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. London: Chapman and Hall, 1993. Another extensive treatment of bootstrapping by its originator (Efron), also accompanied by an R package, bootstrap (but somewhat less usable than boot).
B. S. Everitt and T. Hothorn, A Handbook of Statistical Analyses Using R, Third Edition. Boca Raton: Chapman and Hall, 2014. Many worked-out, brief examples, illustrating a variety of statistical methods. New users of R may find this book useful.
J. Fox. Using the R Commander: A Point-and-Click Interface to R. Boca Raton: Chapman and Hall, 2017. Provides a detailed guide to installing and using the R Commander graphical user interface for R. There's a website for the book.
M. Friendly and D. Meyer, Discrete Data Analysis with R: Visualization and Modelling Techniques for Categorical and Count Data. Boca Raton: Chapman and Hall, 2016. A tour-de-force, wide-ranging treatment of the material clearly described by the title of the book. Visit the web site for the book for chapter summaries, illustrative graphs, and a variety of other information.
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. Rubin, Bayesian Data Analysis, Third Edition. Boca Raton: CRC/Chapman & Hall, 2013. More demanding than McElreath's text, described below, this is a tour-de-force exposition of Bayesian methods, including for mixed-effects models. An appendix to the text explains how to use R and Stan (via the rstan package) for Bayesian inference. Andrew Gelman and Aki Vehtari are among the developers of Stan.
A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press, 2007. A broad yet deep treatment of hierarchical and other regression models, and various related topics, predominantly but not exclusively from a Bayesian perspective, using both R and BUGS software. A newer book, conceived as a partial second edition limited to non-hierarchical models, A. Gelman, J. Hill, and A. Vehtari, Regression and Other Stories, Cambridge: Cambridge University Press, 2021, uses the rstanarm package (which provides a user-friendly front end to Stan) to fit Bayesian regression models.
F. E. Harrell, Jr., Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis, Second Edition. New York: Springer, 2015. Describes an interesting approach to statistical modeling, with frequent references to Harrell's Hmisc and rms packages
T. J. Hastie and R. J. Tibshirani, Generalized Additive Models. London: Chapman and Hall, 1990. An accessible treatment of generalized additive models, as implemented in the gam package, and of nonparametric regression analysis in general. [The gam function in the mgcv package in R takes a somewhat different approach; see Wood (2000), below.]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning, with Applications in R. New York: Springer, 2013. A good, accessible introduction to so-called "machine-learning" methods for regression and classification that, in my opinion, clearly reveals the strengths and weaknesses of this area. For a more advanced treatment of essentially the same topics, see T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, New York: Springer, 2017. A downloadable copies of Hastie et al. is available on the web.
R. Koenker, Quantile Regression. Cambridge: Cambridge University Press, 2005. Describes a variety of methods for quantile regression by the leading figure in the area. The methods are implemented in Koenker's quantreg package for R.
C. Loader, Local Likelihood and Regression. New York: Springer, 1999. Another text on nonparametric regression and density estimation, using the locfit package. Although the text is less readable than Bowman and Azzalini, the locfit software in very capable.
T. Lumley, Complex Surveys: A Guide to Analysis Using R. Hoboken NJ, Wiley, 2010. A lucid introduction to the analysis of data from complex survey samples and to Lumley's highly capable survey package. Thomas Lumley is a member of the R Core group of developers.
P. Mair, Modern Psychometrics with R. Cham, Switzerland: Springer, 2018. Shows how to use R for a wide range of standard psychometric methods, such as classical test theory, item-response theory, factor analysis, and structural equation modeling. The book provides an overview of the topics covered that's broader than it is deep.
R. McElreath, Statistical Retrinking: A Bayesian Course with Examples in R and Stan, Second Edition. Boca Raton: CRC/Chapman and Hall, 2020. (Caveat: The second edition of the book is relatively new and I haven't yet read it, so my comments pertain to the first edition.) The title is reasonably descriptive of this very readable introduction to modern Bayesian methods. The use of R and Stan in the book is somewhat idiosyncratic, employing the author's rethinking package, which is freely available but not from CRAN. Stan implements state-of-art Hamiltonian Monte Carlo methods for drawing samples from posterior distributions, and may be accessed through R via the rstan package (which is on CRAN – see, e.g., the on-line appendix to Fox and Weisberg's R Companion on Bayesian regression in R).
G. P. Nason, Wavelet Methods in Statistics with R. New York: Springer, 2008. Describes the wavethresh package for wavelet smoothing, by one of the key figures in the development of wavelet methods in statistics.
J. C. Pinheiro and D. M. Bates, Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000. An extensive treatment of linear and nonlinear mixed-effects models in S, focused on the authors' nlme package. Mixed models are appropriate for various kinds of non-independent (clustered) data, including hierarchical and longitudinal data. Does not cover Bates's newer lme4 package. Doug Bates is a member of the R Core group of developers. Ben Bolker maintains an extensive website on mixed-effects models in R.
T. M. Therneau and P. M. Grambsch, Modeling Survival Data: Extending the Cox Model. New York, Springer: 2000. An overview of both basic and advanced methods of survival analysis (event-history analysis), with reference to S and SAS software, the former implemented in Therneau's state-of-the-art survival package (and part of the standard R distribution), which has evolved since the publication of the book (see the vignettes included in the package).
S. van Buuren, Flexible Imputation of Missing Data, Second Edition, Boca Raton: CRC Press, 2018. There are several packages in R for multiple imputation of missing data; this book largely describes the mice (multiple imputation by chained equations) package, which is a reasonable general choice. An HTML version of the book is available on-line.
W. N. Venables and B. D. Ripley. Modern Applied Statistics with S, Fourth Edition. New York: Springer, 2002. An influential and wide-ranging treatment of data analysis using S. Many of the methods described in the book are programmed in the associated (and indispensable) MASS, nnet, and spatial packages, which are included in the standard R distribution. This text is more advanced and has a broader focus than the R Companion. I once considered the MASS book the best moderately advanced reference on statistical data analysis in S and R. The book is still useful, but it is showing its age.
S. N. Wood, Generalized Additive Models:
An Introduction with R, Second Edition. New York:
Chapman and Hall, 2017. Describes the mgcv
package in R, which contains a gam function for
fitting generalized additive models based on smoothing
splines. The initials "mgcv" stand for
multiple generalized cross validation, the method by which
Wood selects GAM smoothing parameters.
Other Sources (Many Free)
The home page of the R web site has links to a variety of resources.
The R Journal, the journal of the R Project for Statistical Computing, and its predecessor R News, are also good sources of information, as is the Journal of Statistical Software, a free-access, on-line, and "high-impact" American Statistical Association journal dominated by coverage of R packages.
Information about R packages in a number of application areas is available in various “CRAN task views’’. Also see the package listing on CRAN.
The RStudio web site is a good source of information both on using the RStudio IDE and on other topics, such as R Markdown (see the link for "Resources").