Programming in R

John Fox

(Department of Sociology, McMaster University)

ICPSR Summer Program

2009

Office Hours: 12:30-1:30 325 HN Email: jfox AT mcmaster.ca
Teaching Assistant: Kurt Pyle, Office Hours: 12:30-2:00 320 HN Email: pylekurt AT msu.edu

illustrative 3D rgl graphlowess

The S statistical programming language and computing environment has become the de-facto standard among statisticians and has made substantial inroads in the social sciences. A statistical software “package,” such as SPSS, makes routine data analysis relatively easy, but makes it relatively difficult to do things that are innovative or non-standard, or to add to the built-in capabilities of the package. In contrast, a good statistical computing environment also makes routine data analysis easy, but additionally supports convenient programming.

This two-week course focuses on programming in R, a free, open-source implementation of S. (A commercial implementation of S called S-PLUS is still available, but has been eclipsed by R.) Some previous exposure to R is assumed, but not necessarily to programming in R, nor to programming in general. The course is project-oriented: In addition to small daily problems, each participant is encouraged to complete a substantial programming project in R, either working individually or as part of a collaborative team.


Exercises

A note of explanation concerning the exercises: I'd like you to complete and submit to the course TA at least one exercise each day. These may be selected from among the exercises given in the table below, or you may make up your own exercises as long as they are generally appropriate to the day's topic. Indeed, I encourage you to complete exercises that are directly relevant to your project (see below). If you're working on a team project, then feel free to complete either an individual or team exercise, as appropriate.


Topics

Topic (days)

Optional Reading (in the R and S-PLUS Companion)

Course Materials
1. Review of data analysis using R Ch. 1, Sec. 2.1-2.2, Ch. 3-5 R script: Review.R; data files: Duncan.txt, Powers.txt, Long.txt; exercises; refererence cards: Tom Short's, Jonathan Baron's.
2-3. Basic R programming Sec. 8.1-8.4, 8.6 (new edition 8.1-8.5, 8.10) R script: Basic-programming.R; logistic-regression notes; exercises.
4. Data and data structures in R Sec. 2.3-2.4 R script: Data-structures.R; data files: Prestige.txt, Duncan.xls, nations.por; exercises.
5. Object-oriented programming and lexical scope  in R Sec. 8.4-8.5 (new edition 8.7, 8.9) R script: Object-oriented-programming.R; exercises.
6. Improving and debugging R programs (new edition 8.6)
R script: Debugging.R; exercises; bugged-functions.R.
7. Writing statisical-modeling functions and R packages(new edition 8.8)R script: Modeling-functions-and-packages.R; other files: matrixDemos.R, matrixDemos.zip (WARNING: not a Windows binary package -- do not install); package-building notes; exercises; Windows package-building tools
8. R graphics (time permitting)Ch. 7 R script: Graphics.R; example graphs; data file: UnitedNations.txt; exercises; symbols-colours.R.


Acquiring R

Windows Users

You can download the R Windows installer from CRAN; then double-click on the installer to install R as you would any Windows software. You can subsequently download and install only those packages that you want over the Internet from CRAN, via the Packages -> Install packages from CRAN menu in the RGui console. 

Mac Users

A universal binary for Mac OS X 10.4.4 and higher is available from CRAN. Double-click on the icon for R.mpkg in the disk image to install R. You can then download and install packages over the Internet via the Packages & Data -> Packages Installer menu.

Linux/Unix Users

Precompiled binaries for many Linux systems are avaiable from CRAN, or users can compile R from source. See CRAN for details.

Projects

Because programming is primarily a learn-by-doing activity, I strongly encourage all participants to complete a substantial R programming project. As mentioned, projects may be undertaken by individuals or by teams. Good projects should be doable during at most the first month of the Summer Program, should be of interest to the programmers, and should be within the programmers’ statistical competence.

A project may entail writing a single function (e.g., to fit a particular type of statistical model); or, more likely, several related functions; or, even better, a package containing related functions, associated documentation, and possibly data. Because there are more than 1700 packages on CRAN (and more than 300 in the Bioconductor package repository), it may be difficult to find entirely novel projects. Although it would be desirable to write a novel program, the principal criterion for projects is that they provide useful learning experiences.

In addition to some class time devoted to projects, I and the TA will have regular office hours to assist participants, and the TA will convene additional help sessions.


Selected Bibliography

There has been an explosion of sources on R, with many titles of the form “X with R” (where "X" is some area of statistics). Moreover, many statistics books not specifically focused on R contain R code and examples. The following bibliography focuses instead on books that deal primarily with R programming, R graphics, and closely related topics.

An exception is my book, J. Fox, An R and S-PLUS Companion to Applied Regression, Sage, 2002. Although the focus is not on programming, and the book is a bit dated, many of the examples and much of the code that I use in this course are derived from it. The course is meant to be self-contained, and you do not have to read this or any other book to take it; you may, however, find relevant sections of the book, indicated in the course outline, to be useful. A web site for the book, with scripts for the examples, several on-line appendices, and some other R-related materials may be found at <http://socserv.mcmaster.ca/jfox/Books/Companion/index.html>. The book is associated with the car package for R. A draft version of the revised and expanded Chapter 8 on programming from the second edition of this book (in preparation with co-author Sanford Weisberg and retitled An R Companion to Applied Regression) will be distributed in class.

Another useful general source that is not focused on programming is W. N. Venables and B. D. Ripley, Modern Applied Statistics with S, Fourth Edition, Springer, 2002. It, too, is slightly dated, but it covers many of the topics in this course, and has a wider range than Fox (2002). Venables and Ripley’s text is associated with the MASS, nnet, class, and spatial packages, which are part of the standard R distribution.

Manuals

R is distributed with a complete set of manuals, which are also available at the CRAN (Comprehensive R Archive Network) web site, <http://cran.r-project.org/manuals.html>.

A manual for Trellis Graphics (also useful for the lattice package in R) is at <http://cm.bell-labs.com/cm/ms/departments/sia/doc/trellis.user.pdf>.

Programming in R/S

R. A. Becker, J. M. Chambers, and A .R. Wilks, The New S Language: A Programming Environment for Data Analysis and Statistics. Pacific Grove, CA: Wadsworth, 1988. Defines S Version 2, which forms the basis of the currently used S Versions 3 and 4, as well as R. (Sometimes called the “Blue Book.”)

J. M. Chambers, Programming with Data: A Guide to the S Language. New York: Springer, 1998. Describes the then-new features in S Version 4, including the newer formal object-oriented programming system (also incorporated in R), by the principal designer of the S language. Not an easy read. (The “Green Book.”)

J. M. Chambers, Software for Data Analysis: Programming with R. New York: Springer, 2008. Chambers’s newest book ranges quite widely, and emphasizes a deep understanding of the R language, along with object-oriented programming, and links between R and other software. Some topics are unusual, such as processing text data in R.

J. M. Chambers and T.J. Hastie, eds., Statistical Models in S. Pacific Grove, CA: Wadsworth, 1992.  An edited volume describing the statistical modeling capabilities in S, Versions 3 and 4, and R, and the object-oriented programming system used in S Version 3 and R (and available, for “backwards compatibility,” in S Version 4). In addition, the text covers S software for particular kinds of statistical models, including linear models, nonlinear models, generalized linear models, local-polynomial regression models, and generalized additive models. (The “White Book.”)

R. Gentleman, R Programming for Bioinformatics, Boca Raton: Chapman and Hall, 2009. A thorough, though at points relatively difficult, treatment of programming in R, by one of the original co-developers of R and a founder of the related Bioconductor Project (which develops computing tools for the analysis of genomic data). Don’t let the title fool you: Most of the book is of general interest to R programmers.

R. Ihaka and R. Gentleman, “R: A language for data analysis and graphics.” Journal of Computational and Graphical Statistics, 5:299-314, 1996. The original published description of the R project, now quite out of date but still worth looking at.

W. N. Venables and B. D. Ripley, S Programming. New York: Springer, 2000. A companion volume to Modern Applied Statistics with S, and at the time of its publication the definitive treatment of writing software in the various versions of S-PLUS and R; now somewhat dated, particularly with respect to R.

Statistical Computing in R

The following three books treat traditional topics in statistical computing, such as optimization, simulation, probability calculations, and computational linear algebra, using R (although the coverage of particular topics in the books differs). All offer introductions to R programming. Of these books, Braun and Murdoch is the briefest and most accessible.

W. J. Braun and D. J. Murdoch, A First Course in Statistical Programming with R. Cambridge: Cambridge University Press, 2007.

O. Jones, R. Maillardet, and A. Robinson, Introduction to Scientific Programming and Simulation Using R. Boca Raton: Chapman and Hall, 2009.

M. L. Rizzo. Statistical Computing with R, Boca Raton: Chapman and Hall, 2008.

Graphics in R

P. Murrell. R Graphics. New York: Chapman and Hall, 2006. A tour-de-force – the definitive reference on traditional R graphics and on the grid graphics system on which “lattice” graphics (the R implementation of William Cleveland’s Trellis graphics) is built. The figures from the book and R code to produce them are on Murrell’s web site at <http://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html>.

P. Murrell and R. Ihaka, “An approach to providing mathematical annotation in plots.” Journal of Computational and Graphical Statistics, 9:582-599, 2000. One of the unusual and very useful features of R graphics is the ability to include mathematical notation. This article explains how.

D. Sarkar, Lattice: Multivariate Data Visualization with R. New York: Springer, 2008. Deepayan Sarkar is the developer of the powerful lattice package in R, which implements Trellis graphics. This book provides a fine introduction to and overview of lattice graphics. Figures from the book and the R code to produce them are available at <http://lmdvr.r-forge.r-project.org/figures/figures.html>.

Data Mangement

P. Spector, Data Manipulation with R. New York: Springer, 2008. Data management is a dry subject, but the ability to carry it out is vital to the effective day-to-day use of R (or of any statistical software). Spector provides a reasonably broad and clear introduction to the subject.

Other Sources (Some Free)

See the R web site, at <http://www.r-project.org/doc/bib/R-publications.html>. R News, the newsletter of the R Project for Statistical Computing, is available at <http://cran.r-project.org/doc/Rnews/>, and has recently been superseded by The R Journal  <http://journal.r-project.org/>. There is a variety of free contributed documentation of varying quality at <http://cran.r-project.org/other-docs.html>.


Last Modified: 24 June 2009 by J. Fox <jfox AT mcmaster.ca>