The statistical programming language and computing environment S has become the de-facto standard among statisticians. The S language has two major implementations: the commercial product S-PLUS, and the free, open-source R. Both are available for Windows and Unix/Linux systems; R, in addition, runs on Macintoshes. This one-day workshop introduces R.

A statistical package, such as SPSS, is primarily oriented toward combining instructions with rectangular case-by-variable datasets to produce (often voluminous) printouts. Such packages make routine data analysis relatively easy, but they make it relatively difficult to do things that are innovative or nonstandard, or to add to the built-in capabilities of the package. In contrast, a good statistical computing environment also makes routine data analysis easy, but it additionally supports convenient programming; this means that users can extend the already impressive facilities of R. Statisticians have taken advantage of the extensibility of R to contribute literally hundreds of freely available "packages" of R programs (called "library sections" or just "libraries" in S-PLUS). As well, R is especially capable in the area of statistical graphics, reflecting the origin of S at Bell Labs, a centre of graphical innovation.

The purpose of this workshop is to provide a quick introduction to R and to show you how to accomplish a variety of tasks, including (time permitting) the tasks of writing basic programs and constructing nonstandard graphs. The statistical content is largely assumed known.

Topic |
Materials |

Basics | R script file, Duncan.txt, exercises |

Data | R script file, Prestige.txt, exercises, Fox-ODBC-functions.R |

Statistical models | R script file, exercises, Long.txt, Powers.txt |

Programming | R script file, exercises, bugged functions (solutions), notes |

Graphics | R script file, exercises (solutions) |

CD/ROM and Acquiring R

I've created a CD/ROM with the installer for the Windows version of R, Windows binaries for all of the contributed packages on the Comprehensive R Archive Network (CRAN) web site, along with a pre-installed "live" version of R, which can be run directly from the CD, and the free Tinn-R programming editor. You can download an ISO image of the CD from this web site, and then burn it onto a CD.

Note that this is a large file (about
270 MB), and that an alternative is to download
the much smaller R Windows installer directly; then double-click on the
installer to install R as you would any Windows software. You can subsequently
download and install only those packages that you want over the Internet from
CRAN, via the* Packages Install
packages from CRAN* menu in the *RGui* console. Likewise, the small
installer
for Tinn-R can also be downloaded directly.

Additional
information about obtaining, installing, and configuring R is available
on the web site for my *R
and S-PLUS Companion to Applied Regression*.

SELECTED BIBLIOGRAPHY

There is, of course, no "text"
for the workshop, but the workshop content is largely drawn from J. Fox, *An
R and S-PLUS Companion to Applied Regression*, Sage, 2002. Additional materials
are available on the
web site for the book, including several appendices (on structural-equation
models, mixed models, survival analysis, etc.); scripts for the examples in
all of the chapters and appendices; information on acquiring and installing
R; and more. The book is associated with the car
package for R (and S-PLUS). Alternatively (or additionally), those with more
advanced backgrounds in statistics may wish to read W. N. Venables and B. D.
Ripley, *Modern Applied Statistics with S* as their principal source.

Manuals

R is distributed with a set of manuals, which are also available at the CRAN web site.

A manual for S-PLUS Trellis Graphics
(also useful for the lattice package in R) is at also available
on the web.

Programming in S

R. A. Becker, J. M. Chambers, and
A .R. Wilks, *The New S Language: A Programming Environment for Data Analysis
and Statistics*. Pacific Grove, CA: Wadsworth, 1988. Defines S Version 2,
which forms the basis of the currently used S Versions 3 and 4, as well as R.
(Sometimes called the "Blue Book.")

J. M. Chambers, *Programming with
Data: A Guide to the S Language*. New York: Springer, 1998. Describes the
new features in S Version 4, including the newer formal object-oriented programming
system (also incorporated in R), by the principal designer of the S language.
Not an easy read. (The "Green Book.")

J. M. Chambers and T.J. Hastie,
eds., *Statistical Models in S*. Pacific Grove, CA: Wadsworth, 1992. An
edited volume describing the statistical modeling language in S, Versions 3
and 4, and R, and the object-oriented programming system used in S Version 3
and R (and available, for "backwards compatibility," in S Version
4). In addition, the text covers S software for particular kinds of statistical
models, including linear models, nonlinear models, generalized linear models,
local-polynomial regression models, and generalized additive models. (The "White
Book.")

R. Ihaka and R. Gentleman, R: A
language for data analysis and graphics. *Journal of Computational and Graphical
Statistics*, 5:299-314, 1996. The original published description of the R
project, now dated but still worth looking at.

W. N. Venables and B. D. Ripley,
*S Programming*. New York: Springer, 2000. The definitive treatment of
writing software in the various versions S-PLUS and R, now slightly dated, particularly
with respect to R.

Selected Statistical Methods Programmed in S

W. Bowman and A. Azzalini, *Applied
Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations*.
Oxford: Oxford University Press, 1997. A good introduction to nonparametric
density estimation and nonparametric regression, associated with the sm
package (for both S-PLUS and R).

C. Davison and D. V. Hinkley, *Bootstrap
Methods and their Application*. Cambridge: Cambridge University Press, 1997.
A comprehensive introduction to bootstrap resampling, associated with the boot
package (for S-PLUS and R, written by A. J. Canty). Somewhat more difficult
than Efron and Tibshirani.

B. Efron and R. J. Tibshirani, *An
Introduction to the Bootstrap*. London: Chapman and Hall, 1993. Another extensive
treatment of bootstrapping by its originator (Efron), also accompanied by an
S package, bootstrap (for both
S-PLUS and R, but somewhat less usable than boot).

F. E. Harrell, Jr., *Regression
Modeling Strategies, With Applications to Linear Models, Logistic Regression,
and Survival Analysis*. New York: Springer, 2001. Describes an interesting
approach to statistical modeling, with frequent references to Harrell's Hmisc
and Design packages for S-PLUS
and R.

T. J. Hastie and R. J. Tibshirani,
*Generalized Additive Models*. London: Chapman and Hall, 1990. An accessible
treatment of generalized additive models, as implemented in the gam
function in S-PLUS and in the gam
package in R, and of nonparametric regression analysis in general. [The gam
function in the mgcv package
in R takes a somewhat different approach; see Wood (2000), below.]

C. Loader, *Local Likelihood and
Regression*. New York: Springer, 1999. Another text on nonparametric regression
and density estimation, using the locfit package (in S-PLUS and R). Although
the text is less readable than Bowman and Azzalini, the locfit
software in very capable.

J. C. Pinheiro and D. M. Bates,
*Mixed-Effects Models in S and S-PLUS*. New York: Springer, 2000. An extensive
treatment of linear and nonlinear mixed-effects models in S, focused on the
authors' nlme package, which
is available for both S-PLUS and R. Mixed models are appropriate for various
kinds of non-independent (clustered) data, including hierarchical and longitudinal
data.

J. L. Schafer, *Analysis of Incomplete
Multivariate Data*. London: Chapman and Hall, 1997. This text presents a
broadly applicable Bayesian treatment of missing-data problems, including methods
for multiple imputation. The most extensive implementation of the methods in
the book is in the missing package
in S-PLUS version 6. Schafer's norm,
cat, mix,
and pan packages are available
for earlier versions of S-PLUS and for R.

T. M. Therneau and P. M. Grambsch,
*Modeling Survival Data: Extending the Cox Model*. New York, Springer:
2000. An overview of both basic and advanced methods of survival analysis (event-history
analysis), with reference to S and SAS software. There are both S-PLUS and R
versions of Therneau's state-of-the-art survival
package.

W. N. Venables and B. D. Ripley.
*Modern Applied Statistics with S, Fourth Edition*. New York: Springer,
2002. An influential and wide-ranging treatment of data analysis using S. Many
of the facilities described in the book are programmed in the associated (and
indispensable) MASS, nnet,
and spatial packages, available
both for S-PLUS and R. This text is more advanced and has a broader focus than
my *R and S-PLUS Companion*.

S. N. Wood, Modelling and smoothing
parameter estimation with multiple quadratic penalties. *Journal of the Royal
Statistical Society, Series B*, 62: 413-428, 2000. Describes the mgcv
package in R, which contains a gam
function for fitting generalized additive models. The initials "mgcv" stand
for multiple generalized cross validation, the method by which Wood selects
GAM smoothing parameters. The description of the software in the paper is slightly
dated; consult the package documentation for up-to-date information, including
additional references.

Other Sources (Some Free)

See the R web site for a list of publications.

Last Modified: 18 February 2005 by J. Fox <jfox AT mcmaster.ca>