Complete text of a book review that appeared in the Journal of the American Statistical Association, March 1998 (Vol. 93, No. 441, pp. 400--401).  

Applied Regression Analysis, Linear Models, and Related Methods. John Fox. Thousand Oaks, CA: Sage Publications, 1997. ISBN 0- 8039-4540-X. xxi + 597 pp. $58.

Intended for social science students and researchers, this text gives a thorough treatment of applied regression and associated methodology, including classical linear models, nonparametric regression, logistic regression, and bootstrapping regression. Even though the book is written with social scientists as the target audience, the depth of material and how it is conveyed give it far broader appeal. Indeed, I recommend it as a useful learning text and resource for researchers and students in any field that applies regression or linear models (that is, most everyone), including courses for undergraduate statistics majors.

The expectations of the reader are, in the author's words, generally "a course in basic applied statistics that covers the elements of statistical data analysis" (p. xv). That is, readers should be familiar with introductory probability and statistical inference, on the level that is generally taught in United States universities in early undergraduate survey courses. Only in certain sections or chapters is knowledge of calculus-level mathematics assumed. The author does not shy away from mathematical formulation or analysis, however. He couples the mathematical development with lucid discussion so that it is within reach of less mathematically advanced readers. (Excellent appendixes fill in most details that readers might need.)

The book is divided sensibly into four broad parts: introductory material, linear models and least squares, diagnostics, and extensions of the linear model. Each part contains several chapters, which are themselves delineated into sections. Each chapter ends with a summary comprising salient points highlighted explicitly in the text and references to other books and important papers for the reader interested in further study. In addition, ample, interesting exercises of both a theoretical and application-oriented nature are provided. More difficult chapters, sections, and exercises are marked as such.

The initial chapters contain an essay on statistical thought and the relationship of statistics to social science research, a synopsis of the regression concept motivated by interesting real-world examples, some fundamentals of exploratory data analysis, and a chapter on the data transformations. One of the finest features of the book is its consistent use of data from real applications. All of the datasets used in examples and exercises are available in electronic format via the world wide web; the author maintains, at his personal home page and through the publisher, a site (URL provided in the Preface) where all datasets may be retrieved. I retrieved the entire collection without difficulty and used its content, while reproducing examples and working through some exercises.

Part II, on linear models and least squares, is the heart of the book. Simple linear regression, introduced as a descriptive tool rather than as a model for inference, yields almost immediately to the more applicable multiple regression. Interesting examples provide the motivation for the discussion, which includes topics such as least squares fitting; simple, multiple, and partial correlation; and analysis of variance (ANOVA) from a regression viewpoint. Standard inference (point estimation, confidence intervals, and hypothesis tests) for the linear model is then presented for Gaussian errors.

A chapter on dummy variable regression introduces the reader to extensions of standard regression attainable by mathematical manipulations of regression models. The illustrations are particularly helpful in this chapter. Good examples and insightful discussion highlight the chapter on ANOVA and analysis of covariance (ANCOVA), presented in the regression context by extending the dummy variable development of the preceding chapter.

Part II finishes with two chapters containing more advanced material. The matrix formulation of the linear model and least squares estimates are given, and maximum likelihood estimation is introduced, Statistical inference for the matrix version of the model is presented, including discussion of joint confidence regions. Validity of the regression methodology when the regressors are random is addressed, as is model misspecification. The final chapter in Part II provides an excellent discussion on vector geometry for linear models, including multiple regression and ANOVA models.

Diagnostic methods are addressed in Part III, which comprises three chapters. The first of these addresses influential and outlying observations. I particularly liked the presentation of the leverage and influence concepts and the attention to both numerically and graphically relevant methods. A concluding section in this chapter addresses these issues in the matrix formulation.

Methods to detect nonlinearity, heteroscedasticity. and nonnormality comprise the middle chapter of Part III. Highlights include the introduction of weighted least squares, the partial residual plot, and the families of transformations (Box-Cox and Box-Tidwell). Not shying away from a more advanced topic, the author closes the chapter with a section on structural dimension.

Collinearity concludes Part III. This potentially confusing topic is clearly presented using well-constructed graphics and patient language. More advanced topics include principal components and generalized variance inflation. Various approaches to coping with collinearity, including variable selection, biased estimation, and ridge regression, are discussed and compared.

Part IV, the final section, contains extensions of the linear model, beginning with extensions to the "right side" of the model equation. The first extension entertained is to correlated errors, motivated by time series concepts. Polynomial regression is discussed as a nonlinear (in the independent variables) extension of the linear regression model. Nonlinear least squares is then introduced using the logistic population growth model as an example, and this topic is developed through to numerical estimation. These model extensions are followed with extensions in estimation, including two robust estimation methods (M estimation and bounded influences estimation). Lowess smoothing and general additive regression methods close the chapter.

The second chapter of Part IV extends the "left side" of the linear model equation to include categorical responses via logit and probit models. The presentation here is particularly well done. Estimation and diagnostics for these models are included in more advanced sections. The connection between contingency tables and fully categorical models (in which both in- dependent and dependent variables are categorical) is discussed. Finally, these models are extended to the generalized linear model in an advanced section.

The closing chapter introduces bootstrapping and cross-validation. The bootstrapping section includes a highly readable account of the basic concept of the bootstrap, which is then extended to the now-standard methods of confidence interval calculation (normal-theory, percentile, and BC intervals) and to bootstrap hypothesis testing. Cross-validation is presented with cautious discourse on the relationship of scientific integrity to inference.

The only glaring (and intentional) omission from this book is material on specifically how to carry out analyses on a computer. The author addresses this point in the Preface, where he comments that he has, when teaching the material from the text, introduced students to computing along with the statistical methodology. Omitting such computer implementations accomplishes at least two things. The text is not tied to any one computing language or package, making it automatically more general and timeless, and the flow of the presentation is uninterrupted by computational diversions. Nevertheless, this gives the reader (or in an organized course, the instructor) the nontrivial task of translating the paper-and-pencil analyses into code that software can understand and the analyst can utilize. Were some indication of computer implementation given even marginally, in the text, working the problems in the text and attacking more real, personal applications might be made easier.

Other criticisms are minor. Although I appreciate the author's writing style and his consistently thorough and lucid explanations, on occasion (particularly early in the book) the prose overshadows the point. The author uses copious footnotes to elaborate points made in the text. (Of 30 randomly selected pages from the 571 total, 18 had at least one footnote.) While in general I find this distracting (I am unable to "skip" a footnote, for fear of missing something!), the author uses them to acknowledge his consideration of alternatives and explain his choices.

The four appendixes provide supplemental material on vector geometry, multivariable and matrix differential calculus, and probability and estimation. These add considerably to the book. To these I would add a section on linear and matrix algebra. This would bring several of the more Advanced sections into closer reach for the less mathematically prepared reader and would make this excellent book basically self-contained.

The author is to be commanded for giving us this book, which I trust will find a wide and enduring readership.


Centers for Disease Control and Prevention 

Last modified: 20 March 1998 by John Fox .