John Fox, Applied Regression Analysis, Linear Models, and Related Methods (Sage, 1997)

Contents of the preface:

Linear models, their variants, and extensions are among the most useful and widely used statistical tools for social research. This book aims to provide an accessible, in-depth, modern treatment of regression analysis, linear models, and closely related methods.

The book should be of interest to students and researchers in the social sciences. Although the specific choice of methods and examples reflects this readership, I expect that the book will prove useful in other disciplines that employ linear models for data analysis, and in courses on applied regression and linear models where the subject-matter of applications is not of special concern.

This book is a revision of my 1984 text Linear Statistical Models and Related Methods. In revising the book, I have freely incorporated material from my 1991 monograph on Regression Diagnostics.

The new title of the text reflects a change in organization and emphasis: I have thoroughly reworked the book, removing some topics and adding a variety of new material. But even more fundamentally, the book has been extensively rewritten. It is a new and different book.

I have endeavored in particular to make the text as accessible as possible. With the exception of three chapters, several sections, and a few shorter passages, the prerequisite for reading the book is a course in basic applied statistics that covers the elements of statistical data analysis and inference.

Many topics (e.g., logistic regression in Chapter 15) are introduced with an example that motivates the statistics, or (as in the case of bootstrapping, in Chapter 16) by appealing to familiar material. The treatment of regression analysis starts (in Chapter 2) with an elementary discussion of nonparametric regression, developing the notion of regression as a conditional average --- in the absence of restrictive assumptions about the nature of the relationship between the dependent and independent variables. This approach begins closer to the data than the traditional starting point of linear least-squares regression, and should make readers sceptical about glib assumptions of normality, constant variance, and so on.

More difficult chapters and sections are marked with asterisks. These parts of the text can be omitted without loss of continuity, but they provide greater understanding and depth, along with coverage of some topics that depend upon more extensive mathematical or statistical background. I do not, however, wish to exaggerate the background that is required for this 'more difficult' material: All that is necessary is some exposure to matrices, elementary linear algebra, and elementary differential calculus. Appendices to the text provide additional background material.

All chapters end with summaries, and most include recommendations for additional reading. Summary points are also presented in boxes interspersed with the text.


Part I

The first part of the book consists of preliminary material:

Part II

The second part, on linear models fit by the method of least-squares, comprises the heart of the book:

Part III

The third part of the book describes 'diagnostic' methods for discovering whether a linear model fit to data adequately represents the data. Methods are also presented for correcting problems that are revealed:

Part IV

The fourth part of the book discusses important extensions of linear least squares. In selecting topics, I was guided by the proximity of the methods to the general linear model and by the promise that these methods hold for data analysis in the social sciences. The methods described in this part of the text are (with the exception of logistic regression in Chapter 15) given introductory --- rather than extensive --- treatments. My aim in introducing these relatively advanced topics is (i) to provide enough information so that readers can begin to use these methods in their research; and (ii) to provide sufficient background to support further work in these areas should readers choose to pursue them:


Several appendices provide background, principally --- but not exclusively --- for the starred portions of the text:


Nearly all of the examples in this text employ real data from the social sciences, many of them previously analyzed and published. The exercises that involve data analysis also almost all use real data drawn from various areas of application. Most of the datasets are relatively small. I encourage readers to analyze their own data as well.

The datasets can be downloaded free of charge via the World Wide Web; point your web browser at If you do not have access to the internet, then you can write to me at the Department of Sociology, McMaster University, Hamilton, Ontario, Canada, L8S 4L8, for information about obtaining the datasets on disk. Each dataset is associated with two files: The file with extension .cbk (e.g., duncan.cbk) contains a 'codebook' describing the data; the file with extension .dat (e.g., duncan.dat) contains the data themselves. Smaller datasets are also presented in tabular form in the text.

I occasionally comment in passing on computational matters, but the book generally ignores the finer points of statistical computing in favor of methods that are computationally simple. I feel that this approach facilitates learning. Once basic techniques are absorbed, an experienced data analyst has recourse to carefully designed programs for statistical computations.

I think that it is a mistake to tie a general discussion of linear and related statistical models too closely to particular software. Although the marvelous proliferation of statistical software has routinized the computations for most of the methods described in this book, the workings of standard computer programs are not sufficiently accessible to promote learning. I consequently find it desirable, where time permits, to teach the use of a statistical computing environment as part of a course on applied regression and linear models.

For nearly 20 years, I used the interactive programming language APL in this role. More recently, I use Lisp-Stat. I particularly recommend the R-code program, written in Lisp-Stat, which accompanies Cook and Weisberg's (1994) fine book on regression graphics. Other programmable computing environments that are used for statistical data analysis include S, Gauss, Stata, Mathematica, and the interactive matrix language (IML) in SAS. Descriptions of these environments appear in a book edited by Stine and Fox (1996).

To Readers, Students, and Instructors

I have used the material in this book for two types of courses (along with a variety of short courses and lectures):

I cover the unstarred sections of Chapters 1--8, 11--13, 15, and 16 in a one-semester (13-week) course for social-science graduate students (at McMaster University in Hamilton, Ontario) who have had (at least) a one-semester introduction to statistics at the level of Moore (1995). The outline of this course is as follows:

Week: topic; reading.

  1. Introduction to the course and to MS/Windows; Ch. 1.
  2. Introduction to regression, Lisp-Stat, and R-code; Ch. 2, App. A, D.
  3. Examining and transforming data; Ch. 3, 4.
  4. Linear least-squares regression; Ch. 5.
  5. Statistical inference for regression; Ch. 6.
  6. Dummy-variable regression; Ch. 7.
  7. Analysis of variance; Ch. 8.
  8. Diagnostics I: Unusual and influential data; Ch. 11.
  9. Diagnostics II: Nonlinearity and other ills; Ch. 12.
  10. Diagnostics III: Collinearity and variable selection; Ch. 13. Logit and probit models I; Ch. 15.
  11. Logit and probit models II; Ch. 15 (cont.).
  12. Assessing sampling variation and review; Ch. 16.

The readings from this text are supplemented with parts of Cook and Weisberg's (1994) book on regression graphics, a paper by Tierney (1995) on Lisp-Stat, and several handouts on computing. Students complete required weekly homework assignments, which are mostly focused on data analysis. Homework is collected and corrected, but not graded. I distribute answers when the homework is collected, and take it up in class after it is corrected and returned. There are mid-term and final take-home exams, also focused on data analysis.

I used the material in Chapters 1--13 and 15, along with the appendices and basic introductions to matrices, linear algebra, and calculus, for a two-semester course for social-science graduate students (at York University in Toronto) with similar statistical preparation. For this second, more intensive, course, background topics (such as linear algebra) were introduced as required, and constituted about one-fifth of the course. The organization of the course was similar to the first one.

Both courses include some treatment of statistical computing, with more information on programming in the second course. For students with the requisite mathematical and statistical background, it should be possible to cover the whole text in a reasonably paced two-semester course.

In learning statistics, it is important for the reader to participate actively, both by working though the arguments presented in the book, and --- even more importantly --- by applying methods to data. Statistical data analysis is a craft and, like any craft, it requires practice. Reworking examples is a good place to start, and I have presented illustrations in such a manner as to make re-analysis and further analysis possible. Where possible, I have relegated formal 'proofs' and derivations to exercises, which nevertheless typically provide some guidance to the reader. I believe that this type of material is best learned constructively.

As well, including too much algebraic detail in the body of the text invites readers to lose the statistical forest for the mathematical trees. You can decide for yourself (or your students) whether or not to work the theoretical exercises. It is my experience that some people feel that the process of working through derivations cements their understanding of the statistical material, while others find this activity tedious and pointless. Some of the theoretical exercises, marked with asterisks, are comparatively difficult. (Difficulty is assessed relative to the material in the text, so the threshold is higher in starred sections and chapters.)

In preparing the data-analytic exercises, I have tried to find datasets of some intrinsic interest that embody a variety of characteristics. You can safely assume, for example, that datasets for exercises in Chapter 11 include unusual data. In many instances, I try to supply some direction in the data-analytic exercises, but --- like all real data analysis --- these exercises are fundamentally open-ended. It is therefore important for instructors to set aside time to discuss data-analytic exercises in class, both before and after students tackle them. Although students often miss important features of the data in their initial analyses, this experience --- properly approached and integrated --- is an unavoidable part of learning the craft of data analysis.

A few exercises, marked with pound-signs (#) are meant for 'hand' computation. Hand computation (i.e., with a calculator) is tedious, and is practical only for unrealistically small problems, but it sometimes serves to make statistical procedures more concrete. Similarly, despite the emphasis in the text on analyzing real data, a small number of exercises generate simulated data to clarify certain properties of statistical methods.

Finally, a word about style: I try to use the first person singular --- ''I'' --- when I express opinions. ''We'' is reserved for you --- the reader --- and I.


Many individuals have helped me in the preparation of this book.

I am grateful to Georges Monette of York University, to Bob Stine of the University of Pennsylvania, and to two anonymous reviewers, for insightful comments and suggestions.

Mike Friendly, of York University, provided detailed comments, corrections, and suggestions on almost all of the text.

A number of friends and colleagues donated their data for illustrations and exercises --- implicitly subjecting their research to scrutiny and criticism.

Several individuals contributed to this book indirectly by helpful comments on its predecessor (Fox, 1984), both before and after publication: Ken Bollen, Gene Denzel, Shirley Dowdy, Paul Herzberg, and Doug Rivers.

C. Deborah Laughton, my editor at Sage Publications, has been patient and supportive throughout the several years that I have worked on this project.

I am also in debt to the students at York University and McMaster University, and to participants at the Inter-University Consortium for Political and Social Research Summer Program, all of whom were exposed to various versions and portions of this text and who have improved it through their criticism, suggestions, and --- occasionally --- informative incomprehension.

Finally, I am grateful to York University for providing me with a sabbatical-leave research grant during the 1994-95 academic year, when much of the text was drafted.

If, after all of this help, deficiencies remain, then I alone am at fault.

John Fox

Toronto, Canada

Last Modified: 22 January 1997 by John Fox,