Contents of the preface:

Linear models, their variants, and extensions are among the most useful
and widely used statistical tools for social research. This book aims to
provide an accessible, in-depth, modern treatment of regression analysis,
linear models, and closely related methods.

The book should be of interest to students and researchers in the social
sciences. Although the specific choice of methods and examples reflects
this readership, I expect that the book will prove useful in other disciplines
that employ linear models for data analysis, and in courses on applied
regression and linear models where the subject-matter of applications is
not of special concern.

This book is a revision of my 1984 text Linear Statistical Models and
Related Methods. In revising the book, I have freely incorporated material
from my 1991 monograph on Regression Diagnostics.

The new title of the text reflects a change in organization and emphasis:
I have thoroughly reworked the book, removing some topics and adding a
variety of new material. But even more fundamentally, the book has been
extensively rewritten. It is a new and different book.

I have endeavored in particular to make the text as accessible as possible.
With the exception of three chapters, several sections, and a few shorter
passages, the prerequisite for reading the book is a course in basic applied
statistics that covers the elements of statistical data analysis and inference.

Many topics (e.g., logistic regression in Chapter 15) are introduced with an example that motivates the statistics, or (as in the case of bootstrapping, in Chapter 16) by appealing to familiar material. The treatment of regression analysis starts (in Chapter 2) with an elementary discussion of nonparametric regression, developing the notion of regression as a conditional average --- in the absence of restrictive assumptions about the nature of the relationship between the dependent and independent variables. This approach begins closer to the data than the traditional starting point of linear least-squares regression, and should make readers sceptical about glib assumptions of normality, constant variance, and so on.

More difficult chapters and sections are marked with asterisks. These
parts of the text can be omitted without loss of continuity, but they provide
greater understanding and depth, along with coverage of some topics that
depend upon more extensive mathematical or statistical background. I do
not, however, wish to exaggerate the background that is required for this
'more difficult' material: All that is necessary is some exposure to matrices,
elementary linear algebra, and elementary differential calculus. Appendices
to the text provide additional background material.

All chapters end with summaries, and most include recommendations for
additional reading. Summary points are also presented in boxes interspersed
with the text.

The first part of the book consists of preliminary material:

- Chapter 1 discusses the role of statistical data analysis in social science, expressing the point of view that statistical models are essentially descriptive, not direct (if abstract) representations of social processes. This perspective provides the foundation for the data-analytic focus of the text.
- Chapter 2 introduces the notion of regression analysis as tracing the conditional distribution of a 'dependent' variable as a function of one or several 'independent' variables. This idea is initially explored 'nonparametrically,' in the absence of a restrictive statistical model for the data.
- Chapter 3 describes a variety of graphical tools for examining data. These methods are useful both as a preliminary to statistical modeling and to assist in the diagnostic checking of a model that has been fit to data.
- Chapter 4 discusses variable transformation as a solution to several
sorts of problems commonly encountered in data analysis, including skewness,
nonlinearity, and non-constant spread.

The second part, on linear models fit by the method of least-squares,
comprises the heart of the book:

- Chapter 5 discusses linear least-squares regression. Linear regression is the prototypical linear model, and its direct extension is the subject of Chapters 7--10.
- Chapter 6, on statistical inference in regression, develops tools for testing hypotheses and constructing confidence intervals that apply generally to linear models. This chapter also introduces the basic methodological distinction between empirical and structural relationships --- a distinction central to understanding causal inference in non-experimental research.
- Chapter 7 shows how 'dummy variables' can be employed to extend the regression model to qualitative independent variables. Interactions among independent variables are introduced in this context.
- Chapter 8, on analysis-of-variance models, deals with linear models in which all of the independent variables are qualitative.
- Chapter 9* develops the statistical theory of linear models, providing the foundation for much of the material in Chapters 5--8 along with some additional results.
- Chapter 10* applies vector geometry to linear models, allowing us literally
to visualize the structure and properties of these models. Many topics
are revisited from the geometric perspective, and central concepts ---
such as 'degrees of freedom' --- are given a natural and compelling interpretation.

The third part of the book describes 'diagnostic' methods for discovering whether a linear model fit to data adequately represents the data. Methods are also presented for correcting problems that are revealed:

- Chapter 11 deals with the detection of unusual and influential data in linear models.
- Chapter 12 describes methods for diagnosing a variety of problems, including non-normally distributed errors, non-constant error variance, and nonlinearity. Some more advanced material in this chapter discusses how the method of maximum likelihood can be employed for selecting transformations.
- Chapter 13 takes up the problem of collinearity --- the difficulties
for estimation that ensue when the independent variables are highly correlated.

The fourth part of the book discusses important extensions of linear
least squares. In selecting topics, I was guided by the proximity of the
methods to the general linear model and by the promise that these methods
hold for data analysis in the social sciences. The methods described in
this part of the text are (with the exception of logistic regression in
Chapter 15) given introductory --- rather than extensive --- treatments.
My aim in introducing these relatively advanced topics is (i) to provide
enough information so that readers can begin to use these methods in their
research; and (ii) to provide sufficient background to support further
work in these areas should readers choose to pursue them:

- Chapter 14* describes several important direct extensions of linear least-squares: Time-series regression (and generalized least squares), where the observations are ordered in time, and the errors in the linear model are consequently not assumed to be independent; nonlinear models fit by least squares; robust estimation of linear models (i.e., using methods of estimation more resistant than least-squares to unusual data); and nonparametric regression, in which the functional form of the relationship between the dependent and independent variables is not specified in advance.
- Chapter 15 takes up the centrally important topic of linear-like models for qualitative and ordinal dependent variables, most notably, logit models (logistic regression). The chapter concludes with a brief introduction to generalized linear models, a grand synthesis encompassing linear least-squares regression, logistic regression, and a variety of related statistical models.
- Chapter 16 discusses two broadly applicable techniques for assessing
sampling variation: The 'bootstrap' and cross-validation. The bootstrap
is a computationally intensive simulation method for constructing confidence
intervals and hypothesis tests. The bootstrap does not make strong distributional
assumptions about the data, and can be made to reflect the manner in which
the data were collected (e.g., in complex survey-sampling designs). Cross
validation is a simple method for drawing honest statistical inferences
when --- as is commonly the case --- the data are employed both to select
a statistical model and to estimate its parameters.

Several appendices provide background, principally --- but not exclusively
--- for the starred portions of the text:

- Appendix A describes the notational conventions employed in the text.

- Appendix B* shows how vector geometry can be used to visualize key
concepts of linear algebra. The material in this appendix is required for
Chapter 10 and presupposes a basic knowledge of matrices and linear algebra.

- Appendix C* assumes an acquaintance with elementary differential calculus
of one independent variable and shows how, employing matrices, differential
calculus can be extended to several independent variables. This material
is required for some starred portions of the text, for example, the derivation
of least-squares and maximum-likelihood estimators.

- Appendix D provides an introduction to the elements of probability theory and to basic concepts of statistical estimation and inference. A few, more demanding, parts of the appendix are starred. The background developed in this appendix is required for some of the material on statistical inference in the text.

Nearly all of the examples in this text employ real data from the social
sciences, many of them previously analyzed and published. The exercises
that involve data analysis also almost all use real data drawn from various
areas of application. Most of the datasets are relatively small. I encourage
readers to analyze their own data as well.

The datasets can be downloaded free of charge via the World Wide Web;
point your web browser at http://davinci.socsci.mcmaster.ca/applied-regression.html.
If you do not have access to the internet, then you can write to me at
the Department of Sociology, McMaster University, Hamilton, Ontario, Canada,
L8S 4L8, for information about obtaining the datasets on disk. Each dataset
is associated with two files: The file with extension `.cbk` (e.g.,
`duncan.cbk`) contains a 'codebook' describing the data; the file
with extension `.dat` (e.g., `duncan.dat`) contains the data
themselves. Smaller datasets are also presented in tabular form in the
text.

I occasionally comment in passing on computational matters, but the
book generally ignores the finer points of statistical computing in favor
of methods that are computationally simple. I feel that this approach facilitates
learning. Once basic techniques are absorbed, an experienced data analyst
has recourse to carefully designed programs for statistical computations.

I think that it is a mistake to tie a general discussion of linear and
related statistical models too closely to particular software. Although
the marvelous proliferation of statistical software has routinized the
computations for most of the methods described in this book, the workings
of standard computer programs are not sufficiently accessible to promote
learning. I consequently find it desirable, where time permits, to teach
the use of a statistical computing environment as part of a course on applied
regression and linear models.

For nearly 20 years, I used the interactive programming language APL in this role. More recently, I use Lisp-Stat. I particularly recommend the R-code program, written in Lisp-Stat, which accompanies Cook and Weisberg's (1994) fine book on regression graphics. Other programmable computing environments that are used for statistical data analysis include S, Gauss, Stata, Mathematica, and the interactive matrix language (IML) in SAS. Descriptions of these environments appear in a book edited by Stine and Fox (1996).

I have used the material in this book for two types of courses (along
with a variety of short courses and lectures):

I cover the unstarred sections of Chapters 1--8, 11--13, 15, and 16
in a one-semester (13-week) course for social-science graduate students
(at McMaster University in Hamilton, Ontario) who have had (at least) a
one-semester introduction to statistics at the level of Moore (1995). The
outline of this course is as follows:

- Introduction to the course and to MS/Windows; Ch. 1.
- Introduction to regression, Lisp-Stat, and R-code; Ch. 2, App. A, D.
- Examining and transforming data; Ch. 3, 4.
- Linear least-squares regression; Ch. 5.
- Statistical inference for regression; Ch. 6.
- Dummy-variable regression; Ch. 7.
- Analysis of variance; Ch. 8.
- Diagnostics I: Unusual and influential data; Ch. 11.
- Diagnostics II: Nonlinearity and other ills; Ch. 12.
- Diagnostics III: Collinearity and variable selection; Ch. 13. Logit and probit models I; Ch. 15.
- Logit and probit models II; Ch. 15 (cont.).
- Assessing sampling variation and review; Ch. 16.

The readings from this text are supplemented with parts of Cook and
Weisberg's (1994) book on regression graphics, a paper by Tierney (1995)
on Lisp-Stat, and several handouts on computing. Students complete required
weekly homework assignments, which are mostly focused on data analysis.
Homework is collected and corrected, but not graded. I distribute answers
when the homework is collected, and take it up in class after it is corrected
and returned. There are mid-term and final take-home exams, also focused
on data analysis.

I used the material in Chapters 1--13 and 15, along with the appendices and basic introductions to matrices, linear algebra, and calculus, for a two-semester course for social-science graduate students (at York University in Toronto) with similar statistical preparation. For this second, more intensive, course, background topics (such as linear algebra) were introduced as required, and constituted about one-fifth of the course. The organization of the course was similar to the first one.

Both courses include some treatment of statistical computing, with more information on programming in the second course. For students with the requisite mathematical and statistical background, it should be possible to cover the whole text in a reasonably paced two-semester course.

In learning statistics, it is important for the reader to participate actively, both by working though the arguments presented in the book, and --- even more importantly --- by applying methods to data. Statistical data analysis is a craft and, like any craft, it requires practice. Reworking examples is a good place to start, and I have presented illustrations in such a manner as to make re-analysis and further analysis possible. Where possible, I have relegated formal 'proofs' and derivations to exercises, which nevertheless typically provide some guidance to the reader. I believe that this type of material is best learned constructively.

As well, including too much algebraic detail in the body of the text invites readers to lose the statistical forest for the mathematical trees. You can decide for yourself (or your students) whether or not to work the theoretical exercises. It is my experience that some people feel that the process of working through derivations cements their understanding of the statistical material, while others find this activity tedious and pointless. Some of the theoretical exercises, marked with asterisks, are comparatively difficult. (Difficulty is assessed relative to the material in the text, so the threshold is higher in starred sections and chapters.)

In preparing the data-analytic exercises, I have tried to find datasets
of some intrinsic interest that embody a variety of characteristics. You
can safely assume, for example, that datasets for exercises in Chapter
11 include unusual data. In many instances, I try to supply some direction
in the data-analytic exercises, but --- like all real data analysis ---
these exercises are fundamentally open-ended. It is therefore important
for instructors to set aside time to discuss data-analytic exercises in
class, both before and after students tackle them. Although students often
miss important features of the data in their initial analyses, this experience
--- properly approached and integrated --- is an unavoidable part of learning
the craft of data analysis.

A few exercises, marked with pound-signs (#) are meant for 'hand' computation. Hand computation (i.e., with a calculator) is tedious, and is practical only for unrealistically small problems, but it sometimes serves to make statistical procedures more concrete. Similarly, despite the emphasis in the text on analyzing real data, a small number of exercises generate simulated data to clarify certain properties of statistical methods.

Finally, a word about style: I try to use the first person singular
--- ''I'' --- when I express opinions. ''We'' is reserved for you --- the
reader --- and I.

Many individuals have helped me in the preparation of this book.

I am grateful to Georges Monette of York University, to Bob Stine of
the University of Pennsylvania, and to two anonymous reviewers, for insightful
comments and suggestions.

Mike Friendly, of York University, provided detailed comments, corrections,
and suggestions on almost all of the text.

A number of friends and colleagues donated their data for illustrations and exercises --- implicitly subjecting their research to scrutiny and criticism.

Several individuals contributed to this book indirectly by helpful comments on its predecessor (Fox, 1984), both before and after publication: Ken Bollen, Gene Denzel, Shirley Dowdy, Paul Herzberg, and Doug Rivers.

C. Deborah Laughton, my editor at Sage Publications, has been patient
and supportive throughout the several years that I have worked on this
project.

I am also in debt to the students at York University and McMaster University,
and to participants at the Inter-University Consortium for Political and
Social Research Summer Program, all of whom were exposed to various versions
and portions of this text and who have improved it through their criticism,
suggestions, and --- occasionally --- informative incomprehension.

Finally, I am grateful to York University for providing me with a sabbatical-leave
research grant during the 1994-95 academic year, when much of the text
was drafted.

If, after all of this help, deficiencies remain, then I alone am at
fault.

John Fox

Toronto, Canada

Last Modified: 22 January 1997 by John Fox, jfox@mcmaster.ca.