|
Purpose
This procedure performs simple and multiple regression using least squares.
Montgomery (1982) outlines the following four purposes for running a regression analysis:
Description -
The analyst is seeking to find an equation that describes or summarizes the relationship between two variables. This purpose makes the fewest assumptions.
Coefficient Estimation -
This is a popular reason for doing regression analysis. The analyst may have a theoretical relationship in mind, and the regression analysis will confirm this theory. Most likely, there is specific interest in the magnitudes and signs of the coefficients. Frequently, this purpose for regression overlaps with others.
Prediction -
The prime concern here is to predict the response variable, such as sales, delivery time, efficiency, occupancy rate in a hospital, reaction yield in some chemical process, or strength of some metal. These predictions may be very crucial in planning, monitoring, or evaluating some process or system. There are many assumptions and qualifications that must be made in this case. For instance, you must not extrapolate beyond the range of the data. Also, interval estimates require that normality assumptions to hold.
Control -
Regression models may be used for monitoring and controlling a system. For example, you might want to calibrate a measurement system or keep a response variable within certain guidelines. When a regression model is used for control purposes, the independent variable must be related to the dependent variable in a causal way. Furthermore, this functional relationship must continue over time. If it does not, continual modification of the model must occur.
Preparations
Run Statistics→Regression→Linear Regression...
command. Select dependent variable and predictors (independent variables).
Assumptions
First of all, as is evident in the name multiple linear regression, it is assumed that the relationship between variables is linear. In practice this assumption can virtually never be confirmed. Fortunately, multiple regression procedures are not greatly affected by minor deviations from this assumption.
Also it is assumed in multiple regression that the residuals (predicted minus observed values) are distributed normally (i.e., follow the normal distribution).
Results
|
R2 (R-Square) |
Coefficient of determination; indicates how much variation in the response is explained
by the model. The higher the R2
, the better the model fits your data. |
|
Adjusted R-Square |
Accounts for the number of predictors in your model and is useful for comparing
models with different numbers of predictors. The formula is:
|
1 -
|
MS Error |
|
SS Total / DF Total
|
|
|
Sum of squares (SS)
|
The sum of squared distances. SS Total is the total variation in the data. SS Regression
is the portion of the variation explained by the model, while SS Error is the portion
not explained by the model and is attributed to error.
|
| Degrees of freedom (d.f.)
|
Indicates the number of independent pieces of information involving the response
data needed to calculate the sum of squares. The degrees of freedom for each component
of the model are:
DF Regression = p
DF Error = n - p - 1
Total = n - 1
where n = number of observations and p = number of predictors.
|
|
MS Regression |
Mean square regression. The formula is:
SS Regression
DF Regression
|
|
MS Error
|
Mean square error, which is the variance around the fitted regression line. MS Error = s2. The formula is:
SS Error
DF Error
|
|
F |
If the calculated F-value is greater than the F-value from the F-distribution, then at least one of the coefficients is not equal to zero. The F-value is used to determine the p-value. The formula for the calculated F-value is:
MS Regression
MS Error
|
|
Residuals
|
The difference between the observed values and predicted values.
|
|
Variance inflation factor (VIF)
|
Used to detect multicollinearity (correlated predictors). VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated.
|
|