 |
Charts |
|
|
Purpose
Correlation is a
measure of the relation between two or more variables. The measurement scales
used should be at least interval scales, but other correlation coefficients are
available to handle other types of data. Correlation coefficients can range from
-1.00 to +1.00. The value of -1.00 represents a perfect negative correlation
while a value of +1.00 represents a perfect positive correlation. A value of
0.00 represents a lack of correlation.
The most widely-used type of correlation coefficient is
Pearson r, also called linear or product-moment correlation. Preparations
Run Statistics→Basic Statistics and Tables→Linear
Correlation (Pearson).... Results
One of results is correlation coefficients matrix (r).
The most widely-used type of correlation coefficient is Pearson r (Pearson,
1896), also called linear or product-moment correlation (the term correlation
was first used by Galton, 1888). Using non technical language, one can say that
the correlation coefficient determines the extent to which values of two
variables are "proportional" to each other. The value of the correlation (i.e.,
correlation coefficient) does not depend on the specific measurement units used;
for example, the correlation between height and weight will be identical
regardless of whether inches and pounds, or centimeters and kilograms are used
as measurement units. Proportional means linearly related; that is, the
correlation is high if it can be approximated by a straight line (sloped upwards
or downwards). This line is called the regression line or least squares line,
because it is determined such that the sum of the squared distances of all the
data points from the line is the lowest possible. Pearson correlation assumes
that the two variables are measured on at least interval scales.Ho
shows accepted or rejected null hypothesis (hypothesis about non existence of a difference
between samples in population). Null hypothesis is rejected if test statistics is
gross or equal critical value.
Critical Value - statistics' critical value, may be constructed using the
Student’s t distribution with N - 2 degrees of
freedom.
How to Interpret the Value of Correlations. As mentioned before, the
correlation coefficient (r) represents the linear relationship between two
variables. If the correlation coefficient is squared, then the resulting value
(r², the coefficient of determination) will represent the proportion of common
variation in the two variables (i.e., the "strength" or "magnitude" of the
relationship). In order to evaluate the correlation between variables, it is
important to know this "magnitude" or "strength" as well as the significance of
the correlation. Significance
of Correlations. The significance level calculated for each correlation is a
primary source of information about the reliability of the correlation. As
explained before (see Elementary concepts), the
significance of a correlation coefficient of a particular magnitude will change
depending on the size of the sample from which it was computed. The test of
significance is based on the assumption that the distribution of the residual
values (i.e., the deviations from the regression line) for the dependent
variable y follows the normal distribution, and that the variability of the
residual values is the same for all values of the independent variable x.
However, Monte Carlo studies suggest that meeting those assumptions closely is
not absolutely crucial if your sample size is not very large. It is impossible
to formulate precise recommendations based on those Monte Carlo results, but
many researchers follow a rule of thumb that if your sample size is 50 or more
then serious biases are unlikely, and if your sample size is over 100 then you
should not be concerned at all with the normality assumptions. Outliers.
Outliers are atypical (by definition), infrequent observations. Because of the
way in which the regression line is determined (especially the fact that it is
based on minimizing not the sum of simple distances but the sum of squares of
distances of data points from the line), outliers have a profound influence on
the slope of the regression line and consequently on the value of the
correlation coefficient. A single outlier is capable of considerably changing
the slope of the regression line and, consequently, the value of the
correlation. Note, that as shown on that illustration, just one outlier can be
entirely responsible for a high value of the correlation that otherwise (without
the outlier) would be close to zero. Needless to say, one should never base
important conclusions on the value of the correlation coefficient alone (i.e.,
examining the respective scatterplot is always recommended).
Quantitative Approach to Outliers. Some researchers use quantitative
methods to exclude outliers. For example, they exclude observations that are
outside the range of ± 2 standard deviations (or even ± 1.5 sd's) around the
group or design cell mean. In some areas of research, such "cleaning" of the
data is absolutely necessary. For example, in cognitive psychology research on
reaction times, even if almost all scores in an experiment are in the range of
300-700 milliseconds, just a few "distracted reactions" of 10-15 seconds will
completely change the overall picture. Unfortunately, defining an outlier is
subjective (as it should be), and the decisions concerning how to identify them
must be made on an individual basis (taking into account specific experimental
paradigms and/or "accepted practice" and general research experience in the
respective area). It should also be noted that in some rare cases, the relative
frequency of outliers across a number of groups or cells of a design can be
subjected to analysis and provide interpretable results. For example, outliers
could be indicative of the occurrence of a phenomenon that is qualitatively
different than the typical pattern observed or expected in the sample, thus the
relative frequency of outliers could provide evidence of a relative frequency of
departure from the process or phenomenon that is typical for the majority of
cases in a group.
Nonlinear Relations between Variables. Another potential source of
problems with the linear (Pearson r) correlation is the shape of the relation.
As mentioned before, Pearson r measures a relation between two variables only to
the extent to which it is linear; deviations from linearity will increase the
total sum of squared distances from the regression line even if they represent a
"true" and very close relationship between two variables. The possibility of
such non-linear relationships is another reason why examining scatterplots is a
necessary step in evaluating every correlation.
|