Home
Mac Package - StatPlus:mac
Buy StatPlus
Buy StatPlus:mac
StatPlus 2007 Professional Help Prev Page Prev Page
StatPlus
License agreement
Support
What's New
Getting started
Loading program
Using Keyboard
Entering Data
Editing Data
Statistics
Analyzing Data
Bibliography
Elementary Concepts
Basic Statistics
Descriptive Statistics
Comparing Means
One Sample T-Test
F-Test Two-Sample for Variances
Linear Correlation (Pearson)
Fechner Correlation
Covariance
Normality Tests
Frequency Tables
Cross Tabulation
ANOVA
One-way ANOVA
Two-way and Three-way ANOVA
GLM ANOVA
Latin Squares Analysis
Regression
Linear Regression
Polynomial regression
Stepwise Regression
Binary logistic regression
Cox proportional-hazards regression
Nonparametric statistics
2x2 Tables
Rank Correlations
Comparing two independent samples
Comparing multiple independent samples
Comparing two dependent samples
Comparing multiple dependent samples
Cochran Q Test
Time Series/Forecasting
Autocorrelation and Partial AC
Moving Average
Interrupted Series Analysis
Survival Analysis
Cox proportional-hazards regression
Probit analysis
Charts
Control Charts
Tutorial On Chart Building
Function Reference
All Functions
Math
General
Statistical
Financial
Customizing StatPlus
General
View
Saving
Add-ons
Other
About AnalystSoft

Purpose

   Correlation is a measure of the relation between two or more variables. The measurement scales used should be at least interval scales, but other correlation coefficients are available to handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.
    The most widely-used type of correlation coefficient is Pearson r, also called linear or product-moment correlation.   

Preparations

    Run StatisticsBasic Statistics and Tables→Linear Correlation (Pearson)....

Results

    One of results is correlation coefficients matrix (r).
    The most widely-used type of correlation coefficient is Pearson r (Pearson, 1896), also called linear or product-moment correlation (the term correlation was first used by Galton, 1888). Using non technical language, one can say that the correlation coefficient determines the extent to which values of two variables are "proportional" to each other. The value of the correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be approximated by a straight line (sloped upwards or downwards). This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Pearson correlation assumes that the two variables are measured on at least interval scales.

Ho shows accepted or rejected null hypothesis (hypothesis about non existence of a difference between samples in population). Null hypothesis is rejected if test statistics is gross or equal critical value.

Critical Value - statistics' critical value, may be constructed using the Student’s t distribution with N - 2 degrees of freedom.


How to Interpret the Value of Correlations. As mentioned before, the correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r², the coefficient of determination) will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as the significance of the correlation.

Significance of Correlations. The significance level calculated for each correlation is a primary source of information about the reliability of the correlation. As explained before (see Elementary concepts), the significance of a correlation coefficient of a particular magnitude will change depending on the size of the sample from which it was computed. The test of significance is based on the assumption that the distribution of the residual values (i.e., the deviations from the regression line) for the dependent variable y follows the normal distribution, and that the variability of the residual values is the same for all values of the independent variable x. However, Monte Carlo studies suggest that meeting those assumptions closely is not absolutely crucial if your sample size is not very large. It is impossible to formulate precise recommendations based on those Monte Carlo results, but many researchers follow a rule of thumb that if your sample size is 50 or more then serious biases are unlikely, and if your sample size is over 100 then you should not be concerned at all with the normality assumptions.
Outliers. Outliers are atypical (by definition), infrequent observations. Because of the way in which the regression line is determined (especially the fact that it is based on minimizing not the sum of simple distances but the sum of squares of distances of data points from the line), outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, consequently, the value of the correlation. Note, that as shown on that illustration, just one outlier can be entirely responsible for a high value of the correlation that otherwise (without the outlier) would be close to zero. Needless to say, one should never base important conclusions on the value of the correlation coefficient alone (i.e., examining the respective scatterplot is always recommended).

Quantitative Approach to Outliers. Some researchers use quantitative methods to exclude outliers. For example, they exclude observations that are outside the range of ± 2 standard deviations (or even ± 1.5 sd's) around the group or design cell mean. In some areas of research, such "cleaning" of the data is absolutely necessary. For example, in cognitive psychology research on reaction times, even if almost all scores in an experiment are in the range of 300-700 milliseconds, just a few "distracted reactions" of 10-15 seconds will completely change the overall picture. Unfortunately, defining an outlier is subjective (as it should be), and the decisions concerning how to identify them must be made on an individual basis (taking into account specific experimental paradigms and/or "accepted practice" and general research experience in the respective area). It should also be noted that in some rare cases, the relative frequency of outliers across a number of groups or cells of a design can be subjected to analysis and provide interpretable results. For example, outliers could be indicative of the occurrence of a phenomenon that is qualitatively different than the typical pattern observed or expected in the sample, thus the relative frequency of outliers could provide evidence of a relative frequency of departure from the process or phenomenon that is typical for the majority of cases in a group.
Nonlinear Relations between Variables. Another potential source of problems with the linear (Pearson r) correlation is the shape of the relation. As mentioned before, Pearson r measures a relation between two variables only to the extent to which it is linear; deviations from linearity will increase the total sum of squared distances from the regression line even if they represent a "true" and very close relationship between two variables. The possibility of such non-linear relationships is another reason why examining scatterplots is a necessary step in evaluating every correlation.