Tuesday, October 15, 2019

Assumptions of Linear Regression


Regression and correlation is closely related. Both techniques involve the relationship between two variables, and they both utilize the same set of paired scores taken from the same subjects. However, whereas correlation is concerned with the magnitude and direction of the relationship, regression focuses on using the relationship for prediction. In terms of prediction, if two variables were correlated perfectly, then knowing the value of one score permits a perfect prediction of the score on the second variable. Generally, whenever two variables are significantly correlated, the researcher may use the score on one variable to predict the score on the second.
There are many reasons why researchers want to predict one variable from another. For example, knowing a person’s I.Q., what can we say about this person’s prospects of successfully completing a university course? Knowing a person’s prior voting record, can we make any informed guesses concerning his vote in the coming election? Knowing his mathematics aptitude score, can we estimate the quality of his performance in a course in statistics? These questions involve predictions from one variable to another, and psychologists, educators, biologists, sociologists, and economists are constantly being called upon to perform this function.
Problems we meet in regression analysis are related often to situations when the assumptions of regression analysis are not satisfied. For example, the predictive power of the regression equation depends on the assumption that the residuals from that regression satisfy certain statistical properties. This section will discuss these problems.
Multiple linear regression requires at least two independent variables, which can be nominal, ordinal, or interval/ratio level variables.  When we have one independent variable, we call this “simple” linear regression. A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis.


1.     LEVEL OF MEASUREMENT
The dependent variable or the outcome variable is scale, while the independent variable or predictor variable is scale or nominal.

2.     LINEARITY
Correlation and regression are measures of association between variables. Prior to performing regression analysis, it is important to run the correlation test to determine the strength of the linear relationship between the two variables.

The linearity test will determine whether the relationship between the independent variable and the dependent variable is linear or not. Linearity is a requirement in the correlation and linear studies.

Decision-making process in Linearity Test:
1.      If the value sig. deviation from Linearity >0.05, then the relationship between the independent and dependent variables is linear.
2.      If the value sig. deviation from Linearity <0.05, then the relationship between the independent and dependent variables is not linear.

Steps:
1.      On the SPSS menu, select Analyze, then click Compare Means and then click Means.
2.      A new dialog box will appear with the name Means. Enter the corresponding independent variable(s) in the Independent List and dependent variable(s) in the Dependent List.
3.      The dialog box appears with the name Means: Options. At the bottom of the box where the Statistics for First Layer is found, select the Test for Linearity, and then click Continue.
4.  The last step clicks on OK to terminate the command, after which will appear 

Interpretation of Linearity Test Result
Based on the ANOVA Output Table, value sig. deviation from Linearity of 0.0000 >0.05, it can be concluded that there is a linear relationship between the variables of immunization rate and the mortality rate of the children.


3. NORMALITY OF VALUES
Assumption of normality means that you should make sure your data roughly fits a bell curve shape before running certain statistical tests or regression. The normality test determines the distribution of the data in the variable that will be used in statistical analysis. Shapiro-Wilk and Kolmogorov-Smirnov are commonly used to determine if the variables are normally distributed. Shapiro-Wilk is used for a sample that is less than 50. Meanwhile, for samples more than 50, Kolmogorov-Smirnov is used.

Decision-making process in the normality test:
1.      If the value Asymp. Sig > 0.05, the data is normally distributed.
2.      If the value Asymp. Sig < 0.05, the data is not normally distributed.

Steps:
1.      On the SPSS menu, select Analyze, then click Explore.
2.      A new dialog box will appear. Enter the corresponding independent variable(s) in the Factor List (Independent Variable) and dependent variable(s) in the Dependent List (Dependent Variables).
3.      Click Plots. Uncheck Stem-and-Leaf and click Normality plots with tests. Click Continue, then OK.

Interpretation of Normality Test Result
Based on the table, the Sig. values are > .005, therefore, the data are normally distributed.


4.   MULTICOLLINEARITY TEST
     A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant.
  Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.
     Similarities between the independent variables will result in a very strong correlation. Collinearity between independent variables should not happen in good regression models. 
Variance Inflation Factor (VIF) is often used to detect multicollinearity. The variance inflation factor of the linear regression is defined as VIF = 1/T. With VIF > 5 there is an indication that multicollinearity may be present; with VIF > 10 there is certainly multicollinearity among the variables.
     If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from each score) might help to solve the problem.  However, the simplest way to address the problem is to remove independent variables with high VIF values.

Decision-making process in the collinearity test:
1.      If the VIF value lies between 1-10, there is no multicollinearity.
2.      If the VIF is <1 or >10, there is multicollinearity.

Steps:
1.      On the SPSS menu, select Analyze. Click Regression, then Linear.
2.   A new dialog box will appear. Enter the corresponding independent variable(s) in the Dependent and Independent(s) lists.
3.  Click Statistics.
4.  It will open a new dialog box. Click Collinearity diagnostics.
3.      Click Continue, and then OK.

Interpretation of Multicollinearity Test Result
Based on the coefficient output, the VIF value lies between 1-10, therefore, there is no multicollinearity.


5. EQUALITY OF VARIANCE
Statistical tests, such as analysis of variance (ANOVA), assume that although different samples can come from populations with different means, they have the same variance. Equal variances also called homoscedasticity is when the variances are approximately the same across the samples.

In linear regression analysis, the fact that the errors of the model (also named residuals) are not homoscedastic has the consequence that the model coefficients estimated using ordinary least squares (OLS) are neither unbiased nor those with minimum variance. The estimation of their variance is not reliable.

Decision-making process in the homoscedasticity test:
1.      If the data does not have an obvious pattern, it is homoscedastic.
2.     If the data has very tight distribution to the left of the plot, and a very wide distribution to the right of the plot, or vice versa, the data is not homoscedastic.
  
  Steps:

1.      On the SPSS menu, select Analyze. Click Regression, then Linear.

2.   A new dialog box will appear. Enter the corresponding independent variable(s) in the Dependent and Independent(s) lists.
3.     Click Plots

4.    A new dialog box will open. Drag *ZPRED to X and *ZRESID to Y. 
5.     Click Continue, and then Ok.

Interpretation of Result
Homoscedastic
The data have no obvious pattern.
Heteroscedastic 
Below is an example of heteroscedastic values. There is a tight distribution to the left of the plot and a very wide distribution to the right of the plot. If you were to draw a line around your data, it would look like a cone.



Image: Data SPSS Version 20







1 comment:

  1. Easy to follow and understand. Thank you for such output.

    ReplyDelete

Be sure to check back again because I do make an effort to reply to your comments here.