## Wednesday, October 9, 2019

### How to Transform Data for Normality

Many statistical tests rely on the assumption that the residuals of the model are normally distributed. One of the first steps in assessing normality is simply by looking into the histogram of the variable in question. We would like to see this bell curve.

The term bell curve is used to describe the mathematical concept called normal distribution, sometimes referred to as Gaussian distribution. Bell curve refers to the bell shape that is created when a line is plotted using the data points for an item that meets the criteria of normal distribution. When the data is normal, the bell curve is symmetrical around its center so the right side of the center is a mirror image of the left side. This means that half of the data will fall to the left of the mean and half will fall to the right.
However, whether we like it or not, oftentimes the shape of the bell doesn’t appear to be symmetrical which means that the data are not normally distributed. A possible way to fix non-normal data is to apply the transformation. Data transformation is a method of changing the distribution by applying a mathematical function to each data value.
Data transformation is usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied or to improve the interpretability or appearance of graphs. In the case of parametric statistics, normality test is one of the assumptions to perform diagnostics. It is important that the normality test is fulfilled prior to undergoing different assumptions. Often, the non-normality of the residuals can lead to heteroscedasticity which results in increased type-I error rates and reduced power.
In the given sample data set below, this is the result of normality test before the data transformation. Normality test using the Kolmogorov-Smirnov and Shapiro-Wilk. Use the Kolmogorov-Smirnov for samples more than 50, while, Shapiro-Wilk for less than 50 samples. If the Sig. value is > .05, it means that the data are normally distributed. If the Sig. value is < .05, it means that the data are not normally distributed. In the table below, the Sig. value is .000 which is < .05, which would mean that the data is not normal.

Follow the steps below to normalize the ranks using inverse distribution:

1.      Normalizing / Transforming Ranks
1.1             Open the data set, then click Transform.
1.2             Click Rank Cases
1.3             A dialog box will open with the name Rank Cases.  Drag the selected variable. On the right side, click Rank Types. Click Fractional Rank and click Continue.
1.4             Click Ok.

2.      Inverse Distribution
2.1             Click Transform
2.2             Click Compute Variable
2.3             Type a new name for the new variable (normalized values), for example per_mean_normal.
2.4             Click Inverse DF
2.5             Double Click ldf.Normal
2.6             IDF.NORMAL(?,?,?) will appear below the Numeric Expression
2.7        Write the corresponding values on the (?,?,?). On the first (?), delete it and double click the Fractional Rank which is found on the left side of the dialog box in the Type & Label section. On the second (?), delete it and replace it with the mean value. On the third (?), replace with the standard deviation.

3.      To obtain the Mean and Standard Deviation
3.1             Click Analyze
3.2        Click Descriptive Statistics. It will open a dialog box with the name Descriptives.
3.3             Click Options. Click Mean and Standard Deviation, and then click Continue.
3.4             Click Ok.

Steps with Pictures:
1.      Normalizing / Transforming Ranks
1.1             Open the data set, then click Transform.

1.2  Click Rank Cases

1.3  A dialog box will open with the name Rank Cases.  Drag the selected variable. On the right side, click Rank Types. Click Fractional Rank and click Continue.

1.4   Click Ok.
1.5 Now, an additional 2 columns will appear in your data view. You will be using the RFR001 (Fractional Rank) in the inverse distribution for normalizing the values.

2.      Inverse Distribution
2.1  Click Transform
2.2   Click Compute Variable
2.3 Type a new name for the new variable (normalized values), for example per_mean_normal. Type the new name below the Target Variable.
2.4    Scroll down the Function Group and click Inverse DF.
2.5    Scroll down the Functions and Special Variables and Double Click ldf.Normal.

2.6  IDF.NORMAL(?,?,?) will appear below the Numeric Expression.

2.7  Write the corresponding values for the (?,?,?). On the first (?), delete it and double click the Fractional Rank which is found on the left side of the dialog box in the Type & Label section. On the second (?), delete it and replace it with the mean value. On the third (?), replace with the standard deviation. Click Ok.

2.8 You now have the normalized values on the fifth column with the new name (per_mean_normal) you provided. You may run the normality test again on the generated output to check if the values have been normalized.

3.      To obtain the Mean and Standard Deviation
3.1             Click Analyze
3.2     Click Descriptive Statistics. It will open a dialog box with the name Descriptives:Options.

3.3 Click Options. Click Mean and Standard Deviation, and then click Continue.

3.4   Click Ok
3.5 Check the SPSS Output and get the Mean and Standard Deviation from the table.

This is the result of the Normality Test after the data transformation.

Although transforming the data contributes to normalizing the values, the trade-off is that interpreting the data may be much more difficult. For example, if you run a t-test to check for differences between two groups, and the data you are comparing has been transformed, you cannot simply say that there is a difference between the two groups’ means. There is an added step of interpreting the data based on the square root. For this reason, data transformations are not usually recommended unless, otherwise, necessary.

Is satisfying the Normality Assumption really necessary?

According to Stevens (2016) in his book of Multivariate Statistics for Social Sciences, for analyses like Dependent and Independent Sample t-tests, ANOVA, MANOVA, and Regressions violations of normality is acceptable for validity as long as the sample size exceeds greater than 50. Therefore, there is not usually too much impact on validity from non-normal data.

References:
1. Berry, W. (1993). Understanding regression assumptions. Quantitative Applied Social Science. (92) 81-82.
2. Feingold, E. (2002). Regression-based quantitative-trait–locus mapping in the 21st century. Am J Hum Genet. (71) 217–22.
3. Stevens, J. (2016). Multivariate Statistics for Social Sciences 5th edition. New York, NY, US: Routledge/Taylor & Francis Group.
4. Statistics Solutions. Transforming Data for Normality.

1. 1. 