Many statistical tests rely on the assumption that the
residuals of the model are normally distributed. One of the first steps in assessing normality is simply
by looking into the histogram of the variable in question. We would like to see
this bell curve.

The term

*bell curve*is used to describe the mathematical concept called*normal distribution*, sometimes referred to as*Gaussian distribution*. Bell curve refers to the bell shape that is created when a line is plotted using the data points for an item that meets the criteria of normal distribution. When the data is normal, the bell curve is symmetrical around its center so the right side of the center is a mirror image of the left side. This means that half of the data will fall to the left of the mean and half will fall to the right.
However, whether we like it or not, oftentimes the shape
of the bell doesn’t appear to be symmetrical which means that the data are not normally distributed. A possible
way to fix non-normal data is to apply the transformation. Data transformation is
a method of changing the distribution by applying a mathematical function to
each data value.

Data transformation is usually applied so that the data appear to more
closely meet the assumptions of a statistical inference procedure that is to be applied or
to improve the interpretability or appearance of graphs. In the case of
parametric statistics, normality test is one of the assumptions to perform
diagnostics. It is important that the normality test is fulfilled prior to
undergoing different assumptions. Often, the non-normality of the residuals
can lead to heteroscedasticity which results in increased type-I error rates and
reduced power.

In the given sample data set below, this is the result of normality
test

**before**the data transformation. Normality test using the Kolmogorov-Smirnov and Shapiro-Wilk. Use the*Kolmogorov-Smirnov*for samples more than 50, while,*Shapiro-Wilk*for less than 50 samples. If the Sig. value is > .05, it means that the data are normally distributed. If the Sig. value is < .05, it means that the data are not normally distributed. In the table below, the Sig. value is .000 which is < .05, which would mean that the data is not normal.**Follow the steps below to normalize the ranks using inverse distribution:**

1.

__Normalizing / Transforming Ranks__
1.1
Open the data set, then click
Transform.

1.2
Click

**Rank Cases**
1.3
A dialog box will open with
the name

**Rank Cases**. Drag the selected variable. On the right side, click**Rank Types**. Click**Fractional Rank**and click**Continue**.
1.4
Click

**Ok**.
2.

__Inverse Distribution__
2.1
Click

**Transform**
2.2
Click

**Compute Variable**
2.3
Type a new name for the new
variable (normalized values), for example

*per_mean_normal.*
2.4
Click

**Inverse DF**
2.5
Double Click

**ldf.Normal**
2.6

**IDF.NORMAL(?,?,?)**will appear below the Numeric Expression
2.7 Write the corresponding
values on the

**(?,?,?).**On the first (?), delete it and double click the**Fractional Rank**which is found on the left side of the dialog box in the**Type & Label**section. On the second (?), delete it and replace it with the mean value. On the third (?), replace with the standard deviation.
3.

__To obtain the Mean and Standard Deviation__
3.1
Click

**Analyze**
3.2 Click

**Descriptive Statistics**. It will open a dialog box with the name**Descriptives**.
3.3
Click

**Option**s. Click**Mean**and**Standard Deviation**, and then click**Continue.**
3.4
Click

**Ok**.**Steps with Pictures:**

1.

__Normalizing / Transforming Ranks__
1.1
Open the data set, then click
Transform.

1.2 Click

**Rank Cases**
1.3 A dialog box will open with the name

**Rank Cases**. Drag the selected variable. On the right side, click**Rank Types**. Click**Fractional Rank**and click**Continue**.
1.4 Click

**Ok**.
1.5 Now, an additional 2 columns will appear in your data view. You will be using the RFR001 (Fractional Rank) in the inverse distribution for normalizing the values.

2.

__Inverse Distribution__
2.1 Click

**Transform**
2.2 Click

**Compute Variable**
2.3 Type a new name for the new variable (normalized values),
for example

*per_mean_normal.*Type the new name below the**Target Variable.**
2.4 Scroll down the

**Function Group**and click**Inverse DF.**
2.5 Scroll down the

**Functions and Special Variables**and Double Click**ldf.Normal.**2.6

**IDF.NORMAL(?,?,?)**will appear below the Numeric Expression.

2.7 Write the corresponding values for the

**(?,?,?).**On the first (?), delete it and double click the

**Fractional Rank**which is found on the left side of the dialog box in the

**Type & Label**section. On the second (?), delete it and replace it with the mean value. On the third (?), replace with the standard deviation.

**Ok**.

2.8 You now have the normalized values on the fifth column with the new name

*(per_mean_normal)*you provided. You may run the normality test again on the generated output to check if the values have been normalized.

3.

__To obtain the Mean and Standard Deviation__
3.1
Click

**Analyze**
3.2 Click

**Descriptive Statistics**. It will open a dialog box with the name**Descriptives:Options**.
3.3 Click

**Option**s. Click**Mean**and**Standard Deviation**, and then click**Continue.**

3.4 Click

3.5 Check the SPSS Output and get the Mean and Standard Deviation from the table.

**Ok**.3.5 Check the SPSS Output and get the Mean and Standard Deviation from the table.

This is the result of the Normality Test

**after**the data transformation.
Although transforming the data contributes to
normalizing the values, the trade-off is that interpreting the data may be much
more difficult. For example, if you run a t-test to check for differences
between two groups, and the data you are comparing has been transformed, you
cannot simply say that there is a difference between the two groups’ means. There is
an added step of interpreting the data based on the square root. For this
reason, data transformations are not usually recommended unless, otherwise,
necessary.

**Is satisfying the Normality Assumption really necessary?**

According
to Stevens (2016) in his book of Multivariate Statistics for Social Sciences,
for analyses like Dependent and Independent Sample t-tests, ANOVA, MANOVA, and
Regressions violations of normality is acceptable for validity as long as the
sample size exceeds greater than 50. Therefore, there is not usually too much
impact on validity from non-normal data.

**References:**

1. Berry, W. (1993).
Understanding regression assumptions. Quantitative Applied Social Science. (92) 81-82.

2. Feingold, E. (2002). Regression-based
quantitative-trait–locus mapping in the 21st century. Am J Hum Genet. (71) 217–22.

3. Stevens, J. (2016). Multivariate Statistics for Social Sciences 5

^{th}edition. New York, NY, US: Routledge/Taylor & Francis Group.
4. Statistics Solutions. Transforming Data for Normality.

https://www.statisticssolutions.com/transforming-data-for-normality/. Retrieved on October 9, 2019.

Thank you. I can't find a step-by-step guide on 'how to transform non-normal data to normal' anywhere in the web but here. Great help. Highly recommended.

ReplyDeleteyou're welcome, thank you for your feedback :-)

Delete