Download transformation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Data transformation
The Aim
By the end of this lecture, the students
will be aware of data transformation
methods to make the appropriate
statistical tests and be able to transform
data by using SPSS.
2
The Goals
• Explain why sometimes the data transformation is
needed to apply.
• Count the effects of data transformation on the data.
• Explain how to transform data.
• Explain the typical data transformations and how they
will impact the data;
–
–
–
–
–
Logarithmic transformation
Square root
1/y
Square
Logit
• Must be able to transform the typical data with SPSS.
3
3
•
If we switch to our data analysis and when
we want to apply the significance tests, we
face some of the assumptions of the tests we
will apply.
•
If our data does not meet the assumptions,
since we can not apply the relevant statistical
analysis, it would be a solution to meet the
assumptions of the test by applying data
transformation.
4
4
• The reasons for data trasformation applications
and the conditions resulting after the application
of transformation are:
– Our data is not normally distributed. However, normal
distribution as the distribution of the assumption is
a necessity in many statistical tests.
– Spread of data between our groups can show
extreme differences. But in some tests, such as t-test
for independent groups, the variance parameter
assumptions must be equal.
– Two variables are not related as linear. However, as
in the regression analysis, linearity is a necessary
assumption.
How data transformation is done?
• If we want to apply transformations to any
variables in our data set, we must apply the
same mathematical operations to all variables
of that data.
eg:
-We want to deal with our variable age but,
when we examine our data we see that not fitting
a normal distribution. Suppose that "age" value
of 100 people found in our dataset. We must
apply the same treatment to each age variable
(eg: take the square). Ultimately a new
tarnsformed varible (eg: “YaşDön”) will appear.
6
6
•
As it would be difficult to interpret
transformed clinical variables (It is
difficult to interpret square of age for
clinicians), after doing our analysis, while
reporting our results we need to backtransformation (if we tooke the square,
this time we take the square root).
•
We should note that back-transformation
can be a problem in some data (When we
apply "square transformation” having
negative value variable, taking the square
root of the results will be misleading.)
7
7
Transformatios
-Logarithmic transformation
-Square root transformation
-Reciprocal transformation
-Square transformation
-Logit (logistic) transformation
8
Tipical transformations
1. Logarithmic transformation, z = log y
• Logarithmic transformation can be done according
to logarithmic base 10 or base e.
-(log10 or log)
-((loge or ln )
• Please note that we can not take the logarithm of
zero and negative numbers. Back-transformation of
logarithm is called antilog.
eg: If you take the logarithm of 100;
log10 (100) = 2
Antilog (2) = 102 = 100
9
9
• y is inclined to the right "z = log y" usually have
an approximately normal distribution. In this
case we are talking about the lognormal
distribution with y .
• If there is an exponential relationship between y
and x, when the graphic is drown as x on the
horizontal axis and y on the vertical axis, If an
upward sloping graph appears, there would be a
linear relationship between "z = log y" and x .
10
10
• Suppose we measure y (eg: height) which is a
continuously variable in different groups. As y is
large in the groups, variance will be large also.
In particular, in the case of being equal of
variance coefficent ( standard deviation / mean )
between groups, variances will be equal for
“z=log y” variable.
11
11
Şekil: The effects of logarithmic transformations : (a) normalizing
agent , and (b ) in that the linearized , (c) variance stabilizing.
12
12
• Since easy interpretation and the data
skewed right generally, log
transformation is often used in medicine.
eg:
• As we shall see later, compliance with
the normal distribution is an important
assumption for many Hypothesis testing.
In diyabet.sav data set when we
examine "Weight" variable we see that in
the histogram graph, the tail of the bell
curve (right skewed) is toward right.
13
13
Count
By using Kolmogorov-Smirnov or
Skewness analysis we may also show
that “wieght” variable is not normall
distributed:
Analyze > Descriptive statistics >
Frequencies [“Weight” değişkenini
Variables kısmına geçirin, Statistics
butonunu tıklayıp Skewness kutucuğunu
seçiniz] > Continue > OK. Aşağıdaki çıktı
oluşacaktır:
N
Valid
424
Missing
6
Skewness
Std. Error of Skewness
75
50
25
0
50 ,0
75 ,0
10 0,0
12 5,0
15 0,0
Weigh t
1,329
,119
Since Weight Skewness value (1,329) is pozitive and more
than twice of Standard deviation (0.119)
we can say that our data is inclined to the right.
14
14
• Analyze > Nonparametric Tests > 1- Sample K-S
[“Weight” değişkenini Variables kısmına geçirin] >
Continue > OK. Aşağıdaki çıktı oluşacaktır:
One-Sample Kolmogorov-Smirnov Test
Weight
N
Normal Parameters(a,b)
Most Extreme Differences
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
Mean
Std. Deviation
Absolute
Positive
Negative
424
74,266
15,1381
,094
,094
-,045
1,926
,001
15
15
• Let's make a new variable by taking the
logarithm of «Weigth» variable :
– Transform>Compute variable>[“Target Variable”
alanına “LogWeight”, “Numeric Expression” alanına
ise “LG10(weight)” yazalım]>OK
• A new variable with a name of "LogWeight" will
appear. Let us look at the histogram graph of
this variable;
– Graphs>Interactive>Histogram [X eksenine
“LogWeight” değişkenini sürükleyelim. Üstteki
“Histogram” sekmesini tıklayıp “Normal curve”
kutucuğunu işaretleyelim]>OK
16
16
N
50
Skewness
Std. Error of Skewness
40
Count
Valid
Missing
424
6
,248
,119
•
As it is seen the bell curve become
symmetrical.
• When we look at our new variable skewness
value:
It is close to twice the standard deviation.
• When analyzed by the Kolmogorov –Smirnov;
30
20
10
1,60
1,80
2,00
2,20
LogWei ght
N
Normal Parameters(a,b)
Most Extreme Differences
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
Mean
Std. Deviation
Absolute
Positive
Negative
lgweight
424
1,86
,084
,053
,053
-,038
1,091
,185
17
17
• p value (0.185) is larger than 0.05, so we see
that is normally distributed
• In the meantime, we must check the normal
distribution of our statistical accounts and we
need to report the results of that article, but we
should note that the most of parametric tests
(as will be seen later) are tolerant to slight
deviations from normality.
18
18
2. Square root transformation z = √y
• The characteristics of this transformation is similar
to the log transformation. However, there may be
problems in the interpretation of data during the
back-transformation. Besides the features of
normalizing and linearizing, in the case of as y
increases, variance increases
(Variance / arithmetic average is fixed) variance
stabilizing effect is also present.
• Square root transformation is usually used for
Poisson type variables. Also, it should be noted
that the square root of negative numbers can not
be taken.
19
19
3. Reciprocal transformation, z=1/y
• As we do not use special techniques, reciprocal
transformation is used on survival analysis.
• There are similar effects of reciprocal transformation
to log transformation. In addition to its normalizing
an linearizing abilities, it is more effective at
stabilizing variance than the log transformation if the
variance increases very markedly with increasing
values of y, if the variance divided by the mean is
constant.
• It should be kept in mind that reciprocal of zero
cannot be taken.
20
4. Square transformation, z=y2
-Square transformation does the opposite
transformation of log transformation.
-If y inclined to the left, z=y2 is usually normally
distributed.
-If the relationship between two variables, x and y,
is such that a line curving downward is produced
when we plot y against x, then the relationship
between z=y2 and x is approximately linear
- If the variance of a continuous variable y, tends to
decrease as the value of y increases, then the square
transformation, z = y2, stabilizes the variance
21
• Şekil: Effects square transformation: (a) normalizing
agent, and (b ) in that the linearized, (c) variance
stabilizing.
22
5. Logit (logistic) transformation
•This is the transformation we apply most often to each
proportion, p, in a set of proportions. We cannot take the
logit transformation if either
p = 0 or p = 1
• It linearizes a sigmoid curve.
23
• Şekil: Logit transformation effects on a sigmoid curve.
24
Summary
Transformations
-Logarithmic transformation
-Square root transformation
-Reciprocal transformation
-Square transformation
-Logit (logistic) transformation
25