Download Data Analysis Using R: Introduction to the R language

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Analysis Using R:
2. Descriptive Statistics
Tuan V. Nguyen
Garvan Institute of Medical Research,
Sydney, Australia
Overview
• Measurements
• Population vs sample
• Summary of data: mean, variance, standard deviation,
standard error
• Graphical analyses
• Transformation
Scales of Measurement
• In general, most observable
behaviors can be measured on
a ratio-scale
• In general, many unobservable
psychological qualities (e.g.,
extraversion), are measured on
interval scales
• We will mostly concern
ourselves with the simple
categorical (nominal) versus
continuous distinction (ordinal,
interval, ratio)
variables
categorical
continuous
ordinal
interval
ratio
Ordinal Measurement
• Ordinal: Designates an ordering; quasi-ranking
– Does not assume that the intervals between numbers are equal.
– finishing place in a race (first place, second place)
1st place
1 hour
2 hours
2nd place 3rd place
3 hours
4 hours
4th place
5 hours
6 hours
7 hours
8 hours
Interval and Ratio Measurement
• Interval: designates an equal-interval ordering
– The distance between, for example, a 1 and a 2 is the
same as the distance between a 4 and a 5
– Example: Common IQ tests are assumed to use an
interval metric
• Ratio: designates an equal-interval ordering with a
true zero point (i.e., the zero implies an absence of the
thing being measured)
– Example: number of intimate relationships a person has
had
• 0 quite literally means none
• a person who has had 4 relationships has had twice as many
as someone who has had 2
Statististics: Enquiry to the unknown
Population
Sample
Parameter
Estimate
Estimate the population mean
Population height mean = 160 cm
Standard deviation = 5.0 cm
ht <- rnorm(10, mean=160, sd=5)
mean(ht)
ht <- rnorm(10, mean=160, sd=5)
mean(ht)
ht <- rnorm(100, mean=160, sd=5)
mean(ht)
ht <- rnorm(1000, mean=160, sd=5)
mean(ht)
ht <- rnorm(10000, mean=160, sd=5)
mean(ht)
hist(ht)
The larger the sample, the more accurate the estimate is!
Estimate the population proportion
Population proportion of males = 0.50
Take n samples, record the number of k males
rbinom(n, k, prob)
males <- rbinom(10, 10, 0.5)
males
mean(males)
males <- rbinom(20, 100, 0.5)
males
mean(males)
males <- rbinom(1000, 100, 0.5)
males
mean(males)
The larger the sample, the more accurate the estimate is!
Summary of Continuous Data
• Measures of central tendency:
– Mean, median, mode
• Measures of dispersion or variability:
– Variance, standard deviation, standard error
– Interquartile range
R commands
length(x), mean(x), median(x), var(x), sd(x)
summary(x)
R example
height <- rnorm(1000, mean=55, sd=8.2)
mean(height)
[1] 55.30948
median(height)
[1] 55.018
var(height)
[1] 68.02786
sd(height)
[1] 8.2479
summary(height)
Min. 1st Qu.
28.34
49.97
Median
55.02
Mean 3rd Qu.
55.31
60.78
Max.
85.05
Graphical Summary: Box plot
80
boxplot(height)
75% percentile
Median, 50% perc.
25% percentile
40
50
60
70
95% percentile
30
5% percentile
Strip chart
30
40
50
60
70
80
Histogram
150
100
50
0
Frequency
200
250
Histogram of height
30
40
50
60
height
70
80
90
Implications of the mean and SD
• “In the Vietnamese population aged 30+ years, the
average of weight was 55.0 kg, with the SD being 8.2 kg.”
• What does this mean?
• 68% individuals will have height between 55 +/- 8.2*1 =
46.8 to 63.2 kg
• 95% individuals will have height between 55 +/- 8.2*1.96
= 38.9 to 71.1 kg
Implications of the mean and SD
• The distribution of weight of the entire population can be
shown to be:
1.96SD
6
1SD
Percent (%)
5
4
3
2
1
0
22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 92
Weight (kg)
Summary of Categorical Data
• Categorical data:
– Gender: male, female
– Race: Asian, Caucasian, African
• Semi-quantitative data:
– Severity of disease: mild, moderate, severe
– Stages of cancer: I, II, III, IV
– Preference: dislike very much, dislike, equivocal, like, like
very much
Mean and variance of a proportion
• For an individual i consumer, the probability he/she
prefers A is pi. Assuming that all consumers are
independent, then pi = p.
• Variance of pi is var(pi) = p(1-p)
• For a sample of n consumers, the estimated probability of
preference for A is:
p
and the variance of p_bar is:
p1  p2  p3  ...  pn
n
p 1  p 
var  p  
n
Normal approximation of a binomial
distribution
• For an individual i consumer, the probability he/she
prefers A is pi. Assuming that all consumers are
independent, then pi = p.
• Variance of pi is var(pi) = p(1-p)
• For a sample of n consumers, the estimated probability of
preference for A is:
p  p2  p3  ...  pn
p 1
and the variance of p_bar is:
and standard deviation:
p 1  p 
var  p  
n
s
p1  p 
n
n
Normal approximation of a binomial
distribution - example
•
•
•
•
•
10 consumbers, 8 preferred product A.
Proportion of preference for A: p = 0.8
Variance: var(p) = 0.8(0.2)/10 = 0.016
Standard deviation of p: s = 0.126
95% CI of p: 0.8 + 1.96(0.126) = 0.55 to 1.00
Descriptive Analyses
Continuous data
Paired t-test
• Continuous data
• Normally distributed
• Two samples are NOT independent
Paired t-test – an example
• The problem: Viewing certain meats under red light might
enhance judges preferences for meat. 12 judges were asked
to score the redness of meat under red light and white light
Results:
Judge
1
2
3
4
5
6
7
8
9
10
11
12
Red
20
18
19
22
17
20
19
16
21
17
23
18
White
22
19
17
18
21
23
19
20
22
20
27
24
Paired t-test – analysis
Judge
Red light
White light
Difference
1
20
22
2
2
18
19
1
3
19
17
-2
4
22
18
-4
5
17
21
4
6
20
23
3
7
19
19
0
T-test = (1.83 – 0)/0.81 = 2.23
8
16
20
4
9
21
22
1
P-value = 0.0459
10
17
20
3
11
23
27
4
12
18
24
6
Mean
21.0
19.2
1.83
SD
2.8
2.1
2.82
Mean difference: 1.83, SD: 0.81
Standard error (SE):
SD/sqrt(n) = 0.81/sqrt(10) =
0.81
Conclusion: there was a
significant effect of light colour.
Paired t-test – R analysis
red < -c(20,18,19,22,17,20,19,16,21,17,23,18)
white < -c(22,19,17,18,21,23,19,20,22,20,27,24)
t.test(red, white, paired=TRUE)
data: red and white
t = -2.2496, df = 11, p-value = 0.04592
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-3.6270234 -0.0396433
sample estimates:
mean of the differences
-1.833333
Two-sample t-test
Mean difference:
Sample
1
2
3
4
5
…
n
Group 1 Group2
x1
y1
x2
y2
x3
y3
x4
y4
x5
y5
…
xn
yn
Sample size
n1
n2
Mean
x
y
SD
sx
sy
D=x–y
Variance of D:
T-statistic:
95% Confidence interval:
Two-group comparison: an example
20 consumers rated their preference for
two rice desserts (A and B)
ID
1
2
3
4
5
6
7
8
9
10
A
3
7
1
9
3
4
1
2
6
7
B
3
1
2
4
5
2
2
5
3
2
ID
11
12
13
14
15
16
17
18
19
20
A
5
8
5
9
4
6
4
3
9
5
B
3
4
2
3
5
4
3
1
3
2
Unpaired t-test using R
a<-c(3,7,1,9,3,4,1,2,6,7,5,8,5,9,4,6,4,3,9,5)
b<-c(3,1,2,4,5,2,2,5,3,2,3,4,2,3,5,4,3,1,3,2)
t.test(red,white)
Welch Two Sample t-test
data: a and b
t = 3.3215, df = 27.478, p-value = 0.002539
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8037895 3.3962105
sample estimates:
mean of x mean of y
5.05
2.95
Transformation of data: multiplicative effects
• The following data represent lysozyme levels in the gastric juice of
29 patients with peptic ulcer and of 30 normal controls. It was
interested to know whether lysozyme levels were different between
two groups.
Group 1:
0.2 0.3 0.4 1.1 2.0 2.1 3.3 3.8 4.5 4.8 4.9 5.0 5.3 7.5 9.8 10.4
10.9 11.3 12.4 16.2 17.6 18.9 20.7 24.0 25.4 40.0 42.2 50.0 60.0
Group 2:
0.2 0.3 0.4 0.7 1.2 1.5 1.5 1.9 2.0 2.4 2.5 2.8 3.6 4.8 4.8 5.4 5.7
5.8 7.5 8.7 8.8 9.1 10.3 15.6 16.1 16.5 16.7 20.0 20.7 33.0
Unpaired t-test by R
g1 <- c( 0.2, 0.3, 0.4, 1.1, 2.0, 2.1, 3.3, 3.8,
4.5, 4.8, 4.9, 5.0, 5.3, 7.5, 9.8, 10.4,
10.9, 11.3, 12.4, 16.2, 17.6, 18.9, 20.7,
24.0, 25.4, 40.0, 42.2, 50.0, 60)
g2 <- c(0.2, 0.3, 0.4, 0.7, 1.2, 1.5, 1.5, 1.9, 2.0,
2.4, 2.5, 2.8, 3.6, 4.8, 4.8, 5.4, 5.7, 5.8,
7.5, 8.7, 8.8, 9.1, 10.3, 15.6, 16.1, 16.5,
16.7, 20.0, 20.7, 33.0)
t.test(g1, g2)
data: g1 and g2
t = 2.0357, df = 40.804, p-value = 0.04831
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
0.05163216 13.20239083
sample estimates:
mean of x mean of y
14.310345 7.683333
Exploration of data
par(mfrow=c(1,2))
hist(g1)
hist(g2)
15
10
5
5
= 14.3
15.7
Frequency
10
15
Histogram of g2
Frequency
0
= 7.7
7.8
0
Group 1:
mean(g1)
sd(g1) =
Group 2:
mean(g2)
sd(g2) =
Histogram of g1
0
10
20
30
g1
40
50
60
0
5
10
20
g2
30
Re-analysis of lysozyme data
log.g1 <- log(g1)
log.g2 <- log(g2)
t.test(log.g1, log.g2)
data: log.g1 and log.g2
t = 1.406, df = 55.714, p-value = 0.1653
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-0.2182472 1.2453165
sample estimates:
mean of x mean of y
1.921094 1.407559
exp(1.921-1.407) = 1.67
Group 1’s mean is 67% higher than group 2’s
Descriptive analysis
Categorical data
Comparison of two proportions - theory
Group
1
2
____________________________________________
Sample size
Number of events
Proportion of events
n1
e1
p1
n2
e2
p2
Difference:
D = p1 – p2
SE difference: SE = [p1(1–p1)/n1 + p2(1–p2)/n2]1/2
Z = D / SE
95% CI: D + 1.96(SE)
With (n1 + n2) > 20, and if Z > 2, it is possible to
reject the null hypothesis.
Comparison of two proportions - example
Thirty-day mortality rate (%) of 100
rats who had been exposed to heroine
or cocain.
Group
Heroine Cocaine
__________________________________________
Sample size
100
Number of deaths 90
Mortality rate
0.90
100
36
0.36
Analysis
Difference: D = 0.90 – 0.36 = 0.54
SE (D) = [0.9(0.1)/100 + 0.36(0.64)/100]1/2
= 0.057
Z = 0.54 / 0.057 = 9.54
95% CI:
0.54 + 1.96(0.057)
0.43 to 0.65
Conclusion: reject the null
hypothesis.
Comparison of two proportions - R
events <- c(90, 36)
total <- c(100, 100)
prop.test(events, total)
2-sample test for equality of proportions with
continuity correction
data: deaths out of total
X-squared = 60.2531, df = 1, p-value = 8.341e-15
alternative hypothesis: two.sided
95 percent confidence interval:
0.4190584 0.6609416
sample estimates:
prop 1 prop 2
0.90
0.36
Comparison of >2 proportions –
Chi square analysis
table(sex, ethnicity)
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2
females <- c(4, 43, 22, 0)
total <- c(8, 60, 30, 2)
prop.test(females, total)
Comparison of >2 proportions –
Chi square analysis
4-sample test for equality of proportions without
continuity
correction
data: females out of total
X-squared = 6.2646, df = 3, p-value = 0.09942
alternative hypothesis: two.sided
sample estimates:
prop 1
prop 2
prop 3
prop 4
0.5000000 0.7166667 0.7333333 0.0000000
Warning message:
Chi-squared approximation may be incorrect in:
prop.test(females, total)
Summary
• Examine the distribution of data
– Mean and variance: systematic difference?
– Normally distributed ?
• Transformation?
• Present confidence intervals (and p-values)
Related documents