Download Lecture 3 (Jan 20, 2003)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lecture 3
(Jan 19, 2004)
2.6 Numerical Summaries of Quantitative Variables
We have seen many ideas in this section before.
Notation for raw data:
n: the number of individuals in a data set.
x1, x2, … , xn: the individual raw data values
Mean vs. Median
1. How to calculate mean and median; mean: x =
x
n
i
.
2. influence of outliers on the mean and median;
Spread of data:
1.
2.
3.
4.
Q1: lower quartile;
Q3: upper quartile;
Range: high extreme – low extreme;
Interquartile range: Q3 – Q1, notation as: IQR;
Percentiles: in general, the kth percentile is a number that has k% of the data
values at or below it and (100 – k)% of the data values at or above it.
Boxplot and interpretation:
In Minitab, we use Graph >> Boxplot, and then chose the variable that we
are interested in as Y, and appropriate X.
Using the ideal height data again and want to see the comparison between
males and females:
85
IdealHt
80
75
70
65
60
f emale
male
Gender
1. Label Y-axis with numbers from the minimum to maximum of the
data;
2. The upper end of the box is Q1 and lower end is Q3;
3. The line in the middle is the median;
4. Draw a line that extended from Q1 end to the smallest data value that
is not further than 1.5*IQR from Q1, draw a line that extended from
Q3 end to the largest data value that is not further than 1.5*IQR.
5. The rest points are treated as outliers and they should be represented
with asterisks at their proper positions.
2.7
Bell-shaped Distributions of Numbers
Nature seems to follow a predictable pattern for many kinds of
measurements. Most individuals are concentrated around the center, and
the greater the distance a value is from the center, the fewer individuals
have the value. For example, human’s height, or weight.
Numerical variables follow this pattern are said to follow a bell-shaped
curve, many of them have a certain distribution called normal
distribution.
Look at the actual data of height:
Histogram of Height, with Normal Curve
50
Frequency
40
30
20
10
0
60
70
80
90
Height
Stat >> Basic statistics >> Display Descriptive Statistics, choose the
column of interest, click Graphs and choose Histogram with normal
curve.
Standard Deviation:
Standard deviation: roughly is the average distance values fall from the
mean.
Sample mean: x ;
Population mean:  ; note: usually we do NOT know this value. We want
to draw inference on this population mean.
Sample standard deviation: s =
(x
Population standard deviation:  
i
 x)
=
n 1
 (x
i
x
2
 )
n
2
i
 nx
n 1
2
;
2
.
Empirical Rule: for any bell-shaped curve, approximately:
1. 68% of the values fall within 1 standard deviation of the mean in
either direction;
2. 95% of the values fall within 2 standard deviations of the mean in
either direction;
3. 99.7 of the values fall within 3 standard deviations of the mean in
either direction.
Standardized z-score
z=
observed  mean
. z-score shows us how far a given point is away from
std
the mean in terms of standard deviation.
Empirical Rule: for any bell-shaped curve, approximately:
4. 68% of the values have z-score between -1 and 1;
5. 95% of the values have z-score between -2 and 2;
6. 99.7 of the values have z-score between -3 and 3.