Download sample standard deviation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Methods for Describing Sets of Data
2.1 Describing Qualitative Data
Example
Operations Performed at a Hospital last year
Thoracic
20
Bones and joints
45
Eye, ear, nose, and throat
58
General
98
Abdominal
115
Urologic
74
Neurosurgery
23
Other
65
100
90
80
70
60
50
40
Sum of Frequency
120
110
30
20
Urologic
Thoracic
Other
Neurosurge
General
Eye, Ear, No
Bones and
Abdominal
Bar Graph: displays number or percentage of items
in each category using bars
 Used for qualitative data
 The height of a bar represents the quantity we wish
to compare
 The bars should be of uniform width and
uniformly spaced
2.2 Graphical Methods for Describing
Quantitative Data
Stem and Leaf – separates data entries into “leading
digits” or “stems” and “trailing digits” or “leaves”
 A device that organizes and groups data but allows
us to recover the original data if desired
 Good for spotting extreme values and patterns
Example
14 male weights in pounds
139, 153, 179, 201, 163, 168, 157,
170, 172, 165, 145, 155, 161, 151
Frequency distribution – a summary table in which the
data are arranged into conveniently established class
groupings
 useful when dealing with very large data sets
 through the grouping process the original data is lost
 should have between 5 and 15 classes
 each class grouping should be of equal width
 overlapping the classes must be avoided
class midpoint – the point halfway between the
boundaries of each class
Weight
130 but less than 140
140 but less than 150
150 but less than 160
160 but less than 170
170 but less than 180
180 but less than 190
190 but less than 200
200 but less than 210
Total
Number of males
1
1
4
4
3
0
0
1
14
Histogram – a picture of a frequency distribution
 differs from a bar chart

Used for quantitative data
4
Frequency
3
2
1
0
135
145
155
165
175
C1
185
195
205
Symmetric
Uniform
Skewed Right
Skewed Left
Bimodal
There are several terms that are commonly
used to describe histograms
 Symmetrical – both sides are the same when the graph is
folded vertically
 Uniform – every class has equal frequency (bars are the
same height)
 Skewed Left or Skewed Right – one tail is stretched longer
than the other. The direction of the skewness is on the side
of the longer tail.
 Bimodal – the two classes with largest frequencies are
separated by at least one class (can indicate that we are
sampling from two populations)
.
2.3 Summation Notation
The sum of values,
x1  x2    xn
n
, can be denoted as
x
i 1
i
Example
Select 4 students and ask “how many brothers and
sisters do you have?”
 Data: 2, 3, 1, 3
4
x
i 1
i
 2  3 1 3  9
Or we can write
x  9
Properties of sums (c is a constant)
cx  c x
 c  nc
x  c   x   c   x  nc
Example
Solve the following:
 4x
 x  3
 4x  3
 4x  3
2
2.4 Numerical Measures of Central
Tendency
Description of Average (Typical Value)
Sample Mean:
x

x
n
Example
number of siblings – Data: 2, 3, 1, 3
Suppose we had selected a 5th person for our sample
which had 10 siblings.
New Data: 2, 3, 1, 3, 10
 The sample mean is sensitive to extreme values
and does not have to be a possible data value.
~
Sample Median, x
 rank data from smallest to largest
 if n is odd, median is the middle score
 if n is even, median is the mean of two middle scores
Example
number of siblings – Data: 2, 3, 1, 3
New Data: 2, 3, 1, 3, 10
 Sample median is not sensitive to extreme scores
 Half the data will fall above the sample median and half
below the sample median
 The median is a better measure of central
tendency if extreme scores exist.
 If extreme scores are unlikely, the mean varies less
from sample to sample than the median and is a
better measure.
Shape of a distribution
 If the distribution is right skewed
~
xx
 If the distribution is symmetric
~
xx
 If the distribution is left skewed
~
xx
sample mode: most frequent score
Example
number of siblings – Data: 2,3,1,3
Mode = 3
New Data: 2,3,1,3,10
Mode = 3
 Mode does not always exist/can be more than one
 Also, it is unstable
 Should be used with qualitative data
sample midrange,
Low  High
2
Example
number of siblings – Data: 2,3,1,3
Low  High 1  3

2
Midrange =
2
2
New Data: 2,3,1,3,10
Low  High 1  10
Midrange =

 5.5
2
2
 Midrange is totally dependent on extreme scores.
2.5 Numerical Measures of Variability
Distribution #1
1
2 5
3 5555555
4 5
5
Distribution #2
1 5
2 55
3 555
4 55
5 5
Distribution #1
X = 35
~ = 35
X
mode = 35
midrange =35
Distribution #2
X = 35
~ = 35
X
mode = 35
midrange = 35
Sample Range = High - Low
Example
Years of experience of faculty - Data: 1, 30, 22, 10, 5
 Range is sensitive to extreme scores (Based entirely
on the high and low)
 Range is easy to compute
sample variance - measures the average squared
distance the data points are from x
Sum of Squared X SSX
S 


n 1
n 1
2
 x  x 
2
n 1


 Large values suggest large variability
 Difficult to interpret since it is in square units
 Can never be negative

n  x   x 
2
nn  1
2
Example
Years of experience of faculty - Data: 1, 30, 22, 10, 5
sample standard deviation – measures the average
distance data points are from x
S  S2
 Standard deviation is in the same units as the data
2.6 Interpreting the Standard Deviation
Z-score – Gives the number of standard deviations an
observation is above or below the mean
xx
z
s
Example
Test scores
X = 79, s = 9
If your score is 88%, what is your z-score?
If your score is 63%, what is your z-score?
Empirical rule
(For mound shaped distributions)
 Approximately 68% of the data fall within 1 standard
deviation of the mean ( x  s, x  s )
 Approximately 95% of the data fall within 2 standard
deviations of the mean ( x  2 s, x  2s )
 Approximately 99.7% of the data fall within 3 standard
deviations of the mean ( x  3s, x  3s)
Example
Suppose that the amount of liquid in “12 oz.” Pepsi cans
is a mound shaped distribution with x  12 oz. and
s = 0.1 oz.
2.7 Numerical Measures of
Relative Standing
Percentiles – gives the percentage below an observation
Quartiles – divide the data into four equally sized parts
 Q1 - First Quartile: 25th percentile
 Q2 - Second Quartile (
~x ), 50th percentile
 Q3 - Third Quartile, 75th percentile
Procedure to Compute Quartiles
 Order the data from smallest to largest
 Find
~
x . This is Q2
 Q1 is the median of the lower half of the data; that
is, it is the median of the data falling below Q2 (not
including Q2 )
 Q3 is the median of the upper half of the data;
(same as above)
Interquartile range (IQR) = Q3 – Q1
 Range of the middle 50% of the data
5 number summary – The low score, Q1, Q2, Q3, and
the high score
Example
Amount of money individuals have at a meeting
Students
0 0013555678
1 0
2
3
4
5
6
7
Faculty
0
1 055
2 04588
3 1
4 3
5
6
7 3
Students
Low = 0
Q1 = 1
Q2 = 5
Q3 = 7
High = 10
Faculty
Low = 10
Q1 = 15
Q2 = 25
Q3 = 31
High = 73
2.8 Methods for Detecting Outliers:
Box Plots and z-Scores
Box Plots
 The box goes from Q1 to Q3 and represents IQR
 The line through the box is Q2 ( ~
x)
 Extreme values are identified by *’s
 Lines, called whiskers, run from Q1 to the lowest
value and from Q3 to the highest value (If the low
or high are extreme then the whisker goes to the
next value)
80
70
Students
60
50
40
30
20
10
0
Students
Faculty
Example
How does tread design affect an automobiles stopping
distance? (One Qualitative and One Quantitative Variable)
A
43
38
33
A
B
C
2.9 Graphing Bivariate Relationships
Two Quantitative Variables
Plot observed data on a graph
Horizontal (X axis)
independent variable
Vertical (Y axis)
dependent variable
We call the graph a scatter diagram or scatter plot
Example
X = Dosage of Drug
Y = Reduction in Blood Pressure
X
Y
100
10
200
18
300
32
400
44
500
56
Note the strong relationship between X and Y
Perfect positive linear correlation
50
40
C1
30
20
10
0
0
10
20
30
C2
40
50
Perfect negative linear correlation
50
40
C1
30
20
10
0
0
10
20
30
C2
40
50
Positive linear correlation
50
40
C1
30
20
10
0
0
10
20
30
C2
40
50
Negative linear correlation
50
40
C1
30
20
10
0
0
10
20
30
C2
40
50
Non-linear correlation
30
C2
20
10
0
0
10
20
30
C1
40
50
No correlation
50
40
C2
30
20
10
0
0
10
20
30
C1
40
50
2.10 Distorting the Truth with
Descriptive Statistics
One trick with graphs is to not start the scale at zero
This makes small differences look much bigger