Download Chapter 6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
6.1 What is Statistics?
•
Definition: Statistics – science of collecting,
analyzing, and interpreting data in such a way
that the conclusions can be objectively
evaluated.
3 Phases:
1. Collecting data
2. Analyzing data
3. Interpreting data
6.1 What is Statistics?
• Descriptive Statistics – summarize and describe a
characteristic of a group
example: batting average
• Inferential Statistics – used to estimate, infer, or
conclude something about a larger group
example: polls
• Sample – subset of the group of data available for
analysis
6.1 What is Statistics?
• Population – the entire set
• Bias – favoring of certain outcomes over
others
• Census – collects data from all members of
the population
• Parameter – characteristic value of a
population
• Statistic – characteristic value of a sample
6.2 Organizing Data
• Stem and Leaf Diagram:
data – 35, 52, 37, 44, 51, 48, 45, 12
Stem
5
4
3
2
1
Leaves
12
458
57
2
6.2 Organizing Data
• Frequency Table:
data – 35, 52, 37, 44, 51, 48, 45, 12
Range
50-59
40-49
30-39
20-29
10-19
Frequency
2
3
2
0
1
6.3 Displaying Data
• Ways to display data:
–
–
–
–
–
–
Frequency histogram
Relative frequency histogram
Multiple bar graph
Stacked bar graph
Line graph
Pie chart
6.3 Displaying Data
Frequency Histogram
30
25
20
15
Series1
10
5
0
1
2
3
4
5
6
7
8
6.3 Displaying Data
Relative Frequency Histogram
Relative Frequency
0.3
0.25
0.2
0.15
Series1
0.1
0.05
0
1
2
3
4
5
6
7
8
6.3 Displaying Data
Multiple Bar Graph
5000
4000
lower
3000
upper
2000
graduate
1000
Ar
ts
N
ur
s
i
S
oc
sc
i
du
E
at
sc
N
C
om
m
0
6.3 Displaying Data
Stacked Bar Graph
8000
7000
6000
5000
4000
3000
2000
1000
0
graduate
upper
Ar
ts
N
ur
s
i
sc
So
c
i
u
Ed
at
sc
N
C
om
m
lower
6.3 Displaying Data
Line Graph
5000
4000
lower
3000
upper
2000
graduate
1000
Ar
ts
N
ur
s
i
S
oc
sc
i
du
E
at
sc
N
C
om
m
0
6.3 Displaying Data
Pie Chart
Pie Chart
Comm
Edu
Natsci
Socsci
Nurs
Arts
6.4 Measures of Central Tendency
• Central Tendency – the propensity of data to
be located or clustered about some point.
• Arithmetic Mean – sum of the values of all the
observations divided by the total number of
observations
n
• For sample data, mean is
x
x

i 1
n
i
6.4 Measures of Central Tendency
n
• For population data,
the mean is

x
i 1
i
n
• Median – the median is the middle value
of a set of data when data is arranged in
ascending order
6.4 Measures of Central Tendency
• Finding the median:
1. Arrange the data in increasing order or
decreasing order.
2. Determine if n is even or odd.
a. If n is odd, pick the middle value
b. If n is even, take the average of the two
middle values
6.4 Measures of Central Tendency
• Mode – is the value or values that occur most
frequently.
Note: If all values occur with the same frequency,
then there is no mode.
• Symmetric Distribution
Mean, Median, and Mode
6.4 Measures of Central Tendency
• Distribution skewed to the left
Mean
Median
Mode
• Distribution skewed to the right
Mode
Median
Mean
6.5 Measures of Variability
• Definition: The range of a set of n
measurements, x1, x2, x3, … xn is the
difference between the largest and the
smallest amounts.
N
• Variance -
2 
2
(
x


)
 i
i 1
N
6.5 Measures of Variability
Problem with the variance: the units are the original
units squared.
• Standard deviation – population standard
deviation is the square root of the population
variance.
n
2
( xi  x)
• Sample variance -
s 
2
• s = square root of the
sample variance

i 1
n 1
6.5 Measures of Variability
• Short cut formulas for s2 and 2 are given
on page 495 (provided with test).
• Short cut formula for frequency data is
given on page 499 (provided with test).
• Short cut formulas are genuinely easier to
calculate.
• Approximating the standard deviation:
s  (R/4) where R is the range.
6.6 Measures of Relative Position
• pth percentile - for a data in increasing order
- p% of the data are less than that value and
(100 – p)% of the data are greater than that
value.
6.6 Measures of Relative Position
• Z-scores – The sample z-score for a
measure x is:
xx
z
s
The population z-score for a measure x is:
x
z

z-score represents the # of standard
deviations away from the mean.
6.7 Normal Distribution
•
•
Definition: Standardizing – converting
data to z-scores.
Some empirical rules:
1. About 68% of data is within one  of the
mean.
2. About 95% of data is within two  of the
mean.
3. About 99% of data is within three  of the
mean.
6.7 Normal Distribution
•
The normal distribution looks like:
1. Bell-shaped
2. Symmetric
3. Mean = median = mode
6.7 Normal Distribution
• Definition: Standard normal distribution –
normal distribution with  = 1 and  = 0.
The standard normal distribution table (page
511 or in appendix page 647) can be used to
determine probabilities for a range of zvalues
6.8 Confidence Intervals
• Central Limit Theorem: For a large sample
size, the random variable x is approximately
normally distributed with mean  and
standard deviation /n where  is the
population mean of the x’s and  is the
population standard deviation of the x’s.
6.8 Confidence Intervals
•
x  Z
2

n
-  may be replaced by s
• Common levels of confidence (n  30):
Level of Confidence
z/2
80
90
95
99
1.28
1.645
1.96
2.575
6.8 Confidence Intervals
• Margin of Error: margin of error of an
estimate of a sample proportion is given by:
Z
2
2 n
6.9 Regression and Correlation
• Scatter Plot – a plot of data consisting of 2
variables
• Linear Regression – modeling the data with the
line that “best fits” – usually a “least squares” line
or regression line
• Least Squares Line – is the line that minimizes the
sum of the squared errors for a set of data points
(formulas given on page 531 and shortcut
formulas are on page 532 – formulas to be
provided on test)
6.9 Regression and Correlation
• Correlation Coefficient r – is a measure of
the strength of the linear relationship
between the 2 random variables x and y.
Note: The closer the correlation is to 1 or –
1, the stronger the relationship between the
x and y variables. A correlation of zero
means there is no evidence of a linear
pattern.