Download Statistics: Dealing with Uncertainty

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Statistics: Dealing With Uncertainty
ACADs (08-006) Covered
1.1.1.2
1.1.1.4
3.2.3.19
3.2.3.20
Keywords
Sample, normal distribution, central tendency, histogram,
probability, sample, population, data sorting, standard normal
distribution, Z-tables, probability.
Supporting Material
Statistics
Dealing With Uncertainty
Objectives
• Describe the difference between a sample and a
population
• Learn to use descriptive statistics (data sorting,
central tendency, etc.)
• Learn how to prepare and interpret histograms
• State what is meant by normal distribution and
standard normal distribution.
• Use Z-tables to compute probability.
Statistics
• “There are lies, d#$& lies, and
then there’s statistics.”
Mark Twain
Statistics is...
• a standard method for...
- collecting, organizing, summarizing,
presenting, and analyzing data
- drawing conclusions
- making decisions based upon the
analyses of these data.
• used extensively by engineers (e.g., quality
control)
Populations and Samples
• Population - complete set of all of the
possible instances of a particular object
– e.g., the entire class
• Sample - subset of the population
– e.g., a team
• We use samples to draw conclusions about
the parent population.
Why use samples?
• The population may be large
– all people on earth, all stars in the sky.
• The population may be dangerous to observe
– automobile wrecks, explosions, etc.
• The population may be difficult to measure
– subatomic particles.
• Measurement may destroy sample
– bolt strength
Exercise: Sample Bias
• To three significant figures, estimate the
average age of the class based upon your
team.
• When would a team not be a representative
sample of the class?
Measures of Central Tendency
• If you wish to describe a population (or a sample)
with a single number, what do you use?
– Mean - the arithmetic average
– Mode - most likely (most common) value.
– Median - “middle” of the data set.
What is the Mean?
• The mean is the sum of all data values divided
by the number of values.
Sample Mean
1
x 
n
n
x
Where:
– xis the sample mean
– xi are the data points
– n is the sample size
i 1
i
Population Mean
  1
N
N
x
i 1
i
Where:
– μ is the population mean
– xi are the data points
– N is the total number of observations in the
population
What is the Mode?
• mode - the value that occurs the most often
in discrete data (or data that have been
grouped into discrete intervals)
– Example, students in this class are most
likely to get a grade of B.
Mode continued
• Example of a grade distribution with mean C,
mode B
25
20
15
10
5
0
F
D
C
B
A
What is the Median?
• Median - for sorted data, the median is the
middle value (for an odd number of points)
or the average of the two middle values (for
an even number of points).
– useful to characterize data sets with a few
extreme values that would distort the
mean (e.g., house price,family incomes).
What Is the Range?
• Range - the difference between the lowest
and highest values in the set.
– Example, driving time to Houston is 2 hours +/- 15
minutes. Therefore...
• Minimum = 105 min
• Maximum = 135 minutes
• Range = 30 minutes
Standard Deviation
• Gives a unique and unbiased estimate of
the scatter in the data.
Standard Deviation
•
Population
•
Sample
 
1
N
N
2
(
x


)
 i
Variance = 2
i 1
Deviation
n
1
2
s
(
x

x
)

i
(n  1) i 1
Variance = s2
The Subtle Difference Between  and
σ
N versus n-1
n-1 is needed to get a better estimate of the
population  from the sample s.
Note: for large n, the difference is trivial.
A Valuable Tool
• Gauss invented standard deviation circa 1700
to explain the error observed in measured star
positions.
• Today it is used in everything from quality
control to measuring financial risk.
Team Exercise
• In your team’s bag of M&M candies, count
– the number of candies for each color
– the total number of candies in the bag
• When you are done counting, have a
representative from your team enter your
data in Excel
More
Team Exercise (con’t)
For each color, and the total number of candies,
determine the following:
maximum
minimum
range
mean
mode
median
standard deviation
variance
Individual Exercise: Histograms
• Flip a coin EXACTLY ten times. Count the
number of heads YOU get.
• Report your result to the instructor who will
post all the results on the board
• Open Excel
• Using the data from the entire class, create
bar graphs showing the number of classmates
who get one head, two heads, three heads,
etc.
Data Distributions
• The “shape” of the data is described by its
frequency histogram.
• Data that behaves “normally” exhibit a “bellshaped” curve, or the “normal” distribution.
• Gauss found that star position errors tended
to follow a “normal” distribution.
The Normal Distribution
• The normal distribution is sometimes called
the “Gauss” curve.
1
RF 
 2
1
2
 x   /  2
e 2
mean
RF
Relative
Frequency
x
Standard Normal Distribution
Define:
Then
z  x    / 
Area = 1.00
RF 
0.5
1 2
 z
e 2
0.4
0.3
0.2
2
0.1
0.0
-4.0
-3.0
-2.0
-1.0
0.0
z
1.0
2.0
3.0
4.0
Some handy things to know.
• 50% of the area lies on each side of the midpoint for any normal curve.
• A standard normal distribution (SND) has a
total area of 1.00.
• “z-Tables” show the area under the standard
normal distribution, and can be used to find
the area between any two points on the zaxis.
Using Z Tables
(Appendix C, p. 624)
• Question: Find the area between z= -1.0 and
z= 2.0
– From table, for z = 1.0, area = 0.3413
– By symmetry, for z = -1.0, area = 0.3413
– From table, for z= 2.0, area = 0.4772
– Total area = 0.3413 + 0.4772 = 0.8185
– “Tails” area = 1.0 - 0.8185 = 0.1815
“Quick and Dirty” Estimates of
 and 
•  @ (lowest + 4*mode + highest)/6
• For a standard normal curve, 99.7% of the
area is contained within ± 3  from the mean.
• Define “highest” =  + 3 
• Define “lowest” =   3 
• Therefore,  @ (highest - lowest)/6
Example:
Drive time to Houston
• Lowest = 1 h
• Most likely = 2 h
• Highest = 4 h (including a flat tire, etc.)
–  = (1+4*2+4)/6 = 2.16 (2 h 12 min)
–  = (4 - 1)/6= 0.5 h
• This technique (Delphi) was used to plan the
moon flights.
Review
• Central tendency
– mean
– mode
– median
• Scatter
– range
– variance
– standard deviation
• Normal Distribution