Download note-3-online

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Randomness wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
Measures of Variation
Range = Maximum – Minimum
= Length of an interval containing all of the data
Example: August 2005 tropical storms: 40, 50, 65, 105, 175
Range= 175-40=135
How far from the mean is a typical data point?
For population, the standard deviation is defined as
βˆ‘π‘
(π‘₯𝑖 βˆ’ πœ‡)2
𝑖=1
𝜎=√
𝑁
and the variance is 𝜎 2
Variance is mathematically convenient, std. dev. is more interpretable.
Std. Dev. is a measure of the distance from the mean to a typical data
point, and is the most common measure of spread.
Ex: If we are interested in the wind speed of Aug 2005 storms, what is
our population? (interested in 2005 storms?)
For population, πœ‡
π‘₯Μ… =
= π‘₯Μ… .
𝜎 2=
𝜎=
In most cases, we only have a sample, not a population.
Then we use π‘₯Μ… to estimate πœ‡, and 𝑠 to estimate 𝜎.
The formula for sample standard deviation is
βˆ‘π‘›π‘–=1(π‘₯𝑖 βˆ’ π‘₯Μ… )2
𝑠=√
π‘›βˆ’1
Empirical rule for data with a Bell-shaped distribution:
About 68% of the data will be within 1 s.d.
About 95% within 2 s.d.
Is the following be considered as a bell shaped?
Is it symmetric? Skewed?
Textbook, Page 48, Figure 2-11 gives a good illustration about Mean,
Median, Mode and Skewness.
Anything we can say for non-bell shaped distribution/ all distributions?
Chebyshev’s Theorem
The proportion of any set of data lying within K s.d. of the mean is
always at least 1-1/(K*K), where K>1.
Standardized Scores (z Scores)
How extreme is an observation?
How do we compare observations from different datasets?
We standardize the data by subtracting its mean and dividing by its s.d.
(p. 69)
Sample vs. Population
Ex: Hurricane Katrina (175-87)/49.25=1.79
Aug 2004’s biggest storm was Karl at 145. The mean & s.d. for Aug 04
were 90.6 and 38.28.
So Karl’s z Score is
Conclusion: Even after adjusting for the increased variability in 2005,
Katrina stands out more extreme.
SAT scores (rescaled to 200-800) adjusts for varying difficulty of the
exams.
From Empirical rule, |z|>2, means the observation is unusual.
Quartiles and Percentiles
The median separates the data into two equally sized groups – half of
the observations are above the median, half are below. Equivalently, 50%
are below, so median is the 50th percentile.
The 99 percentiles divide the data into 100 groups. 1% of the
observations are less than the 1st percentile, 2% less than the 2nd
percentile, and so on.
To find the kth percentile: (p 73)
1.
2.
3. a.
b.
Storm examples:
Aug04 40,45,65,70,105,129,135,145
Aug05 40,50,65,105,175
The 25th percentile is called 1st quartile (Q1), the 75th percentile is the
3rd quartile (Q3).
The Median = Q2 =50th percentile.
Boxplot
Min, Q1, median, Q3, Max
Probability
Event: any collection of results or outcomes of a procedure.
Examples: procedure: flipping a fair coin, rolling 2 dice
Outcomes: Head/Tail,
?~12
Simple Event: cannot be broken down further
Examples: H/T
Sample Space: collection of all possible simple events
Procedure
Flip a coin once
(some possible) Event
Sample Space
Head (simple event)
Flip a coin 3 times 2 heads 1 tail
{H, T}
{HHH,HHT,HTH,HTT,
(HHT, HTH,THH are
all simple events resulting
in 2Head 1Tail event)
Notations
P, denotes a probability
A,B or C denote specific events
P(A) denotes the probability of event A Occurring
THH,THT,TTH,TTT }
Rule 1: Relative frequency approximation of probability
Conduct (or observe) a procedure, and count the number of times that
event A actually occurs. Based on these actual results, P(A) is estimated
as
P(A)=#of times A occurred/# of times trial was repeated
Rule 2: Classical
Assume that a given procedure has n different simple events and that
each of those simple events has an equal chance of occurring. If event A
can occur in s of n ways, then
P(A)=s/n
Example, P(2Hs1T)=3/8
Example: rolling 1 die, 1 12 face die and 2 dice
Rule 3: Subjective Probability: based on knowledge…
Law of Large Number
As a procedure is repeated again and again, the relative frequency
probability of an event tends to approach the actual probability.
Complement of event A denotes by 𝐴̅, consists of all outcomes in which
event A does not occur.
P(𝐴̅)=1- P(A)
Review
There are 9 people in a room. I wanted to know what the proportion of
them that smokes is. I randomly picked 3 of them to ask.
Population:
Sample:
Parameter:
Statistics:
Data type:
Experimental/observational:
(Simple) Random sampling:
To identify people, assign them numbers 1~9, or equivalently,
1 (1,1),2 (1,2),3 (1,3),
4(2,1),5 (2,2),6 (2,3),
7(3,1),8 (3,2),8 (3,3).
Example or sampling methods:
Rolling a 3 number die then chose by row, by column, or by color.
Why Random?
Each individual member has the same chance of being selected.
Not simple random? Every size n(3) sample has the same chance to
being chosen.
What are the all possible size 3 samples? (Sample space)
{(1,2,3),(1,2,4),(1,2,5),(1,2,6),(1,2,7),…(2,3,4),…(7,8,9)} (totally 84
possible samples)
Among them, for example (1,2,4) cannot be chosen by row, by column
or by color.
So the sampling methods β€œRolling a 3 number die then chose by row, by
column, or by color” are not a simple random sampling methods. And
the samples selected in such ways are not simple random samples.
How to conduct a simple random sample for this example?
3 Random numbers from1~9
Rolling twice to pick (a,b). (This is equivalent to the above line, but to
prove is out of the scope.)
Another example:
Product line: every 100th from some starting point
Fixed starting number
Radom starting
Simple random?
Using common sense.
Key concepts:
Stem and leaf
Histogram
Frequency distribution
Mean, Median, Mode
Symmetric/Skew
Range
s.d. / variance
Quartile, percentile
Outlier
Probability/event/simple event/sample space