Download Last lecture summary

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Last lecture summary
β€’ Which measures of variability do you know?
β€’ What are they advantages and disadvantages?
β€’ Empirical rule
Statistical jargon
population (census) vs. sample
parameter (population) vs. statistic (sample)
Population - parameter
Mean πœ‡
Standard deviation 𝜎
Sample - statistic
Mean π‘₯
Standard deviation s
VýbΔ›r - statistika
VýbΔ›rový prΕ―mΔ›r π‘₯
VýbΔ›rová smΔ›rodatná odchylka s
Statistical inference
β€’ A statistic is a value calculated from our observed data
(sample).
β€’ A parameter is a value that describes the population.
β€’ We want to be able to generalize what we observe in our
data to our population. In order to this, the sample needs
to be representative.
β€’ How to select a representative sample? Use
randomization.
New stuff
Random sampling
β€’ Simple Random Sampling (SRS) – each possible
sample from the population is equally likely to be
selected.
β€’ Stratified Sampling – simple random sample from
subgroups of the population
β€’ subgroups: gender, age groups, …
β€’ Cluster sampling – divide the population into non-
overlapping groups (clusters), sample is a randomly
chosen cluster
β€’ example: population are all students in an area, randomly select
schools and create a sample from students of the given school
Simple random sampling
β€’ sampling with replacement (WR)
β€’ výbΔ›r s navrácením
β€’ Generates independent samples
β€’ Two sample values are independent if that what we get on the first
one doesn't affect what we get on the second.
β€’ sampling without replacement (WOR)
β€’ výbΔ›r bez navrácení
β€’ Deliberately avoid choosing any member of the population more
than once.
β€’ This type of sampling is not independent, however it is more
common.
β€’ The error is small as long as
1. the sample is large
2. the sample size is no more than 10% of population size
Bias
β€’ If a sample is not representative, it can introduce bias into
our results.
β€’ bias – zkreslení, odchylka
β€’ A sample is biased if it differs from the population in a
systematic way.
β€’ The Literary Digest poll, 1936, U. S. presidential election
β€’ surveyed 10 mil. people – subscribers
β€’ 2.3 mil. responded predicting (3:2) a Republican candidate to win
β€’ a Democrat candidate won
β€’ What went wrong?
β€’ only wealthy people were surveyed (selection bias)
β€’ survey was voluntary response (nonresponse bias) – angry people or
people who want a change
Bessel’s correction
𝑠=
π‘₯𝑖 βˆ’ π‘₯
π‘›βˆ’1
2
www.udacity.com – Statistics
Sample vs. population SD
β€’ We use sample standard deviation to approximate
population paramater Οƒ
𝑠=
π‘₯𝑖 βˆ’ π‘₯
π‘›βˆ’1
2
β‰ˆ 𝜎=
π‘₯𝑖 βˆ’ πœ‡
𝑛
2
β€’ But don’t get confused with the actual standard deviation
of a small dataset.
β€’ For example, let’s have this dataset: 5 2 1 0 7. Do you
divide by 𝑛 or by 𝑛 βˆ’ 1?
Bessel's game
πœ‡=2
8
2
𝜎 =
3
Bessel's game
β€’ An important property of a sample statistic that estimates
a population parameter is that if you evaluate the sample
statistic for every possible sample and average them all,
the average of the sample statistic should equal to the
population parameter.
We want:
possible
population
average of all sample =
variance
variances
β€’ This is called unbiased.
Bessel’s game
1.
2.
List all possible samples of 2 cards.
Calculate sample averages.
Population of all cards in a bag
Sample
Sample
average
Bessel’s game
1.
2.
3.
4.
List all possible samples of 2 cards.
Calculate sample averages.
Now, half of you calculate sample
variance using /n, and half of you
using /(n-1).
And then average all sample variances.
𝑠2
Population of all cards in a bag
π‘₯𝑖 βˆ’ π‘₯ 2
=
𝑛 OR 𝑛 βˆ’ 1
Sample
Sample
average
0,2
1
0,4
2
2,0
1
2,4
3
4,0
2
4,2
3
0,0
0
2,2
2
4,4
4
Sample
variance
8
πœ‡ = 2, 𝜎 =
3
Bessel’s game
2
Sample
Sample
average
Sample variance (n-1)
Sample variance (n)
0,2
1
2
1
0,4
2
8
4
2,0
1
2
1
2,4
3
2
1
4,0
2
8
4
4,2
3
2
1
0,0
0
0
0
2,2
2
0
0
4,4
4
0
0
average
18
=2
9
24 8
=
9
3
12 4
=
9
3
Median absolute deviation (MAD)
β€’ standard deviation is not robust
β€’ IQR is robust
β€’ mean absolute deviation MAD – a robust equivalent of the
standard deviation
β€’ Také your data, find median, calculate absolute deviation
from the median, find the median of absolutes deviations
Median absolute deviation (MAD)
Data
Median deviation
5
10
30
20
30
5
15
10
15
Median:
MAD:
Absolute deviation
NORMAL
DISTRIBUTION
Playing chess
β€’ Pretend I am a chess player.
β€’ Which of the following tells you most about how good I
am:
1.
2.
3.
My rating is 1800.
8110th place among world competitive chess players.
Ranked higher than 88% of competitive chess players.
Distribution
We should use relative
frequencies and convert
all absolute frequencies
to proportions.
Distribution of scores in one particular year
Height data – absolute frequencies
http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights
Height data – relative frequencies
Height data – relative frequencies
What proportion
of values is
between 170 cm
and 173.75 cm?
30%
Height data – relative frequencies
What proportion of
values is between
170 cm and 175 cm?
We can’t tell for
certain.
β€’ How should we modify data/histogram to allow us a more
detail?
1.
2.
3.
Adding more value to the dataset
Increasing the bin size
A smaller bin size
Height data – relative frequencies
What proportion of values is between 170 cm and 175 cm?
36%
Height data – relative frequencies
Height data – relative frequencies
recall the empirical rule
Normal distribution
68-95-99.7
1
π‘₯βˆ’πœ‡
𝑒π‘₯𝑝 βˆ’
2𝜎 2
2πœŽπœ‹
2