Download statistic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Last lecture summary
• Mode
• Distribution
• Five numbers summary, percentiles, mean
• Box plot, modified box plot
• Robust statistic – mean, median, trimmed mean
• outlier
MEASURES OF
VARIABILITY
Navození atmosféry
www.udacity.com – Introduction to statistics
QUESTION
Mean1
Mean2
Mode1
Mode2
Median1 Median2
www.udacity.com – Statistics
range
(variační rozpětí)
MAX - min
www.udacity.com – Statistics
Range
Range changes when we add new data into dataset
• Always
• Sometimes
• Never
www.udacity.com – Statistics
Adding Mark Zuckerberg
www.udacity.com – Statistics
Cut off data
IQR, mezikvartilové rozpětí
www.udacity.com – Statistics
Interquartile range, IQR
Let’ take this quiz, answer yes or no.
1. About 50% of the data fall within the IQR.
2. The IQR is affected by every value in the data set.
3. The IQR is not affected by outliers.
4. The mean is always between Q1 and Q3.
průměr = 8.62
n = 13
0 1 1 1 2 2 2 2 2 3 3 3 90
Q1=1
Q2
Q3=3
www.udacity.com – Statistics
Define the outlier
Outlier < 𝑄1 − 1.5 × 𝐼𝑄𝑅
OR
Sample (n=10)
$38,946
$43,420
$49,160
$50,430
$50,557
$52,580
$53,595
$54,160
$60,181
$10,000,000
Outlier > 𝑄3 + 1.5 × 𝐼𝑄𝑅
What values are outliers for this
data set?
1. $60,000
2. $80,000
3. $100,000
4. $200,000
www.udacity.com – Statistics
Problem with IQR
normal
bimodal
uniform
www.udacity.com – Statistics
Options for measuring variability
• Find the average distance between all pairs of data
values.
• Find the average distance between each data value and
either the max or the min.
• Find the average distance between each data value and
the mean.
www.udacity.com – Statistics
Average distance from mean
Sample
10
5
3
2
19
1
7
11
1
1
Deviation from mean (𝑥𝑖 − 𝑥)
Average distance from mean
Sample
Deviation from mean (𝑥𝑖 − 𝑥)
10
4
5
-1
3
-3
2
-4
19
13
1
-5
7
1
11
5
1
-5
1
-5
(𝑥𝑖 − 𝑥) = 0
Find the average distance between
each data value and the mean.
Preventing cancellation
• How can we prevent the negative and positive deviations
from cancelling each out?
1.
2.
3.
4.
Ignore (i.e. delete) the negative sign.
Multiply each deviation by two.
Square each deviation.
Take absolute value of each deviation.
Average absolute deviation
Sample
Deviation from mean (𝑥𝑖 − 𝑥)
Absolute deviation |𝑥𝑖 − 𝑥|
10
4
4
5
-1
1
3
-3
3
2
-4
4
19
13
13
1
-5
5
7
1
1
11
5
5
1
-5
5
1
-5
5
avg. absolute deviation = 4.6
Average absolute deviation
What formulas describes what you just did?
1.
2.
3.
4.
5.
𝑥𝑖
𝑛
|𝑥𝑖 −𝑥|
𝑛
|𝑥𝑖 −𝑥|
𝑛
|𝑥−𝑥𝑖 |
𝑛
|𝑥𝑖 −𝑥|
𝑛
Squared deviations
Sample
Deviation from
mean (𝑥𝑖 − 𝑥)
Squared deviation
𝑥𝑖 − 𝑥 2
10
4
16
5
-1
1
3
-3
9
2
-4
16
19
13
169
1
-5
25
7
1
1
11
5
25
1
-5
25
1
-5
25
SS, sum of squares
(čtverce odchylek)
𝑥𝑖 − 𝑥
2
avg. square deviation = 31.2
Variance
Average square devation has a special name – variance
(rozptyl).
𝑥𝑖 − 𝑥
𝑛
2
www.udacity.com – Statistics
Standard deviation
• směrodatná odchylka, 𝑠
𝑠=
𝑥𝑖 − 𝑥
𝑛
2
• Which symbol would you use for a variance?
𝑠2
Standard deviation
• What is so great about the standard deviation? Why don’t
we just find the average absolute deviation?
1. SD is used because of tradition
2. It is easier to work with power of two
than with absolute value.
3. SD has very nice interpretation in
Gaussian distribution.
More on absolute vs. standard deviation: http://www.leeds.ac.uk/educol/documents/00003759.htm
Standard deviation – empirical rule
Standard deviation – empirical rule
Standard deviation – empirical rule
Empirical rule – well behaved distribution
n = 400
𝑚𝑒𝑎𝑛 = 14.2, 𝑠. 𝑑. = 14.1
𝑚𝑒𝑎𝑛 ± 𝑠. 𝑑 covers 273 data values, 66.8%
𝑚𝑒𝑎𝑛 ± 2𝑠. 𝑑. covers 380 data values, 95%
𝑚𝑒𝑎𝑛 ± 3𝑠. 𝑑. covers 397 data values, 99.3%
Empirical rule – not-so-well behaved distribution
197 countries
𝑚𝑒𝑎𝑛 = 69.9
𝑠. 𝑑. = 9.7
65% within 1 s.d.
94.7 within 2 s.d.
100% within 3 s.d.
Statistical inference
• The goal of statistics: make rational conclusions or
decisions based on the incomplete information we have in
our data.
• This process is known as statistical inference.
• In inferential statistics we want to answer
1.
2.
Is some relationship in data due to chance? Or is it a real
difference?
If the effect is real, can it be generalized to a larger group?
Statistical jargon
• Population – the group we are interested in making
conclusions about.
• Census – a collection of data on the entire population.
• Sample – if we can’t conduct a census, we collect data
from the sample of a population. Goal: make conclusions
about that population.
Statistical jargon
population (census) vs. sample
parameter (population) vs. statistic (sample)
Population - parameter
Mean 𝜇
Standard deviation 𝜎
Sample - statistic
Mean 𝑥
Standard deviation s
Výběr - statistika
Výběrový průměr 𝑥
Výběrová směrodatná odchylka s
𝑥 = 19.44
𝑠 = 2.45
𝑥 = 16.89
𝑠 = 9.17
𝑥 = 17.22
𝑠 = 6.24
Statistical inference
• A statistic is a value calculated from our observed data
(sample).
• A parameter is a value that describes the population.
• We want to be able to generalize what we observe in our
data to our population. In order to this, the sample needs
to be representative.
• How to select a representative sample? Use
randomization.
Related documents