Download Probability distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Chapter 2: Basics
Petter Mostad
[email protected]
Empirical statistics versus
statistical inference
" Empirical statistics = descriptive statistics
Extract important features from the data
Illustrating the data graphically
Giving an overview
Examples: Average values, (sample) standard
deviation, histogram, scatterplot, etc.
" Statistical inference: Using statistical
models to represent and infer knowledge
Data types
" Categorical
" Discrete (whole numbers)
" Continuous (real numbers,
possibly limited to some interval)
Numerical
data
" Univariate, bivariate, or multivariate data.
The empirical distribution
" The set of observed values
" Can be summarized and illustrated in many
ways
Average, and (sample) standard deviation.
Dot diagrams
Boxplots
Histograms or frequency distributions
Scatterplots
Average, standard deviation,
and median
" If data are x1, x2, x3,& ,xn, then:
" The average or mean (gives location) is
x1 + x2 +... + xn
x=
n
" The standard deviation (gives spread) is
s=
( x1 − x ) + ( x2 − x ) +... + ( xn − x )
n −1
2
2
2
sample variance
is the square of
the sample s.d.: s2
" The median:
If an odd number of observations, the middle one
If an even number of observations, the average of the two middle
ones
The coefficient of variation
" The ratio of the standard deviation to the
mean is called the coefficient of variation:
s/x
" The inverse is sometimes called the signalto-noise ratio:
x/s
Dot diagrams and boxplots
" Dot diagrams show the
data directly
" Boxplots indicate the
different portions of
the set of values: The
median, the quartiles,
and the range of the
data.
140
145
150
155
160
165
0
50
100
150
200
250
Visualizing numerical data with
histograms
0
5
10
15
20
25
" Bins and bin sizes
" The areas should be
proportional to the
number of
observations within
each bin.
" Standard: The area is
equal to the frequency
of observations.
Scatterplots
20
30
40
50
y
60
70
80
90
" Displays bivariate data.
Here: two variables, x
and y.
" Great tool to see
relationships between
variables.
10
20
30
x
40
50
Population and sample
" Often: Data is derived by sampling from a
population
" Random sampling
" Often: We want to learn about properties of the
population by studying the sample: Statistical
inference
" The properties of the sample approaches the
population properties as sample size increases
Example: a political poll
" Population: All swedes? In Göteborg?
Adult?
" Is the sample a random sample?
" Goal: To find out about the political
opinions of all Swedes, based on the
answers from the sampled persons.
" Random sampling makes it possible to do
so.
Example: Repeated measurements
Data: Repeated measurements of the
acidity of a river.
" Population: All possible measurements that
could have been done. (Infinite population?)
" Random sampling?
" Goal: To learn about the whole population
(acidity in all of river, and the variation)
based on the sample data.
Example: How can we make
inference from sample to population
Your newspaper claims 40% of Swedes vote S. Assume you
ask a random sample of 100 persons, and 32 say they vote
S. Can you claim the newspaper wrong?
" Assume the newspaper is correct. On your computer,
simulate a random sample of 100 persons, each with 40%
chance of voting S. You get that 38 vote S.
" Repeat the above 1000 times. You find that in only 64 of
the cases, 32 or fewer of the simulated Swedes vote S.
" This gives you a pretty good argument that the newspaper is
probably wrong.
When population size increases
towards infinity
" Sampling with or without replacement.
" When population size increases towards infinity:
with/without replacement unimportant.
" Often: The population size can be infinite for
practical (sampling) purposes (population of
Sweden etc.) or imagined infinite.
" Histograms can be made with smaller and smaller
bin sizes, and approach continuous functions with
area under the curve equal to 1.
Sampling from an (almost) infinite
population of continuous values
" The population is represented by a continuous
function with integral 1, the probability density,
representing the population distribution
" The probability that a sample will be in an interval
is given by area under curve of continuous
function.
" Our goal: to learn about the population distribution
from the sample
For discrete values:
" The population is represented by a set of
probabilities on the possible values
" Our goal: To learn about these probabilities from
the sample.
" NOTE: A probability distribution representing an
infinite population is sometimes called a random
variable. (Continuous or discrete)
Population average versus sample
average
" Generally: Average of sample will approach
average of population when sample size
increases.
" For infinite populations represented by a
probability distribution:
Population average is called expectation
Computed with integral
Example: Average CO2 emissions
from cars in Sweden
" Investigated by testing a few cars.
" Compared to size of sample, size of
population is infinite .
" Population represented by a probability
distribution, with some expectation.
" We want to learn about this expectation
from our sample.
45
44
43
average of sample
46
47
Sample averages seem to stabilize as
sample size increases
0
20
40
60
sample size
80
100
Population standard deviation versus
sample standard deviation
" We can define the population standard deviation of
a probability distribution representing a population.
" The sample standard deviation will then (generally)
approach the population standard deviation as the
sample size increases.
" We need mathematical tools so that we know what
we can say about the population standard deviation
from a sample standard deviation.
A statistic
" A specific funtion of the data is called a
statictic. Examples:
The mean
The variance
The range (max min)
...
Population properties versus sample
properties
" Descriptive statistics based on sample have
counterparts defined on probability distributions:
sample average vs. population expectation
sample variance vs. population variance
sample median vs. population median
histogram vs. probability density function
" Generally, the sample versions will approach
population versions as sample size increases.
" How fast and how reliably is an important part of
the rest of the course.
Parametrical families of probability
distributions
" How can we solve the problem of learning
about the population distribution from the
sample?
" Usual procedure: Assume the population
distribution has a particular parametric
form, and estimate parameters from the
sample.
Example: The normal
distribution
" A family of probability distributions.
" Two parameters: μand σ>0
" The probability density is
 ( x − µ )2 

p( x) =
exp −
2
2
2σ
2πσ


1
" For all values of the parameters, the integral is 1, so
it is a probability distribution.
" The expectation is μand the standard deviation is
σ. .
0.0
0.1
0.2
0.3
0.4
Plots of the normal distribution
-6
-4
-2
0
2
4
6
The standard normal distribution has expectation 0 and standard deviation 1
Quantiles of the
standard normal distribution
0.0
0.1
0.2
0.3
0.4
" 68.3 percent of the area
is between -1 and 1.
" 95.4 percent is between
-2 and 2.
" 99.7 percent is between
-3 and 3.
-6
-4
-2
0
2
4
6
Using tables
" To find out how much area there is below the curve
to the right of a number x, find x in table A in the
textbook.
" Example. The area to the right of 2.77 is 0.0028.
So:
If the population has a standard normal distribution, the
probability of observing something above 2.77 is 0.0028.
The probability of observing something less than -2.77 is
also 0.0028.
Linear transformations of normal
distributions
" If X has a normal distr. with exp μand st. dev. σ
then (X-μ)/σhas a standard norm. distr.
" Question: If X has a norm. distr. with exp. 4 and
st.dev. 2, what is the probability that X is above 9?
" Answer: X>9 if and only if (X-4)/2 > (9-4)/2 = 2.5.
The probability is the same as for a standard normal
distribution to be above 2.5. Table A gives the
probability 0.0062
10
5
0
Frequency
15
Example: fitting a normal distribution
to data
10
15
20
25
30
35
" We have a sample of size 100:
see histogram.
" The histogram shape is similar
to a normal distribution: We
use the parametric family of
normal distributions as guess
for population distribution.
" Sample average: 22.39. sample
standard deviation: 4.40.
" Our best guess for population
distribution is a normal
distribution with expectation
22.39 and st.dev. 4.40
Example continued
" Assuming our best guess for population
distribution is correct we can answer
questions as:
What is the probability that a new sampled value
will be larger than 30? (Answer:
(30-22.39)/4.4 = 1.73, table gives 0.0418 )
What is the probability that a new sampled value
will be less than 35? (Answer: (35-22.39)/4.4 =
2.87, table gives 1-0.0021=0.9979)