Download Types of Variables - Center for Astrostatistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Page 1
Random Variables, Probability Distributions, and Expected Values
Random Variables (RV’s): Numerical value assigned to the outcomes of an
experiment. Capital letters X, Y, Z, with or without subscripts, are used to denote
RV’s
Examples
 B – V colors of stars
 Absolute magnitude of quasars M_i in the i band
 Number of electrons emitted from a cathode in a time
interval of length t.
Two types: Discrete and Continuous
a.
Probability distribution of a discrete random variable: table of
values of the variable and the proportion of times (or
probability) it occurs (which may be expressible in functional
form). The first two RV’s above are ‘continuous’.
b.
Probability distribution of a continuous random variable:
idealized curve (perhaps from a histogram) which represents
probability that a value of the variable occurs as an area under
the curve.
Example: Discrete Random Variable.
Consider observing some phenomena with exactly two possible outcomes (say,
success and failure) until the first success occurs, when the phenomena are
independent of one another. The it can be shown that the probability function of
the number Y of trials until the first success occurs is given by
p(y|) =  (1 - )y-1. y = 1, 2, … and 0 otherwise. (geometric distribution)
The parameter  is the probability of success. For example suppose we are looking
for some astronomical object at random and count the number of objects examined
until the first occurrence of the object is found
Expected Value of a Discrete RV.
The mean µ of a probability distribution or the mean of a random variable or the
expected value of X is defined to be
Page 2
µ = E(X) =  k  P( X  k ) and more generally for a function g(x)
E[g(x)] = Σg(x) p(x)
In particular, the expected value of the RV X2 is given by
E(X2) =
k
2
 P( X  k )
The Variance  2 of a RV X is given by 2 = Var(X) = E(X2) -[E(X)]2 ; and the
standard deviation  of X =SD(X) is defined to be  =  2 .
A special discrete probability distribution we will encounter is the Poisson.
The 'Poisson Distribution'.
Situations in which there are many opportunities for some phenomena to occur but
the chance that the phenomenon will occur in any given time interval, region of
space or whatever is very small, lead to the distribution of the number X of
occurrences of the phenomena having a Poisson distribution.
The
Poisson
distribution has a parameter  measuring the rate at which the phenomena occur
per unit (time period, interval, area, etc.). Here are some examples:
1. Number X of earthquakes in a region (for example, California, Indonesia,
Iran, Turkey, Mexico) in a specified period (five years?)of magnitudes
greater than 5.0
2. Number X of times lightning strikes in a 30 minute period in a region (like
the state of Colorado)
. 3. The arrival times of photons from a non-variable astronomical object.
4. The spatial distribution of instrumental background photons in an
image.
5.. The number of photons arriving in adjacent bins in a spectrum of a
faint continuum source.
6. The number of ‘arguments’ married couples have in one year.
Page 3
The probability distribution (frequency function) p(y) of a Poisson random variable
with rate parameter  is given by
p( y |  ) = e- y/ y! ,
y = 0, 1, 2, . . . ,
Fact: The sum of independent Poisson random variables has a Poisson distribution
with parameter the sum of the parameters of the individual variables: Assume Yi,
i = 1, 2, …, n, have a Poisson distribution with parameter i. Then
Y =  Yi has a Poisson distribution with parameter  =  i .
The mean and variance of the Poisson distribution are both equal to . For values
of  ‘large’, say  > 25 (or even smaller), the Poisson distribution is approximately
normal. A probability histogram of the Poisson distribution with  = 25 is given
below.
Scatterplot of p(y|25) vs y
0.08
0.07
0.06
p(y|25)
0.05
0.04
0.03
0.02
0.01
0.00
10
15
20
25
y
30
35
40
What does the distribution look like? Yeah, normal! So, if  is large, one can
approximate Poisson probabilities using the normal distribution with mean  and
standard deviation .
If a response variable in a regression context has a Poisson distribution, one can
perform a ‘Poisson regression’ analogously to what one does if Y has a normal
distribution in conventional linear or multiple regression. We will illustrate this
later, as an example of ‘generalized linear models’.
Page 4
Continuous Random Variables
Definition. A continuous random variable X is one for which the outcome can be
any number in an interval or collection of intervals.
Examples. Height, weight, time, head circumference, rainfall amounts, lifetime of
light bulbs, physical measurements, etc.
Probabilities are obtained as areas under a curve, called the probability density
function f(x). Below is a graph of the pdf f(x|20) =
1  x / 20
e
, for x > 0 and 0
20
elsewhere--it is called the exponential pdf with mean µ =20; the standard deviation
is also 20. It could represent the
lifetimes of batteries until recharging,e.g..The cumulative
distribution function CDF gives the
total area under the curve (or
cumulative probability):
Scatterplot of f(x|20) vs x
0.05
f(x|20)
0.04
0.03
0.02
x
0.01
CDF = F(x) =
 f ( y | 20)dy

0.00
0
20
40
60
80
100
120
x
140
160
180
= 1 – e-x/20, for x > 0 and 0 elsewhere
Areas under the curve between two points give the proportion of a population that
30
have values between the two points. For example, Prob(10 < X < 30) =
1
 20e
 y / 20
dy
10
=e
-10/20
-e
-30/20
=e
-0.5
- e
-1.5 .
.
The Normal Distribution. The most well-known continuous distribution is
probably the normal, with probability density function (pdf) f(x) given by
f(x|µ, σ) =
1
2 2
e ( x   )
2
/ 2 2
, -∞ < x < ∞ and CDF Φ(x) =
x
 f ( y |  ,  )dy

The graph of a normal pdf is the (familiar) uni-modal symmetric bell-shaped curve.
The CDF Φ(x) is an elongated ess-shaped curve. The mean and variance of a
normal distribution are the parameters µ and σ2. Many natural phenomena have
normal distributions—physical measurements, astronomical variables etc.
Page 5
Descriptive Statistics.
Types of Data: We classify all ‘data’ about a variable into two types:
a. Categorical: data with ‘names’ as values:
Ex. 1 type of gamma ray burst GRB (short–hard, long-soft),
b. Numerical (or quantitative) data:: value is ‘numerical’
Ex.’s. mass of black holes, distance to stars, temperature at launch time of a
shuttle, brightness of a star.
Numerical (also called quantitative) variables are divided into two types: discrete
and continuous.
Parameters and Statistics.
Samples. When we obtain a sample from the population we also say we obtained a
sample from the probability distribution.
Statistics are quantities calculated from samples.
Parameters are characteristics computed from the population as a whole or a
probability distribution.
The quantities , , and 2 are parameters. Statistics are used to estimate
parameters. For example, the sample mean is used to estimate the mean of the
population from which the sample is obtained.
Graphical and Numerical Summaries of Quantitative Variables
Numerical Summaries:
1. Measures of Location:
Three commonly used measures of the center of a set of numerical values
are the mean, median, and trimmed mean.
x = Average of the data values,
Trimmed Mean: Delete a (fixed) proportion of smallest and largest observations
(e.g., 5% or 10% each) and then re-calculate the mean. Judging in contests?
Page 6
Median, Arrange data in order from smallest to largest, with n observations.
If n is odd, the median is the middle number. If n is even, the median is the
average of the middle two numbers.
Measures of Position in the Dataset:
The First Quartile Q1 is the median of the numbers below the median or the
25th percentile
The Third Quartile Q3 is the median of the numbers above the median or the
75th percentile.
Quantiles are order statistics expressed as fractions or proportions. For example,
the pth quantile Qp(or 100pth percentile) divides the lower p of the data from the
upper 1-p. For example, Q.67 (or 67th percentile) divides the lower .67 of the data
from the upper .33 of the data. Q.25 and Q.75 are the first and third quartiles
The Interquartile Range (IQR) = Q3 – Q1
Five Number Summary:
Min Q1
Median
Q3
Max
Example 1. The body temperatures of 18 adults were measured, resulting in the
following values:
98.2 97.8 99.0 98.6 98.2 97.8 98.4 99.7 98.2 97.4 97.6
98.4 98.0 99.2 98.6 97.1 97.2 98.5
Data Display (Sorted, from smallest to largest):
97.1 97.2 97.4 97.6 97.8 97.8 98.0 98.2 98.2 98.2 98.4
98.4 98.5 98.6 98.6 99.0 99.2 99.7
Descriptive Statistics: BodyTemp
Variable
N Mean SE Mean StDev
BodyTemp 18 98.217
0.161 0.684
Minimum
Q1 Median
Q3 Maximum
97.100 97.750 98.200 98.600
99.700
Five-Number Summary: Last five quantities in the descriptive statistics above:
Minimum
Q1
97.100 97.750
Median
98.200
Q3
98.600
Maximum
99.700
Page 7
A Boxplot (simple, no unusual observations) is a graphical display of the 5-#
summary. The ‘box’ is drawn from Q1 to Q3 with the median shown in the box,
lines are drawn from the minimum value to the bottom of the box (at Q1) and
from the top of the box (at Q3) to the maximum value.
Boxplot of BodyTemp
100.0
99.5
BodyTemp
99.0
98.5
98.0
97.5
97.0
Outliers:
An observation is a mild outlier if it is more than 1.5 IQR’s below Q1 or 1.5 IQR’s
above Q3. It is an extreme outlier if it is more than 3 IQR’s below Q1 or above
Q3.
Software packages often identify outliers in some fashion; e.g., Minitab puts an ‘*’
for outliers (not necessarily all of them though).
Example. Number of CD’s owned by college students at Penn State University Stat
students:
Variable N Mean SE Mean StDev Min Q1 Median Q3 Max
CDs
236 78.08
5.57 85.59 0
25 50.00 100 500
Mild Outliers: IQR = Q3 – Q1 = 100 – 25 = 75; (1.5)(IQR) = (1.5)(75) = 112.5.
Mild Outliers are #CDs < 25- 112.5 or > 100 + 112.5 = 212.5. There are 17 values
> 212.5 (multiples at some values)
Page 8
Extreme Outliers: 3IQR = (3)(75) = 225. Extreme outliers are #CDs < 25- 225
(negative value) or > 100 + 225 = 325. By this rule, there are several extreme
outliers. See the boxplot below.
Boxplot of CDs
500
400
CDs
300
200
100
0
‘
Stem-and-Leaf Plots: A stem-and-leaf plot is a graphical display of data
consisting of a stem--the most important part of a number--and leafs—the second
most important part of a number.
Example. Stem and Leaf diagram of CDs:
Stem-and-leaf of CDs N = 236; Leaf Unit = 10
109 0 00000001111111111111111111111112222222222222222222222222222+
(56) 0 55555555555555555555555555555556666666667777777888899999
71 1 00000000000000000000000001122
42 1 555555555555555
27 2 00000000001
16 2 55555
11 3 000000
5 3 5
4 4 0
3 4 55
1 5 0
Resistant Statistics: A statistic is said to be ‘resistant’ if its value is relatively
unaffected by outliers.
Page 9
Example 1. The salaries of employees in a small company are as follows:
$20K, $20K, $20K, $20K, $20K, $500K, and $800K.
The average salary is $200K. Delete the highest salary and find that the mean is
$100K. Delete the two highest salaries and calculate the mean to be $20K. The
median is $20 in both situations. The median is a resistant statistic and the
average is not.
Example 2. Remove the 5 extreme outliers in the CDs dataset and redo the
descriptive statistics.
Descriptive Statistics: CD’s with extreme outliers removed:
Variable
N
CDs(outliers out) 231
CDs outliers in) 236
Mean SE Mean StDev Min Q1 Med Q3 Max
70.46
4.50 68.39
0 25 50 100 300
78.08
5.57 85.59
0 25 50 100 500
Note that the only statistic in the 5%-number summary that changed was the Max
(which had to change!) Note also that the mean decreased.
Examples. Resistant Statistics: Median, 1st and 3rd quartiles and IQR (for
moderate samples—n = 10 or more roughly))
Non-Resistant Statistic: Mean (average)
Measures of Spread (Variability):
Interquartile Range IQR, Standard Deviation, Range = Maximum – Minimum,
Mean Absolute Deviation, and Median Absolute Deviation.
The IQR measures the middle 50% of the data..
The Standard Deviation (SD) is roughly the average distance values are from the
mean. The actual definition of the standard deviation (sd), denoted by s, is the
square root of the sample variance s2, where
s2 = ∑ (xi -
x
)2 / (n – 1)
Page 10
is the sum of squared deviations of the values from the mean.. The sd is not a
resistant statistic
The mean absolute deviatio0n = | xi -
x
|/n
The median absolute deviatio0n = median [ | xi – median(xi)|]
Example. Body Temperature.
The interquartile range IQR = Q3 – Q1 = 98.600 - 97.750 = 0.850
The sample range = Max – Min = 99.7 – 97.1 = 2.60
The sample variance s2 = 0.467353; the SD = s = s 2 = 0.6836
The mean absolute deviation = 0.516667
The median absolute deviation = 0.400
Astronomy Example. We use data from Mukherjee, Feigelson, Babu, etal in
“Three types of Gamma-Ray Bursts (The Astrophysical Journal, 508, pp 314327, 1998), in which there 11 variables, including 2 measures of burst
durations T50 and T90 (times in which 50% and 90% of the flux arrives) and
total fluence (flu_tot) as the sum of 4 time integrated fluences. Descriptive
Statistics for the Variables‘flu_tot’ and ln(flu_tot) are given below.
Variable
N
flu_tot
802
ln(flu_tot) 802
Mean
0.0000125
-12.955
SE Mean
0.00000164
0.0632
StDev
0.0000465
1.789
Five Number Summaries:
Variable
Min
flu_tot
.0000000159
ln(flu_tot)
-17.957
Q1
.000000720
-14.144
Median
.00000234
-12.968
Q3
.00000734
-11.823
Max
.000781
-7.155
Empirical Rule says that if the data are symmetric and bell-shaped (unimodal),
indicative of a normal distribution, then
Page 11
About 68% of the observations will be within 1 SD of the mean
About 95% of the observations will be within 2 SDs of the mean
Almost all—99.7%--of the observations will be within 3 SDs of the mean
For the variable ‘ln(flu_tot)’, we find that the intervals and the percentages are as
follows:
Mean  1 StDev
-12.955  1.789 (-14.744, -11.166) 554/802 = 0.6908 or 69%
Mean  2 StDev
-12.955  3.578 (-16.533, -9.377) 763/802 = 0.9514 or 95%
Mean  3 StDev
-12.955  5.367 (-18.322, -7.588) 800/802 = 0.9975 or 99.75%
A similar dataset on gamma ray bursts included a categorical variable—gmark—
with four values. A box plot graphically displaying this data for the variables
log(flu)tot) is given below. It dramatically illustrates how transforming the data
(here, using a log transformation) reduces or eliminates outliers, gives a visual
comparison of the five #-statistics, and enables one to compare the values (median)
four the four types of gamma ray bursts.
Boxplot of lf_tot vs gmark
Boxplot of flutot vs gmark
0.0008
-3
0.0007
-4
0.0006
0.0005
flutot
lf_tot
-5
-6
0.0004
0.0003
0.0002
-7
0.0001
0.0000
-8
1
2
3
gmark
4
1
2
3
gmark
2. Coefficient of Variation = 100·StDev/|Mean| = (100)(1.789)/( 12.955) = 13.81.
Small values of the coefficient are desirable (indicating small errors compared to
the size of the observations).
4
Page 12
Standard Errors. The sample mean xbar is a statistic (a quantity calculated from
a sample). As such, it varies from sample to sample and hence is a random
variable with a probability distribution. It can be shown that
E( x ) =  = mean of the population from which the sample is taken
and
Standard deviation ( x ) = /n
where  is the standard deviation of the population from which the sample is taken
and n is the size of the sample.
The standard error of the mean, denoted by se mean or s.e.(xbar) is the estimated
standard deviation of Xbar:
s.e.( x ) = s/n.
Empirical Cumulative Distribution Function
Definition: The empirical (cumulative distribution function (ecdf) is a
step function given by
[#{ X i  x}]
Fˆn ( x) 
, -∞ < x < ∞
n
where X1, X2, . . . , Xn is a random sample (from some distribution). An
example is given below. In other words, the ECDF, Fˆn ( x) , increases by
1/n at each value in the sample (or k/n if there are k identical values at
some value).
Example. Body Temperature. Here are the values of n = 18 body
Empirical CDF of BodyTemp
temperatures and a graph of the ECDF:
Normal
Mean
StDev
N
100
97.2
98.0
98.4
99.2
97.4 97.6 97.8
98.2 98.2 98.2
98.5 98.6 98.6
99.7
The ECDF e quals
0 for x < 97.1,
1/18 for 97.1 ≤ x < 97.2,
2/18 for 97.2 ≤x <97.4, etc.
80
Percent
97.1
97.8
98.4
99.0
60
40
20
0
96.5
97.0
97.5
98.0
98.5
BodyTemp
99.0
99.5
100.0
98.22
0.6836
18
Page 13
Nonparametric statistics make fundamental use of ECDF’s (a topic to be discussed
later in this short course).
Q-Q (Quantile-Quantile) and Normal Probability Plots.
These plots are used to determine whether a particular distribution fits data or to
compare different sample distributions. Q-Q Plots graph quantiles of one variable
against quantiles of a second variable. Two uses for quantile-quantile plots are:
1.
2.
comparing two empirical distributions
comparing an empirical distribution with a theoretical distribution
(Normal, for example).
The basic idea of Q-Q Plots is to plot the quantiles of the two distributions
(either 1 or 2 above) against one another. If the two distributions are roughly
the same, the their quantiles should be about the same, so a plot of the
quantiles against one another should be roughly a straight line,
Example Normal Q-Q and Probability Plots:
Sample We have n = 5 observations: 20, 24, 25, 27, and 30. Their mean = 25.2
and their standard deviation s = 3.70135
x
20
24
25
27
30
i
1
2
3
4
5
p=pi
0.129630
0.314815
0.500000
0.685185
0.870370
Qp=pth
quantile
21.0243
23.4150
25.2000
26.9850
29.3757
other
p's
0.1
0.2
0.3
0.4
0.5
other
q’s
20.4565
22.0849
23.2590
24.2623
25.2000
Data: ‘x’; pi = (i - 0.3)/(n + 0.4) (some people use (i-0.375)/(n+.25); qp=pth
quantile = -1(p), where -1 is the inverse of the cumulative normal
distribution function.
The graph to the left is a Q-Q Plot of x vs. the pth quantile Qp
The graph on the bottom left is a plot of quantiles vs. p (so a normal probability
plot)
The graph on the bottom right is a normal probability plot (Terminology depends
on scale of y-axis)
Page 14
Normal Q-Q Plot
30
Probability Plot of x
Normal
28
99
27
95
M ean
25.2
S tDev 3.701
N
5
AD
0.161
P -V alue 0.884
26
80
25
Percent
pth quantile
29
24
23
50
20
22
5
21
20
22
24
26
28
30
1
x
15
20
25
x
30
Scatterplot of p vs pth quantile
0.9
0.8
0.7
p
0.6
0.5
0.4
0.3
0.2
0.1
21
22
23
24
25
26
pth quantile
27
28
29
30
Example. Let’s re-examine the astronomy example on page 10: data from
Mukherjee, Feigelson, Babu, etal in “Three types of Gamma-Ray Bursts. The five
number summary and histograms of the two variables ‘flu_tot’ and ‘ln (flu_tot) are
given again along with histograms of the data.
Five Number Summaries:
Variable
Min
flu_tot
0.0000000159
ln(flu_tot) -17.957
Q1
0.000000720
-14.144
Median
0.00000234
-12.968
Q3
Max
0.00000734 0.000781
-11.823
-7.155
35
Page 15
Graphical displays (Histograms)
Histogram of flu_tot
Histogram of C13
700
100
600
80
Frequency
Frequency
500
400
300
60
40
200
20
100
0
0.0000
0.0001
0.0002
0.0003
0.0004
flu_tot
0.0005
0.0006
0
0.0007
-18.0
-16.5
-15.0
-13.5
-12.0
C13
-10.5
-9.0
-7.5
Is the variable ‘ln(flu_tot)’ normally distributed, as suggested by the histogram?
Here are graphs of the empirical distribution function of the data and a graph with
statistics that provide a test for normality--the first graph is the empirical cdf of
ln(flu_tot) and the second is a probability plot of ln(flu_tot):
Empirical CDF of C13
Probability Plot of ln(flu_tot)
Normal
Normal
Mean
StDev
N
100
99.99
-12.95
1.789
802
Mean
StDev
N
AD
P-Value
99
80
Percent
Percent
95
60
40
80
50
20
5
20
1
0
-18
-16
-14
-12
C13
-10
-8
-6
0.01
-20.0
-17.5
-15.0
-12.5
ln(flu_tot)
-10.0
-7.5
The information in the box within the right hand graph is reproduced below:
Anderson-Darling (AD) test for Normality:
Mean -12.95
StDev
1.789
N
802
AD
0.282
P-Value 0.638 (Indicates a good fit for the normal)
-5.0
-12.95
1.789
802
0.282
0.638
Page 16
Comment. A random variable X is said to have a lognormal distribution if Y= ln X
has a normal distribution. Here, we take X = flutot, Y = ln(flutot). The AndersonDarling test indicates that it is reasonable to assume that Y has a normal
distribution (with mean  estimated by x = -12.955 and standard deviation 
estimated by s = 1.789)