Download PUAF 610 TA - Public Policy PhD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
PUAF 610 TA
Session 2
1
Today
• Class Review- summary statistics
• STATA Introduction
• Reminder: HW this week
2
Review: Two types of Statistics
• Descriptive statistics summarize
numerical information.
• Inferential statistics uses a sample to
infer the population.
3
Summary statistic
• In descriptive statistics, summary statistics
are used to summarize a set of
observations.
• Typically,
– What is the central value?
– How widely are values spread from the
center?
– Are there data that are very atypical?
– ….
4
Summary statistic
• a measure of location, or central tendency
• a measure of statistical dispersion
• a measure of the shape of the distribution
5
Central tendency
• Central tendency relates to the way
in which quantitative data tend to
cluster around some value.
• A measure of central tendency is
any of a number of ways of specifying
the “central value”.
6
Basic measures of central
tendency
• Mean
• Median
• Mode
7
Mean
• the sum of all measurements divided by
the number of observations in the data set
• population mean () v. sample mean (“xbar”)
8
Example
• Assume 4 people take PUAF 610, and
their final exam scores are 95, 87, 93, 83.
What’s the mean for exam score?
9
Example
• Mean= (95+87+93+83)/4=89.5
10
Median
• the middle observation, when data are
ordered from smallest to largest
• the point of a distribution that divides the
bottom 50% from the top 50% of the data.
The median is the 50th percentile.
11
Median
• If there is an odd number of observations,
the median is the middle observation
• If there is an even number of observations,
the median is the average of the two
middle observations
• If the dataset is arranged in increasing
order the median is located at position
(n+1)/2
12
Example
• Calculate the sample median for the
following observations: 1, 5, 2, 8, 7.
• Start by sorting the values: 1, 2, 5, 7, 8.
• The median is located at position
(n+1)/2=3, thus it is 5.
• An odd number of values.
13
Example
• Calculate the sample median for the
following observations: 1, 5, 2, 8, 7, 2.
• Start by sorting the values: 1, 2, 2, 5, 7, 8.
• The median is located at position
(n+1)/2=3.5, Thus, it is the average of the
two middlemost terms (2 + 5)/2 = 3.5.
• An even number of values
14
Mode
• the most frequent value in the data set
• It is possible for a distribution to have
more than one mode or not to have a
mode at all.
15
Example
•
•
•
•
The mode for the following data set
(1) 1, 2, 2, 3, 4, 7, 9
(2) 12, 26, 26, 53, 84, 71, 71, 79
(3) 32, 46, 53, 94, 37, 29
16
Comparing of Mode, Median and
Mean
• Pros and Cons
• For descriptive purposes we might use the
measure that suits the data.
• If we would like to infer from samples to
populations, the mean is a measure of
choice because it can be manipulated
mathematically.
17
Summary statistic
• a measure of location, or central tendency
• a measure of statistical dispersion, or
variation
• a measure of the shape of the distribution
18
Measures of Variation
• Variation is variability or spread in a
variable
• Measures of variation are lengths of
intervals on the measurement scale that
indicate the spread of values in a
distribution.
19
Measures of Variation
•
•
•
•
•
Range
Quartiles
Interquartile range
Variance
Standard Deviation
20
Range
• the length of the smallest interval which
contains all the data
• (highest value – lowest value) + 1
21
Quartiles
• any of the three values which divide the
sorted data set into four equal parts, so
that each part represents one fourth of the
sampled population.
22
Quartiles
• first quartile (Q1) = lower quartile = cuts off
lowest 25% of data = 25th percentile
• second quartile (Q2) = median = cuts data set
in half = 50th percentile
• third quartile (Q3) = upper quartile = cuts off
highest 25% of data, or lowest 75% = 75th
percentile
• * The difference between the upper and lower
quartiles is called the interquartile range.
23
Variance
• Describes how far values lie from the mean.
• Use the absolute values or to square the
deviation scores to get rid of the minus signs.
• Averaging absolute values cannot be used in
more advanced analyses.
– By averaging the sum of squared deviations (sum of
squares) we can get a measure that is susceptible to
further algebraic manipulations that are difficult or
impossible with absolute values.
24
Variance
• Less intuitive and more difficult to interpret,
because it is measured in squared units
rather than original units
• Do not use variance much
•
(in population)
and
(in sample)
where μ is the mean and N is the
number of population.
25
25
Standard deviation
• A widely used measure of the variability or
dispersion.
• It shows how much variation there is from the
"average“.
• Standard deviation is obtained by taking a square
root of the variance, i.e.
(population)
(sample)
26
26
Standard deviation
• A low standard deviation indicates that
the data points tend to be very close to
the mean.
• A high standard deviation indicates
that the data is spread out over a large
range of values.
27
Summary statistic
• a measure of location, or central tendency
• a measure of statistical dispersion, or
variation
• a measure of the shape of the distribution
28
Shape of the distribution
• Skewness
• Kurtosis
29
Skewness
• a measure of the asymmetry of the
distribution
• The skewness value can be positive or
negative, or even undefined.
30
Skewness
• negative skew: The left tail is longer; the
mass of the distribution is concentrated on
the right of the figure. It has relatively few
low values.
31
Skewness
• positive skew: The right tail is longer; the
mass of the distribution is concentrated on
the left of the figure. It has relatively few
high values.
32
Skewness
• A zero value indicates that the values are
relatively evenly distributed on both sides
of the mean.
33
Kurtosis
• a measure of the
"peakedness" of the
distribution
• Higher kurtosis means
more of the variance is the
result of infrequent extreme
deviations, as opposed to
frequent modestly sized
deviations
34
That’s all for class review. So
far so good?
Let’s go to STATA!
35