Download Descriptive Statistics - Department of Mathematics and Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Descriptive Statistics
(Gathering and organizing data)
Chapter 6
1
Experimentation
The celebrity news magazine Us Weekly
conducted a poll in which they asked 100 people
whether they were shocked by the breakup of
actress Carmen Electra and musician Dave
Navarro after nearly three years of marriage.
The participants in the survey were individuals
encountered on Fifth Avenue in New York City.
43% of those asked said they were shocked by
the news while the other 57% said they were
not.
2
Experimentation
In chapters 2-5 we looked at probability distributions of
random variables such as
1 −x2
f ( x) = e , x ≥ 0
2
where X is the lifetime of a certain type of insect.
More often than not, the true underlying probability
distribution of a random variable is not known. An
investigator finds out everything possible about the
probability density function through conducting studies
and collecting data.
3
Experimentation
Observational Study: Individuals are observed and measured on
variables of interest. Researchers do not implement a ‘treatment’.
A recent study by a web site that gives quotes on insurance rates
found that astrological signs are a significant factor in predicting car
accidents. The study looked at 100,000 North American drivers'
records from the past six years. Reuters
– Libras were found to be the worst offenders for tickets and accidents.
Leos were found to be the best overall.
– In the study results, Leos were described as "generous, and
comfortable in sharing the roadway." Whereas Aries "have a 'me first'
childlike nature that drives [them] into trouble.”
One criticism of observational studies is that the researchers may
overlook important variables (confounding factors) that influence the
observed outcomes.
4
Experimentation
At a recent conference on eating disorders, a
research group presented data showing that
individuals who undergo a “stomach stapling”
procedure to control their weight tend to have
obese children. The researchers concluded that
the parents’ stomach stapling caused the obesity
in their children.
What other factors might explain this relationship?
5
Experimentation
Experiment: Researchers impose some treatment on
subjects in order to observe their responses. An
experiment can be used to determine whether the
treatment caused the response.
Duct tape no magical cure for warts, study finds November 6, 2006
WASHINGTON (Reuters) -- Duct tape does not work any better than
doing nothing to cure warts in schoolchildren, Dutch researchers
reported on Monday in a study that contradicts a popular theory
about an easy way to get rid of the unattractive lumps.
The study of 103 children aged 4 to 12 showed the duct tape worked
only slightly better than using a corn pad, a sticky cushion that does
not actually touch the wart and which was considered to be a
placebo.
6
Experimentation
In order to ensure that the data collected appropriately
address the issue in question, the population must be
appropriately identified.
A population consists of all possible observations available
from a particular probability distribution.
The identification of the population depends on the
question.
7
Experimentation
Example: Consider the following three questions
that might underlie the US weekly poll seen
previously.
1. What percentage of U.S. adults were shocked by
Carmen Electra’s and Dave Navarro’s breakup.
2. What percentage of New York residents were
shocked by Carmen Electra’s and Dave Navarro’s
breakup.
3. What percentage of US weekly readers were
shocked by Carmen Electra’s and Dave Navarro’s
breakup.
The question indicates the population.
8
Experimentation
A parameter is a quantity θ which is a property of interest in
an unknown probability distribution.
Parameters are usually unknown, one goal of statistics is to
estimate them.
Example: An investigator wishes to know what proportion of voters in
the US will vote ‘Republican’ in the next election.
Population – Voters in the US.
Parameter – proportion who will vote ‘Republican’.
Example: A toothpaste company would like to know what proportion of
dentists recommend the brand it produces. Identify a population and
parameter.
Population –
Parameter –
9
Experimentation
Typically, it is not feasible to collect data from every
individual in a given population.
Why not?
How would you get the data you need?
If it were possible to gather data from the entire population,
the true underlying probability mass function or
probability density function would be known.
10
Experimentation
Data is generally collected
for a subset of the
population, a sample.
The science of
deducing properties of
an underlying
probability distribution
A statistic is a random
variable whose value can be from a sample is know
calculated from a sample.
as statistical inference.
The statistic can then be
used to estimate an
unknown parameter and
thereby to draw conclusions
about the overall population.
11
Sample Selection
When the sample is representative of the population,
estimating parameters from the sample is justified, thus
how a sample is chosen is very important.
For a sample to be representative of the population, every
individual in the population must have a chance of being
included in the sample.
When a sample is poorly chosen, a statistic calculated from
that sample is a poor estimate of the population
parameter.
12
Sample Selection
Example: Suppose the population for the “Breakup” survey
is all adults in the U.S.
The parameter of interest is presumably the percentage of
U.S. adults who were shocked by Carmen Electra’s and
Dave Navarro’s breakup.
Is the sample used in the survey representative of the
population? Is the statistic obtained from the Us survey
a good estimate of the parameter?
13
Sample Selection
‘If you had it to do again, would you still have children?’ In
the 1970s Ann Landers submitted this question to her
readers. She received over 10,000 responses. 70% of the
respondents answered “no.”
What were the population and parameter of interest?
Did Ann Lander’s sample adequately represent the
population?
Are there any issues regarding the way the sample was
chosen that need to be addressed?
14
Sample Selection
Some common sources of bias in a sample are:
Selection Bias: a systematic tendency on the part of the
sampling procedure to exclude one kind of person or
another from the sample.
Non-response bias: bias introduced by important
differences between those who respond and those who
do not.
How would you select a representative sample to carry out
Ann Lander’s survey?
15
Sample Selection
To ensure that the sample is representative of the
population, a random sample should be used, i.e. the
elements in the sample should be chosen randomly from
the population. The following methods could be used to
randomly select individuals for the sample:
– Toss a coin
– Draw names from a hat
– Use computer generated random numbers
When a sample is randomly chosen, human choice is not
part of the selection process.
16
Sample Selection
Example:
The Statesman Survey
(www.utahstatesman.com)
What is the population?
Is the sample representative of the population?
What sources of bias might influence this survey?
17
Sample Selection
Discuss the following methods of sample selection.
1.
4 students want to determine which Logan grocery store has the
lowest prices. Each student makes a list of the items he buys
most often and the students compare prices on those items at the
local stores.
2.
4 students want to determine which Logan grocery store has the
lowest milk prices. They record the prices of all varieties of milk
at all grocery stores. The students then randomly select 4 types of
milk to compare for all stores.
3.
A teacher wishes to know what proportion of the students that
take stat 3000 are female. She teaches one section of stat 3000
in the spring, thus she chooses this section for her sample.
18
Data Presentation
Presenting data in a clear
organized way facilitates
understanding the
information.
There are both graphical
and numeric methods of
summarizing data.
“If ‘a picture is worth a
thousand words,’ then it is
worth at least a million
numbers.” (from the text)
19
Data Presentation
Poorly constructed graphics for data presentation can be
worse than useless since they sometimes obscure
information or appear to exaggerate relationships.
20
Pie Charts – Categorical Data
A pie chart presents frequencies of categorical data
visually. It emphasizes the proportion of the total data set
that falls into each category.
A category containing r of n total observations is
represented by a slice of the pie with an angle of (r /n)
360°.
21
Bar Charts – Categorical Data
A bar chart presents frequencies of categorical data
visually. Each category has a corresponding bar whose
height is proportional to the frequency associated with that
category. Bar charts are useful for comparing categories.
When might a bar chart be more useful than a pie chart?
22
Histograms – Numerical Data
Histograms look similar to bar charts, but are used to
present numerical data rather than categorical.
The histogram is composed of bins whose area is
proportional to the number of data observations which
take the value(s) within the bin. (Note: if all bins have
equal width then the height of the bin is proportional to
the number of observations with values within the band.)
A probability mass function (or density function) can be
thought of as a smoothed version of a histogram.
23
Histograms – Numerical Data
The binwidth needs to be carefully chosen – if two
wide, little of the structure of the underlying
probability density function will be observable, if
two narrow the histogram could have many gaps
and spikes.
24
Outliers
Outliers are data points that
are separated from the rest
of the data.
Outliers can affect
conclusions drawn from the
data and should therefore
be removed if possible.
–If outliers are the result of
misrecorded or otherwise
incorrect data collection, they
can be removed.
–If outliers represent true
variation in the population,
they should not be discarded.
25
Outliers
Data from a stat
1040 class, outliers
represent true
variation in the
population.
26
Sample Statistics
Numerical data summaries are called
sample statistics. Some useful sample
statistics the sample mean, median,
standard deviation, and variance.
27
Sample Statistics
Measures of Center:
For a data set consisting of n observations, x1,...,xn,
• Sample mean :
∑
x=
n
x
i =1 i
n
• Sample median : the “middle” data point. When n is even, the
average of the two middle points.
• Sample trimmed mean : the mean of the observations minus
some of the largest and some of the smallest, say 5% of each.
• Sample mode : The category or data value which contains the
larges number of observations.
28
Sample Statistics
In 1984, the University of Virginia reported
that the mean starting salary of graduates
from its department of Rhetoric and
Communications was $55,000. One of
these graduates was Ralph Sampson who
played basketball in the NBA. His salary
did not reflect the earning potential of
other graduates from his department.
UVA did not publish the median salary.
29
Sample Statistics
Examples: The ages of 5 students are 21, 22, 24,
24, 26.
1. Find the sample mean.
2. Find the sample median.
3. Find the sample trimmed mean (remove 20% top
and bottom).
4. Find the sample mode,
30
Sample Statistics
Measures of Spread:
For a data set consisting of n observations, x1,...,xn,
• The sample variance :
s2
∑
=
n
2
(
x
−
x
)
i
i =1
n −1
• The sample standard deviation : s=√s2
• The pth sample quantile (percentile): a value such that proportion p
(p%) of the sample values are smaller than it.
• Upper and lower sample quartiles : 75 – percentile and 25 – percentile
respectively. If these quartiles fall between values, they can be found
by taking weighted averages as follows:
• Upper sample quartile = (1/4 * low #) + (3/4 * high #)
• Lower sample quartile = (3/4 * low #) + (1/4 * high #)
• Interquartile range (IQR) : the difference between the upper and lower
sample quartiles.
31
Sample Statistics
Examples: The ages of 5 students are 21, 22, 24,
24, 26.
1. Find the sample variance.
2. Find the sample standard deviation
3. Find the sample IQR.
32
Box Plots
A boxplot provides a good
visual summary of the
distribution of the data set.
It presents the sample
median, upper and lower
sample quartiles, and the
larges and smallest data
observations.
33
Box Plots
Boxplots can be useful
for comparing
distributions. This
example shows the
distribution of height in
my stat 1040 and stat
3000 classes from fall of
2006.
• Why might the
distributions be so
different?
34
Data Presentation
What graphical representation(s) would you
use for each of the data sets?
1.
2.
3.
4.
Ethnicity of participants in an asthma study
Depths of Massachusetts ponds
Data collected for Electra/Navarro survey
College GPAs of students who attended
private high schools and the same for
students who attended public high schools.
35