Download CE902 Lecture 3: Statistics for Research

Document related concepts

Operations research wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
CE902 Lecture 3:
Statistics for Research
Annie Louis
January 30, 2017
CE902 Professional Practice and Research Methodology, Spring 2017
Annie Louis
CE902 Lecture 3: Statistics for Research
Your 1-minute presentations start this week
Three batches each approx. 30 students, check list on Moodle to
identify which batch you are in
I
I
Batch 1: today, Batch 2: week 19, Batch 3: week 20
Your presentation counts for 2% of your proposal marks
I
Based on participation not assessment of the content
Email your image (of a single slide) to this address by noon on the
Friday before your slot comes up
I
[email protected]
Check previous lecture for instructions
Annie Louis
CE902 Lecture 3: Statistics for Research
Recap previous lectures
What common research methods are used in science?
I
Hypothesis-based
I
Measurements
Good scientific practices
I
Science versus pseudoscience
I
Inference - drawing a conclusion
I
Fallacies - faulty inferences
Annie Louis
CE902 Lecture 3: Statistics for Research
This lecture - statistics for research
Many experiments involve data
I
data which are used to make conclusions
I
results/ intermediate values
Statistics helps with data description and drawing conclusions from
data
Annie Louis
CE902 Lecture 3: Statistics for Research
Types of values
What is the difference between these data types?
1. Height of each student in this classroom
2. Values from a survey where students in this class tell me their
most preferred day of the week when they like to eat out
3. Number of stars (between 1 to 5) that each customer gave to
a product on Amazon.com
Annie Louis
CE902 Lecture 3: Statistics for Research
Types of values
Categorical: values come from one of n categories
I
Not numeric
I
Eg. Values from a survey where students in this class tell me
their most preferred day of the week when they like to eat out
Numerical: values make numerical sense
I
Eg. Height of each student in this classroom
Ordinal: one of n categories, but the categories have a numerical
order
I
Number of stars (between 1 to 5) that each customer gave to
a product on Amazon.com
Annie Louis
CE902 Lecture 3: Statistics for Research
Suppose you have a dataset
you are using in your project. How will you analyze/ present the
data in your report?
i.e. What are the first steps?
1. Height of students in this class (in cm)
I
150, 160, 152, 162, 155, 168, 161, 170, 200
Annie Louis
CE902 Lecture 3: Statistics for Research
Arithmetic mean
Arithmetic average of N numbers Suppose x1 , x2 , ... xN are N
observations
x1 + x2 + x3 + ... + xN
x̄ =
N
2. What if you have monthly household income in an area as
follows? (in pounds)
I
1000, 1500, 2000, 2200, 2500, 3000, 3300, 3500, 4000, 4200,
25000
Annie Louis
CE902 Lecture 3: Statistics for Research
Arithmetic mean is influenced greatly by outliers
Median: what is a middle value?
I
Arrange the values in ascending order
I
If odd number of values, the number in the middle is the
median
I
If even, take average of the middle two numbers
Median household income is a better measure of the center of the
data
I
Not prone to effects of some very poor or very wealthy
households
Annie Louis
CE902 Lecture 3: Statistics for Research
What about categorical data?
Choice for the favourite day to eat out
I
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 6, 7
Annie Louis
CE902 Lecture 3: Statistics for Research
What about categorical data?
Choice for the favourite day to eat out
I
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 6, 7
Mode: most repeated value
I
There may be more than one mode
I
There may be no mode!
Annie Louis
CE902 Lecture 3: Statistics for Research
These measures helped to describe data
Descriptive statistics. summarizing data and observations
I
Eg: computing an average
I
But don’t go beyond the data we have
Mean, median and mode are known as measures of central
tendency
Measures of variability: How spread out are the values around
the mean?
Annie Louis
CE902 Lecture 3: Statistics for Research
These measures helped to describe data
Descriptive statistics. summarizing data and observations
I
Eg: computing an average
I
But don’t go beyond the data we have
Mean, median and mode are known as measures of central
tendency
Measures of variability: How spread out are the values around
the mean?
I
Variance, standard deviation
Annie Louis
CE902 Lecture 3: Statistics for Research
Statistics also helps to make conclusions from
measurements or data
Inferential statistics. drawing conclusions based on data and
observations
I
Eg: testing statistical hypotheses
Annie Louis
CE902 Lecture 3: Statistics for Research
Inferential Statistics
How to make conclusions that go beyond the sample data?
We will start with some basics of probability distributions before
moving on
Annie Louis
CE902 Lecture 3: Statistics for Research
Probability
“Degree of certainty”
Two interpretations
Annie Louis
CE902 Lecture 3: Statistics for Research
Probability as Relative frequency
How often something happens in a sequence of observations in the
long-run
I
Eg. probability of getting a head on a coin toss
Based on the idea of a repeatable experiment or a trial
I
Eg. tossing a coin, rolling a die
I
The outcome of the experiment be A
Annie Louis
CE902 Lecture 3: Statistics for Research
Probability as Relative frequency
If there are n trials and A comes up m times, the relative frequency
of A is m/n
Over a large number of trials, this relative frequency becomes
stable
For a fair coin, this relative frequency of getting a head is close to
1/2
Annie Louis
CE902 Lecture 3: Statistics for Research
Probability as Degree of Belief
Subjective opinion of some individual regarding how certain an
event is to occur
I
Eg. probability that it will snow tomorrow
I
Eg. probability of a patient surviving an operation
Does not make sense to think of these as repeatable experiments
Annie Louis
CE902 Lecture 3: Statistics for Research
Sample space and Events
Sample or outcome space: All possible outcomes of an experiment
I
Eg. Rolling a die S = {1, 2, 3, 4, 5, 6}
I
Eg. Flipping three coins S = {HHH, HTT, THT, TTH, HHT,
THH, HTH, TTT }
Event: A subset of the outcome space
I
Eg. event of getting a value less than 4, E1 = {1, 2, 3}
I
Eg. event of getting exactly two heads, F1 = {HHT, THH,
HTH}
Annie Louis
CE902 Lecture 3: Statistics for Research
A random variable
A convenient notation for representing the outcome of an
experiment
A random variable takes a unique value for each event
Eg. Experiment where 3 coins are tossed
I
Y = number of heads
I
Range of Y is 0 to 3
I
Y = 0 corresponds to the event {TTT}
Annie Louis
CE902 Lecture 3: Statistics for Research
Two types of random variables
Discrete random variable
I
Takes countable values
I
Eg. X = number of heads in 10 coin tosses
Continuous random variable
I
Takes any real numbered value
I
Eg. Y = 1.534 for lifetime of a bulb in years
Annie Louis
CE902 Lecture 3: Statistics for Research
We can now talk about the probability distribution of a
random variable
The probabilities of the individual values of a random variable
taken together form a probability distribution
P(X = x), for all x in the range of X
Annie Louis
CE902 Lecture 3: Statistics for Research
Discrete probability distribution
Defined by a probability mass function P(x) or P(X = x) giving
probabilities for each value of a discrete random variable X
I
I
0 ≤ P(X = x) ≤ 1 for all x ∈ E
P
x∈E P(x) = 1
Annie Louis
CE902 Lecture 3: Statistics for Research
Discrete probability distribution
Eg 1: X = outcome of a coin flip
x
P(X = x)
0 (head)
0.5
1(tail)
0.5
Eg 2: X = number of heads in 3 flips of a coin
I
S = {HHH, HHT, THH, HTH, HTT, THT, TTH, TTT}
x
P(X = x)
0
1/8
Annie Louis
1
3/8
2
3/8
3
1/8
CE902 Lecture 3: Statistics for Research
Continuous probability distribution
In some cases, it does not make sense talking about the probability
at an individual point, eg. P(weight = 3.334)
I
i.e. for continuous random variables
I
Rather we talk about intervals, P(3 < weight < 4)
Defined by a probability density function f(x) giving the probability
of a random variable X taking values in a range
Z
P(A ≤ X ≤ B) =
b
f (x)dx
a
f (x) ≥ 0 for all x ∈ R
R +∞
−∞ f (x) = 1
Annie Louis
CE902 Lecture 3: Statistics for Research
(1)
Normal distribution
A popular continuous distribution
Takes the shape of a bell curve, symmetric when divided vertically
in the middle
The density function of a normal distribution with mean µ,
standard deviation σ
f (x) = √
Annie Louis
(x−µ)2
1
e − 2σ2
2πσ
CE902 Lecture 3: Statistics for Research
(2)
Annie Louis
CE902 Lecture 3: Statistics for Research
Empirical rules
68% of the data is within the first standard deviation from the
mean
95% of data is within two standard deviations
99.7% within three standard deviations
Annie Louis
CE902 Lecture 3: Statistics for Research
Many quantities in nature are well approximated by normal
distributions
Empirical observations of
I
Test scores
I
Heights of people
I
Errors in measurements
I
Blood pressure
Annie Louis
CE902 Lecture 3: Statistics for Research
Standard normal distribution
Z has standard normal distribution when mean = 0 and standard
deviation is 1
I
density function g (z) =
1 2
√1 e − 2 z
2π
A Normal distribution X can be converted into a Z distribution
I
Z=
X −µ
σ
The probabilities g(z) can be looked up from the Z table
Annie Louis
CE902 Lecture 3: Statistics for Research
Annie Louis
CE902 Lecture 3: Statistics for Research
Quick summary
We can have discrete and continuous probability distributions
A normal distribution has a bell-shaped curve and symmetric when
vertically divided in the middle
I
95% of data is within two standard deviations
A normal distribution can be transformed into a Z distribution
Annie Louis
CE902 Lecture 3: Statistics for Research
Population versus Sample
Population is universe of individuals you are interested in
I
Eg. for a coin flip, the population has outcomes of an infinite
number of flips
I
Eg. people in Essex
I
Eg. salmon fish in the Pacific ocean
Sample is a subset of the population from which you may want to
make conclusions about the population
I
Eg. 100 flips of the coin
I
Eg. 100 people from Essex chosen for a survey
I
Eg. salmon fish observed in a 1sq. mile area of the Pacific
Annie Louis
CE902 Lecture 3: Statistics for Research
Population parameters versus sample statistics
Let P(X) be the probability distribution of the population
I
Eg. distribution of heads and tails
I
Eg. distribution of age of all the people of Essex
I
Eg. distribution of the lengths of salmon fish in the Pacific
The characteristics of the population such as mean and standard
deviation are known as population parameters or otherwise as
true mean and true standard deviation
When these measures are computed on the sample, we call them
as sample statistics – sample mean and sample s.d.
Annie Louis
CE902 Lecture 3: Statistics for Research
Remind ourselves of our goal
We are interested in making conclusions about the population
based on our sample
I
Eg: You may survey a small set of voters but what you are
interested in is the actual election results
We compute statistics based on the sample and want to know if
the statistics are representative of the population
I
Ideally, we want to get sample statistics that are close to the
population parameters
Statistics provides tools to check this closeness between a
samples statistics and the population parameters
Annie Louis
CE902 Lecture 3: Statistics for Research
There is a primary concern while using sample statistics.
What is it?
Annie Louis
CE902 Lecture 3: Statistics for Research
There is a primary concern while using sample statistics.
What is it?
Variability!
If we take different samples, the statistics will always show
variability
I
number of heads in 100 flips of a coin will be different when
your friend makes a different 100 flips and computes the
number
I
different samples of 100 people from Essex. The average age
and standard deviation will always have variability
I
mean lengths of fish observed in 1 square mile of the Pacific.
Highly unlikely to get the exact same value in a different
location
Annie Louis
CE902 Lecture 3: Statistics for Research
How do we know if the sample statistics we have
computed is reliable?
Annie Louis
CE902 Lecture 3: Statistics for Research
Population distribution for outcome from the roll of a die
Expected value = 1/6 * 1
+ 1/6 * 2 + 1/6 * 3 + 1/6 *
4 + 1/6 * 5 + 1/6 * 6 = 3.5
If you rolled
a die, an ‘infinite’ number of
times, and averaged the values,
you will end up with 3.5
Annie Louis
CE902 Lecture 3: Statistics for Research
Now we only have 10 rolls (a sample)
Outcomes = {2, 4, 3, 2, 1, 6, 1, 4, 2, 6}
Sample mean = 3.1
Sample standard deviation = 1.8
N = 10 is known as the sample size
Annie Louis
CE902 Lecture 3: Statistics for Research
The Sampling Distribution
If we take a very very large number of samples of size N and plot
the sample statistic, we get the sampling distribution
For our example, we take a very large number of samples of 10
rolls of a die
I
In each case, the compute the sample mean
I
The distribution of all these sample means is a new
distribution – the sampling distribution
Here the random variable is the sample statistic (not values from
the population)
Annie Louis
CE902 Lecture 3: Statistics for Research
Annie Louis
CE902 Lecture 3: Statistics for Research
Parameters of the sampling distribution
For the sampling distribution of sample means,
Mean = population mean µ
I
if you had infinite samples
Standard deviation ∆x =
I
σ ← population s.d.
I
N ← sample size
√σ
N
This standard deviation is known as the Standard Error
Annie Louis
CE902 Lecture 3: Statistics for Research
Standard Error
Standard Error ∆x =
√σ
N
I
Uncertainly in sample means
I
If I take different samples, how much do the means vary
Standard error decreases as the sample size increases
I
Less variation in sample statistics as the sample size increases
When population standard deviation σ is unknown, we can
approximate using the sample standard deviation s
∆x = √sN
Annie Louis
CE902 Lecture 3: Statistics for Research
Central Limit Theorem
As N (sample size) becomes large, the sampling distribution can
be approximated by a normal distribution
Generally with a sample size of around 30 or more, we start seeing
a normal distribution
The sampling distribution is normal regardless of whether the
population distribution was normal or not
Annie Louis
CE902 Lecture 3: Statistics for Research
What are the implications of the Central Limit Theorem?
We get one sample
We compute the sample mean
We can get the probability of this sample mean under the
sampling distribution
I
If we did the sampling many many times, how often do we get
such a mean value?
I
We can get this probability *without* actually doing the
sampling many many times
Annie Louis
CE902 Lecture 3: Statistics for Research
Annie Louis
CE902 Lecture 3: Statistics for Research
A particular (sample) mean value can now be mapped to a
Z distribution
Suppose z = 2.4
“If the data is actually
sampled from a population
with mean µ and stdev
σ, then 95% of the time, z
will lie between -2 and +2. By
chance it will lie outside this
interval only 5% of the time.”
Annie Louis
CE902 Lecture 3: Statistics for Research
Z-test
Null hypothesis: The data comes from a distribution with mean µ
and standard deviation σ
I
Alternative hypothesis: The data does not come from this
distribution
Collect N samples, compute sample mean x̄ and standard error ∆x
z=
x̄ − µ
∆x
(3)
General rule: reject the null hypothesis if the z value could have
arisen by chance < 5% of the time
I
z value less than -1.96 or greater than 1.96. 95% of the curve
is between these values
Annie Louis
CE902 Lecture 3: Statistics for Research
p-value
This percent or probability value is known as a p-value
General value: p-value < 0.05 reject the null hypothesis
Reject the null hypothesis at 5% level. Chance alone can produce
such a statistic less than 5% of the time
Annie Louis
CE902 Lecture 3: Statistics for Research
Example: Z-test for a coin experiment
Annie Louis
CE902 Lecture 3: Statistics for Research
You have a coin. You flip it 100 times, it comes up tails 54
times and head 46 times. Is the coin fair?
Represent heads by 1 and tail by 0
Null hypothesis: The coin is fair. True mean is 0.5 and standard
deviation is 0.5
Alternative hypothesis: The coin is not fair, biased towards tails
Annie Louis
CE902 Lecture 3: Statistics for Research
Plot mean and standard error
Annie Louis
CE902 Lecture 3: Statistics for Research
The error bar pretty much overlaps the expected fraction of heads
Cannot reject the null hypothesis: The result of slightly more tails
is only by random chance.
I
Therefore the coin is fair
A statistical test can be used to get a precise value for how likely is
it for the result to have occurred by chance
Annie Louis
CE902 Lecture 3: Statistics for Research
Z-test for this experiment
Sample mean (46 * 1 + 54 * 0)/ 100 = 0.46
Sample standard deviation = 0.501
Standard error ∆x = 0.05
z = 0.46 − 0.50.05 = 0.8
p-value = 0.42
“If the coin is fair, then we are likely to see such a sample mean by
chance 42% of the time. Hence we cannot reject the null
hypothesis”
Annie Louis
CE902 Lecture 3: Statistics for Research
Common use of a Z-test
Check if a sample mean is close to the population’s mean
I
The population’s mean value and its variance is known
Annie Louis
CE902 Lecture 3: Statistics for Research
Summary
Statistics helps in data analysis for research
Descriptive statistics. Describe data’s characteristics. Eg.
measures of central tendency andvariability
Inferential statistics. Draw conclusions based on a sample from a
population. Eg. Is this sample statistic very different from the
general population?
Annie Louis
CE902 Lecture 3: Statistics for Research
References and acknowledgements
1. [Book] Probability, Jim Pitman, 2006
2. [Book] Research Methods for Science, Michael P. Marder, 2011
3. [Book] Probability and Statistics for Engineers and Scientists,
2007
Annie Louis
CE902 Lecture 3: Statistics for Research