Download Hypothesis Testing for Large Sample

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics in Research
June 01, 2013
Institute of Space Technology
By
Asma Ali, Ph. D.
Define Statistics
 Statistics is concerned with scientific methods for
 Collecting
 Organizing
 Summarizing
 Presenting,
 And Analyzing Data
 So to draw valid conclusions and make reasonable
decisions
Population & Sample
 Population – Entire Group
 Sample – A subset of population
 A population can be finite or infinite.
 For example,
 Numbers of bolts produced in a day is finite
population
 All possible outcomes of a coin toss (heads,
tails) in successive tosses is infinite population
Population & Sample (Contd..)
 Population data is often impossible and impractical to
collect
 For example,
 Average height of men in Pakistan
 Average Speed of vehicles on roadways
 Median Household income
Variables
 A variable is a symbol, x, y, z, ----, that can assume any




value.
If a variable can assume only one value it is called a
“constant”.
If a variable can assume any value between two given
values is called a “continuous” variable, otherwise it is
called “discrete” value.
Example, the number of children in a family can
assume value such as 0, 1, 2, 3 --- is a discrete variable.
The height of an individual can be 62”, 63.8”, 65.7” etc.
Organizing Data
 When data is collected in a large quantity the data can
be looked at as individual pieces of data or grouped
into classes of data for easy comprehension.
 Data can be put into an array format, sorted into an
ascending or descending format to get a sense of
data.
 Or if the data set is large it can be organized into
groups.
Organizing Data
Frequency Distribution of Heights
Recap
• Statistics is about collecting, organizing, visualizing, and
analyzing data.
• Population: Larger dataset
• Sample: Subset of the larger dataset
• Organizing Data: By sorting in order, by organizing in
groups
• There are two types of variables
• Discrete (1, 2, 3…, )
• Continuous (1.0, 1.1, 1.02, …)
Data Visualization
 Visualizing the pattern in the data.
 Using different statistical analysis techniques.
 Plotting Graphs
Scatter Plot
Crashes
Line Graph
Year
Crashes
Crashes
Bar Chart
Year
Year
Pie Chart
Box Plot
Recap
• Data visualization
• Scatter plot
• Bar Chart
• Pie Chart
• Box Plot
Measures of Central Tendency
 Mean (Arithmetic Mean)
The mean of 8, 3, 5, 12, and 10 is
Mean = (8 + 3 + 5 + 12 + 10)/5 = 7.6
 Mode


The mode for the data set, 8, 8, 8, 4, 6, 7, 7, 2, ,2, 2, 2, 5
The most frequently appearing number in this series is 2.
Therefore the frequency of this data is 2
 Median - The median of a set of numbers arranged in
an order is either the middle value or the mean of the
middle values.

2, 2, 2, 2, 4, 5, 6, 7, 7, 8, 8
Measures of Dispersion
 Variance – Defines the variation in the data, the dispersion
in the data.
 Defines the magnitude of variation around the mean.
Where,
S2 = variance
n = sample size, number of observations
= mean
Measures of Dispersion
.6
.4
.2
-.0
-.2
-.4
-.6
4.0
4.2
4.4
4.6
Fitted Value (sec/mile)
4.8
5.0
Running Time Per Mile (sec/mile)
160
TOD = AM
TOD = MD
TOD = PM
140
120
100
80
60
40
0
500
1000
1500
Traffic Flow Rate (veh/hr/lane)
2000
Measures of Dispersion
 Standard Deviation – The square root of variance
Recap
• Measures of Central Tendency
• Mean
• Mode
• Median
• Dispersion
Frequency Distribution
Frequency distribution or Histogram of a data is plotted to visualize if the data
follows a normal distribution.
Normal Distribution
 One of the most common statistical distribution.
 Normal Distribution is known by its characteristics
bell shaped curve.
 A normal distribution is a very important statistical
data distribution pattern occurring in many natural
phenomena, such as height, blood pressure, lengths of
objects produced by machines, etc.
Normal Distribution
 Normal Distributions are symmetrical with a single central
peak at the mean of the data.
 Fifty percent distribution lies to the left of the mean and
fifty percent lies to the right of the mean.
http://en.wikipedia.org/wiki/Standard_deviation
Normal Distribution
http://www.regentsprep.org/Regents/math/algtrig/ATS2/NormalLesson.htm
Standard Error of the Mean
 The answers to the two questions is based on the
Standard Errors of the mean.
Population
Sample
Population Mean
Sample Mean
Sample Size
 In practice we usually do not know the standard
deviation of the population (or a very large sample),
so we may have to estimate from the sample.
 For a Confidence Interval of



90 % , Std Error = Interval/1.64
95 %, Std Error = Interval/1.94
99 %, Std Error = Interval/2.58
Shape of the Distribution
Kurtosis – Refers to peakedness or flatness of
the distribution
Shape of the Distribution
 Analysts are frequently concerned about how data is
distributed between the extremes.
 Symmetrical Distribution (A)
 Skewed Right Distribution (B)
 Skewed Left Distribution (C)
Skewness = 3(mean – median)
standard deviation
Recap
• Histogram
• Normal Curve (68% of the data is clustered around mean)
• Standard Error of Mean (The distance between population
and sample mean)
• Sample Size
• Shape of Distribution
• Skeweness
• Kurtosis
Hypothesis Testing
 Null Hypothesis, Ho ,The hypothesis that we formulate
or want to test, for example;



The coin is fair, so the probability of a head is p = 0.5
All vehicles on a certain road travel < 40 km/hr
There is no difference between medicine 1 and 2
 Alternative Hypothesis, H1 ,The hypothesis that differs
from or opposite of null hypothesis , for example;



The coin is not fair, so the probability of a head, p ≠ 0.5
All vehicles on a certain road travel < 40 km/hr
There is a difference between medicine 1 and 2
Type I and Type II Errors
 If we reject a hypothesis when it should be accepted than





we commit Type I error
If we accept a hypothesis when it should be rejected then
we commit Type II error
In either case a wrong decision in judgment has occurred
In order for the test of hypothesis to be good, test should be
designed to minimize errors
An attempt to decrease one type of error causes an increase in
other type of error
The only way to minimize both types of errors is to increase
the sample size which is not always possible
Level of Significance
 In testing a hypothesis the maximum risk that we can take to




make Type I error is called Level of Significance or Significance
Level
The Significance Level is denoted by α
Commonly, a significance level of 0.05 or 0.01 is considered
If for example, an α of 0.05 is chosen in designing a decision,
then there are about 5 chances in 100 that we would reject the
hypothesis when it should be accepted, that is we are 95%
confident that we have made the right decision
In such case we say that the hypothesis has been rejected at
the 0.05 significance level, that the hypothesis has a 0.05
probability of being wrong
Standard Normal Distribution
 It is a normal distribution with mean “0” and standard deviation
“1”.
 Standard normal distribution is denoted as z: N[0,1].
 Any value of x on any normal distribution, denoted
x: N[µ, σ2], can be converted to an equal value of z on the
standard normal distribution.
z = (x - µ)/σx
Where,
z = Equivalent statistics on the standard normal distribution
x = Statistics on any arbitrary normal distribution
µ = Mean
σ = Standard error of sampling distribution
Hypothesis Testing for Large
Sample
 If the sample size is greater than 30, the sampling
distribution is considered to approximate normal
distribution.
 For large sample size we conduct “One-Tail” or “Two-Tail” ztest depending upon the hypothesis.
 One-Tailed Test: looks for a definite increase or decrease in
the parameter. In case of a one-tailed test the critical region
is on only one side of the curve.
Hypothesis Testing for Large
Sample
 If our calculated value lies within the red region, we reject
the null hypothesis in favor of the alternative hypothesis.
 Example;
 According to an article in a magazine, the work force’s
average age is decreasing. A finance company wants to find
out whether its share holder’s age is also decreasing. A
survey conducted few years ago showed that the company’s
share holder’s average age is 55.
The Null hypothesis is H0: µ > 55
The Alternative hypothesis is H1: µ < 55
Hypothesis Testing for Large
Sample
 A sample of 250 people was drawn from the share holder’s list and
they were contacted for their age.
 The sample mean was computed and was found to be 53 years.
 The standard deviation of the sample was found to be 12 years.
 The confidence bound for the test was 90%, significance level.
 σx = σ/√n = 12/ √250 = 0.76
 z = 53 – 55/0.76 = -2.63
 For a 90% confidence level or 10% significance level,
z = -1.28
 -2.63 < -1.28, therefore, the null hypothesis is rejected in
favor of the alternative hypothesis
Hypothesis Testing for Large
Sample
Z-values
Level of Significance, α
0.1
0.05
0.01
One Tailed test
-1.28/1.28
-1.645/1.645 -2.33/2.33
Two Tailed test
-1.645/1.645 -1.96/1.96 -2.58/2.58
Hypothesis Testing for Large
Sample
The two-tailed test is a statistical test used in inference, in which a
given statistical hypothesis, H0 (the null hypothesis), will be rejected when the
value of the test statistic is either sufficiently small or sufficiently large
-z < x < z
http://en.wikipedia.org/wiki/Two-tailed_test
Hypothesis Testing for Large
Sample
• A telephone mail order company wants to determine if
the average time between telephone orders has changed
from 3.8 minutes?
• The company collects a sample of 100 calls. The mean
time between these calls is found to be 4.0 minutes with
standard deviation of 0.5 minutes.
 The significance level for this hypothesis test is selected
to be 2%.
• The Null Hypothesis Ho; µ = 3.8 seconds
• The alternative hypothesis H1 ≠ 3.8 seconds
Hypothesis Testing for Large
Sample
σx = σ/√n = 0.5/ √100 = 0.05
For a 98% confidence level or 2/2 = 1%
significance level, z = -2.33 and 2.33
If z < -2.33 or z > 2.33, reject the null
hypothesis
z = 4 – 3.8/0.05 = 4.0
Since 4.0 > 2.33, null hypothesis is
rejected in favor of the alternative
hypothesis
Hypothesis Testing for Small
Sample, t-test
• If the sample size is less than 30, and the sample
distribution follows the normal distribution, then
conduct t-test for hypothesis testing.
• For one-sample t-test, t = (x - μ) / (St. Dev/(n)^0.5)
• Degrees of freedom. The degrees of freedom (DF) is equal to
the sample size (n) minus one. Thus, DF = n - 1.
• If “t” value (calculated) > “t” value from the table, reject the
null hypothesis.
Hypothesis Testing for Small
Sample
Correlation Analysis
 Determines the strength of relationship between variables.
 The co-efficient of correlation, R, indicates the strength of




relationship. The sign shows the direction of relationship.
+ sign shows a positive relationship, i.e., as the independent
variable increases, the dependent variable increases as well.
- sign shows a negative relationship, i.e., as the independent
variable increases, the dependent variable decreases.
The value of R ranges between 0 and 1. “o” being no
relationship exists. “1” means a very strong relationship exists.
Correlation analysis can be conducted between dependent and
independent variables and between independent variables
Correlation Analysis
Speed
85th
(mph)
Posted
Speed
(mph)
Lane
Width
(ft)
Median
Width
(ft)
Access
Density
(ft)
Segment
Length
(mile)
Pearson Correlation Analysis
Speed 85th
1.0
Posted
0.97*
Speed
Lane Width 0.35*
Median
0.58**
Width
0.97*
1.0
0.35*
0.32
0.58**
0.55**
-0.10
-0.068
0.65**
0.71**
0.32
0.55**
1.0
0.33*
0.34*
1.0
-0.17
-0.25
0.29
0.075
Access
Density
-0.10
-0.068
-0.17
-0.25
1.0
0.14
Segment
Length
0.66
0.71**
0.29
0.075
0.14
1.0
Regression Analysis
 Regression is a statistical modeling technique that
determines the relationship between dependent variable
and independent variable(s).
 The independent variable explains the variation in the
dependent variable.
 A regression model can be linear or non-linear
depending upon the relationship between the
dependent and the independent variables.
 A regression model can be a simple single variable
model or a multiple variable model (Chaterjee, 2000).
Regression Analysis
 In a single variable model, the variation in dependent
variable is explained by a single independent variable.
 In a multiple variable regression model two or more
explanatory variables are used to predict the response
variable.
 A multiple regression equation takes the following form.
Y   o  1 X 1   2 X 2  ....   n X n  
Residual or error that can not be explained by the model
Regression Analysis
Observed value
60
Y
Residual
55
Y^
Estimated value
50
45
40
35
35
40
45
50
55
Fitted Values of 85th Percentile Speed (mph)
60
Regression Analysis
 The amount of variation in dependent variable
explained by independent variable(s) is determined by
co-efficient of determination denoted by R2.
 The value of R2 may range between 0 and 1
 The closer the value or R2 to 1.0, the stronger is the
prediction power of the regression model. The value of
“0” for R2 means no explanatory power of the model.
Regression Analysis
 Regression model is developed with the following
assumptions (Crawley, 2003):
 Errors (ε) are normally distributed
 Errors have constant variance
 Explanatory variables are measured without error
 All of the unexplained variation is confined to the
response variable