Download Chapter 9 Statistical Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Chapter 9
Statistical Data Analysis
An Introduction to Scientific
Research Methods in Geography
Montello and Sutton
Data Analysis
 Data Analysis
 Helps us achieve the four scientific goals of
description, prediction, explanation, and
control
 Statisical Data Analysis
 Three primary reasons geographers treat data
in a statisitical fashion
http://rlv.zcache.com/knowledge_is_power_do_statistics_stats_humor_fly
er-p2440846222778564182dwj5_400.jpg
Statistical Description
 Descriptive Statistics
 Parameters
 Central Tendency
 Mode
 Median
 Mean X , m
 Arithmetic mean
 When would you use the median or the mode
instead of the mean?
Descriptive Statistics
 Variability
 Range
 = largest value – smallest value
 Variance
å(x - m)
N
2
s2 =
i =1
i
N
 Standard Deviation
N
s=
å(x - m)
i
i =1
N
2
Descriptive Statistics
 Form
 Modality
 Skewness
 Positive
 Negative
 Symmetry
 Unimodal – Bell-shaped
 Normal Distribution
http://people.eku.edu/falkenbergs/images/skewness.jpg
Descriptive Statistics
 Derived Scores
 Percentile Rank
 Highest – 99th percentile
 Where is the median?
 Z-score
 Standard deviation units above or below the mean
x-m
z=
s
Descriptive Statistics
 Relationship
 Linear Relationship
 Positive
 Negative
 Relationship Strength
 Weak, strong, no relationship
 Correlation Coefficient
 Between -1 and 1
 0 – no relationship
 Regression Analysis
 Criterion variables (Y)
 Predictor variables (X)
http://hosting.soonet.ca/eliris/remotesensing/LectureImages/correlation.gif
Correlation – Causation?
http://xkcd.com/552/
“Correlation doesn’t imply causation, but it does
waggle its eyebrows suggestively and gesture
furtively while mouthing ‘look over there’.” - XKCD
Statistical Inference
 Inferential Statistics
 Statistics
 Sampling error
 Given our sample statistics, we infer our
parameters
 Assign probabilities to our guesses
 Power and difficulty of inferential statistics
comes from deriving probabilities about how
likely it is that sample patterns reflect
population patterns
Inferential Statistics
 Sampling distribution
 Ex: sampling distribution of means – show the
probability that a single sample would have a
mean within some given RANGE of values
 Central limit theorem – sampling distribution
of sample means will be normal with a mean
equal to the population mean and a standard
deviation equal to the population standard
deviation divided by the square root of the
sample size
Inferential Statistics
 Generation of sampling distributions
 Assumptions
 Distributional assumptions
 Nonparametric
 Parametric
 Normality
 Homogeneity of variance
 Independence of scores
 Correct specification of models
Estimation and Hypothesis Testing
 Estimation
 Point estimation
 Confidence Interval
 Usually 95%
 Hypothesis Testing
 Null hypothesis
 A hypothesis about the exact (point) value of a
parameter or set of parameters
 Use sample statistics to make an inference about
the probable truth of our null hypothesis
Hypothesis Testing
 Alternative
Hypothesis
 Hypothesis that the
parameter does not
equal the exact value
hypothesized in the
null
 A range rather than an
exact value
If A is true,
Then B is true
B is not true
B is true
Therefore,
A is not true
Therefore, ???
 Modus Tollens
 Useful for
disconfirming
 Not confirming!
Example
 From a recent nationwide study it is known that the
typical American watches 25 hours of television per
week, with a population standard deviation of 5.6 hours.
Suppose 50 Denver residents are randomly sampled
with an average viewing time of 22 hours per week and a
standard deviation of 4.8. Are Denver television viewing
habits different from nationwide viewing habits?
 Step 1: State your null and alternative hypotheses
H 0 : X = 25
H A : X ¹ 25
 What is this saying?
Example
 Step 2: Determine your appropriate test statistic and its sampling
distribution assuming the null is true
 We are testing a sample mean where n>30 and so a z distribution can
be used
 Step 3: Calculate the test statistic from your sample data
X = 22 m = 25
X -m
22 - 25
z=
=
= -3.79
s = 4.8 s = 5.6
s/ n
5.6 / 50
n = 50
 Step 4: Compare the empirically obtained test statistic to the null
sampling distribution
 P value: p = .0001
 OR Critical value at .05 significance level: z = ±1.96
 Decision: Reject the null hypothesis
 -3.79 is less than -1.96: reject
 The p value is very small, less than .05 and even .01: reject
Error
 You have made either a correct inference
or a mistake
 Type I error is the rejection level, p (or α)
 Type II error - β
http://www.mirrorservice.org/sites/home.ubalt.edu/ntsbarsh/Businessstat/error.gif
Data in Space and Place
 Spatiality is a focus in geography, unlike other disciplines
 Spatial autocorrelation
 First Law of Geography: Everything is related to everything else,
but near things are more related than distant things
 Positive v negative spatial autocorrelation
 A violation of the important statistical assumption of
independence
 Ex: If its raining in my backyard, I can say with a high degree of
confidence its raining in my neighbor’s backyard, but my level of
confidence that it is raining across town is lower, and 300 miles
away even lower
 Variogram
http://www.innovativegis.com/basis/Papers/Other/ASPRSchapter/
Default_files/image023.png
Data in Space and Place
 “Spatial data are special” – a special difficulty
 Which areal units should be used to analyze
geographic data
 Modifiable Areal Unit Problem
 Gerrymandering
 Geographic phenomena are often scale
dependent
 Must identify the scale of a phenomena and collect
and organize data in units of that size
 Data aggregation issues
Discussion Questions
 What measure of central tendency is best for nominal




data?
When pollsters tell you that a candidate is favored by
44% of likely voters, plus or minus 3 percent, what is the
44% and what is the plus/minus 3%?
A survey of all users of a park in 1980 found the average
number of people per party to be 3.5. In a random
sample of 35 parties in 2000 the average was 2.9. If you
wanted to test if the number of persons per party in 2000
was different from the number in 1980, what would your
null and alternative hypotheses be?
In the United States, we presume that someone is
innocent. If a guilty person were found to be not guilty,
what type of error would this be?
A researcher finds that a particular learning software has
an effect on student’s test scores, when actually it does
not. What type of error is this?