Download Data Analysis 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Misuse of statistics wikipedia, lookup

Psychometrics wikipedia, lookup

Foundations of statistics wikipedia, lookup

Categorical variable wikipedia, lookup

Transcript
Data Analysis
Does the data support the
hypothesis?
• Once the appropriate
inquiry method has been
selected and the data
gathered. The last and
most important step is to
determine what the data
is telling you.
• They key method of data
analysis is statistics.
Data analysis
• Statistics – the use of mathematics to
organize, summarize, and interpret
numerical data. There are two kinds:
• Descriptive statistics – used to organize
and summarize data.
• Inferential statistics – used to interpret
data and draw conclusions.
Descriptive statistics
• Statistics can be used to organize data so
it clearly describes what has occurred.
• There are three main descriptive statistics:
– Central tendency
– Variability
– Coefficient of correlation
Central tendency
• Central tendency – patterns of frequency,
predictability, or typical results in a set of
numerical data. There are three
measures…
• Mode
• Median
• Mean
Mode – the most frequent score
in a distribution.
• The mode is not affected by extreme
scores
• The mode is a quick method of describing
central tendency however it is not
particularly useful or descriptive of the
data
• What is the mode for the following set of
scores?
• 32 32 35 36 38 38 39 39 39 40 40 42 45
On a frequency graph, the mode would be
the highest point.
If the two most frequently occurring scores
occur the same number of times then the
distribution is bimodal.
• Eg:
• 32 32 32 36 37 38 38 39 39 39 40 40 42 45
Median – is the score that falls exactly
in the centre of a distribution of scores.
• It is the score value that cuts the distribution set
(n) in half, half the scores fall above and half the
scores below.
• Rank order the data set from lowest to highest
• If there is an odd number of values (n) then the
median is simply the middle number
32
32
35
36
36
37
38
38
39
39
39
40
40
45
46
If there is an even number of values (n) then
the midpoint can be found by adding the two
middle values together and dividing by 2
32
35
36
36
37
38
38
39
39
39
40
40
42
45
Mean – the arithmetic average
of the scores in a distribution.
• The sum of the scores divided by the
number of scores.
• Sensitive to extreme scores
• Most commonly used way of describing a
data set.
• Mean is not necessarily the average –
mode and median can also be called the
average.
• As is commonly known, KIWI-birds are
native to New Zealand. They are born
exactly one foot tall and grow in one foot
intervals. That is, one moment they are
one foot tall and the next they are two feet
tall. They are also very rare. An
investigator goes to New Zealand and
finds four birds. The mean of the four birds
is 4, the median is 3, and the mode is 2.
What are the heights of the four birds?
Skew
• When the mean, median and mode fall at the
same point in the distribution, then the results
are symmetrical.
• If the results are asymmetrical, then the
distribution is said to be skewed. The mean will
be pulled away from the median and the mode
by the extreme scores
• Positive skew –
asymmetrical
distribution pointing
to the positive
direction
• Negative skew –
asymmetrical
distribution pointing
in the negative
direction
Variability
• Is what you have recorded accurate? Does it
support your hypothesis?
• Variability – how much the scores in a data set
vary from each other and from the mean. It is the
dispersion or spread, of scores
Standard Deviation
• Standard deviation – an index of the
amount of variability in a data set.
• When the standard deviation is large, the
variability is great, if it is small ,then the
variability is small as well.
Correlation
• Is there a relationship between two variables? If there is,
how reliable is it and can I make certain predictions with
this data?
• Correlation – when two variables are related to each
other. They can be positive in the same direction, or
negative in the opposite direction, and weak or strong.
• Scatterplot – data table with one variable represented on
the the X axis and the independent variable on the Y
axis. Individual results are plotted to observe a
correlation.
• Positive correlation – a high value of
variable X means a high value of Y.
• Negative correlation – a high value of X
means a low value of Y.
• Correlation coefficient – a numerical index
of the degree of relationship between two
variables. It indicates which direction
(positive or negative) the relationship
works and how strongly (.0 to 1.0) the two
variables are related.
Causal fallacy
• Causation – where one variable causes a
change in another. Even though two
variables are strongly correlated, we do
not know HOW they are related, they may
not have a causal relationship.
Testing the hypothesis
• Null hypothesis (Hₒ) – the statement of a
zero (or null) difference that is statistically
tested. The negative version of the original
claim of a study. We use this because
statistically, we test for the negative
version of the hypothesis to see if we
accept or reject it.
• Alternative hypothesis (H1) – the
statement we must accept if the statistical
test of the null hypothesis is false.
Error
• Type I error – the mistake of rejecting the null hypothesis when it is
true (symbol – α)
• Type II error – the mistake of failing to reject the null hypothesis
when it is false (symbol – β)
• To decrease α or β we should increase the sample size
• We also use this error judgement to determine our acceptable error
levels (.01, .05)
• We should form claims so that the most serious error is the type 1
error
• The condition of equality should become the Ho
Null hypothesis
is true
Null hypothesis
is false
We decide to
reject the null
Type 1 error
Right!
We decide to
accept the null
Right!
Type 2 error
Confirm the null hypothesis
•
•
•
Test statistic – a computed sample value
based on the sample data that is used to
make the final determination confirming or
rejecting the null hypothesis
Critical region – the set of all values of
the test statistic that would cause us to
reject the null hypothesis
Critical values – the value or values that
separates the critical region from the
values of the test statistic that would lead
us to reject the null hypothesis, this will
depend on
– the type of hypothesis (one or two
tailed)
– the sampling distribution (normal or
skewed)
– the level of significance (type of
possible error and consequence)
α
t,p,z
σ – determined
error value
Test statistic
• z – calculation:
• t – calculation:
• The difference between both is the size of
the sample or population under
investigation
Nested vs. Crossed designs
• When the means of two groups of scores
are calculated and compared for critical
difference…
• if different subjects in each group, then the
design is nested or between subjects
• If same subjects used for different levels of
treatment, the design is crossed or within
subjects
Degrees of freedom
• Based on the size
of the sample
• Affects the height of
the distribution
• Along with the type
of hypothesis, used
to calculate the
critical region
• Df = N-1
Is our experiment statically significant?
•
•
•
•
What we need:
N – number of subjects
X – mean
SD – standard deviation (or s² - variance)
• Step 1 – find the standard deviation of the sample set
• Step 2 – calculate the difference in variance
• Step 3 – calculate the t-value
• Step 4 - Using an error =.01 and a one-tailed t-test, the value
of t found in the t-tables would be:
Apply this formula to our
experiment…
• In groups, you are to compare the mean of
the silence scores to the mean of one
other assigned group of scores.
• 6 trials, 1 silent = 5 comparisons
• X2 groups for each comparison
(remember inter-rater reliability)
• Report your t-obs data on the white board.
• The t score you calculate
Chance….
• Statistical significance – is said to exist when the
probability that the observed findings are due to chance is
very low.
• For psychological findings, the threshold of 5% or .05 is
considered statistically significant (this is called the pvalue). That is, 5 chance results in 100.
Apply these principles to the
Memory study…
• Is there a statistically significant difference
between the mean of the silence and the
mean of all the other music types?
• What should be done about our results?
Your homework…
• Identify one study you would like to
perform on the population of the school.
You must explain:
• The hypothesis
• The type of study
• The independent and dependant variables
• Your anticipated result
• How that result would change the way the
school is run.