Download Quantitative analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Statistical inference wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics
Empirical research in TS involves collecting, processing and interpreting data.
To do this you need a basic understanding of some statistical concepts (though these
days, software often does the work for you).
Representativeness
To what extent you data is typical or representative of a wider population.
If it is a special case then you cannot really generalise from it.
A population (populacija) is a group of phenomena (people or things) that have
something in common.
A sample (vzorec) is a smaller group of members of the population taken to represent
that population.
In order to use statistics to learn about the population, the sample must be random
(naključen), i.e. every member of the population has an equal chance of being selected.
A parameter (parameter) is a characteristic of a population.
A variable (sprejemljivka) is an observable characteristic of a phenomenon that can be
measured or classified.
A statistic (statistika) is a characteristic of a sample.
Inferential statistics (inferenčna statistika) enables you to make an educated guess about
a population parameter based on a statistic computed from a random sample.
Descriptive statistics (opisna statistika) is the analysis of data without such
generalisation.
Hypothesis test
Setting up and testing hypotheses (hipoteze, domneve) is an essential part of statistical
inference. In order to formulate such a test, usually some theory has been put forward,
either because it is believed to be true or because it is to be used as a basis for argument,
but has not been proved, for example, claiming that a new drug is better than the current
drug for treatment of the same symptoms.
In each problem considered, the question is simplified into two competing claims or
hypotheses between which we have a choice: the null hypothesis (ničelna hipoteza),
denoted H0, against the alternative hypothesis, denoted H1. These two competing
hypotheses are not however treated on an equal basis: special consideration is given to
the null hypothesis.
There two common situations:
The experiment has been carried out in an attempt to disprove or reject a particular
hypothesis, the null hypothesis, thus we give that one priority so it cannot be rejected
unless the evidence against it is sufficiently strong. For example,
H0: there is no difference in taste between coke and diet coke
against
H1: there is a difference.
If one of the two hypotheses is 'simpler' we give it priority so that a more 'complicated'
theory is not adopted unless there is sufficient evidence against the simpler one. For
example, it is 'simpler' to claim that there is no difference in flavour between coke and
diet coke than it is to say that there is a difference.
The hypotheses are often statements about population parameters like expected value and
variance; for example H0 might be that the expected value of the height of ten year old
boys in the Scottish population is not different from that of ten year old girls. A
hypothesis might also be a statement about the distributional form of a characteristic of
interest, for example that the height of ten year old boys is normally distributed within the
Scottish population.
The outcome of a hypothesis test is "Reject H0 in favour of H1" or "Do not reject H0".
[See: http://www.stats.gla.ac.uk/steps/glossary]
A confidence interval (intervali zaupanja) gives an estimated range of values which is
likely to include an unknown population parameter, the estimated range being calculated
from a given set of sample data. If independent samples are taken repeatedly from the
same population, and a confidence interval calculated for each sample, then a certain
percentage (confidence level) of the intervals will include the unknown population
parameter. Confidence intervals are usually calculated so that this percentage is 95%, but
we can go for 90%, 99% or whatever.
The width of the confidence interval gives us some idea about how uncertain we are
about the unknown parameter. A very wide interval may indicate that more data should
be collected before anything very definite can be said about the parameter.
Confidence intervals are more informative than the simple results of hypothesis tests
(where we decide "reject H0" or "don't reject H0") since they provide a range of plausible
values for the unknown parameter.
The confidence level is the probability value
associated with a confidence
interval. It is often expressed as a percentage. For example, say
, then
the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level.
Example
Suppose an opinion poll predicted that, if the election were held today, the Conservative
party would win 60% of the vote. The pollster might attach a 95% confidence level to the
interval 60% plus or minus 3%. That is, he thinks it very likely that the Conservative
party would get between 57% and 63% of the total vote.
Distribution (porazdelitev) is a collection of measurements: how scores tend to be
dispersed about a measurement scale.
Normal distribution (normalna porazdelitev): Gauss curve (Gausova krivulja) or bellshaped curve.
Measures of central tendency
Numbers that tend to cluster around the 'middle' of a set of values, e.g. the mean the
median and the mode.
The mean (povprečje, aritmetična sredina), is the sum of the measures divided by the
number of measures (the average).
e.g. daily weekday earnings: 350 €, 150 €, 100 €, 350 €, 50 €
mean = 1000 / 5 = 200 €
The median (mediana, srednja vrednost) is the middle value when the measures are
arranged in order. It is less influenced than the mean is by extreme values.
e.g. 50 €, 100 €, 150 €, 350 €, 350 €
NB: odd number of measures: 60, 77, 107, 108, 112, 114, 120, 155, 200 – median is 112
even number of measures: 60, 77, 107, 108, 112, 114, 120, 155, 200, 219 – median is 113
The mode (modus) is the value that occurs most often. In the above group 350 €
Standard deviation (SD) is a measure of dispersion around the mean. In a normal
distribution, 68% of cases fall within one SD of the mean and 95% of cases fall within 2
SD.
e.g. if the mean age is 47, with a standard deviation of 12, 95% of the cases would
be between 23 and 71 in a normal distribution.
Univariate tests of hypotheses are of problems with only one variable.
Bivariate problems involve two values.
A correlation coefficient (korelacijski koeficient) is a measure of the degree to which
two variables are linearly related. A correlation (korelacija) is the interdependence of
two variables in a population (NB: correlation is not causation).
Statistical research
1. Descriptive
Define the problem
Survey existing research
Collect data (e.g. survey)
Discussion
Conclusions (regarding the definition of the problem)
2. Inferential
Define the problem
Survey existing research
Put forward a hypothesis
Collect data
Test the hypothesis
Discussion
Conclusions (regarding the hypothesis)
Notes
1. Be clear what you are trying to measure (and why).
2. Be sure you have chosen the best means to measure it.
3. Present findings in an appropriate format: table, bar chart (categories or groups),
histogram (graphic presentation of frequency), frequency curve.
4. Be very cautious when interpreting your results.