Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Psychometrics wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Statistical inference wikipedia , lookup
Foundations of statistics wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Statistics Empirical research in TS involves collecting, processing and interpreting data. To do this you need a basic understanding of some statistical concepts (though these days, software often does the work for you). Representativeness To what extent you data is typical or representative of a wider population. If it is a special case then you cannot really generalise from it. A population (populacija) is a group of phenomena (people or things) that have something in common. A sample (vzorec) is a smaller group of members of the population taken to represent that population. In order to use statistics to learn about the population, the sample must be random (naključen), i.e. every member of the population has an equal chance of being selected. A parameter (parameter) is a characteristic of a population. A variable (sprejemljivka) is an observable characteristic of a phenomenon that can be measured or classified. A statistic (statistika) is a characteristic of a sample. Inferential statistics (inferenčna statistika) enables you to make an educated guess about a population parameter based on a statistic computed from a random sample. Descriptive statistics (opisna statistika) is the analysis of data without such generalisation. Hypothesis test Setting up and testing hypotheses (hipoteze, domneve) is an essential part of statistical inference. In order to formulate such a test, usually some theory has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved, for example, claiming that a new drug is better than the current drug for treatment of the same symptoms. In each problem considered, the question is simplified into two competing claims or hypotheses between which we have a choice: the null hypothesis (ničelna hipoteza), denoted H0, against the alternative hypothesis, denoted H1. These two competing hypotheses are not however treated on an equal basis: special consideration is given to the null hypothesis. There two common situations: The experiment has been carried out in an attempt to disprove or reject a particular hypothesis, the null hypothesis, thus we give that one priority so it cannot be rejected unless the evidence against it is sufficiently strong. For example, H0: there is no difference in taste between coke and diet coke against H1: there is a difference. If one of the two hypotheses is 'simpler' we give it priority so that a more 'complicated' theory is not adopted unless there is sufficient evidence against the simpler one. For example, it is 'simpler' to claim that there is no difference in flavour between coke and diet coke than it is to say that there is a difference. The hypotheses are often statements about population parameters like expected value and variance; for example H0 might be that the expected value of the height of ten year old boys in the Scottish population is not different from that of ten year old girls. A hypothesis might also be a statement about the distributional form of a characteristic of interest, for example that the height of ten year old boys is normally distributed within the Scottish population. The outcome of a hypothesis test is "Reject H0 in favour of H1" or "Do not reject H0". [See: http://www.stats.gla.ac.uk/steps/glossary] A confidence interval (intervali zaupanja) gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can go for 90%, 99% or whatever. The width of the confidence interval gives us some idea about how uncertain we are about the unknown parameter. A very wide interval may indicate that more data should be collected before anything very definite can be said about the parameter. Confidence intervals are more informative than the simple results of hypothesis tests (where we decide "reject H0" or "don't reject H0") since they provide a range of plausible values for the unknown parameter. The confidence level is the probability value associated with a confidence interval. It is often expressed as a percentage. For example, say , then the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level. Example Suppose an opinion poll predicted that, if the election were held today, the Conservative party would win 60% of the vote. The pollster might attach a 95% confidence level to the interval 60% plus or minus 3%. That is, he thinks it very likely that the Conservative party would get between 57% and 63% of the total vote. Distribution (porazdelitev) is a collection of measurements: how scores tend to be dispersed about a measurement scale. Normal distribution (normalna porazdelitev): Gauss curve (Gausova krivulja) or bellshaped curve. Measures of central tendency Numbers that tend to cluster around the 'middle' of a set of values, e.g. the mean the median and the mode. The mean (povprečje, aritmetična sredina), is the sum of the measures divided by the number of measures (the average). e.g. daily weekday earnings: 350 €, 150 €, 100 €, 350 €, 50 € mean = 1000 / 5 = 200 € The median (mediana, srednja vrednost) is the middle value when the measures are arranged in order. It is less influenced than the mean is by extreme values. e.g. 50 €, 100 €, 150 €, 350 €, 350 € NB: odd number of measures: 60, 77, 107, 108, 112, 114, 120, 155, 200 – median is 112 even number of measures: 60, 77, 107, 108, 112, 114, 120, 155, 200, 219 – median is 113 The mode (modus) is the value that occurs most often. In the above group 350 € Standard deviation (SD) is a measure of dispersion around the mean. In a normal distribution, 68% of cases fall within one SD of the mean and 95% of cases fall within 2 SD. e.g. if the mean age is 47, with a standard deviation of 12, 95% of the cases would be between 23 and 71 in a normal distribution. Univariate tests of hypotheses are of problems with only one variable. Bivariate problems involve two values. A correlation coefficient (korelacijski koeficient) is a measure of the degree to which two variables are linearly related. A correlation (korelacija) is the interdependence of two variables in a population (NB: correlation is not causation). Statistical research 1. Descriptive Define the problem Survey existing research Collect data (e.g. survey) Discussion Conclusions (regarding the definition of the problem) 2. Inferential Define the problem Survey existing research Put forward a hypothesis Collect data Test the hypothesis Discussion Conclusions (regarding the hypothesis) Notes 1. Be clear what you are trying to measure (and why). 2. Be sure you have chosen the best means to measure it. 3. Present findings in an appropriate format: table, bar chart (categories or groups), histogram (graphic presentation of frequency), frequency curve. 4. Be very cautious when interpreting your results.