Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Psychometrics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
History of statistics wikipedia , lookup
Misuse of statistics wikipedia , lookup
Regression toward the mean wikipedia , lookup
Scientific Research Methodology in Linguistics 1302740 Dr. Ahmad El-Sharif [email protected] Basic Statistics WHAT IS STATISTICS? Statistics is a group of methods used to collect, analyze, present, and interpret data and to make decisions. Statistics is an objective way of interpreting a collection of observations POPULATION VERSUS SAMPLE A population consists of all elements – individuals, items, or objects – whose characteristics are being studied. The population that is being studied is also called: the target population. A study that includes every member of the population is called a census. The technique of collecting information from a portion of the population is called sampling. The portion of the population selected for study is referred to as a sample. A sample drawn in such a way that each element of the population has an equal chance of being selected is called a simple random sample. SAMPLING 1. Random sampling: tables of random numbers 2. Stratified random sampling Strata=small groups. Sample from each strata 1. Systematic sampling Pick a start and sample every nth number. 1. Random assignment Justifying post hoc explanations Convenience sample? How good does the sample have to be? Good enough for our purposes! TYPES OF STATISTICS Descriptive Statistics consists of methods for organizing, displaying, and describing data by using tables, graphs, and summary measures. Inferential Statistics consists of methods that use information from samples to make predictions, decisions or inferences about a population. RAW FREQUENCY Frequency itself does NOT tell you much in terms of the validity of a hypothesis e.g. There are 25 instances of code-switching in a 30 mins of recorded interview with a subject, so what? Does this mean that our subjects code-switch frequently – or infrequently – when they speak? 1 Statistics is informative when it is comparative The size of a sample may affect the level of statistical significance The common base for normalization must be comparable to the sizes of the collected data Normalizing the spoken vs. written registers to a common base of 1000 tokens? BASIC DEFINITIONS A variable is a characteristic under study that assumes different values for different elements. e.g.: the temperatures measured in a specific location at different time of the day/month/year. A variable on which everyone has the same exact value is a constant. e.g. the temperature by which water starts to evaporate/boil The value of a variable for an element is called a score, observation or measurement. A data set is a collection of scores on one or more variables. e.g. temp on 12.07.2010 (Int. 2 hrs): 12, 12, 15, 16, 22, 14, 14, 13, 13, 10, 10, 10 A distribution is a collection of scores or measurements on a particular variable. TYPES OF VARIABLES A variable whose values are countable is called a discrete variable. In other words, a discrete variable can assume only a limited number of values with no intermediate values. e.g. A group of ten students took a test and their scores from 10 are as follows: 4, 5, 6, 6, 7, 7, 7, 9, 9, 10 A variable that can assume any numerical value over a certain interval or intervals is called a continuous variable. e.g. The Salaries of group of ten employees in a company are as follows: 454, 514, 613, 644, 765, 765, 744, 906, 995, 1044,…. A variable that cannot assume a numerical value but can be classified into two or more categories is called a categorical variable. e.g. the answer of the question “Where is this speaker from?”: Possible Answers: Amman, Salt, Irbid, Aqaba,………etc. Categories: north, south, …etc. MEASURES OF CENTRAL TENDENCY Mean – The center of gravity or balance point of the distribution Median – The score that divides the distribution into two groups of equal size Mode – The most frequent score in a distribution Mean The mean is the arithmetic average The most common measure of central tendency Can be calculated by adding all of the scores together and then dividing the sum by the number of scores While the mean is a useful measure, unless we also knows how dispersed (i.e. spread out) the scores in a dataset are, the mean can be an uncertain guide The mean is obtained by dividing the sum of all values by the number of values in the data set. The Mean is the Center of Gravity Median The median is the middle score of a set of scores ordered from the lowest to the highest For an odd number of scores, the median is the central score in an ordered list For an even number of scores, the median is the average of the two central scores In the example provided the median is 6 (i.e. (6+6)/2) (from the data set: 3, 4, 5, 5, 6, 6, 6, 8, 8, 9) Median The calculation of the median consists of the following two steps: Rank the data set in increasing order. Then, find the middle number in the data set such that half of the scores are above and half below. The value of this middle number is the median. 2 Example: Calculate the median for the following measurements for ambiguity: 8, 9, 8, 6, 6, 6, 5, 5, 3, 4 Mode The mode is the most common score in a set of scores The mode in our testing example is 6, because this score occurs more frequently than any other score: 8, 9, 8, 6, 6, 6, 5, 5, 3, 4 A distribution with a single mode is said to be unimodal A distribution with more than one mode is said to be bimodal, trimodal, etc., or in general, multimodal MEASURES OF DISPERSION Range Highest value in the distribution minus the lowest value in the distribution The range is a simple way to measure the dispersion of a set of data The difference between the highest and lowest frequencies / scores e.g. ambiguity scale scores: 8, 9, 8, 6, 6, 6, 5, 5, 3, 4 In our example the range is 6 (i.e. highest 9 – lowest 3) Only a poor measure of dispersion An unusually high or low score in a dataset may make the range unreasonably large, thus giving a distorted picture of the dataset Variance The variance measures the distance of each score in the dataset from the mean e.g. ambiguity scale scores: 8, 9, 8, 6, 6, 6, 5, 5, 3, 4 In our test results, the variance of the score 4 is 2 (i.e. 6 – 4); and the variance of the score 9 is -3 (6 – 9) For the whole dataset, the sum of these differences is always zero Some scores will be above the mean (variance is +) while some will be below the mean (Variance is -) Meaningless to use variance to measure the dispersion of a whole dataset Variance Deviation Measure of how different scores are on average in squared units: We can avoid this problem (deviation scores sum to 0) by squaring each deviation score before summing them Here, the square root of the variance is known as the standard deviation (SD). Standard Deviation The standard deviation is a number used to tell how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are very close to the average. A high standard deviation means that the numbers are spread out. REPRESENTING DISTRIBUTION GRAPHICALLY The graph used to represent distribution can be: Uniform, Skewed, Bell-shaped or Normal, Ogive or S-shaped A normal curve is symmetric about the mean For a normal distribution approximately 1. 68% of the observations lie within one standard deviation of the mean 2. 95% of the observations lie within two standard deviations of the mean 3. 99.7% of the observations lie within three standard deviations of the mean ********************************* 3