Download Scientists use observations and reasoning to

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Student's t-test wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Scientific Research Methodology
in Linguistics
1302740
Dr. Ahmad El-Sharif
[email protected]
Basic Statistics
WHAT IS STATISTICS?
 Statistics is a group of methods used to collect, analyze, present, and interpret data and to make decisions.
 Statistics is an objective way of interpreting a collection of observations
POPULATION VERSUS SAMPLE
 A population consists of all elements – individuals, items, or objects – whose characteristics are being
studied. The population that is being studied is also called:
 the target population.
 A study that includes every member of the population is called a census. The technique of collecting
information from a portion of the population is called sampling.
 The portion of the population selected for study is referred to as a sample.
 A sample drawn in such a way that each element of the population has an equal chance of being selected is
called a simple random sample.
SAMPLING
1. Random sampling: tables of random numbers
2. Stratified random sampling
Strata=small groups. Sample from each strata
1. Systematic sampling
Pick a start and sample every nth number.
1. Random assignment
Justifying post hoc explanations
Convenience sample?
How good does the sample have to be?
Good enough for our purposes!
TYPES OF STATISTICS
 Descriptive Statistics consists of methods for organizing, displaying, and describing data by using tables,
graphs, and summary measures.
 Inferential Statistics consists of methods that use information from samples to make predictions, decisions
or inferences about a population.
RAW FREQUENCY
Frequency itself does NOT tell you much in terms of the validity of a hypothesis
e.g. There are 25 instances of code-switching in a 30 mins of recorded interview with a subject, so what?
Does this mean that our subjects code-switch frequently – or infrequently – when they speak?
1
Statistics is informative when it is comparative
The size of a sample may affect the level of statistical significance
The common base for normalization must be comparable to the sizes of the collected data
Normalizing the spoken vs. written registers to a common base of 1000 tokens?
BASIC DEFINITIONS
 A variable is a characteristic under study that assumes different values for different elements.
e.g.: the temperatures measured in a specific location at different time of the day/month/year.
 A variable on which everyone has the same exact value is a constant.
e.g. the temperature by which water starts to evaporate/boil
 The value of a variable for an element is called a score, observation or measurement.
 A data set is a collection of scores on one or more variables.
e.g. temp on 12.07.2010 (Int. 2 hrs): 12, 12, 15, 16, 22, 14, 14, 13, 13, 10, 10, 10
 A distribution is a collection of scores or measurements on a particular variable.
TYPES OF VARIABLES
 A variable whose values are countable is called a discrete variable. In other words, a discrete variable can
assume only a limited number of values with no intermediate values.
e.g. A group of ten students took a test and their scores from 10 are as follows:
4, 5, 6, 6, 7, 7, 7, 9, 9, 10
 A variable that can assume any numerical value over a certain interval or intervals is called a continuous
variable.
e.g. The Salaries of group of ten employees in a company are as follows:
454, 514, 613, 644, 765, 765, 744, 906, 995, 1044,….
 A variable that cannot assume a numerical value but can be classified into two or more categories is called
a categorical variable.
 e.g. the answer of the question “Where is this speaker from?”:
 Possible Answers: Amman, Salt, Irbid, Aqaba,………etc.
 Categories: north, south, …etc.
MEASURES OF CENTRAL TENDENCY
 Mean – The center of gravity or balance point of the distribution
 Median – The score that divides the distribution into two groups of equal size
 Mode – The most frequent score in a distribution
Mean




The mean is the arithmetic average
The most common measure of central tendency
Can be calculated by adding all of the scores together and then dividing the sum by the number of scores
While the mean is a useful measure, unless we also knows how dispersed (i.e. spread out) the scores in a dataset
are, the mean can be an uncertain guide
 The mean is obtained by dividing the sum of all values by the number of values in the data set.
 The Mean is the Center of Gravity
Median
 The median is the middle score of a set of scores ordered from the lowest to the highest
 For an odd number of scores, the median is the central score in an ordered list
 For an even number of scores, the median is the average of the two central scores
 In the example provided the median is 6 (i.e. (6+6)/2) (from the data set: 3, 4, 5, 5, 6, 6, 6, 8, 8, 9)
Median
 The calculation of the median consists of the following two steps: Rank the data set in increasing order. Then,
find the middle number in the data set such that half of the scores are above and half below. The value of this
middle number is the median.
2
Example: Calculate the median for the following measurements for ambiguity:
8, 9, 8, 6, 6, 6, 5, 5, 3, 4
Mode
 The mode is the most common score in a set of scores
 The mode in our testing example is 6, because this score occurs more frequently than any other score:
8, 9, 8, 6, 6, 6, 5, 5, 3, 4
 A distribution with a single mode is said to be unimodal
 A distribution with more than one mode is said to be bimodal, trimodal, etc., or in general, multimodal
MEASURES OF DISPERSION
Range
 Highest value in the distribution minus the lowest value in the distribution
 The range is a simple way to measure the dispersion of a set of data
 The difference between the highest and lowest frequencies / scores
e.g. ambiguity scale scores: 8, 9, 8, 6, 6, 6, 5, 5, 3, 4
 In our example the range is 6 (i.e. highest 9 – lowest 3)
 Only a poor measure of dispersion
 An unusually high or low score in a dataset may make the range unreasonably large, thus giving a distorted
picture of the dataset
Variance
 The variance measures the distance of each score in the dataset from the mean
 e.g. ambiguity scale scores: 8, 9, 8, 6, 6, 6, 5, 5, 3, 4
 In our test results, the variance of the score 4 is 2 (i.e. 6 – 4); and the variance of the score 9 is -3 (6 – 9)
 For the whole dataset, the sum of these differences is always zero
 Some scores will be above the mean (variance is +) while some will be below the mean (Variance is -)
 Meaningless to use variance to measure the dispersion of a whole dataset
Variance  Deviation
 Measure of how different scores are on average in squared units:
 We can avoid this problem (deviation scores sum to 0) by squaring each deviation score before summing them
 Here, the square root of the variance is known as the standard deviation (SD).
Standard Deviation
 The standard deviation is a number used to tell how measurements for a group are spread out from the average
(mean), or expected value.
 A low standard deviation means that most of the numbers are very close to the average.
 A high standard deviation means that the numbers are spread out.
REPRESENTING DISTRIBUTION GRAPHICALLY
 The graph used to represent distribution can be: Uniform, Skewed, Bell-shaped or Normal, Ogive or S-shaped
 A normal curve is symmetric about the mean
 For a normal distribution approximately
1. 68% of the observations lie within one standard deviation of the mean
2. 95% of the observations lie within two standard deviations of the mean
3. 99.7% of the observations lie within three standard deviations of the mean
*********************************
3