Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Predictive analytics wikipedia , lookup
Regression analysis wikipedia , lookup
Corecursion wikipedia , lookup
Data analysis wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Data assimilation wikipedia , lookup
Introduction to Quantitative Data Analysis (continued) Reading on Quantitative Data Analysis: Baxter and Babbie, 2004, Chapter 11. Course website: http://www.sfu.ca/cmns/faculty/marontate_j/260/07-spring/ Audio recordings of Thursday lectures available on-line (for students registered in the course) at www.sfu.ca/lectures Last Day: Beginning of Quantitative Data Analysis Introduction to Common Ways of Presenting Statistics & Importance for Analysis (descriptive statistics) Tables Charts Graphs Univariate Statistics Measures of Central Tendancy Measures of Dispersion Discrete & Continuous Variables Continuous Variable can take infinite (or large) number of values within range Ex. Age measured by exact date of birth Discrete Attributes of variable that are distinct but not necessarily continuous Ex. Age measured by age groups (Note: techniques exist for making assumptions about discrete variables in order to use techniques developed for continuous variables) The Lexis Diagram Isochron: observation in 1968 Age Life line: cohort born in 1948 80 60 40 Age at year of observation: 20 20 0 1890 1910 1930 1950 1970 1990 2010 Period Core Notions in Basic Univariate Statistics Ways of describing data about one variable (“uni”=one) Measures of central tendency Summarize information about one variable (“averages”) Measures of dispersion Variations or “spread” Mode most common or frequently occurring category or value (for all types of data) Babbie (1995: 378) Bimodal When there are two “most common” values that are almost the same (or the same) Median middle point of rank-ordered list of all values (only for ordinal, interval or ratio data) Babbie (1995: 378) Mean (arithmetic mean) Arithmetic “average” = sum of values divided by number of cases (only for ratio and interval data) Babbie (1995: 378) Two Data Sets with the Same Mean Another Diagram of Normal Curve (Showing Ideal Random Sampling Distribution, Standard Deviation & Zscores) Normal Distribution & Measures of Central Tendency Symmetric Also called the “Bell Curve” Neuman (2000: 319) Skewed Distributions & Measures of Central Tendency Skewed to the left Skewed to the right Neuman (2000: 319) Why Measures of Central Tendency are not enough to describe distributions 7 people at bus stop in front of bar aged 25,26,27,30,33,34,35 median= 7 people in front of ice-cream parlour aged 5,10,20,30,40,50,55 median= 30, mean= 30 30, mean= 30 BUT issue of “spread” socially significant Another Illustration Normal & Skewed Distributions Measures of Variation or Dispersion range: distance between largest and smallest scores standard deviation: for comparing distributions percentiles: % up to and including the number (from below) z-scores: for comparing individual scores taking into account the context of different distributions Range & Interquartile range distance between largest and smallest scores what does a short distance between the scores tell us about the sample? But problems of “outliers” or extreme values may occur Interquartile range (IQR) distance between the 75th percentile and the 25th percentile range of the middle 50% (approximately) of the data Eliminates problem of outliers or extreme values Example from StatCan website (11 in sample) Data set: 6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36 Ordered data set:6, 7, 15, 36, 39, 41, 41, 43, 43, 47, 49 Median:41 Upper quartile: 41 Lower quartile: 15 IQR= 41-15 Standard Deviation and Variance Inter quartile range eliminates problem of outliers BUT eliminates half the data Solution? measure variability from the center of the distribution. standard deviation & variance measure how far on average scores deviate or differ from the mean. Calculation of Standard Deviation 1 2 13 4 5 6 7 8 Neuman (2000: 321) Calculation of Standard Deviation Neuman (2000: 321) Standard Deviation Formula Neuman (2000: 321) Details on the Calculation of Standard Deviation Neuman (2000: 321) Discussion The Bell Curve & standard deviation Discussion of Preceding Diagram “Many biological, psychological and social phenomena occur in the population in the distribution we call the bell curve (Portney & Watkins, 2000).” link to source Preceding picture a symmetrical bell curve, average score [i.e., the mean] in the middle, where the ‘bell’ shape tallest. Most of the people [i.e., 68% of them, or 34% + 34%] have performance within 1 segment [i.e., a standard deviation] of the average score.” Interpreting Standard Deviation amount of variation from mean Illustration: high & low standard deviation meaning depends on exact case Recall: Central Tendency & Dispersion (description of distributions) 7 people at bus stop in front of bar aged 25,26,27,30,33,34,35 median= 30, mean= 30 Range= 10, standard deviation=10.5 7 people in front of ice-cream parlour aged 5,10,20,30,40,50,55 median= 30, mean= 30 Range= 50, standard deviation=17.9 Other ways of characterizing dispersion or spread Techniques for understanding position of a case (or group of cases) in the context all of cases Percentiles Standard Scores z-scores Percentile 1st Calculate rank then choose a rank (score) and figure out percentage equal to or less than the rank (score) Link to more complex definition of percentile % up to and including the number (from below) “A percentile rank is typically defined as the proportion of scores in a distribution that a specific score is greater than or equal to. For instance, if you received a score of 95 on a math test and this score was greater than or equal to the scores of 88% of the students taking the test, then your percentile rank would be 88. You would be in the 88th percentile” Also used in other ways (for example to eliminate cases) z-scores For understanding how a score is positioned in the data set to enable comparisons with other scores from other data sets (comparing example individual scores in different distributions) of two students from different schools with different GPAs comparing sample distributions to population. How representative is sample to population under study? (Link to more complete discussion of use of z-scores to understand sampling distribution) Calculating Z-Scores z-score=(score – sample mean)/standard deviation of set Link to formula Link to z-score calculator Calculating Z-Scores (p. 265 textbook) Using Z-scores to compare two students’ from different schools: A Susan with GPA of 3.62 and Jorge with GPA of 3.64 Susan from College A Susan’s Grade Point Average =3.62 Mean GPA= 2.62 SD= .50 Susan’s z-score= 3.62-2.62=1.00/.50=2 Susan’s grade is two Standard deviations above mean at her school Using Z-scores to compare two students’ from different schools: B Jorge from College B Jorge’s GPA =3.64 Mean GPA= 3.24 SD=.40 Jorge’s z-score= 3.64-3.24=.40/.40=1 Jorge’s grade is one standard deviation above the mean at his school Susan’s absolute grade is lower but her position relative to other students at her school is much higher than Jorge’s position at his school Another Diagram of Normal Curve with Standard Deviation & Z-scores Discussion of Previous Case Relationship of sampling distribution to population (use mean of sample to estimate mean of population) Recall: Results with two Variables-Bivariate Statistics Statistical relationships between two variables Covariation (vary together) a type of association Not necessarily causal Independence (Null hypothesis): no relationship between the two variables Cases with values in one variable do not have any particular value on the other variable Sample Mean Notation Population Mean Notation Standard Error (recall tutorial task about average ages in family) Calculate mean for all possible samples Divide by number of samples Measures variability Recall: Results with two Variables-Bivariate Tables (Cross Tabulations) Singleton, R., Straits, B. & Straits, M. (1993) Approaches to social research. Toronto: Oxford Interpretation issues (Bivariate Tables) Calculate percentages within categories of attributes of independent variable In example: Independent variable: gender Dependent variable: fear of walking alone at night Women more afraid than men Other Ways of Presenting Same Data Link to other tables Calculating Expected Outcomes If variables (gender & fear) not related then distribution of subgroups of independent variable (male & female) should be the same in each subgroup as in the group overall (therefore men and women should express fear in the same proportions) Used in techniques for studying relationships (Chi-square) Descriptive dimension (strength of relationship) Inferential (probability that the association is due to chance) Expected outcomes (Null Hypothesis) Singleton, R., Straits, B. & Straits, M. (1993) Approaches to social research. Toronto: Oxford Next Day Control variables: Trivariate Tables Men/Women Drivers In, Say it with Figures, Hans Zeisel presents the following data: Automobile Accidents by Sex -----------------------------------------Per Cent Accident Free Women Men 68% (6,950) 56% (7,080) ------------------------------------------ Automobile Accidents by Sex and Distance Driven ---------------------------------------------------------------------------Distance Under 10,000 km Over 10,000 km Per Cent Per Cent Accident Free Accident Free Women Men 75% (5,035) 75% (2,070) 48% (1,915) 48% (5,010) ---------------------------------------------------------------------------- Women have fewer accidents than men because women tend to drive less frequently than do men, and people who drive less frequently tend to have fewer accidents