Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What is Statistics? Statistics is a collection of procedures and principles for gathering data and analyzing information in order to help people make decisions when faced with uncertainty. THINK—SHOW—TELL 1. Why? 2. Who? 3. What? 4. When? 5. Where? 6. How? “Data are used to make a judgment about a situation” 1) What question needs to be answered? 2) How should we collect data & how much? 3) How can we summarize the data? 4) What decisions or generalizations can be made in regards to the question based on the data collected? Population Data vs. Sample Data • Everyone—everything • Representative smaller “subset” of population • Parameters—summary • Statistics—summary measurements (p, ) of the population measurements denoted by standard letters ( p̂ , x ) data. of the sample data. Data--Types of Variables Categorical Group of category names w/no order Quantitative Numerical values taken from an individual Eye Color (brown, blue, green) Weight (117 lbs, 170 lbs, 253 lbs) Types of Quantitative (Numerical) Data Discrete Example: Number of siblings, number of pockets in a pair of jeans, number of free throws made in a season,… Continuous Example: Time, Weight, Height, …because of our limitations of measurement accuracy we often round to the nearest second, ounce, inch,… Summarizing Data w/ Bar graph of Categorical Data TX_betw eenHoustonDallas race White 0 ancestry marital Czechosl... Nev eduCode eduText income 11 Some coll... 2300 industry job Construc... Bookkee... 19 White 1 Mexican Nev 10 High sch... 3000 Miscellan... amuseme... F 21 Filipino 0 Filipino Nev 11 Some coll... 3084 Eating an... Waiter/w ... 389 M 19 White 0 French Nev 11 Some coll... 2000 Construc... Construc... 390 M 20 White 1 German Nev 11 Some coll... 1500 Eating an... Cashier 391 M 19 White 0 Nev 11 Some coll... 3000 Air trans... Weigher/... 386 = 387 F F 388 sex 20 age hisp < Bar Chart TX_betw eenHoustonDallas 80 Count 60 40 Acadian American American Indian Asian Indian British Canadian Chinese Cuban Czech Czechoslovakian Danish Dutch English European Filipino Finish French German Greek Haitian Hispanic Honduran Irish Italian Japanese Korean Malaysian Mexican Nicaraguan Norwegian Panamanian Polish Saudi Arabian Scandanavian Scotch Irish Scottish Slovak Slovene Spanish Sri Lankan Swedish Trinidadian Turkish VIetnamese Welsh 20 ancestry Summarizing Data with Pie Chart for Categorical Data 100% Dotplot for Univariate Quantitative Data Dot Plot paneldat 0 20 40 60 80 100 Temperature 120 140 160 Stemplot for Quantitative Data Ages of Death of U.S. First Ladies 3 | 4 indicates 34 years old 3 | 4, 6 4|3 Stem 5 | 2, 4, 5, 7, 8 6 | 0, 0, 1, 2, 4, 4, 4, 5, 6, 9 Leaf—a 7 | 0, 1, 3, 4, 6, 7, 8, 8 single digit 8 | 1, 1, 2, 3, 3, 6, 7, 8, 9, 9 9|7 Split Stemplot Stem is split for every 2 leaves— (0, 1), (2, 3), (4, 5), (6, 7), and (8, 9) 1|7 1 | 8, 9, 9, 9, 9, 9 2 | 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 2 | 2, 2, 2, 3, 3 2 | 4, 5 2| 2|8 3 | 0, 1 Age of 27 students randomly selected from Stat 303 at A&M Split Stemplot 1| 1 | 7, 8, 9, 9, 9, 9, 9 2 | 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4 2 | 5, 8 3 | 0, 1 Stem is split for every 5 3| leaves—(0 thru 4) AND ( 5 thru 9) Age of 27 students randomly selected from Stat 303 at A&M Back-to-back Stemplot Babe Ruth 5, 2 5, 4 9, 7, 6, 6, 6, 1, 1 9, 4, 4 0 Roger Maris 8 3, 4, 6 3, 6, 8 3, 9 |0| |1| |2| |3| |4 |5| |6| 1 Number of home runs in a season Histogram—Univariate Quantitative data Histogram TX_betw eenHoustonDallas 120 Frequency Count 100 Count 80 Univariate Variable Age 60 40 20 0 20 40 60 age 80 100 Boxplot and Modified Boxplot HusbandsAndWives Box Plot “Divides data into 4 quarters” HusbandsAndWives Box Plot 15 20 25 30 35 40 45 50 55 60 65 Age_Wife 25% of data in each section 1550 1650 1750 1850 Ht_Husband 1950 Comparative Parallel Boxplots— Univariate quantitative data by category Box Plot sex M TX_betw eenHoustonDallas F Outliers 0 10 20 30 40 50 age 60 70 80 90 Cumulative Frequency Plot Scatterplot—Bivariate quantitative data Scatter Plot Olympics - Mens Field Trends 9.0 8.5 LongJump_m 8.0 7.5 7.0 6.5 6.0 1880 1900 1920 1940 year 1960 1980 2000 Summary Features of Quantitative Variables Center—Location Spread—Variability Shape—Distribution pattern with data Any unusual features? Explain in context. Location—Center Mean(, x ) —add up data values and divide by number of data values Median—list data values in order, locate middle data value Data Set: 19, 20, 20, 21, 22 19 20 20 21 22 20.04 Mean is x 5 Median is 20 since it is the middle number of the ranked (ordered) data values. Robust (Resistant) Statistic Median is resistant to extreme values (outliers) in data set. Mean is NOT robust against extreme values. Mean is pulled away from the center of the distribution toward the extreme value (“tails of graph”). Of the 2 segments, where’s the Mean with respect to the Median? Remember the mean is pulled toward extreme values. Where’s the Mean with respect to the Median? Mean or Median? th Location—p Percentile The pth percentile of a distribution (set of data) is the value such that p percent of the observations fall at or below it. Suppose your Math SAT score is at the 80th percentile of all Math SAT scores. This means your score was higher than 80% of all other test takers. Describing Location: Quartiles Spread: Range and Interquartile Range Range = Maximum – minimum Q1 (Quartile 1) is the 25th percentile of ordered data or median of lower half of ordered data Median (Q2) is 50th percentile of ordered data Q3 (Quartile 3) is the 75th percentile of ordered data or median of upper half of ordered data IQR(Interquartile Range) = Q3 – Q1 Any point that falls outside the interval calculated by Q1- 1.5(IQR) and Q3 + 1.5(IQR) is considered an outlier. Summary Statistics 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 min Q1 median Q3 max Range = Max – min = 13 – 1 = 12 IQR = Q3 – Q1 = 10.5 – 3.5 = 7 Boxplot—5 Number Summary Computersx1000 Box Plot ComputerDensity 250 1000 2950 5400 1000 8600 250 2400 3500 5400 8600 0 min Q1 2000 4000 6000 ThouComputers median 8000 Q3 IQR = Q3 – Q1 = 5400 – 1000 = 4400 Max Calculating boundaries for potential outliers Find Q1 and Q3. Calculate IQR = Q3 – Q1. Multiply IQR by 1.5. Subtract this from Q1. Q1 = 10, Q3 = 20 IQR = 10 1.5·IQR = 15 Q1 – 1.5 IQR 10 – 15 -5 Add it to Q3. Q3 + 1.5 IQR 20 + 15 35 These are the boundaries.………………………...(-5, 35) If any data value falls outside of this interval, the data values are to be considered potential outliers. Describing Spread: Standard Deviation Roughly speaking, standard deviation is the average distance values fall from the mean (center of graph). Population and Sample Standard Deviation x 2 i n x x 2 s i n 1 2 population variance s2 sample variance What is Variance??? What is Variance? Variance = (Standard 2 deviation) Calculated Standard Deviation is a measure of Variation in data Sample Data Set Mean 100, 100, 100, 100, 100 100 Standard Deviation 0 90, 90, 100, 110, 110 100 10 30, 90, 100, 110, 170 100 50 90, 90, 100, 110, 320 142 99.85 Descriptive Terms Trend Descriptive Terms of Sampling Distribution (Histogram) and Model (Red Curve) Shape----Bell-shaped curve----Symmetric Descriptive Terms of Population Models Skewed Right (or Skewed Left) “Tail” points to right Descriptive Terms of Sampling Distribution Cluster---Gaps---Potential Outliers HusbandsAndWives Histogram Count 45 40 35 30 25 20 15 10 5 20 30 40 50 Age_Husb_at_Marriage 60 Uniform Population Model Total area under the curve (model) will always equal 1. Various Population Models