Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exploring Data Descriptive Data 1 Content • • • • • Types of Variables Describing data using graphical summaries Describing the Centre of Quantitative Data Describing the Spread of Quantitative Data How Measures of Position Describe Spread 2 Variable • A variable is any characteristic that is recorded for the subjects in a study • Examples: Marital status, Height, Weight, IQ • A variable can be classified as either – Categorical (e.g. Male / Female) – Quantitative (e.g. Age) • Discrete or (number of children in family) • Continuous (weight: 70,25 kg) www.thewallstickercompany.com.au 3 Categorical Variable • A variable is categorical if each observation belongs to one of a set of categories. • Examples: 1. 2. 3. 4. Gender (Male or Female) Religion (Catholic, Jewish, …) Type of residence (Apartment, House, …) Belief in life after death (Yes or No) 4 Quantitative Variable • A variable is called quantitative if observations take numerical values for different magnitudes of the variable. • Examples: 1. Age 2. Number of brothers/sisters 3. Annual Income 5 Categorical vs. Quantitative • Categorical variables – percentage of observations in each category is important – E.g. % Male, % Female • Quantitative variables – center (a representative value) and spread (variability) are important – Average Age – Variation around the average age 6 Discrete Quantitative Variable • A quantitative variable is discrete if its possible values form a set of separate numbers: 0,1,2,3,…. • Examples: 1. Number of pets in a household 2. Number of children in a family 3. Number of foreign languages spoken by an individual 7 Continuous Quantitative Variable • A quantitative variable is continuous if it has an infinite number of possible values • Measurements • Examples: 1. Height/Weight 2. Age 3. Blood pressure www.wtvq.com 8 4 types of scale • • • • Nominal Ordinal Interval Ratio 9 Nominal Scale • Nominal scale is simplest scale. • They are numbers or letters assigned to objects – serve as labels for identification or classification • e.g. names and gender are categorical variables; – ‘M’ for Male and ‘F’ for Female, – or ‘1’ for male and ‘2’ for female, – or ‘1’ for female and ‘2’ for male. • Other examples include – marital status, religion, race, colour and employment status, and so forth. 10 Ordinal Scale • A subset of the nominal scale – Where the scale follows an order • Ordinal scale creates an ordered (ranked) relationship • Typical ordinal scales – (i) result of examination: first, second, third and fail; – (ii) quality of products: ‘excellent’, ‘good’, ‘fair’ or ‘poor’ – (iii) social class: upper, middle, lower class 11 Interval Scale • • • • Indicate order and distance in units. The Interval is a measuring tool But Zero point is arbitrary Example: a price index – – – – the number of the base year (say year 2010) is set to be usually 100 Price of bread is 40 kn (= 100) is year 2010 Price of bread is 50 kn (= 125) in year 2015 We then know price of bread is higher in 2015 by 25% • Another example of interval scale – temperature where the initial point is always arbitrary – O degrees is freezing point in Celsius (used in Europe) – 32 degrees is freezing point in Fahrenheit (used in US) 12 Ratio Scale • Ratio scales are absolute rather than relative • If interval scale can only have an absolute zero – then it is really a ratio scale. • Absolute zero – a point on scale where the attribute is zero • Examples – age, money and weight are ratio scales – because they possess an absolute zero and interval properties – A person can’t have a negative weight or negative age 13 Describing data using graphical summaries 14 Frequency Table • Frequency table – a listing of possible values for a variable – together with the number of observations – or relative frequencies (%) for each value 15 Be careful to distinguish Proportions & Percentages (Rel. Freq.) Proportions and percentages are also called relative frequencies. 16 Graphs for Categorical Variables • Use pie charts and bar graphs to summarize categorical variables 1. Pie Chart: A circle having a “slice of pie” for each category 2. Bar Graph: A graph that displays a vertical bar for each category 17 Pie Charts • Summarize categorical variable • Drawn as circle where each category is a slice • The size of each slice is proportional to the percentage in that category 18 Bar Graphs • Summarizes categorical variable • Vertical bars for each category • Height of each bar represents either counts or percentages • Easier to compare categories with bar graph than with pie chart • Called Pareto Charts when ordered from tallest to shortest 19 Histograms • Graph that uses bars to portray frequencies or relative frequencies for a quantitative variable • Frequency is always on vertical axis • Intervals always on horizontal axis 20 Constructing a Histogram 1. Divide into intervals of equal width 2. Count # of observations in each interval Sodium in Cereals 21 Constructing a Histogram 3. Label endpoints of intervals on horizontal axis 4. Draw a bar over each value or interval with height equal to its frequency (or percentage) 5. Label and title Sodium in Cereals 22 Interpreting Histograms • Assess where a distribution is centered by finding the median • Assess the spread of a distribution • Shape of a distribution: roughly symmetric, skewed to the right, or skewed to the left Left and right sides are mirror images 23 Examples of Skewness 24 Shape: Type of Mound Height of 10 year olds Electricity demand or demand for seats in a restaurant different times of day 25 Outlier An outlier falls far from the rest of the data 26 Time Plots • Display a time series, data collected over time • Plots observation on the vertical against time on the horizontal • Points are usually connected Time Plot from 1995 – 2001 of number of people globally who use the Internet 27 Describing the Centre of Quantitative Data 28 Mean • The mean is the sum of the observations divided by the number of observations • It is the center of mass 29 Median Order 1 2 3 4 5 6 7 8 9 Data 78 91 94 98 99 101 103 105 114 Order 1 2 3 4 5 6 7 8 9 10 Data 78 91 94 98 99 101 103 105 114 121 • Midpoint of the observations when ordered from least to greatest 1. Order observations 2. If the number of observations is: a) Odd, the median is the middle observation (99) b) Even, the median is the average of the two middle observations (99+101 =100) 30 Comparing the Mean and Median • Mean and median of a symmetric distribution are close – Mean is often preferred because it uses all data • But in a skewed distribution, the mean is farther out in the skewed tail than is the median – Median is preferred because it is better representative of a typical observation 31 Mode • Value that occurs most often • Highest bar in the histogram • Mode is most often used with categorical data 32 Resistant Measures • A measure is resistant if extreme observations (outliers) have little, if any, influence on its value – Median is resistant to outliers – Mean is not resistant to outliers • Example: 75 people in class – – – – – – 72 people absent for 1 day year in year 2 people absent for 50 day each 1 person absent for 100 days Median = 1 day Mean = 2.42 days Mode = 1 day 33 Describing the Spread of Quantitative Data 34 Range Range = max – min Two teams with same average (mean) height = 2.0m 2.5m 2.1m 2.1m 1.8m 1.5m 2.2m 2.1m 2.0m 1.9m 1.8m The range is strongly affected by outliers. 35 Properties of Sample Standard Deviation 1. 2. 3. 4. 5. 6. Measures spread of data Only zero when all observations are same; otherwise, s > 0 As the spread increases, s gets larger Same units as observations Not resistant Strong skewness or outliers greatly increase s 38 How Measures of Position Describe Spread 40 Percentile The pth percentile is a value such that p percent of the observations fall below or at that value 70th percentile 41 Finding Quartiles • Splits the data into four parts with same number of observations in each part 1. Arrange data in order 2. The median is the second quartile, Q2 3. Q1 is the median of the lower half of the observations 4. Q3 is the median of the upper half of the observations 42 Measure of Spread: Quartiles • 1. 2. 3. Quartiles divide a ranked data set into four equal parts: 25% of the data at or below Q1= first quartile = 2.2 Q1 and 75% above 50% of the obs are above M = median = 3.4 the median and 50% are below 75% of the data at or below Q3 and 25% above Q3= third quartile = 4.35 43 Calculating Interquartile Range • The interquartile range is the distance between the third and first quartile, giving spread of middle 50% of the data: IQR = Q3 - Q1 44 Criteria for Identifying an Outlier • An observation is a potential outlier if: – it falls more than 1.5 x IQR below the first quartile or – more than 1.5 x IQR above the third quartile. IQR: (75-25) = 50 Outlier < -25 Outlier > 150 25 50 75 45 5 Number Summary • The five-number summary of a dataset consists of: 1. 2. 3. 4. 5. Minimum value First Quartile Median Third Quartile Maximum value 46 Boxplot 1. Box goes from the Q1 to Q3 (the IQR) 2. Line is drawn inside the box at the median (the middle value) 3. Lines go from – – lower end of box to smallest observation that’s not a potential outlier from upper end of box to largest observation that’s not a potential outlier 4. Potential outliers are shown separately, often with * or + 47 Comparing Distributions using Boxplots • • Boxplots do not display the shape of the distribution as clearly as histograms but are useful for making graphical comparisons of two or more distributions 1,3 1,3 m m 1,6 m 1,9 m 48