* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Variable
Survey
Document related concepts
Transcript
Variable An item of data Examples: – gender – test scores – weight Value varies from one observation to another Types/Classifications of Variables Qualitative Quantitative – Discrete – Continuous Qualitative Data Describes the quality Non-numerical format Counts Cannot order or measure Examples – gender – marital status – geographical region – job title…. Categorical data Non-overlapping categories or characteristics Examples: – Completes/Incompletes – Professions – Gender Quantitative Data Frequencies Measurements Discrete Measurements are integers Examples: – number of employees of a company – number of incorrect answers on a test – number of participants in a program… Continuous Measurements can take on any value - usually within some range Examples: – Age – Income Arithmetic operations such as differences and averages make sense. Qualitatiave or Quantitative? Discrete or Continuous? Score on a placement exam Preferred restaurant Dollar amount of a loan Height Salary Length of time to complete a task Number of applicants Ethnic origin Treatment as Ranks Natural order Not strictly measured Examples: – Age group – Likert Scale data Distinction between adjacent points on the scale is not necessarily the same Analysis Qualitative Data Frequency tables Modes - most frequently occurring Graphs: Bar Charts and Pie Charts Analysis Quantitative Data Any form Create groups or categories and generate frequency tables All descriptive statistics Effective Graphs: Quantitative Data Histograms Stem-and-Leaf plots Dot Plots Box plots XY Scatter Plots (2 variables). Examples of Graphs Pie Chart Performance Appraisals 10% More Difficult 38% 14% Difficult Same Much Easier Easier 33% 90 80 70 60 50 East West North 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Histogram Histogram 12 Frequency 10 8 6 4 2 0 49 59 69 79 Score 89 99 Boxplot Boxplot of C1 20 30 40 50 60 70 C1 80 90 100 110 Stem and Leaf Plot Stem and Leaf Plot Weight of Meat 7 8 8 9 9 10 10 11 11 12 12 13 13 14 5 3 7999 23 66789 688 2244 788 4 8 8 1 Analyze Ranked Data Frequency tables Mode, Median, Quartiles Graphs: – Bar Charts – Dot Plots, Pie Charts – Line Charts (2 variables) Data Example Suggest some ways you could analyze these items. Score on a placement exam Preferred restaurant Dollar amount of a loan Height Salary Length of time to complete a task Number of applicants Ethnic origin Tables and Graphs Note Excel will create any graph that you specify Consider the type of data before selecting your graph. Frequency Table/Frequency Distribution Summarize data: categorical nominal Continuous data - the data set has been divided into meaningful groups Frequency Distribution Count the number of observations that fall into each category. Frequency: the number associated with each category Relative Frequency Distribution Proportion of observations falling in a given category Report relative frequencies or percentages Example Frequency Distribution No. of Defective Parts 0 &< 2 4 6 8 10 12 14 n 2 4 6 8 10 12 14 16 0 4 5 12 14 9 10 6 Graphs Categorical/Qualitative Data Pie Charts Circle - divided proportionately Segment - percentage of the whole that falls into each category Viet Namese 15% Spanish 25% Swedish 5% English 55% Native Language Bar Charts Bar charts - % in various Average Units Sold (per person) by Product 20 Average Sold/Person categories Vertical scale frequencies, relative frequencies Horizontal scale categories Allows comparisons 15 10 5 Bef ore Training A f ter Training 0 B41 BA 42 B41F Product C21 Other Constructing Bar Charts All boxes should have the same width Gaps between the boxes - no connection between Any order. Use to represent two categorical variables simultaneously Graphs: Measured Continues Quantitative Data Histograms Stem and Leaf Box plots Line Graphs XY Scatter Charts (2 variables) Histograms Frequency Drawn without gaps between the bars Frequency distributions of continuous variables Grade Distribution 12 10 8 6 4 2 0 Grade 59 69 79 89 99 Constructing Histograms Non-overlapping Grade Distribution Frequency intervals Intervals - generally the same length Number of values in each interval -class frequency Relative frequencies o 12 10 8 6 4 2 0 Grade 59 69 79 89 99 XY Scatter Chart Two variables Abscent by Age Variables: quantitative and continuous. coordinate system Examine the relationship between two variables 15 Days Absent Plot pairs - rectangular 20 10 5 0 0 10 20 30 40 Age 50 60 70 Line Chart Similar to the scatter 1997 Monthly Sales Month June May Apr Mar Feb 170 165 160 155 150 145 140 135 130 125 Jan Sales (x$10,000) chart Values of the independent variable (shown on the horizontal axis) can be ranked values (i.e.. they do not have to be continuous variables). Basic Principles for Constructing All Plots Data should stand out clearly from background The information should be clearly labeled – title – axes, bars, pie segments, etc. - include units that are needed to interpret data – scale including starting points. Principles cont. Source No clutter Minimize information or data on one graph. Try several approaches Describing Data Shape of the Distribution – Symmetry – Skewness – Modality: most frequently occurring value – Unimodal or bimodal or uniform Right Skewed Left Skewed Histogram Histogram 12 10 Frequency Frequency 12 10 8 6 4 2 0 8 6 4 2 0 59 69 79 89 99 Grade 59 69 79 Grade Histogram Frequency 12 10 8 6 4 Symmetrical 2 0 59 69 79 Grade 89 99 89 99 Describing Data Centrality Spread Extreme values Measures of Centrality Mean Median Mode Mean Most common measure Extremely large values in a data set will increase the value of the mean Extremely low values will decrease it. Calculating the Mean T1 85 90 75 90 340 85 T2 85 90 35 90 300 75 T3 85 90 75 110 360 Sum 90 Mean Median Central point . Half of the data has a value than the median Half of the data has a higher value than the median Not affected by extremely large or small values Find the Median 85 90 75 92 95 Data 75 85 90 92 95 Sorted Data Median is 90. Find the Median 95 90 92 85 Data 85 90 92 95 Sorted Data Median: (90 + 92)/2 = 91 Measures of Spread Range Subtract the smallest value from the largest Report the smallest and largest values. 85 90 Range: or 75 92 75 to 95 20 95 Scores Variance/Standard Deviation Average variation of the data values from the mean of the values Variance. The Empirical Rule Symmetrical Data At least: 68% of the data values are within one standard deviation of the mean 90% of the data values are within two standard deviation of the mean 99% of the data values are within three standard deviations of the mean Tchybychef’s Inequality Skewed Data At least: 75% of the data values are within two standard deviation of the mean. 90% of the data values are within one standard deviation of the mean. Measures of Relative Standing Percentiles Quartiles Quartiles The lower quartile is the same as the 25th percentile. – 25% of the scores are lower and – 75% of the scores are higher than the lower quartile. The upper quartile is the same as the 75th percentile. – 75% of the scores are lower and Correlation Describes the strength of the relationship between two (or more) variables Pearson Product-moment Correlation Coefficient - assumes continuous quantitative data Relationship between Variables Positive Negative No relationship. Interpreting Correlation Coefficients. 0.20 to 0.35- show a slight relationship (little value in practical prediction situations) 0.50 - crude group prediction (Correlations this low do not suggest a good relationship) 0.65 to 0.85 - group predictions that are good Over 0.85 - a close relationship between the two variables. Even a high correlation coefficient does not establish a cause and effect relationship!!!!! Coefficient of Determination Square root of the correlation coefficient Gives the percent of variation in the dependent variable that is ‘explained’ by the independent variable. Look at an XY scatter plot Least Square Line Describe the relationship between the two variables Make predictions of the dependent variable from the independent variable Positive Relationship r will be a positive number. Y 8 7 6 5 4 3 2 1 0 0 2 4 X 6 Negative Relationship r will be a negative number. 8 7 6 Y 5 4 3 2 1 0 0 2 4 X 6