Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA DESCRIPTION 1 Units Unit: entity we are studying, subject if human being Each unit/subject has certain parameters, e.g., a student (subject) has his age, weight, height, home address, number of units taken, and so on. 2 Variables These parameters are called variables. In statistics variables are stored in columns, each variable occupying a column. 3 Cross-sectional and time-series analyses In a cross-sectional analysis a unit/subject will be the entity you are studying. For example, if you study the housing market in San Diego, a unit will be a house, and variables will be price, size, age, etc., of a house. In a time-series analysis the unit is a time unit, say, hour, day, month, etc. 4 Data Types Nominal data: male/female, colors, Ordinal data: excellent/good/bad, Interval data: temperature, GMAT scores, Ratio data: distance to school, price, 5 Two forms GRAPHICAL form NUMERICAL SUMMARY form 6 Graphical forms Sequence plots Histograms (frequency distributions) Scatter plots 7 Sequence plots To describe a time series The horizontal axis is always related to the sequence in which data were collected The vertical axis is the value of the variable 8 Example: sequence plot 470 S&P-500 460 450 440 430 Index 10 20 30 40 9 Histograms I A histogram (frequency distribution) shows how many values are in a certain range. It is used for cross-sectional analysis. the potential observation values are divided into groups (called classes). The number of observations falling into each class is called frequency. When we say an observation falls into a class, we mean its value is greater than or equal to the lower bound but less than the upper bound of the class. 10 Example: histogram A commercial bank is studying the time a customer spends in line. They recorded waiting times (in minutes) of 28 customers: 5.9 4.0 1.1 1.1 7.6 1.6 8.6 6.7 5.3 7.3 4.3 5.0 9.7 8.2 1.2 4.5 1.6 8.4 3.3 9.4 3.5 6.5 2.1 6.3 7.4 8.9 8.4 6.4 11 Example: histogram 12 Histogram II The relative frequency distribution depicts the ratio of the frequency and the total number of observations. The cumulative distribution depicts the percentage of observations that are less than a specific value. 13 Example: relative frequency distribution A “relative frequency” distribution plots the fraction (or percentage) of observations in each class instead of the actual number. For this problem, the relative frequency of the first class is 6/28=0.214. The remaining relative frequencies are 0.179, 0.250, 0.286 and 0.071. A graph similar to the above one can then be plotted. 14 Example: cumulative distribution In the previous example, the percentage of observations that are less than 3 minutes is 0.214, the percentage of observations that are less than 5 is 0.214+0.179=0.393, less than 7 is 0.214+0.179+0.25=0.643, less than 9 is 0.214+0.179+0.25+0.286=0.929, and that less than 11 is 1.0. 15 Example: cumulative distribution 16 Histogram III The summation of all the relative frequencies is always 1. The cumulative distribution is nondecreasing. The last value of the cumulative distribution is always 1. A cumulative distribution can be derived from the corresponding relative distribution, and 17 vice versa. Probability A random variable is a variable whose values cannot predetermined but governed by some random mechanism. Although we cannot predict precisely the value of a random variable, we might be able to tell the possibility of a random variable being in a certain interval. The relative frequency is also the probability of a random variable falling in the corresponding class. The relative frequency distribution is also the 18 probability distribution. Scatter plots A scatter plot shows the relationship between two variables. 19 Example: scatter plot . The following are the height and foot size measurements of 8 men arbitrarily selected from students in the cafeteria. Heights and foot sizes are in centimeters. man 1 2 3 4 5 6 7 8 Height 155 160 149 175 182 145 177 164 foot 23.3 21.8 22.1 26.3 28.0 20.7 25.3 24.9 20 Example: scatter plot He ight, cm 190 180 170 160 150 140 130 20 22 24 26 28 Foot size, cm 21 Numerical Summary Forms Central locations: mean, median, and mode. Dispersion: standard deviation and variance. Correlation. 22 Mean Mean/average is the summation of the observations divided by the number of observations 27 22 26 24 27 20 23 24 18 32 Sum = (27 + 22 + 26 + 24 + 27 + 20 + 23 + 24 + 18 + 32) = 243 Mean = 243/10 = 24.3 23 Median Median is the value of the central observation (the one in the middle), when the observations are listed in ascending or descending order. When there is an even number of values, the median is given by the average of the middle two values. When there is an odd number of values, the 24 median is given by the middle number. Example: median 18 20 22 23 24 24 26 27 27 32 25 Compare mean and median The median is less sensitive to outliers than the mean. Check the mean and median for the following two data sets: 18 20 22 23 24 24 18 20 22 23 24 24 26 26 27 27 27 32 27 320 26 Mode Mode is the most frequently occurring value(s). 27 Symmetry and skew A frequency distribution in which the area to the left of the mean is a mirror image of the area to the right is called a symmetrical distribution. A distribution that has a longer tail on the right hand side than on the left is called positively skewed or skewed to the right. A distribution that has a longer tail on the left is called negatively skewed. If a distribution is positively skewed, the mean exceeds the median. For a negatively skewed distribution, the mean is less than the median. 28 Range The range is the difference in the maximum and minimum values of the observations. 29 Standard deviation and variance The standard deviation is used to describe the dispersion of the data. The variance is the squared standard deviation. 30 Calculation of S.D. Calculate the mean; calculate the deviations; calculate the squares of the deviations and sum them up; Divide the sum by n-1 and take the square root. 31 Example: S.D. Sample 27 22 26 24 27 20 23 24 18 32 Deviation 2.7 -2.3 1.7 -0.3 2.7 -4.3 -1.3 -.3 -6.3 7.7 Sq of Dev 7.29 5.29 2.89 .09 7.29 19.5 1.69 .09 39.7 59.3 Sum of = 7.29 + 5.29 + ..... + 59.3 = 142.1 Std. Dev. = 142.1 15. 79 3. 97 9 32 std . dev. ( x1 x ) 2 ( x 2 x ) 2 ( xn x ) 2 n 1 33 Empirical rules If the distribution is symmetrical and bellshaped, Approximately 68% of the observations will be within plus and minus one standard deviation from he mean. Approximately 95% observations will be within two standard deviation of the mean. Approximately 99.7% observations will be 34 within three standard deviations of the mean. Percentiles The 75th percentile is the value such that 75% of the numbers are less than or equal to this value and the remaining 25% are larger than this value. The k-th percentile is the value such that k% of the numbers are less than or equal to this value and the remaining 1-k% are larger than this value. 35 Correlation coefficient The Correlation coefficient measures how closely two variables are (linearly) related to each other. It has a value between -1 to +1. Positive and negative linear relationships. If two variables are not linearly related, the correlation coefficient will be zero; if they are closely related, the correlation coefficient will be close to 1 or -1. 36