Download Data statistics and some comments on cstrings (arrays of char)

Measures of Central Tendency: Ungrouped Data Mode • Measures of central tendency yield information about “particular places or locations in a group of numbers.” • Common Measures of Location • The most frequently occurring value in a data set • Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio) – Mode – Median – Mean – Percentiles – Quartiles • Bimodal -- Data sets that have two modes • Multimodal -- Data sets that contain more than two modes Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 1 Mode -- Example • The mode is 44. • There are more 44s than any other value. Median 35 41 44 45 37 41 44 46 37 43 44 46 39 43 44 46 40 43 44 46 40 43 45 48 Stat 130D, UCLA, Ivo Dinov. • Middle value in an ordered array of numbers. • Applicable for ordinal, interval, and ratio data • Not applicable for nominal data • Unaffected by extremely large and extremely small values. Stat 130D, UCLA, Ivo Dinov. 3 4 Median: Example with an Odd Number of Terms Median: Computational Procedure Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22 • First Procedure – Arrange the observations in an ordered array. – If there is an odd number of terms, the median is the middle term of the ordered array. – If there is an even number of terms, the median is the average of the middle two terms. • • • • There are 17 terms in the ordered array. Position of median = (n+1)/2 = (17+1)/2 = 9 The median is the 9th term, 15. If the 22 is replaced by 100, the median is 15. • If the 3 is replaced by -103, the median is 15. • Second Procedure – The median’s position in an ordered array is given by (n+1)/2. Stat 130D, UCLA, Ivo Dinov. 2 Stat 130D, UCLA, Ivo Dinov. 5 Page 1 6 Median: Example with an Even Number of Terms Arithmetic Mean Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 • • • • • Commonly called ‘the mean’ is the average of a group of numbers Applicable for interval and ratio data Not applicable for nominal or ordinal data Affected by each value in the data set, including extreme values • Computed by summing all values in the data set and dividing the sum by the number of values in the data set • There are 16 terms in the ordered array. • Position of median = (n+1)/2 = (16+1)/2 = 8.5 • The median is between the 8th and 9th terms, 14.5. • If the 21 is replaced by 100, the median is 14.5. • If the 3 is replaced by -88, the median is 14.5. Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 7 8 Population Mean µ= ∑ X = X + X + X +... + X 1 2 3 Sample Mean X= N 2 3 n Stat 130D, UCLA, Ivo Dinov. 9 10 Percentiles: Computational Procedure Percentiles • Organize the data into an ascending ordered array. • Calculate the P percentile location: i= (n) • Measures of central tendency that divide a group of data into 100 parts • At least n% of the data lie below the nth percentile, and at most (100 - n)% of the data lie above the nth percentile • Example: 90th percentile indicates that at least 90% of the data lie below it, and at most 10% of the data lie above it • The median and the 50th percentile are the same. • Applicable for ordinal, interval, and ratio data • Not applicable for nominal data Stat 130D, UCLA, Ivo Dinov. 1 n n 57 + 86 + 42 + 38 + 90 + 66 = 6 379 = 6 = 63.167 N N 24 + 13 + 19 + 26 + 11 = 5 93 = 5 = 18. 6 Stat 130D, UCLA, Ivo Dinov. ∑ X = X + X + X +... + X 100 • Determine the percentile’s location and its value. • If i is a whole number, the percentile is the average of the values at the i and (i+1) positions. • If i is not a whole number, the percentile is at the (i+1) position in the ordered array. Stat 130D, UCLA, Ivo Dinov. 11 Page 2 12 Percentiles: Example Quartiles • Raw Data: 14, 12, 19, 23, 5, 13, 28, 17 • Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28 • Location of 30 30th percentile: • Measures of central tendency that divide a group of data into four subgroups • Q1: 25% of the data set is below the first quartile • Q2: 50% of the data set is below the second quartile • Q3: 75% of the data set is below the third quartile • Q1 is equal to the 25th percentile • Q2 is located at 50th percentile and equals the median • Q3 is equal to the 75th percentile • Quartile values are not necessarily members of the data set i= (8) = 2.4 100 • The location index, i, is not a whole number; i+1 = 2.4+1=3.4; the whole number portion is 3; the 30th percentile is at the 3rd location of the array; the 30th percentile is 13. Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 13 Quartiles 25% 25% Quartiles: Example • Ordered array: 106, 109, 114, 116, 121, 122, 125, 129 109+114 25 • Q1 Q2 Q1 14 i= (8) = 2 100 Q1 = • Q2: i= 50 (8) = 4 100 Q2 = 116+121 = 1185 . 2 • Q3: i= 75 (8) = 6 100 Q3 = 122+125 = 1235 . 2 Q3 25% 25% Stat 130D, UCLA, Ivo Dinov. = 1115 . Stat 130D, UCLA, Ivo Dinov. 15 16 Variability Variability No Variability in Cash Flow 2 Mean Mean Variability Variability in Cash Flow Mean Mean No Variability Time Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 17 Page 3 18 Measures of Variability: Ungrouped Data Range • The difference between the largest and the smallest values in a set of data • Simple to compute 35 41 44 • Ignores all data points except 37 41 the44 two extremes 37 43 44 • Example: Range 39 43 = 44 Largest - Smallest = 40 43 44 48 - 35 = 13 • Measures of variability describe the spread or the dispersion of a set of data. • Common Measures of Variability – Range – Interquartile Range – Mean Absolute Deviation – Variance – Standard Deviation – Z scores – Coefficient of Variation 40 Stat 130D, UCLA, Ivo Dinov. 43 45 45 46 46 46 46 48 Stat 130D, UCLA, Ivo Dinov. 19 Interquartile Range 20 Deviation from the Mean • Data set: 5, 9, 16, 17, 18 • Mean: X 65 • Range of values between the first and third quartiles • Range of the “middle half” • Less influenced by extremes µ= ∑ = N 5 = 13 • Deviations from the mean: -8, -4, 3, 4, 5 Interquartile Range = Q 3 − Q1 -4 -8 0 +3 5 10 +4 15 +5 20 µ Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 21 Mean Absolute Deviation Population Variance • Average of the absolute deviations from the mean X X − µ 5 9 16 17 18 Stat 130D, UCLA, Ivo Dinov. -8 -4 +3 +4 +5 0 X − µ +8 +4 +3 +4 +5 24 M . A. D. = ∑ 22 • Average of the squared deviations from the arithmetic mean X X −µ N 5 9 16 17 18 24 = 5 = 4.8 Stat 130D, UCLA, Ivo Dinov. 23 Page 4 X − µ (X -8 -4 +3 +4 +5 0 −µ ) 64 16 9 16 25 130 2 ∑ (X − µ ) 2 σ 2 = N 130 = 5 = 2 6 .0 24 Population Standard Deviation • Square root of the variance 5 9 16 17 18 • Average of the squared deviations from the arithmetic mean ∑ (X − µ) 2 X − µ (X X Sample Variance ) σ −µ -8 -4 +3 +4 +5 0 2 2 = 130 = 5 = 2 6 .0 64 16 9 16 25 130 σ = = σ X − X (X X N 2,398 1,844 1,539 1,311 7,092 2 625 71 -234 -462 0 − X ) 2 S 390,625 5,041 54,756 213,444 663,866 2 = ∑ (X − X n−1 6 6 3 ,8 6 6 = 3 = 2 2 1 , 2 8 8 .6 7 ) 2 2 6 .0 = 5 .1 Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 25 Sample Standard Deviation • Square root of the sample variance X X − X (X 2,398 1,844 1,539 1,311 7,092 625 71 -234 -462 0 −X ) 2 390,625 5,041 54,756 213,444 663,866 S 2 = Uses of Standard Deviation ∑ (X − X n−1 6 6 3 ,8 6 6 = 3 = 2 2 1, 2 8 8 .6 7 S = = S 26 ) • Indicator of financial risk • Quality Control 2 – construction of quality control charts – process capability studies • Comparing populations – household incomes in two cities – employee absenteeism at two companies 2 2 2 1 , 2 8 8 .6 7 = 4 7 0 .4 1 Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 27 Standard Deviation as an Indicator of Financial Risk Empirical Rule • Data are normally distributed (or approximately normal) Annualized Rate of Return Financial Security Stat 130D, UCLA, Ivo Dinov. µ 28 σ Distance from the Mean A 15% 3% B 15% 7% µ ± 1σ µ ± 2σ µ ± 3σ Stat 130D, UCLA, Ivo Dinov. 29 Page 5 Percentage of Values Falling Within Distance 68 95 99.7 30 Chebyshev’s Theorem Chebyshev’s Theorem • Applies to all distributions P(µ − kσ < X < µ + kσ ) ≥ 1 − • Applies to all distributions Number of Standard Deviations 1 2 k Distance from the Mean µ ± 2σ K=2 for k > 1 µ ± 3σ K=3 µ ± 4σ K=4 Stat 130D, UCLA, Ivo Dinov. Coefficient of Variation 1-1/32 = 0.89 1-1/42 = 0.94 32 Coefficient of Variation µ = 29 • Ratio of the standard deviation to the mean, expressed as a percentage • Measurement of relative dispersion µ = 84 1 σ 1 = 4.6 σ σ C.V .= (100) µ CV . .=µ 1 1 σ (100) 2 2 = 10 σ CV . .=µ 2 2 1 4.6 (100) 29 = 1586 . (100) 2 10 (100) 84 . = 1190 = = Stat 130D, UCLA, Ivo Dinov. 33 34 Mean of Grouped Data Measures of Central Tendency and Variability: Grouped Data • Weighted average of class midpoints • Class frequencies are the weights • Measures of Central Tendency – Mean – Median – Mode µ = • Measures of Variability – Variance – Standard Deviation – Mean Absolute Deviation Stat 130D, UCLA, Ivo Dinov. 1-1/22 = 0.75 Stat 130D, UCLA, Ivo Dinov. 31 Stat 130D, UCLA, Ivo Dinov. Minimum Proportion of Values Falling Within Distance = = ∑ fM ∑ f ∑ fM N f 1M Stat 130D, UCLA, Ivo Dinov. 35 Page 6 1 + f 2M f1+ f 2 2 + f 3 M 3 + ⋅ ⋅ ⋅ + f iM + f 3 + ⋅ ⋅ ⋅ + fi i 36 Calculation of Grouped Mean Class Interval Frequency Class Midpoint 20-under 30 6 25 30-under 40 18 35 40-under 50 11 45 50-under 60 11 55 60-under 70 3 65 75 70-under 80 1 50 µ= ∑ fM ∑f = Median of Grouped Data fM 150 630 495 605 195 75 2150 N − cfp (W ) Median = L + 2 fmed Where: L = the lower limit of the median class cfp = cumulative frequency of class preceding the median class fmed = frequency of the median class W = width of the median class 2150 = 43. 0 50 N = total of frequencies Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 37 Median of Grouped Data -- Example Mode of Grouped Data • Midpoint of the modal class • Modal class has the greatest frequency N Cumulative − cfp Class Interval Frequency Frequency Md = L + 2 (W ) fmed 20-under 30 6 6 30-under 40 18 24 50 − 24 40-under 50 11 35 (10) = 40 + 2 50-under 60 11 46 11 60-under 70 3 49 = 40.909 70-under 80 1 50 N = 50 Stat 130D, UCLA, Ivo Dinov. Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1 Variance and Standard Deviation of Grouped Data 2 2 σ = σ 2 2 = S = ∑ (M − X ) 2 f n−1 S 40 2 f M fM 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 6 18 11 11 3 1 50 25 35 45 55 65 75 150 630 495 605 195 75 2150 2 = ∑ (M − µ) Page 7 (M -18 -8 2 12 22 32 2 f Stat 130D, UCLA, Ivo Dinov. 41 M−µ Class Interval σ Stat 130D, UCLA, Ivo Dinov. 30 + 40 = 35 2 Population Variance and Standard Deviation of Grouped Data Sample ∑ f ( M − µ) S σ = N Mode = Stat 130D, UCLA, Ivo Dinov. 39 Population 38 N = 7200 = 144 50 −µ) 2 f −µ) 2 1944 1152 44 1584 1452 1024 7200 324 64 4 144 484 1024 σ= (M σ 2 = 144 = 12 42 Measures of Shape Skewness • Skewness – Absence of symmetry – Extreme values in one side of a distribution • Kurtosis – – – – Peakedness of a distribution Leptokurtic: high and thin Mesokurtic: normal shape Platykurtic: flat and spread out Negatively (Left) Skewed • Box and Whisker Plots – Graphic display of a distribution – Reveals skewness Stat 130D, UCLA, Ivo Dinov. Positively (Right) Skewed Symmetric (Not Skewed) Stat 130D, UCLA, Ivo Dinov. 43 Skewness 44 Coefficient of Skewness • Summary measure for skewness Mean Median Mean Median Mode Negatively Skewed Symmetric (Not Skewed) Mode • If S < 0, the distribution is negatively skewed (skewed to the left). • If S = 0, the distribution is symmetric (not skewed). • If S > 0, the distribution is positively skewed (skewed to the right). Mean Mode Median Positively Skewed Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 45 Skewness & Kurtosis • What do we mean by symmetry and positive and negative skewness? Kurtosis? N Properties?!? N ( )3 ( Skewness = ∑Y k =1 k −Y ( N − 1) SD 3 ; Kurtosis = ∑Y k =1 k Kurtosis • Peakedness of a distribution −Y ) ( N − 1) SD 4 – Leptokurtic: high and thin – Mesokurtic: normal in shape – Platykurtic: flat and spread out 4 • Skewness in linearly invariant Sk(aX+b)=Sk(X) • Skewness is a measure of unsymmetry • Kurtosis is a measure of flatness • Both are use to quantify departures from StdNormal • Skewness(StdNorm)=0; Kurtosis(StdNorm)=3 Stat 130D, UCLA, Ivo Dinov. 46 Leptokurtic Mesokurtic Platykurtic Stat 130D, UCLA, Ivo Dinov. 47 Page 8 48 Box and Whisker Plot Box and Whisker Plot • Five secific values are used: – – – – – Median, Q2 First quartile, Q1 Third quartile, Q3 Minimum value in the data set Maximum value in the data set • Inner Fences – IQR = Q3 - Q1 – Lower inner fence = Q1 - 1.5 IQR – Upper inner fence = Q3 + 1.5 IQR Q1 Minimum Q3 Q2 Maximum • Outer Fences – Lower outer fence = Q1 - 3.0 IQR – Upper outer fence = Q3 + 3.0 IQR Stat 130D, UCLA, Ivo Dinov. Stat 130D, UCLA, Ivo Dinov. 49 Skewness: Box and Whisker Plots, and Coefficient of Skewness S<0 S=0 Pearson Product-Moment Correlation Coefficient S>0 SSXY r= ( SSX )( SSY ) ∑ ( X − X )(Y − Y ) ∑ ( X − X ) ∑ (Y − Y ) (∑ X )(∑ Y ) ∑ XY − n = 2 = Negatively Skewed Symmetric (Not Skewed) 50  ∑   Positively Skewed Stat 130D, UCLA, Ivo Dinov. 2 − n   (∑ Y )  2 ∑Y 2 − n   −1≤ r ≤ 1 52 Computation of r for the Economics Example (Part 1) Three Degrees of Correlation Day 1 2 3 4 5 6 7 8 9 10 11 12 Summations r>0 r=0 Stat 130D, UCLA, Ivo Dinov. (∑ X )   2 X Stat 130D, UCLA, Ivo Dinov. 51 r<0 2 Stat 130D, UCLA, Ivo Dinov. 53 Page 9 Interest X 7.43 7.48 8.00 7.75 7.60 7.63 7.68 7.67 7.59 8.07 8.03 8.00 92.93 Futures Index Y 221 222 226 225 224 223 223 226 226 235 233 241 2,725 X2 55.205 55.950 64.000 60.063 57.760 58.217 58.982 58.829 57.608 65.125 64.481 64.000 720.220 Y2 48,841 49,284 51,076 50,625 50,176 49,729 49,729 51,076 51,076 55,225 54,289 58,081 619,207 XY 1,642.03 1,660.56 1,808.00 1,743.75 1,702.40 1,701.49 1,712.64 1,733.42 1,715.34 1,896.45 1,870.99 1,928.00 21,115.07 54 Scatter Plot and Correlation Matrix for the Economics Example Computation of r for the Economics Example (Part 2) = ∑ XY −  ∑   (∑ X )(∑ Y ) (∑ X )   n 2 X 2 −   n 240 (∑ Y )  2 ∑Y (21,115.07 ) −  ( 720.22 ) −   245 Futures Index r= 2 − n   ( 92.93)( 2725) (92.93)  (619,207 ) − (2725)  12 230 225 220 7.40 12 2 235 7.60 7.80 2   12   Interest =.815 Interest Futures Index Stat 130D, UCLA, Ivo Dinov. 8.20 1 Futures Index 0.815254 0.815254 1 Stat 130D, UCLA, Ivo Dinov. 55 int main() { string str; cout << "Enter a candidate for palindrome test " << "\nfollowed by pressing return.\n"; getline(cin, str); if (isPal(str)) cout << "\"" << str + "\" is a palindrome "; else cout << "\"" << str + "\" is not a palindrome "; cout << endl; return 0; } void swap(char& lhs, char& rhs); //swaps char args corresponding to parameters lhs and rhs string reverse(const string& str); //returns a copy of arg corresponding to parameter //str with characters in reverse order. string removePunct(const string& src, const string& punct); //returns copy of string src with characters //in string punct removed string makeLower (const string& s); //returns a copy of parameter s that has all upper case //characters forced to lower case, other characters unchanged. //Uses <string>, which provides tolower void swap(char& lhs, char& rhs) { 57 char tmp = lhs; lhs = rhs; rhs = tmp; } // Palindrome Testing Program (3 of 5) // Palindrome Testing Program (4 of 5) string reverse(const string& str) { int start = 0; int end = str.length(); string tmp(str); //returns a copy of src with characters in punct removed string removePunct(const string& src, const string& punct) { string no_punct; int src_len = src.length(); int punct_len = punct.length(); for(int i = 0; i < src_len; i++) { string aChar = src.substr(i,1); int location = punct.find(aChar, 0); //find location of successive characters //of src in punct if (location < 0 || location >= punct_len) no_punct = no_punct + aChar; //aChar not in punct -- keep it } return no_punct; } while (start < end) { end--; swap(tmp[start], tmp[end]); start++; } return tmp; } //Returns arg that has all upper case characters forced to lower case, //other characters unchanged. makeLower uses <string>, which provides //tolower string makeLower(const string& s) //uses <cctype> { string temp(s); //This creates a working copy of s for (int i = 0; i < s.length(); i++) temp[i] = tolower(s[i]); return temp; } 56 // Palindrome Testing Program (2 of 5) // Palindrome Testing Program (1 of 5) // test for palindrome property – Cstrings vs Strings! #include <iostream> #include <string> #include <cctype> using namespace std; bool isPal(const string& this_String); //uses makeLower, removePunct. //if this_String is a palindrome, // return true; //else // return false; 8.00 Interest 59 58 60 Page 10 // Palindrome Testing Program (5 of 5) //uses functions makeLower, removePunct. //Returned value: //if this_String is a palindrome, // return true; //else // return false; bool isPal(const string& this_String) { string punctuation(",;:.?!'\" "); //includes a blank string str(this_String); str = makeLower(str); string lowerStr = removePunct(str, punctuation); return lowerStr == reverse(lowerStr); } 61 Page 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data statistics and some comments on cstrings (arrays of char)