Download Data statistics and some comments on cstrings (arrays of char)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Measures of Central Tendency:
Ungrouped Data
Mode
• Measures of central tendency yield
information about “particular places or
locations in a group of numbers.”
• Common Measures of Location
• The most frequently occurring value in a
data set
• Applicable to all levels of data
measurement (nominal, ordinal, interval,
and ratio)
– Mode
– Median
– Mean
– Percentiles
– Quartiles
• Bimodal -- Data sets that have two modes
• Multimodal -- Data sets that contain more
than two modes
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
1
Mode -- Example
• The mode is 44.
• There are more 44s
than any other value.
Median
35
41
44
45
37
41
44
46
37
43
44
46
39
43
44
46
40
43
44
46
40
43
45
48
Stat 130D, UCLA, Ivo Dinov.
• Middle value in an ordered array of
numbers.
• Applicable for ordinal, interval, and ratio
data
• Not applicable for nominal data
• Unaffected by extremely large and
extremely small values.
Stat 130D, UCLA, Ivo Dinov.
3
4
Median: Example
with an Odd Number of Terms
Median: Computational Procedure
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• First Procedure
– Arrange the observations in an ordered array.
– If there is an odd number of terms, the median
is the middle term of the ordered array.
– If there is an even number of terms, the median
is the average of the middle two terms.
•
•
•
•
There are 17 terms in the ordered array.
Position of median = (n+1)/2 = (17+1)/2 = 9
The median is the 9th term, 15.
If the 22 is replaced by 100, the median is
15.
• If the 3 is replaced by -103, the median is
15.
• Second Procedure
– The median’s position in an ordered array is
given by (n+1)/2.
Stat 130D, UCLA, Ivo Dinov.
2
Stat 130D, UCLA, Ivo Dinov.
5
Page 1
6
Median: Example
with an Even Number of Terms
Arithmetic Mean
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
•
•
•
•
•
Commonly called ‘the mean’
is the average of a group of numbers
Applicable for interval and ratio data
Not applicable for nominal or ordinal data
Affected by each value in the data set,
including extreme values
• Computed by summing all values in the
data set and dividing the sum by the number
of values in the data set
• There are 16 terms in the ordered array.
• Position of median = (n+1)/2 = (16+1)/2 = 8.5
• The median is between the 8th and 9th terms,
14.5.
• If the 21 is replaced by 100, the median is 14.5.
• If the 3 is replaced by -88, the median is 14.5.
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
7
8
Population Mean
µ=
∑ X = X + X + X +... + X
1
2
3
Sample Mean
X=
N
2
3
n
Stat 130D, UCLA, Ivo Dinov.
9
10
Percentiles: Computational Procedure
Percentiles
• Organize the data into an ascending ordered
array.
• Calculate the
P
percentile location:
i=
(n)
• Measures of central tendency that divide a group of
data into 100 parts
• At least n% of the data lie below the nth percentile,
and at most (100 - n)% of the data lie above the nth
percentile
• Example: 90th percentile indicates that at least
90% of the data lie below it, and at most 10% of
the data lie above it
• The median and the 50th percentile are the same.
• Applicable for ordinal, interval, and ratio data
• Not applicable for nominal data
Stat 130D, UCLA, Ivo Dinov.
1
n
n
57 + 86 + 42 + 38 + 90 + 66
=
6
379
=
6
= 63.167
N
N
24 + 13 + 19 + 26 + 11
=
5
93
=
5
= 18. 6
Stat 130D, UCLA, Ivo Dinov.
∑ X = X + X + X +... + X
100
• Determine the percentile’s location and its
value.
• If i is a whole number, the percentile is the
average of the values at the i and (i+1)
positions.
• If i is not a whole number, the percentile is at
the (i+1) position in the ordered array.
Stat 130D, UCLA, Ivo Dinov.
11
Page 2
12
Percentiles: Example
Quartiles
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Location of
30
30th percentile:
• Measures of central tendency that divide a group
of data into four subgroups
• Q1: 25% of the data set is below the first quartile
• Q2: 50% of the data set is below the second
quartile
• Q3: 75% of the data set is below the third quartile
• Q1 is equal to the 25th percentile
• Q2 is located at 50th percentile and equals the
median
• Q3 is equal to the 75th percentile
• Quartile values are not necessarily members of the
data set
i=
(8) = 2.4
100
• The location index, i, is not a whole
number; i+1 = 2.4+1=3.4; the whole
number portion is 3; the 30th percentile is at
the 3rd location of the array; the 30th
percentile is 13.
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
13
Quartiles
25%
25%
Quartiles: Example
• Ordered array: 106, 109, 114, 116, 121, 122,
125, 129
109+114
25
• Q1
Q2
Q1
14
i=
(8) = 2
100
Q1 =
• Q2:
i=
50
(8) = 4
100
Q2 =
116+121
= 1185
.
2
• Q3:
i=
75
(8) = 6
100
Q3 =
122+125
= 1235
.
2
Q3
25%
25%
Stat 130D, UCLA, Ivo Dinov.
= 1115
.
Stat 130D, UCLA, Ivo Dinov.
15
16
Variability
Variability
No Variability in Cash Flow
2
Mean
Mean
Variability
Variability in Cash Flow
Mean
Mean
No Variability
Time
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
17
Page 3
18
Measures of Variability:
Ungrouped Data
Range
• The difference between the largest and the
smallest values in a set of data
• Simple to compute
35
41
44
• Ignores all data points except 37 41 the44
two extremes
37
43
44
• Example:
Range
39
43 = 44
Largest - Smallest
=
40
43
44
48 - 35 = 13
• Measures of variability describe the spread
or the dispersion of a set of data.
• Common Measures of Variability
– Range
– Interquartile Range
– Mean Absolute Deviation
– Variance
– Standard Deviation
– Z scores
– Coefficient of Variation
40
Stat 130D, UCLA, Ivo Dinov.
43
45
45
46
46
46
46
48
Stat 130D, UCLA, Ivo Dinov.
19
Interquartile Range
20
Deviation from the Mean
• Data set: 5, 9, 16, 17, 18
• Mean:
X 65
• Range of values between the first and third
quartiles
• Range of the “middle half”
• Less influenced by extremes
µ=
∑
=
N
5
= 13
• Deviations from the mean: -8, -4, 3, 4, 5
Interquartile Range = Q 3 − Q1
-4
-8
0
+3
5
10
+4
15
+5
20
µ
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
21
Mean Absolute Deviation
Population Variance
• Average of the absolute deviations from the
mean
X
X − µ
5
9
16
17
18
Stat 130D, UCLA, Ivo Dinov.
-8
-4
+3
+4
+5
0
X − µ
+8
+4
+3
+4
+5
24
M . A. D. =
∑
22
• Average of the squared deviations from the
arithmetic mean
X
X −µ
N
5
9
16
17
18
24
=
5
= 4.8
Stat 130D, UCLA, Ivo Dinov.
23
Page 4
X − µ (X
-8
-4
+3
+4
+5
0
−µ
)
64
16
9
16
25
130
2
∑ (X − µ )
2
σ
2
=
N
130
=
5
= 2 6 .0
24
Population Standard Deviation
• Square root of the
variance
5
9
16
17
18
• Average of the squared deviations from the
arithmetic mean
∑ (X − µ)
2
X − µ (X
X
Sample Variance
) σ
−µ
-8
-4
+3
+4
+5
0
2
2
=
130
=
5
= 2 6 .0
64
16
9
16
25
130
σ =
=
σ
X − X (X
X
N
2,398
1,844
1,539
1,311
7,092
2
625
71
-234
-462
0
− X
)
2
S
390,625
5,041
54,756
213,444
663,866
2
=
∑ (X − X
n−1
6 6 3 ,8 6 6
=
3
= 2 2 1 , 2 8 8 .6 7
)
2
2 6 .0
= 5 .1
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
25
Sample Standard Deviation
• Square root of the
sample variance
X
X − X (X
2,398
1,844
1,539
1,311
7,092
625
71
-234
-462
0
−X
)
2
390,625
5,041
54,756
213,444
663,866
S
2
=
Uses of Standard Deviation
∑ (X − X
n−1
6 6 3 ,8 6 6
=
3
= 2 2 1, 2 8 8 .6 7
S =
=
S
26
)
• Indicator of financial risk
• Quality Control
2
– construction of quality control charts
– process capability studies
• Comparing populations
– household incomes in two cities
– employee absenteeism at two companies
2
2 2 1 , 2 8 8 .6 7
= 4 7 0 .4 1
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
27
Standard Deviation as an
Indicator of Financial Risk
Empirical Rule
• Data are normally distributed (or approximately
normal)
Annualized Rate of Return
Financial
Security
Stat 130D, UCLA, Ivo Dinov.
µ
28
σ
Distance from
the Mean
A
15%
3%
B
15%
7%
µ ± 1σ
µ ± 2σ
µ ± 3σ
Stat 130D, UCLA, Ivo Dinov.
29
Page 5
Percentage of Values
Falling Within Distance
68
95
99.7
30
Chebyshev’s Theorem
Chebyshev’s Theorem
• Applies to all distributions
P(µ − kσ < X < µ + kσ ) ≥ 1 −
• Applies to all distributions
Number of
Standard
Deviations
1
2
k
Distance from
the Mean
µ ± 2σ
K=2
for k > 1
µ ± 3σ
K=3
µ ± 4σ
K=4
Stat 130D, UCLA, Ivo Dinov.
Coefficient of Variation
1-1/32 = 0.89
1-1/42 = 0.94
32
Coefficient of Variation
µ = 29
• Ratio of the standard deviation to the mean,
expressed as a percentage
• Measurement of relative dispersion
µ = 84
1
σ 1 = 4.6
σ
σ
C.V .= (100)
µ
CV
. .=µ
1
1
σ
(100)
2
2
= 10
σ
CV
. .=µ
2
2
1
4.6
(100)
29
= 1586
.
(100)
2
10
(100)
84
.
= 1190
=
=
Stat 130D, UCLA, Ivo Dinov.
33
34
Mean of Grouped Data
Measures of Central Tendency
and Variability: Grouped Data
• Weighted average of class midpoints
• Class frequencies are the weights
• Measures of Central Tendency
– Mean
– Median
– Mode
µ =
• Measures of Variability
– Variance
– Standard Deviation
– Mean Absolute Deviation
Stat 130D, UCLA, Ivo Dinov.
1-1/22 = 0.75
Stat 130D, UCLA, Ivo Dinov.
31
Stat 130D, UCLA, Ivo Dinov.
Minimum Proportion
of Values Falling
Within Distance
=
=
∑ fM
∑ f
∑ fM
N
f 1M
Stat 130D, UCLA, Ivo Dinov.
35
Page 6
1
+ f 2M
f1+ f
2
2
+ f 3 M 3 + ⋅ ⋅ ⋅ + f iM
+ f 3 + ⋅ ⋅ ⋅ + fi
i
36
Calculation of Grouped Mean
Class Interval Frequency Class Midpoint
20-under 30
6
25
30-under 40
18
35
40-under 50
11
45
50-under 60
11
55
60-under 70
3
65
75
70-under 80
1
50
µ=
∑ fM
∑f
=
Median of Grouped Data
fM
150
630
495
605
195
75
2150
N
− cfp
(W )
Median = L + 2
fmed
Where:
L = the lower limit of the median class
cfp = cumulative frequency of class preceding the median class
fmed = frequency of the median class
W = width of the median class
2150
= 43. 0
50
N = total of frequencies
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
37
Median of Grouped Data -- Example
Mode of Grouped Data
• Midpoint of the modal class
• Modal class has the greatest frequency
N
Cumulative
− cfp
Class Interval Frequency Frequency Md = L + 2
(W )
fmed
20-under 30
6
6
30-under 40
18
24
50
− 24
40-under 50
11
35
(10)
= 40 + 2
50-under 60
11
46
11
60-under 70
3
49
= 40.909
70-under 80
1
50
N = 50
Stat 130D, UCLA, Ivo Dinov.
Class Interval Frequency
20-under 30
6
30-under 40
18
40-under 50
11
50-under 60
11
60-under 70
3
70-under 80
1
Variance and Standard Deviation
of Grouped Data
2
2
σ =
σ
2
2
=
S =
∑
(M − X )
2
f
n−1
S
40
2
f
M
fM
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
6
18
11
11
3
1
50
25
35
45
55
65
75
150
630
495
605
195
75
2150
2
=
∑
(M − µ)
Page 7
(M
-18
-8
2
12
22
32
2
f
Stat 130D, UCLA, Ivo Dinov.
41
M−µ
Class Interval
σ
Stat 130D, UCLA, Ivo Dinov.
30 + 40
= 35
2
Population Variance and Standard
Deviation of Grouped Data
Sample
∑ f ( M − µ) S
σ =
N
Mode =
Stat 130D, UCLA, Ivo Dinov.
39
Population
38
N
=
7200
= 144
50
−µ)
2
f
−µ)
2
1944
1152
44
1584
1452
1024
7200
324
64
4
144
484
1024
σ=
(M
σ
2
= 144 = 12
42
Measures of Shape
Skewness
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis
–
–
–
–
Peakedness of a distribution
Leptokurtic: high and thin
Mesokurtic: normal shape
Platykurtic: flat and spread out
Negatively
(Left)
Skewed
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness
Stat 130D, UCLA, Ivo Dinov.
Positively
(Right)
Skewed
Symmetric
(Not Skewed)
Stat 130D, UCLA, Ivo Dinov.
43
Skewness
44
Coefficient of Skewness
• Summary measure for skewness
Mean
Median
Mean
Median
Mode
Negatively
Skewed
Symmetric
(Not Skewed)
Mode
• If S < 0, the distribution is negatively skewed
(skewed to the left).
• If S = 0, the distribution is symmetric (not
skewed).
• If S > 0, the distribution is positively skewed
(skewed to the right).
Mean
Mode
Median
Positively
Skewed
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
45
Skewness & Kurtosis
• What do we mean by symmetry and positive
and negative skewness? Kurtosis?
N
Properties?!? N (
)3
(
Skewness =
∑Y
k =1
k
−Y
( N − 1) SD
3
;
Kurtosis =
∑Y
k =1
k
Kurtosis
• Peakedness of a distribution
−Y )
( N − 1) SD
4
– Leptokurtic: high and thin
– Mesokurtic: normal in shape
– Platykurtic: flat and spread out
4
• Skewness in linearly invariant
Sk(aX+b)=Sk(X)
• Skewness is a measure of unsymmetry
• Kurtosis is a measure of flatness
• Both are use to quantify departures from
StdNormal
• Skewness(StdNorm)=0; Kurtosis(StdNorm)=3
Stat 130D, UCLA, Ivo Dinov.
46
Leptokurtic
Mesokurtic
Platykurtic
Stat 130D, UCLA, Ivo Dinov.
47
Page 8
48
Box and Whisker Plot
Box and Whisker Plot
• Five secific values are used:
–
–
–
–
–
Median, Q2
First quartile, Q1
Third quartile, Q3
Minimum value in the data set
Maximum value in the data set
• Inner Fences
– IQR = Q3 - Q1
– Lower inner fence = Q1 - 1.5 IQR
– Upper inner fence = Q3 + 1.5 IQR
Q1
Minimum
Q3
Q2
Maximum
• Outer Fences
– Lower outer fence = Q1 - 3.0 IQR
– Upper outer fence = Q3 + 3.0 IQR
Stat 130D, UCLA, Ivo Dinov.
Stat 130D, UCLA, Ivo Dinov.
49
Skewness: Box and Whisker Plots,
and Coefficient of Skewness
S<0
S=0
Pearson Product-Moment
Correlation Coefficient
S>0
SSXY
r=
( SSX )( SSY )
∑ ( X − X )(Y − Y )
∑ ( X − X ) ∑ (Y − Y )
(∑ X )(∑ Y )
∑ XY −
n
=
2
=
Negatively
Skewed
Symmetric
(Not Skewed)
50

∑


Positively
Skewed
Stat 130D, UCLA, Ivo Dinov.
2
−
n


(∑ Y ) 
2
∑Y
2
−
n


−1≤ r ≤ 1
52
Computation of r for
the Economics Example (Part 1)
Three Degrees of Correlation
Day
1
2
3
4
5
6
7
8
9
10
11
12
Summations
r>0
r=0
Stat 130D, UCLA, Ivo Dinov.
(∑ X )  
2
X
Stat 130D, UCLA, Ivo Dinov.
51
r<0
2
Stat 130D, UCLA, Ivo Dinov.
53
Page 9
Interest
X
7.43
7.48
8.00
7.75
7.60
7.63
7.68
7.67
7.59
8.07
8.03
8.00
92.93
Futures
Index
Y
221
222
226
225
224
223
223
226
226
235
233
241
2,725
X2
55.205
55.950
64.000
60.063
57.760
58.217
58.982
58.829
57.608
65.125
64.481
64.000
720.220
Y2
48,841
49,284
51,076
50,625
50,176
49,729
49,729
51,076
51,076
55,225
54,289
58,081
619,207
XY
1,642.03
1,660.56
1,808.00
1,743.75
1,702.40
1,701.49
1,712.64
1,733.42
1,715.34
1,896.45
1,870.99
1,928.00
21,115.07
54
Scatter Plot and Correlation Matrix
for the Economics Example
Computation of r
for the Economics Example (Part 2)
=
∑ XY −

∑


(∑ X )(∑ Y )
(∑ X )  
n
2
X
2
−


n
240
(∑ Y ) 
2
∑Y
(21,115.07 ) −

( 720.22 ) −


245
Futures Index
r=
2
−
n


( 92.93)( 2725)
(92.93)  (619,207 ) − (2725) 
12
230
225
220
7.40
12
2
235
7.60
7.80
2


12


Interest
=.815
Interest
Futures Index
Stat 130D, UCLA, Ivo Dinov.
8.20
1
Futures Index
0.815254
0.815254
1
Stat 130D, UCLA, Ivo Dinov.
55
int main()
{
string str;
cout << "Enter a candidate for palindrome test "
<< "\nfollowed by pressing return.\n";
getline(cin, str);
if (isPal(str))
cout << "\"" << str + "\" is a palindrome ";
else
cout << "\"" << str + "\" is not a palindrome ";
cout << endl;
return 0;
}
void swap(char& lhs, char& rhs);
//swaps char args corresponding to parameters lhs and rhs
string reverse(const string& str);
//returns a copy of arg corresponding to parameter
//str with characters in reverse order.
string removePunct(const string& src,
const string& punct);
//returns copy of string src with characters
//in string punct removed
string makeLower (const string& s);
//returns a copy of parameter s that has all upper case
//characters forced to lower case, other characters unchanged.
//Uses <string>, which provides tolower
void swap(char& lhs, char& rhs)
{
57
char tmp = lhs;
lhs = rhs;
rhs = tmp;
}
// Palindrome Testing Program (3 of 5)
// Palindrome Testing Program (4 of 5)
string reverse(const string& str)
{
int start = 0;
int end = str.length();
string tmp(str);
//returns a copy of src with characters in punct removed
string removePunct(const string& src, const string& punct)
{
string no_punct;
int src_len = src.length();
int punct_len = punct.length();
for(int i = 0; i < src_len; i++)
{
string aChar = src.substr(i,1);
int location = punct.find(aChar, 0);
//find location of successive characters
//of src in punct
if (location < 0 || location >= punct_len)
no_punct = no_punct + aChar; //aChar not in punct -- keep it
}
return no_punct;
}
while (start < end)
{
end--;
swap(tmp[start], tmp[end]);
start++;
}
return tmp;
}
//Returns arg that has all upper case characters forced to lower case,
//other characters unchanged. makeLower uses <string>, which provides
//tolower
string makeLower(const string& s) //uses <cctype>
{
string temp(s); //This creates a working copy of s
for (int i = 0; i < s.length(); i++)
temp[i] = tolower(s[i]);
return temp;
}
56
// Palindrome Testing Program (2 of 5)
// Palindrome Testing Program (1 of 5)
// test for palindrome property – Cstrings vs Strings!
#include <iostream>
#include <string>
#include <cctype>
using namespace std;
bool isPal(const string& this_String);
//uses makeLower, removePunct.
//if this_String is a palindrome,
// return true;
//else
// return false;
8.00
Interest
59
58
60
Page 10
// Palindrome Testing Program (5 of 5)
//uses functions makeLower, removePunct.
//Returned value:
//if this_String is a palindrome,
// return true;
//else
// return false;
bool isPal(const string& this_String)
{
string punctuation(",;:.?!'\" "); //includes a blank
string str(this_String);
str = makeLower(str);
string lowerStr = removePunct(str, punctuation);
return lowerStr == reverse(lowerStr);
}
61
Page 11
Related documents