Download Lecture #2

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Action Research
Measurement Scales and
Descriptive Statistics
INFO 515
Glenn Booker
INFO 515
Lecture #2
1
Measurement Needs
Need a long set of measurements for one
project, and/or many projects to examine
statistical trends
 Could use measurements to test specific
hypotheses
 Other realistic uses of measurement are to
help make decisions and track progress
 Need scales to make measurements!

INFO 515
Lecture #2
2
Measurement Scales

There are four types of measurement
scales





Nominal
Ordinal
Interval
Ratio
Completely optional mnemonic: to
remember the sequence, I think of ‘NOIR’
like in the expression ‘film noir’ (‘noir’ is
French for ‘black’)
INFO 515
Lecture #2
3
Nominal Scale

A nominal (“name”) scale groups or
classifies things into categories, which:




Must be jointly exhaustive (cover everything)
Must be mutually exclusive (one thing can’t
be in two categories at once)
Are in any sequence (none better or worse)
So a nominal variable is putting things
into buckets which have no inherant order
to them
INFO 515
Lecture #2
4
Nominal Scale

Examples include






INFO 515
Gender (though some would dispute
limitations of only male/female categories)
Dewey decimal system
The Library of Congress system
Academic majors
Makes of stuff (cars, computers, etc.)
Parts of a system
Lecture #2
5
Ordinal Scale
This measurement ranks things in order
 Sequence is important, but the intervals
between ranks is not defined numerically
 Rank is relative, such as “greater than” or
“less than”
 E.g. letter grades, urgency of problems,
class rank, inspection ratings
 So now the buckets we’re using have
some sense or order or direction

INFO 515
Lecture #2
6
Interval Scale
An interval scale measures quantitative
differences, not just relative
 Addition and subtraction are allowed
 E.g. common temperature scales (°F or
C), a single date (Feb 15, 1999), maybe
IQ scores



Let me know if you find any more examples
A zero point, if any, is arbitrary (90 °F is
*not* six times hotter than 15 °F!)
INFO 515
Lecture #2
7
Ratio Scale
A ratio scale is an interval scale with a
non-arbitrary zero point
 Allows division and multiplication
 The “best” type of scale to use, if possible
 E.g. defect rates for software, test scores,
absolute temperature (Kelvin or Rankine),
the number or count of almost anything,
size, speed, length, …

INFO 515
Lecture #2
8
Summary of Scales




Nominal
 names different categories, not ordered, not ranked: Male,
Female, Republican, Catholic..
Ordinal
 Categories are ordered: Low, High, Sometimes, Never,
Interval
 Fixed intervals, no absolute zero: IQ, Temperature
Ratio



Fixed intervals with an absolute zero point: Age, Income, Years of
Schooling, Hours/Week, Weight
Age could be measured as ratio (years), ordinal (young,
middle, old), or nominal (baby boomer, gen X)
Scale of measurement affects (may determine) type of
statistics that you can use to analyze the data
INFO 515
Lecture #2
9
Scale Hierarchy
Measurement scales are hierarchical:
ratio (best) / interval / ordinal / nominal
 Lower level scales can always be derived
from data which uses a higher scale
 E.g. defect rates (a ratio scale) could be
converted to {High, Medium, Low} or
{Acceptable, Not Acceptable} (ordinal
scales)

INFO 515
Lecture #2
10
Reexamine Central Tendencies
If data are nominal, only the mode is
meaningful
 If data are ordinal, both median and mode
may be used
 If data are ratio or interval (called “scale”
in SPSS), you may use mean, median,
and mode

INFO 515
Lecture #2
11
Reexamine Variables

Discrete variables use counting units or
specific categories



Example: makes of cars, grades, …
Use Nominal or Ordinal scales
Continuous = Integer or Real
Measurements


INFO 515
Example: IQ Test scores, length of a table,
your weight, etc.
Use Ratio or Interval scales
Lecture #2
12
Refine Research Types
Qualitative Research tends to use Nominal
and/or Ordinal scale variables
 Quantitative Research tends to use
Interval and/or Ratio scale variables

INFO 515
Lecture #2
13
Frequency Distributions
Frequency distributions describe how
many times each value occurs in a
data set
 They are useful for understanding the
characteristics of a data set
 Frequencies are the count of how many
times each possible value appears for a
variable (gender = male, or operating
system = Windows 2000)

INFO 515
Lecture #2
14
Frequency Distributions
They are most useful when there is a fixed
and relatively small number of options for
that variable
 They’re harder to use for variables which
are numbers (either real or integer) unless
there are only a few specific options
allowed (e.g. test responses 1 to 5 for a
multiple choice question)

INFO 515
Lecture #2
15
Generating Frequency Distributions
Select the command Analyze /
Descriptive Statistics / Frequencies…
 Select one or more “Variable(s):”
 Note that the Frequency (count) and
percent are included by default; other
outputs may be selected under the
“Statistics...” button


INFO 515
A bar chart can be generated as well using
the “Charts…” button; see another way later
Lecture #2
16
Sample Frequency Output
O
N
u
r
P
r
u
c
c
e
e
V
8
3
2
2
2
1
0
1
1
3
1
6
3
3
5
1
6
5
5
0
1
9
4
4
5
1
1
3
3
8
1
9
9
9
7
1
7
7
7
4
2
2
4
4
8
2
1
2
2
0
T
4
0
0
INFO 515
Lecture #2
17
Analysis of Frequency Output



The first, unlabeled column has the values of
data – here, it first lists all Valid values (there are
no Invalid ones, or it would show those too)
The Frequency column is how many times that
value appears in the data set
The Percent column is the percent of cases with
that value; in the fourth row, the value 15
appears 116 times, which is 24.5% of the 474
total cases (116/474*100 = 24.5%)
INFO 515
Lecture #2
18
Analysis of Frequency Output
The Valid Percent column divides each
Frequency by the total number of Valid
cases (= Percent column if all cases valid)
 The Cumulative Percent adds up the
Valid Percent values going down the
rows; so the first entry is the Valid Percent
for first row, the second entry is from 11.2
+ 40.1 = 51.3%, next is 51.3 + 1.3 =
52.5% and so on

Round-off error
INFO 515
Lecture #2
19
Generating Frequency Graphs
Frequency is often shown using a
bar graph
 Bar graphs help make small amounts of
data more visible
 To generate a frequency graph alone



INFO 515
Click on the Charts menu and select “Bar…”
Leave the “Simple” graph selected, and leave
“Summaries are for groups of cases” selected;
click the “Define” button
Lecture #2
20
Generating Frequency Graphs




INFO 515
Let the Bars Represent remain “N of cases”
Click on variable “Educational Level (years)”
and move it into the Category Axis field
Click “OK”
You should get the graph on the next slide.
Notice that the text below the X axis is the
Label for the Category Axis.
Lecture #2
21
Sample Frequency Output
Notice that the exact
same graph can be
generated from
Frequencies, or just
as a bar graph
INFO 515
Lecture #2
22
Frequency Distributions
A frequency distribution is a tabulation
that indicates the number of times a score
or group of scores occurs
 Bar charts best used to graph frequency of
nominal & ordinal data
 Histograms best used to display shape of
interval & ratio data

INFO 515
Lecture #2
23
Frequency Distribution Example
Employment Category
400
300
Frequency
200
100
0
Clerical
Custodial
Manager
Employment Category
Employment Category
Valid
SPSS for Windows, Student Version
INFO 515
Lecture #2
Clerical
Cus todial
Manager
Total
Frequency
363
27
84
474
Percent
76.6
5.7
17.7
100.0
Valid
Percent
76.6
5.7
17.7
100.0
Cumulative
Percent
76.6
82.3
100.0
24
Basic Measures - Ratio
Used for two exclusive populations
(every case fits into one OR the other)
 Ratio = (# of testers) /
(# of developers)
 E.g. tester to developer ratio is 1:4

INFO 515
Lecture #2
25
Proportions and Fractions
Used for multiple (> 2) populations
 Proportion = (Number of this population) /
(Total number of all populations)
 Sum of all proportions equals unity (one)


E.g. survey results
Proportions are based on integer units
 Fractions are based on real numbered
units

INFO 515
Lecture #2
26
Percentage
A proportion or fraction multiplied by 100
becomes a percentage
 Only report percentages when N (total
population measured) is above ~30 to 50;
and always provide N for completeness


Why? Otherwise a percentage will imply
more accuracy than the data supports

INFO 515
If 2 out of 3 people like something, it’s misleading
to report that 66.667% favor it
Lecture #2
27
Percents
Percent = the percentage of cases having
a particular value.
 Raw percent = divide the frequency of
the value by the total number of cases
(including missing values)
 Valid percent = calculated as above but
excluding missing values

INFO 515
Lecture #2
28
Percent Change
The percent increase in a measurement is
the new value, minus the old one, divided
by the old value; negative means
decrease:
% increase = (new - old) / old
 The percent change is the absolute value
of the percent increase or decrease:
% change = | % increase |

INFO 515
Lecture #2
29
Percent Increase
Later Value – Earlier Value
Earlier Value
 So if a collection goes from 50,000
volumes in 1965 to 150,000 in 1975,
the percent increase is:
 150,000-50,000 = 2 = 200%
50,000
 Always divide by where you started

Carpenter and Vasu, (1978)
INFO 515
Lecture #2
30
Percentiles
A percentile is the point in a distribution at
or below a given percentage of scores.
 The median is the 50% percentile
 Think of the SAT scores - what percentile
were you for verbal, math, etc. - means
what percent of people did worse than you

INFO 515
Lecture #2
31
Rate
Rate conveys the change in a
measurement, such as over time, dx/dt.
Rate = (# observed events) / (# of
opportunities)*constant
 Rate requires exposure to the risk being
measured
 E.g. defects per KSLOC (1000 lines of
code) = (# defects)/(# of KSLOC)*1000

INFO 515
Lecture #2
32
Exponential Notation

You might see output of the form
+2.78E-12


The ‘E’ means ‘times ten to the power of’
This is +2.78 * 10-12 (+2.78*10**-12)

A negative exponent, e.g. –12, makes it a very
small number



INFO 515
10-12 = 0.000000000001
10+12 = 1,000,000,000,000
The leading number, here +2.78, controls
whether it is a positive or negative number
Lecture #2
33
Exponential Notation
+5*10**+12 (a positive number >>1)
Pos.
0
+5*10**-12 (a positive number <<1)
-5*10**-12 (a negative number <<1)
Neg.
-5*10**+12 (a negative number >>1)
INFO 515
Lecture #2
34
Precision

Keep your final output to a consistent level
of precision (significant digits)


Don’t report one value as “12” and another
as “11.86257523454574123”
Pick a level of precision to match the
accuracy of your inputs (or one digit
more), and make sure everything is
reported that way consistently (e.g.
12.0 and 11.9)
INFO 515
Lecture #2
35
Data Analysis
Raw data is collected, such as the dates
a particular problem was reported and
closed
 Refined data is extracted from raw data,
e.g. the time it took a problem to be
resolved
 Derived data is produced by analyzing
refined data, such as the average time to
resolve problems

INFO 515
Lecture #2
36
Descriptive Statistics

Descriptive statistics describes the key
characteristics of one set of data
(univariate)





INFO 515
Mean, median, mode, range (see also
last week)
Standard deviation, variance
Skewness
Kurtosis
Coefficient of variation
Lecture #2
37
Mean
A.k.a.: Average Score
 The mean is the arithmetic average of the
scores in a distribution




Add all of the scores
Divide by the total number of scores
The mean is greatly influenced by extreme
scores; they pull it off center
INFO 515
Lecture #2
38
Mean Calculation
HOLDINGS IN 7 DIFFERENT LIBRARIES
X
7400
6500
6200
5900
5100
4300
3800
 X= 39200
INFO 515
Mean =
X
N
39200 = 5600
7
Here, sum every data value
Lecture #2
39
Mean with a Frequency Distribution
X (IQ)
140
135
132
130
128
126
125
123
120
110
101
F=Freq
2
1
2
1
1
1
4
1
4
3
1
21
Mean = ∑FX =
N
N = SF
INFO 515
FX = F*X
280
135
264
130
128
126
500
123
480
330
101
2597
2597 = 123.67 = 124 (round off)
21
Lecture #2
40
Central Tendency Example
Staff Salaries
$4100
6000
6000
6000
8000
9000
10000
11000
20000
Mode = $6000
Median = 9 + 1 = 5th value = $8000
2
Mean = ∑X
N
=
80100 = $8900
9
Carpenter and Vasu, (1978)
INFO 515
Lecture #2
41
Handling Extreme Values
In cases where you have an extreme
value (high or low) in a distribution, it is
helpful to report both the median and
the mean
 Reporting both values gives some
indication (through comparison) of a
skewed distribution

INFO 515
Lecture #2
42
Measures of Variation

Measures which indicate the variation,
or spread of scores in a distribution



INFO 515
Range (see last week)
Variance
Standard Deviation
Lecture #2
43
Standard Deviation, Variance
Standard deviation is the average amount
the data differs from the mean (average)
SD = ( S (Xi-X)**2 / (N-1) )
SD = ( Variance )
 Variance is the standard deviation squared
Variance = S (Xi-X)**2 / (N-1)

[per ISO 3534-1, para 2.33 and 2.34]
INFO 515
Lecture #2
44
Standard Deviation
The standard deviation is the square root
of the variance. It is expressed in the
same units as the original data.
 Since the variance was expressed
“squared units” it doesn’t make much
practical sense. For example, what are
“squared books” or “squared man-hours?”

INFO 515
Lecture #2
45
Computing the Variance
S2 = ∑(X – Mean)2
N

1. Subtract the mean from each score

2. Square the result

3. Sum the squares for all data points

4. Divide by the N of cases
INFO 515
Lecture #2
46
Divide by N or N-1???

You’ll see different formulas for variance
and standard deviation – some divide by
N, some by N-1 (e.g. slides 43 and 45);
why?


INFO 515
If your data covers the entire population (you
have all of the possible data to analyze), then
divide by N
If your data covers a sample from the
population, divide by N-1
Lecture #2
47
Standard Deviation for Freq Dist.
X
17
16
14
10
9
6
F
2
4
5
2
3
1
FX
34
64
70
20
27
6
221
X2
289
256
196
100
81
36
σ = √ (∑FX2 – (∑FX)2/N)
N
= √ ((3061- 2873)/17)
FX2
578
1024
980
200
243
36
3061
Standard Deviation of
Bookmobile Distribution
= √ (3061- (221)2/17)
17
= 3.3
Notice that FX2 is F*(X2), not (F*X)2
INFO 515
Lecture #2
48
Std Dev Reflects Consistency
Distance from Target
In Meters
200
150
100
50
0
-50
-100
-150
-200
Battery A
2
4
5
7
9
7
5
4
2
Frequency
Battery B
0
1
5
10
13
10
5
1
0
Mean =0
Standard D. =
102.74
Mean =0
Standard D. =
65.83
Runyon and Haber (1984)
INFO 515
Lecture #2
49
Standard Deviation vs. Std. Error

To be precise, the standard error is the
standard deviation of a statistic used to
estimate a population parameter
[per ISO 3534-1, para 2.56 and 2.50]
So standard error pertains to sample data,
while standard deviation should describe
the entire population
 We often use them interchangeably 

INFO 515
Lecture #2
50
Skewness

Skewness is a measure of the asymmetry
of a distribution.


A distribution with a significant positive
skewness has a long right tail


The normal distribution is symmetric,
and has a skewness value of zero.
Positive skewness means the mean and
median are more positive than the mode
(the peak of the distribution)
Negative skewness has a long left tail.
INFO 515
Lecture #2
51
Skewness

As a rough guide, a skewness magnitude
more than two (>2 or <-2) is taken to
indicate a significant departure from
symmetry
Positive skewness
Negative skewness
Both curves have same mean and standard deviation.
INFO 515
Lecture #2
From www.riskglossary.com
52
Kurtosis

Kurtosis is a measure of the extent to
which data clusters around a central point


For a normal distribution, the value of the
kurtosis is 3
The kurtosis excess (= kurtosis-3) is zero
for a normal distribution


INFO 515
Positive kurtosis excess indicates that the data
have longer tails than “normal”
Negative kurtosis excess indicates the data
have shorter tails
Lecture #2
53
Kurtosis
Platykurtic
Leptokurtic
tail
The curve on the right has higher kurtosis than the curve on the left.
It is more peaked at the center, and it has fatter tails. If a
distribution’s kurtosis is greater than 3, it is said to be leptokurtic
(sharp peak). If its kurtosis is less than 3, it is said to be platykurtic
(flat peak). They might have equal standard deviation.
Mesokurtic is the “normal” curve, which has kurtosis = 3.
INFO 515
Lecture #2
From www.riskglossary.com
54
Skewness & Kurtosis Example

From the Employee data set, use Analyze
/ Descriptive Statistics / Descriptives,
select the ‘salary’ variable;

Under Options…, select Skewness and Kurtosis
Skewness is 2.125, so there is significant
positive skewness to the data
 Kurtosis is 5.378, so the data is
leptokurtic

INFO 515
Lecture #2
55
Coefficient of Variation
The coefficient of variation (CV) is the
ratio of the standard deviation to the
mean:
CV = s/m
[per ISO 3534-1, para 2.35]
 Smaller CV means the more
representative the mean is for the total
distribution
 Can compare means and standard
deviations of two different populations


INFO 515
Higher CV means more variability
Lecture #2
56
Coefficient of Variation
Divide the standard deviation by the mean
to get CV. CV = s/m
 The smaller the decimal fraction this
produces, the more representative is the
mean for the total distribution
 The larger the decimal fraction, the worse
job the mean does of giving us a true
picture of the distribution

INFO 515
Lecture #2
57
Generating a Histogram
Frequency graphs can be generated for
variables which have many integer or real
values (e.g. salary), by using a histogram
 A histogram shows how many data points
fall into various ranges of values
 The closest “normal” curve can be shown
for comparison

INFO 515
Lecture #2
58
Generating a Histogram

The “¾ rule” is helpful for histograms

The tallest bar should be ¾ of the height of
the Y axis
Be sure to label X and Y axes
appropriately
 The each bar shows how many data points
fall within a range of X axis values


INFO 515
See How to Lie with Statistics, by Darrell Huff
Lecture #2
59
Histogram of Salary
CURRENT SALARY
140
120
100
80
40
Std. Dev = 6830.26
20
0
Mean = 13767.8
N = 474.00
0
0.
00
54 0.0
00
50 0.0
00
46 0.0
00
42 0.0
00
38 0.0
00
34 0.0
00
30 0.0
00
26 0.0
00
22 0.0
00
18 0.0
00
14 0.0
00
10 0
.
00
60
Frequency
60
CURRENT SALARY
INFO 515
Lecture #2
60
Another Note on Histograms
SPSS will define its own bar widths for a
histogram, e.g. how wide the range of
salary values is for each bar
 Later in the course, we’ll look at how you
can define your own variables to make
predefined histograms bars

INFO 515
Lecture #2
61
Pie Chart Histogram
A histogram can also be made in the
shape of a pie
 This should be limited to variables with a
small number of possible values

INFO 515
Lecture #2
62
A *bad* pie chart histogram
15660
CURRENT SALARY
9180
15540
9240
15480
9300
15420
9360
15360
9420
15120
9480
15060
9540
15000
9600
14820
9660
14640
9720
14460
9780
14400
9840
14280
9900
14220
9960
14100
10020
14040
10080
INFO 515
Lecture #2
10140
(I had to include
this one just
because it’s
colorful)
63
This is a better example:
EDUCATIONAL LEVEL
21
20
19
18
8
17
16
This visually
implies the
percentages of
data in each
value.
12
15
14
INFO 515
Lecture #2
64
Bookmobile Data
Case/
Bookmobile
Value of Var.
No. of Stops
X
No. of Stops
F No. of
Bookmobiles
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
6
9
10
14
16
17
14
16
14
10
9
14
14
16
9
17
16
17
16
14
10
9
6
2
4
5
2
3
1
N = 17
Bookmobile examples taken from Carpenter and Vasu, (1978)
Same data as used on slides 48 & 66.
INFO 515
Lecture #2
65
Bookmobile Distributions
Percent cumulative
freq counting down
Stops
17
16
14
10
9
6
f
2
4
5
2
3
1
%
11.8
23.5
29.4
11.8
17.6
5.8
CF
17
15
11
6
4
1
Cumulative freq
adding down
INFO 515
Lecture #2
CF
2
6
11
13
16
17
C%
100
88
64
35
23
6
Cumulative freq
adding up
66
HISTOGRAM OF BOOKMOBILE STOPS
10
8
F6
4
2
Std. Dev = 3.43
Mean = 13.0
N = 17.00
0
5.0
7.5
10.0
12.5
15.0
17.5
Number of Bookmobile Stops
INFO 515
Lecture #2
67
Normalizing Data
Some data sets are not very close to a
normal distribution
 Sometimes it helps to transform the
independent variable by applying a math
function to it, such as looking at log(x)
(the logarithm of each x value) instead of
just x

INFO 515
Lecture #2
68
Normalizing Data
In SPSS this can be done by defining a
new variable, such as “log_x”
 Then use Transform / Compute to
calculate
log_x = LG10(x)
assuming that ‘x’ is the original
variable
 Then generate a histogram showing the
normal curve, to see if log_x is closer to a
normal distribution

INFO 515
Lecture #2
69
Normalizing Data
Who cares if we have a normal
distribution?
 Many tests in statistics can only be applied
to a variable which has a normal
distribution – so it’s worth our while to
transform the variable

INFO 515
Lecture #2
70