Download Comparing Groups

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lecture 2
Comparing Groups
Using SPSS
One of the goals of the ATV study was to determine whether severity of injury was related to whether or not the
rider was wearing a helmet.
So the variable, helmet, is an independent variable in the study.
SPSS FREQUENCIES output for helmet
Missingness and Missing Values
In most research we encounter situations in which we are supposed to have all values of a variable, but some are
missing.
Age and Income variables are good examples of variables for which values are missing because of respondent
failure to provide them.
Other variables go missing because of communication failure, data entry error, and a 1000 other reasons.
We typically do not throw out the baby with the bathwater. That is, we try to keep the data we have, even
though some of the values are missing.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 1
4/30/2017
How missingness is represented in data editors
Empty cells in a data editor, often the case in SPSS
Special Values in cells in a data editor.
,
rcmdr. .
Missing value: A value entered in the absence of a valid data value.
SPSS has to be told that a specific value represents missingness.
Rcmdr automatically assumes that the character sequence “NA” stands for missingness.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 2
4/30/2017
The SPSS EXPLORE procedure – a procedure for group
comparisons
A procedure in SPSS designed to allow comparison of groups using a variety of descriptive techniques.
We’ll compare ISS scores of helmet users vs non helmet users .
The EXPLORE main dialog window
Analysis specifics
I told the program to give
me histograms.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 3
4/30/2017
I clicked on Options and
told the program to
include reports for missing
values of the factor
variable.
The EXPLORE Output
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 4
4/30/2017
Whew!!
Whew!!
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 5
4/30/2017
The Histograms
Note that only the Nohelmet group had patients
with very high ISS values.
Note that the Helmet
group had no patients
with very high ISS values.
It appears that the “Info
Unavailable” group also
had no very high ISS
values.
Note:
1. I stacked the histograms vertically – following the rule for comparing groups using histograms.
2. The histograms have equal x-axis labels and equal column widths.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 6
4/30/2017
To manipulate x-axis labels in SPSS.
1. Double-click on the histogram to open the Chart Editor window.
2. Double-click on one of the x-axis numbers
3. Then click on Scale and choose the appropriate scale values – in this case I chose 0, 80, and 10 for
Minimum, Maximum, and Major Increment
4. Click on Apply.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 7
4/30/2017
To manipulate column width in SPSS
1. Double-click the figure to open the Chart Editor window.
2. Double-click on a column.
2. Click on Binning, then click on Custom, and enter the desired width. I entered 5.
3. Click on Apply.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 8
4/30/2017
Comparing Groups in rcmdr . . .
R  Load Packages  Rcmdr
Data  Inport Data  from SPSS dataset  ATVDataForClass050906.sav
Statistics  Summaries  Numerical summaries . . .
> numSummary(ATVData[,"iss"], groups=ATVData$helmet, statistics=c("mean",
+
"sd", "IQR", "quantiles"), quantiles=c(0,.25,.5,.75,1))
mean
sd IQR 0% 25% 50% 75% 100% data:n
no 11.39244 8.647624 11 1
5
9 16
75
344
yes 7.84127 4.749215
6 1
4
8 10
25
63
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 9
4/30/2017
graphs  histogram. . .
As was the case with the SPSS histogram, it’s clear
to see that the No helmet group had larger ISS
values.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 10
4/30/2017
Dot plots in rcmdr by helmet group
Same conclusion from the dot plot – more
larger ISS values in the No helmet group.
WEAR YOUR HELMET!!!
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 11
4/30/2017
The Goals of Descriptive Statistics
What kinds of characteristics can a collection of numbers have?
People can be kind, aloof, gregarious, tall, friendly, mean, spacy, etc. Cities can be forward-looking, violent,
progressive, etc. Cars can be fast, economical, stylish, ugly, heavy, etc.
Just as there are certain characteristics which seem to "belong" to people or cities or cars, there are a few
characteristics which "belong" to collections of numbers and which statisticians feel should be mentioned
whenever an attempt is made to describe a collection.
The Big Three Characteristics of data
1: Central Tendency
The first characteristic is called the central tendency. (It's also called "average" value, location, and expected
value.) It reflects the sizes of the numbers in the collection.
Consider the following weights:
Compare them with the following:
230, 260, 305, 195.
115, 120, 105, 94, 110,115, 100 90, 85.
Even though the second collection has more scores in it, the central tendency of the first is larger. The scores
in the first collection are larger than those in the second.
2: Variability
The second important characteristic of collections of numbers is the variability of the values. It is also called
the dispersion, heterogeneity or width of the values. This characteristic reflects the differences between the
values. If all the values are close to each other we say that variability is small. If the values in the collection are
quite different from each other, we say that variability is large.
Consider the following collection: 150, 155, 158, 160, 153, 156, 152.
Compare it with: 85, 175, 305, 95, 130.
Note that the scores in the second collection are quite different from each other. Thus, the second collection is
more variable than the first.
3: Shape
Shape refers to the way score values are position or placed on the number line.
In some distributions, the scores are all piled up on lone side or the other.
In others, the scores are piled up in the middle.
Shape will be considered in detail after graphical methods of description have been introduced.
Other Characteristics
4. Relationship between paired values.
We will consider the relationship or correlation between paired data later in the course.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 12
4/30/2017
Numeric Measures of Central Tendency and Variability
Howell Chapters 4 & 5
Pros and Cons of Tables and Graphs
Pros
1. Easy for the laypeople to understand.
2. Many are fairly easy to construct.
3. Show the complexities of distributions and comparisons of distributions – central tendency, variability,
shape, outliers all in one presentation.
4. Particularly good for identifying problem distributions and outliers.
5. Don’t require or assume specific distribution shape, such as normality.
Cons (relative to numeric summaries)
1. Take up space.
2. Are not amenable to further computations – no analog to a mean of means, for example.
3. Richness of information may make you crazy.
4. Not useful for generalizing from samples to populations.
Numeric Summaries
Single values chosen to represent a characteristic of data.
Measures of central tendency Single values chosen to represent central tendency of a collection.
Measures of variability – Single values chosen to represent variability of a collection.
Measures of skewness – Single values chosen to represent skewness of a distribution
Measures of kurtosis – Single values chosen to represent how similar the distribution is to the normal
distribution
Looking ahead
Measures of correlation – The extent to which values of one variable covary with paired values of another
variable.
Fewer than 20 measures that you’ll have to be able to interpret as a data analyst.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 13
4/30/2017
Missing Data
Why consider missing data here?
Because the presence of missing data complicates the computation and representation of data using the numeric
summaries we’re about to cover.
Reasons for missing data include
1) respondents failing to answer questions in a survey.
2) values incorrectly entered into the computer.
3) values that represent “Don’t Know” or “Don’t Care” or “Won’t tell you” responses.
In SPSS parlance, a missing value is a an actual value that was put into the data not as a valid data value
but in order to represent the fact that a score is in fact, missing.
In SPSS, an empty cell in the data editor stands for a missing value. But in many situations, an actual value
must be recorded when there is a missing response. Such values are the “missing values” we’re dealing with
here.
For example, if you’re saving data as a text file for use in another program, it is often easiest to for every
cell in the data editor to have something in it prior to saving.
Missing values are not a terribly important issue when frequency distributions and graphs are used to
summarize data because they’re just part of the summary. But when a statistic is to be computed, values that
“don’t count” should not be included in the computation. The statistical package has to be told that such
values are special and are not to be included in computation of statistics.
Missing data are represented in SPSS in two ways.
1) Empty cells in the Data Editor window. These are called SYSTEM MISSING.
2) Actual values entered into the Data Editor window but given “Missing Value” status by you.
In Excel, only empty cells are recognized as missing values
In rcmdr , the NA symbol is used to represent missingness.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 14
4/30/2017
To tell SPSS that one of the values of a variable is to be treated as a “Missing Value”,
1) Click on the “Variable View” tab at the lower left of the Data Editor window.
2) Click under “Missing” in the same row as the variable for whom Missing values are to be declared.
3) Enter the values to be treated as missing in the dialog box shown below.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 15
4/30/2017
Measures of central tendency
From worst to best
The Mode:
Definition: Value that occurred most frequently in the collection.
Example data: 5 6 7 7 7 7 8 9 10 11 13
Problems
Mode is 7
1) Often not computable, especially with small samples.
E.g., What’s the mode of 3,4,5,5,6,7,8,8,9?
2) Very unstable (unreliable) from sample to sample.
Should only be reported . . .
1) When it dominates the data, e.g., 70% of scores are one value.
2) When data are nominal, e.g., gender, ethnic group, in which case other quantitative
measures are not appropriate
Don’t report it (on penalty of lost points) in other situations
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 16
4/30/2017
The median
Conceptual definition: Value above which and below which 50% of scores fall.
Example data: How about: 2 4 6 8 Hmm. We need to be more precise.
Operational definition:
.
1) Order the scores.
2) For odd N, median is middle score in the ordered list.
For even N, median is the average of the two middle scores in the ordered list.
Example 1 – N is odd
X’s: 81, 69, 77, 93, 96, 99, 83, 85, 75, 89, 94
Ordered: 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 99. Median is 85.
Example 2 – N is even
X’s: 81, 69, 77, 93, 96, 99, 83, 85, 75, 89, 94, 57
Ordered: 57, 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 99. Median is (83+85)/2 = 84.
Pros
1. Gives an indication of the center of the distribution.
2. Usually not affected by outliers. E.g., Median of 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 999 is 85. So
the 999 didn’t affect it. Robust with respect to outliers.
3. All in all, a very useful measure.
Cons
1. For normally distributed data for which there are absolutely no outliers, median is slightly less stable
from sample to sample than the mean.
2. Not a part of the normal distribution. Not descended from royalty.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 17
4/30/2017
The mean
Best
Definition: Arithmetic average of the scores.
Mean
Median
Weighted sum of the scores with weighting equal to 1/N.
Symbols
Group:
Symbol:
Sample
X or MX
Population
µ (Pronounced myou.
If you mated a cat that says “meow”
and a cow that says “moo”, the
offspring would say “mu”.
Pros
1. Good heritage – comes from royalty. It’s a part of the normal distribution formula.
2. For normally distributed data with no outliers, most stable from sample to sample.
3. Computation is straightforward, doesn’t involve sorting.
Cons
1. Can be dramatically affected by outliers.
Worst
Mode
For example, mean of 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 99 from above is 82.8.
But the mean of 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 999 is 167.4, a value not close to ANY of the
original scores. Compare this with the median of the above data. You should always compute both
and compare them.
2. Related to the above, many analysts feel that the mean is unrepresentative of skewed data.
So compute the median AND the mean. If they’re approximately equal, then use the mean.
If they’re different, then probably the median is more appropriate.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 18
4/30/2017
Trimmed mean
Definition: Mean of the scores remaining after the largest K% and smallest K% have been removed. Typically,
K is 5.
Having your cake and eating it too the benefits of the mean without the sensitivity to extreme values..
Olympic tradition.
Pros.
1. Less affected by outliers.
Cons
1. Still not representative of skewed data in my view.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 19
4/30/2017
When to use the various measures of Central Tendency
Memorize this table. Make a locket out of it.
I. Numeric Variables
No Outliers
Outliers may be present
Distribution Shape
Unimodal and Symmetric (US)
Skewed
Mean
Median
Median
Median
Trimmed Mean
II. Nominal Data.
The mode is the only measure that makes sense when you're attempting to summarize nominal data.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 20
4/30/2017
Measures of Variability
The Range
Definition: Difference between largest score and smallest.
2 problems.
1. Range is restricted whenever score values are restricted.
Use of 5-point scales on questionnaires is a good example.
2. Range is unstable from sample to sample.
Don’t use as the primary measure.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 21
4/30/2017
The Interquartile Range
Quartiles:
Points identifying "quarters" of a distribution.
Conceptual Definitions
Q4
Fourth Quartile
The value below which 4/4th's of the distribution falls.
Q3
Third Quartile
The value below which 3/4ths of the distribution falls.
Q2
Second Quartile
The value below which 2/4ths of the distribution falls.
Q1
First Quartile
The value below which 1/4th of the distribution falls.
Q0
"Zeroth" Quartile
The value below which 0/4th's of the distribution falls.
Operational Definitions
Q4
The largest score in the distribution.
Q3
The median of the upper half of the distribution.
(If N is odd, include the overall median in the upper half.)
Q2
The overall median of the collection. Compute using the median formula.
Q1
The median of the lower half of the distribution..
(If N is odd, include the overall median in the lower half.)
Q0
The smallest score in the distribution.
Interquartile Range: The distance (on the number line) between the Q1 and Q3 - between the first
quartile and the third quartile.
IQR = Q3 - Q1
Interpretation
The distance or interval size required to contain the middle 50% of the scores.
If the middle 50% is contained in a small area, the distribution is quite "crowded" - the scores are close
to each other; the distribution has little variability.
If the middle 50% is contained in a wide area, the distribution is sparse - the scores are far from either
other; the distribution has much variability.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 22
4/30/2017
Example - A distribution with an even number of scores.
Upper half of distribution
75
65
50
45
40
40
35
35
30
30
30
25
25
10
IQR = 45 – 30 = 15.
Example - A distribution with an odd number of scores.
Note that 35, the overall median is included
in both the lower and upper halves.
Upper half of distribution
Lower half of distribution
65
50
45
40
35
35
30
25
25
20
15
IQR = 42.5 – 25 = 17.5
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 23
4/30/2017
Data Examples Start here on 9/6/16
Conscientiousness scale scores from the Bias Study Questionnaire Packet administered at the beginning of
semester in 2008. Each person’s score was the mean of either 10 items (IPIP) or 12 items (NEO-FFI). For
each, the response scale was a 5-point scale, numbered from 1 to 5.
Distribution of Conscientiousness scores from the IPIP Personality Questionnaire.
Statistics
icon
N
Valid
Missing
Mean
Median
Std. Deviation
Range
Percentiles
25
50
75
189
0
3.59418
3.60000
.614729
3.000
3.25000
3.60000
4.00000
Interquartile range = 4.00 – 3.25 = 0.75
Distribution of Conscientiousness scores from the NEO-FFI Personality Questionnaire
Statistics
ncon
N
Mean
Median
Std. Deviation
Range
Percentiles
Valid
Missing
25
50
75
189
0
3.70767
3.83333
.574311
2.750
3.33333
3.83333
4.12500
Interquartile range = 4.12 – 3.33 = 0.79
Both the IPIP questionnaire at the top and the NEO questionnaire at the bottom were scored on the same 5-point
scale.
The two distributions are pretty nearly identical. (I believe a previous version of these notes had the wrong
distribution for ncon. )
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 24
4/30/2017
Variance
Definition 1
The sum of the squared differences of the scores from the mean divided by N.
This is the “dividing by N” definition. Use this formula for populations.
Definition 2: The sum of the squared differences of the scores from the mean divided by N-1.
This is called the, you guessed it, “dividing by N-1” definition. Use this formula for samples.
The variance is a useful theoretical measure of variability, but it’s not useful as descriptive measure
because it’s in squared units.
Variance is part of the normal distribution formula, so it has good roots.
Variance is a part of many formulas (e.g., t, F) in inferential statistics.
Standard Deviation
Definition 1: Square root of the sum of the squared differences of the scores from the mean divided by N
That is, the standard deviation is the square root of the variance. This definition is for populations.
Definition 2: Square root of the sum of the squared differences of the scores from the mean divided by N-1.
This definition is for samples.
Wait! Is this daja vu all over again. Do these seem familiar?
It should, because the standard deviation is simply the square root of the variance.
Symbols
Group
Sample
Sample
Population
Population
Measure
Variance
Standard Deviation
Variance
Standard Deviation
Symbol
S2
S
σ2
σ
Formula
Σ(X-Mean)2
--------N – 1
Σ(X-Mean)2
-----------N – 1
Σ(X-Mean)2
----------N
Σ(X-Mean)2
------------N
Pros of the standard deviation
1. Good roots – is in the normal distribution formula.
2. Generally regarded as best for normal distributions (with no outliers).
Cons of the standard deviation
1. Inflated by the presence of outliers. Can be dramatically inflated by them.
2. What’s it mean??
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 25
4/30/2017
Facts about the Standard deviation
Assume you have a large (e.g., N >= 30) collection of scores that are unimodal and symmetric.
1. About 2/3 of the scores will be within 1 SD of the mean
-About 2/3 of scores in here --
Mean - SD
Mean
Mean + SD
2. About 95% of the scores will be within 2 SDs of the mean
-------------------------About 95% of scores in here ---------------------------
Mean - 2 SD
Mean - SD
Mean
Mean + SD
Mean + 2 SD
So, if you scored 2 standard deviations about the mean in Conscientiousness, what would be your approximate
score? 2 SDs above the mean would be 3.6 + .61 + .61 = 4.83. Two SDs below is 3.6 – 1.22 = 2.4
Wrap up – when to use each measure of variability. ...
US distribution
Skewed Distribution
No outliers
Standard deviation
IQR
Outliers possible
IQR
IQR
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 26
4/30/2017
Making use of both scale level and scale variability
We typically think only of the level of a psychological variable, how big the responses to all items were.
But what about how different an individual’s responses were from item to item – the variability of responses.
Data: IPIP Conscientiousness Scale.
Excerpt from Data Editor
gencon is the typical Conscientiousness scale score
sgencon is the standard deviation of responses to the 10 conscientiousness items.
Compare lines 1 and 8 – both have the same scale level (4.00) but 8 is much more variable than 1.
Compare lines 17 and 20 – both have the same variability (1.07) but 20 has a higher scale value than 17.
These examples suggest that both levels and variabilities are exhibited by the responses to questionnaires.
Are these differences of any use to us???
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 27
4/30/2017
Distributions of level
We looked at both the level of responses – the typical
score computed from a questionnaire - and also the
variability of responses to Conscientiousness items.
The largest levels of Conscientiousness
predicted high GPAs. People who are high
in level of conscientiousness have higher
GPAs.
and variability . . .
But the smallest variabilities of
Conscientiousness predicted high GPAs.
People who are less variable in their report
of conscientiousness have higher GPAs.
Note that both distributions are approximately unimodal and symmetric, although the distribution of standard
deviations is slightly positively skewed.
We’ve foun, as have a probably more than 100 other researchers, that level of conscientiousness (gencon in the
above graph) is a valid predictor of GPA. It’s not a perfect predictor, but it has been found to be statistically
significant in a vast majority of studies. People who score high on conscientiousness scales generally get better
grades than people with the same intelligence who score lower on conscientiousness.
Now here’s something that is almost new to our research here at UTC: We have found that variability in selfreported conscientiousness (sgencon in the above) is ALSO a valid predictor of GPA. Only about 5 studies
have found that – all of them conducted here at UTC. The relationship is inverse. People who are more
inconsistent in their self-reports (who have higher sgencon values) have slightly LOWER GPAs than people
who are less inconsistent.
So both level and variability may be of use to us.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 28
4/30/2017
Measures of distribution shape
Measures of skewness
A popular measure of skewness is the following, given by
Kirk, R. (1999). Statistics: An introduction. 4th Ed. New York: Harcourt Brace.
Skewness = (Σ(X-Mean)3 / N ) / S3
In English: The sum of the cubed deviations of scores from the mean divided by N, then divided by the cube
of the standard deviation.
Or, the average of the cubed deviations of scores from the mean then divided by the cube of the standard
deviation.
Interpretation of values
Value of Skewness measure
Interpretaton
Larger than 0
Positively skewed distribution
0
Symmetric distribution
Less than 0
Negatively skewed distribution
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 29
4/30/2017
Example of the skewness statistic
1. Salaries from the Employee Data file.
2. Extroversion scores of 109 UTC students
Sta tistic s
sal ary Curren t Sa lary
N
Va lid
47 4
Mi ssing
Ske wne ss
2.1 25
Std . Erro r of S kewness
Sta tistic s
0
.11 2
he xt
N
Va lid
10 9
Mi ssing
1
Ske wne ss
Histogram
-.2 20
Std . Erro r of S kewness
.23 1
120
Histogram
100
14
Frequency
80
12
60
10
20
0
$0
Mean = $34,419.57
Std. Dev. =
$17,075.661
N = 474
$40,000
$80,000
$120,000
$20,000
$60,000
$100,000
$140,000
Frequency
40
8
6
4
Current Salary
2
Mean = 4.4582
Std. Dev. = 0.95104
N = 109
0
0.00
2.00
4.00
6.00
8.00
hext
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 30
4/30/2017
Kurtosis
Kurtosis refers to the relationship of the shape of a distribution to the shape of the Normal Distribution.
Kirk gives the following measure of Kurtosis
Kursosis = ( (Σ(X-Mean)4 / N ) / S4 ) - 3
In English: The sum of the deviations of scores from the mean raised to the fourth power divided by N, then
divided by the standard deviation raised to the fourth power minus 3.
The average of the 4th-powered deviations from the mean divided by the standard deviation to the 4th power,
then minus 3.
Interpretation
Value of Kurtosis measure
Interpretaton
Larger than 0
More peaked than the Normal distribution
0
Same peakedness as the Normal distribution.
Less than 0
Less peaked (flatter) than the Normal distribution.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 31
4/30/2017
Example
1. Extroversion scores of 109 UTC students
Sta tistic s
hext
N
Va lid
109
Missing
1
Ku rtosis
-.37 1
Std . Erro r of K urtosis
.45 9
Histogram
25
Frequency
20
15
10
5
Mean = 4.4582
Std. Dev. = 0.95104
N = 109
0
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
hext
Although it’s not immediately apparent from the histogram, according to the Kurtosis measure the distribution
is slightly less peaked than the Normal Distribution.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 32
4/30/2017
Importing Data from Excel to SPSS
1. Importing Data from Excel using SPSS’s built-in Importing capabilities.
Demo with ‘G:\MDBT\InClassDatasets\TennesseeHospitalSurvey for class pres.xls’
A. From SPSS: File -> Open -> Data (Choose .Excel(*.xls) under “Files of type:”.)
Check all data very carefully. Sometimes the data won’t be put into SPSS in the way you believe they should.
Problem areas . .
i. Date and Time variables.
ii. Columns of numbers which happen to have a blank cell or a string character in the first cell of the column.
Make the appropriate choice in the following dialog box.
If the Excel file has names in the first row, leave the “Read variable names from the first row of data” checked.
If there are no variable names in the Excel file, uncheck that box.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 33
4/30/2017
The Excel file . . .
The SPSS file . . .
Note – 3 alphabetic columns
2. Importing data from Excel by copying and pasting.
A. Open a blank SPSS data editor window.
B. Open the file within Excel.
C. Highlight a column and choose “Copy”.
D. Click on the top cell of the column in which data are to be pasted in SPSS.
E. Choose “Paste”.
Check all data very carefully. Problem areas . . .
i. if pasting a String (character variable) you must set the column type in SPSS as string before pasting.
ii. Columns which have mixtures of strings and numbers will paste in as only strings or only numbers in SPSS.
SPSS doesn’t allow mixtures of data types within a column.
Copyright © 2005 by Michael Biderman
Measures of CT and Variability - 34
4/30/2017