Download Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
In the name of Allah Kareem,
Most Beneficent, Most Gracious,
the Most Merciful !
Introduction to SPSS
Before further processing of the data we should get to
know about SPSS software first .
SPSS
SPSS stands for “statistical package for social sciences”. It
is basically used for the analysis of quantitative data .
How to open SPSS
Start
Menu
Programs
SPSS
Inc.
SPSS
16.0
3
Opening the SPSS
4
SPSS Interface
Title Bar
Menu Bar
Tool Bar
Variable definition criteria
Serial Number / Cases
Work sheet
SPSS Views
5
6
7.3 Data Entry
After defining the variables enter the data in data view for each
case (row wise) against each variable (column wise)
7
Data Preparation (Processing)
After getting your survey completed (Sample attached as annexure
1) and knowing the interface of the SPSS the next step in
quantitative research process is to prepare the data for analysis.
This process involves four steps.
1.Coding the questionnaire.
2.Defining the variables in SPSS variable view.
3.Entering the data in SPSS data view.
8
Data Processing
After collecting the data, data processing is
started that involves
1. Data coding
2. Defining the variables
3. Data entry in the software
9
Coding
Coding is the process of assigning numbers to the values or
levels of each variable.
Rules of Coding
1.
2.
3.
4.
5.
6.
7.
All data should be numeric.
Each variable for each case or participant must occupy the same
column.
All values (codes) for a variable must be mutually exclusive.
Each variable should be coded to give maximum information.
For each participant, there must be a code or value for each
variable.
Apply any coding rule consistently for all participants.
Use high numbers (values or codes) for the “agree,” “good,” or
“positive” end of a variable that is ordered
10
SAMPLE QUESTIONNAIRE
Please circle or supply your answer
ID_________
SD
SA
1. I would recommend this course to other students
1 2
3
4
5
2. I worked very hard in this course
1 2
3
4
5
3. My college is : Arts & sciences___ _ Business____ Engineering____
4. My gender is
M
F
5. My GPA is
_____________
6. For this class, I did: (Check all that apply
The reading
The homework
Extra credit
11
7.2 Defining the variables
In SPSS first of all the variables are defined in variable view
This includes
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Name of the variable (Short without space)
Type
(Numeric, String)
Width
(8, 10, etc)
Decimals
(2, 3, 5 etc)
Label
(Full name of the variable)
Values
(answer categories with codes)
Missing
(blank, multiple, wrong answers)
Columns
(6, 8, 10 etc)
Align
(Left, right, centre)
Measure
(Nominal, ordinal, scale)
12
Data Analysis
1. Data analysis is a process of organizing, summarizing,
presenting, interpreting, and drawing conclusions based
on data with the goal of highlighting useful information,
and supporting decision making. In quantitative research
data analysis is performed objectively using statistical
techniques.
2. Statistics is a branch of applied mathematics concerned
with the collection and interpretation of quantitative data
to draw conclusions and test (accept or reject)
hypothesis. There are two levels/types of statistics
1. Descriptive statistics
2. Inferential statistics
13
Descriptive Statistics
Descriptive statistics are the statistics that are used to
understand and describe the data. They are used to answer
the descriptive type of research questions. It involves
1.
2.
3.
4.
5.
6.
Summarizing the data
Measure of central tendency
Measure of dispersion
Checking data normality
Data file management
Recode and transform variables
14
Descriptive Statistics
1. Summarizing the data
A data matrix contains too much information to be taken in at a glance due to
which it becomes difficult to understand and get feel of the data. A set of data
can be understood only if it is summarized in some appropriate way.
i. Summarizing categorical data
A categorical variable is usually summarized in frequencies and there
percentages. This process is called Frequency distribution. It can be
presented in two ways that are in the form of
• Tables of frequency and percentages or
• Graphs
15
Descriptive Statistics
a. Frequency Distribution
A frequency distribution is a tally (IIII) or count of the
number of times each score (category) on a single
variable is marked by respondents. A frequency can
be further summarized by expressing them as
percentages of the total using following formula
Percentage = (frequency/total) X100
To get a frequency distribution Table:
Analyze
Descriptive Statistics
Frequencies
move religion to the variable box
OK (make
sure that the Display frequency tables box is checked)
16
Frequency Distribution Table
Valid
protestant
catholic
no religion
Total
Missing
Frequency
Percent
Valid Percent
Cumulative Percent
30
40.0
44.8
44.8
23
30.7
34.3
79.1
14
18.7
20.9
100.0
67
89.3
100.0
4
5.3
4
5.3
8
10.7
75
100.0
other religion
blank
Total
Total
17
Interpretation:
In this example, there is a Frequency column that shows the
numbers of students who marked each type of religion (e.g.,
30 said protestant and 4 left it blank). Notice that there are a
total of (67) for the three responses considered Valid and a
total (8) for the two types of responses considered to be
Missing as well as an overall total (75). The Percent column
indicates that 40.0% are protestant, 30.7% are catholic,
18.7% are not religious, 5.3% had one of several other
religions, and 5.3% left the question blank. The Valid
Percentage column excludes the eight missing cases and is
often the column that you would use. Given this data set, it
would be accurate to say that of those not coded as missing,
44.8% were protestant and 34.3% catholic and 20.9% were
not religious.
18
Descriptive Statistics
b) Frequency distribution graphs
With Nominal data, you should not use a graphic that connects
adjacent categories because with nominal data, there is no
necessary ordering of the categories or levels. Thus, it is better to
make a bar graph or chart of the frequency distribution of variables
like religion, ethnic group, or other nominal variables; the points that
happen to be adjacent in your frequency distribution are not by
necessarily adjacent.
c) Bar Charts
bar graphs are usually used to display "categorical
qualitative data", the bars in bar graphs are usually
separated and the height of the bars shows the frequency
of that category.
19
Descriptive Statistics
To get a bar chart select
Graphs
legacy dialogues
interactive
bar chart
move variable to the box
ok
Fig.2. Bar chart for the nominal variable religion
20
Descriptive Statistics
1. Summarizing Numerical Data
Simple numerical summaries of a numerical variable can be obtained
through
a)Five Figure Summary
the data can be summarized by quoting five figures if the data is first
sorted into (ascending) numerical order These five figures are
Minimum value (Min) —the smallest value, with rank 1
Maximum value (Max) — the largest value, with rank n, and
Median (M/Q2) —The middle value, with rank (n+1)/2
The median divides the data into two halves, each with the same
number of observations. Each of these halves may, in turn, be
divided into two by quartiles, so that the data is split into 4
quarters. These are known as:
Lower quartile (Q1) —The middle value of first half.
Upper quartile (Q3) —The middle value of second half
Rank the values from 1 (the smallest value) to n (the largest
value; n denotes the total number of observation).
21
Descriptive Statistics
1. Summarizing Numerical Data
Simple numerical summaries of a numerical variable can be obtained
through
a)Five Figure Summary
the data can be summarized by quoting five figures if the data is first
sorted into (ascending) numerical order These five figures are
Minimum value (Min) —the smallest value, with rank 1
Maximum value (Max) — the largest value, with rank n, and
Median (M/Q2) —The middle value, with rank (n+1)/2
The median divides the data into two halves, each with the same
number of observations. Each of these halves may, in turn, be
divided into two by quartiles, so that the data is split into 4
quarters. These are known as:
Lower quartile (Q1) —The middle value of first half.
Upper quartile (Q3) —The middle value of second half
Rank the values from 1 (the smallest value) to n (the largest
value; n denotes the total number of observation).
22
Descriptive Statistics
Example 1: Department An absenteeism data. Consider the absenteeism
data for a department in an organization
Department A: 20 employees
0 0 2 2 0 0 1 1 3 1 2 3 3 5 95 5 5 8 10 15
Step 1-Ascending order
0 0 0 0 1 1 1 2 2 2 3 3 3 5 5 5 8 10 15
Step 2 Ranking
95
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
95
Value
0
0
0
0
1
1
1
2
2
2
3
3
3
5
5
5
8
10
15
95
23
Descriptive Statistics
Boxplot
A boxplot is a quick method of summarizing and graphically representing
ordinal and scale data for examining one or more sets of data. It is also
called
box and whisker plot. It is useful to
•
•
•
•
Summarize the data by getting five figure summary
Check the data for errors
Examine and compare frequency distributions
Check assumption for inferential statistics (Check normality of data)
Boxplot for one set of data
Graphs
Boxplot
in boxplot window select simple and summaries of
separate variables click define select the variable and move it into the boxes
represent box
click ok
24
Case Processing Summary
Cases
Valid
N
math achievement test
Missing
Percent
75
100.0%
N
Total
Percent
0
.0%
N
Percent
75
100.0%
Upper whisker =Max
Upper Quartile = Q3
Median = Q2
Lower Quartile = Q1
Lower whisker =Min
25
Interpretation
The case processing summary table shows the valid N=75,
with no missing values for total sample of 75 for the variable
math achievement. The plot shows a box plot for math
achievement. The box represents the middle 50% of the
cases (M=13), lower end of the box shows lower quartile
(Q1=7.67), and upper end of the quartile shows upper
quartile (17.00). The whiskers indicate the expected range
(25.33) of scores from minimum (Min=-1.67) to Maximum
(Max=23.67). Scores outside of this range are considered
unusually high or low, such scores are called outliers. There
are no outliers for in this case.
26
Descriptive Statistics
Histograms
Histogram is a form of a bar graph used with numerical (scale) variable preferably of
continuous nature. The intervals are shown on the X-axis and the number of scores
in each interval is represented by the height of a rectangle located above the
interval. Unlike the bar graph, in a histogram there is no space between the
bars. The data is continuous so the lower limit of any one interval is also the upper
limit of the previous interval. It is useful to
• Summarize the data Examining and comparing frequency distributions
• Check normality of data
To draw a histogram select:
Graphs legacy dialogues
variable to the box
OK
interactive
histogram
move
27
Fig.3. Histogram of SAT- math score
28
Interpretation
In fig. 3 the frequencies (number of students), shown by the bars are for a
range of points (in this case SPSS selected a range of 50: 250-299, 300349, 350-399, etc). Notice that the largest number of students (about 20)
had scores in the middle two bars of the range (450-499 and 500-549).
Similar small numbers of students have very low and very high scores.
The bars in the histogram form a distribution (pattern or curve) that is
similar to the normal, bell shaped curve. Thus, the frequency distribution
of the SAT math scores is said to be approximately normal.
In fig. 4 shows the frequency distribution for the competence scale.
Notice that the bars form a pattern very different from the normal curve
line. This distribution can be said to be not normally distributed. As we
see later in the chapter, the distribution is negatively skewed. That is,
extreme scores or the tail of the curve are on the low end or left side.
Note how much this differs from the SAT math score frequency
distribution. As you will see in the Levels of Measurement section, we call
the competence scale variable ordinal.
29
Descriptive Statistics
Scatter plot
Scatter plot is a plot or graph of two variables that shows how the score on
one variable associates with his or her score on the other variable. Each
dot or circle on the plot represents a particular individual’s score on the
two
variables with one variable being represented on the X axis and the other
on the Y axis. The measurement for both variables is continuous
(measurement data). It is useful to
• Gain insight into the relationship between two scale variables.
• To check the assumptions of linearity for correlation and regression statistics
• To locate the outliers that are far from the regression line.
Graphs legacy dialogues
interactive
scatter plot
move
“Scholastic aptitude” to the x-axis and move competence scale to y-axis
OK
30
Interpretation
The output shows a scatter plot for two scale variables i.e. scholastic
aptitude test and competence scale The overall pattern of the dots show that it is from
diagonal upward straight regression line showing positive association between the two
variables and the points fit the line pretty well (r2= 0.08) and there are very few values
dispersed far from the regression line so it seems that there is strong relationship
between scholastic aptitude test and competence scale
31
Descriptive Statistics
Measures of Central Tendency (Average/ Location)
Central tendency of a data set refers to a measure of the
"middle, central or average" value of the data set in order to
find out the only one value that can represent the whole data
set. It is also called measure of the location. It includes
Mean is the arithmetic average of numerical data. It is an
appropriate measure of central tendency when there is less
fluctuation in data and values are more consistent with no
outliers. It is the most common measure of central tendency. It
can be calculated by dividing sum of the values (∑ X) with the
number of values (n). Its formula is
X = ∑ X /n
32
Descriptive Statistics
Median is the middle value of the numerical data. It is an appropriate
measure of central tendency for ordinal raw data is less consistent with
more fluctuations and outliers. It is the midpoint of a distribution that the
same numbers of scores are above the median as below it. It can be
calculated by
Arranging the data in ascending order
Ranking them
And locating the value at middle rank using the formula as
under
X = (n+1)/2th value
Mode is the most common category in the data. It is a measure of central
tendency for any kind of data but it is most appropriate for categorical data
preferably of qualitative nature. It generally provides the least precise
information about central tendency in case of categories of ordinal or scale
data. Remember that some time data can have more than one mode. Mode
is denoted by
X = the most frequent value
33
Descriptive Statistics
To find out mean, median, and mode click on
Analyze
Descriptive statistics
frequencies
move the
“Scholastic aptitude” to the variables box
click on statistics tab
check mean median mode
click OK
Fig.6. Mean, Median Mode of SAT math score
N
Valid
Missing
Mean
Median
Mode
75
0
490.53
490.00
500
34
Descriptive Statistics
Measures of Variability:
Measure of variability is the quantitative measure of the degree of variation
or dispersion of values in a data set including score of one variable.
Standard deviation
Standard deviation is the most commonly used measure of the variability. It
is the average distance of the values from the mean of data and thus
shows how much variation is there in the data from the "average" (mean).
The formula for standard deviation is as follows
35
Descriptive Statistics
Example
Suppose we wished to find the standard deviation of the data set consisting
of the values 3, 7, 7, and 19.
Step 1: find the arithmetic mean (average) of 3, 7, 7, and 19,
Step 2: find the deviation of each number from the mean, by subtracting
the mean from values (x-x)
Step 3: square each of the deviations to obtain (x-x)2 , which amplifies large
deviations and makes negative values positive,
=48
Step 4: find the average of those squared deviations by adding them up
and dividing by n-1 to get the variance s2)
4–1
Step 5: take the non-negative square root of the quotient (converting squared units
back to regular units),
S=
36
Descriptive Statistics
Interpretation of standard deviation
In order to measure the dispersion of the data from its mean (x = 9) standard deviation is
calculated. The standard deviation (s=6.93) shows that the average distance of the
values from the means is 6.93 which relates that the most of the values falls in the range
of 9 ± 6.93 (x±s) that is from 2.07 to 15.93.
Zero Standard deviation means that the data values are clustered at one point i.e. mean. A
low standard deviation indicates that the data points tend to be very close to the mean,
whereas high standard deviation indicates that the data are spread out over a large range of
values.
For data with a symmetric and approximately normal distribution it can be shown that
37
Descriptive Statistics
Other measures of Variability
Besides standard deviation there are also some other
measures of variability that are as follows
Range - The range is the difference between the highest and
lowest score in a distribution. It is the simplest measure to
compute and understand variability of the data but it is not
often used as the sole measure of variability due to its
instability. Because it is based solely on the most extreme
scores in the distribution and does not fully reflect the pattern
of variation within a distribution, hence the range is a very
limited measure of variability.
Range = Max - Min
38
Descriptive Statistics
Interquartile Range (IQR) - The interquartile range is the
range of the middle 50% of a distribution. Because any outliers
in our distribution must be on the ends of the distribution, the
range as dispersion can be strongly influenced by outliers.
One solution to this problem is to eliminate the ends of the
distribution and measure the range of scores in the middle.
Thus, with the interquartile range we will eliminate the bottom
25% and top 25% of the distribution, and then measure the
distance between the extremes of the middle 50% of the
distribution that remains.
IQR = Q3 - Q1
Variance - The variance is a measure based on the deviations
of individual scores from the mean. As noted in the definition
of the mean, however, simply summing the deviations will
result in a value of 0. To get around this problem the variance
is based on squared deviations of scores about the mean. 39
Table 4 Selection of Appropriate Descriptive Statistics and Plots
40
Descriptive Statistics
The Normal Curve
The frequency distributions of many of the variables used in
the behavioral sciences are distributed approximately as a
normal curve when N is large.
Properties of Normal Curve
The mean, median and mode are equal.
It has one “hump” and this hump is in the middle of the distribution.
The curve is symmetric. If you fold the normal curve in half, the right side
would fit perfectly with the left side; that is, it is not skewed.
The range is infinite.
The curve is neither too peaked nor too flat and its tails are neither too
short nor too long.
41
Descriptive Statistics
Data File Management
Data file management involves different methods for data transformations
into the form needed to answer the research questions. Data file
management can be quite time consuming especially if you have a lot of
questions/items that you combine to compute that summated or composite
variables that you want to use in later analysis.
1. Count Math Courses Taken
2. Recode and Re-label
revise
reverse
3. Compute Variable
42
SUPERIOR GROUP OF COLLEGES
43
43