Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
In the name of Allah Kareem, Most Beneficent, Most Gracious, the Most Merciful ! Introduction to SPSS Before further processing of the data we should get to know about SPSS software first . SPSS SPSS stands for “statistical package for social sciences”. It is basically used for the analysis of quantitative data . How to open SPSS Start Menu Programs SPSS Inc. SPSS 16.0 3 Opening the SPSS 4 SPSS Interface Title Bar Menu Bar Tool Bar Variable definition criteria Serial Number / Cases Work sheet SPSS Views 5 6 7.3 Data Entry After defining the variables enter the data in data view for each case (row wise) against each variable (column wise) 7 Data Preparation (Processing) After getting your survey completed (Sample attached as annexure 1) and knowing the interface of the SPSS the next step in quantitative research process is to prepare the data for analysis. This process involves four steps. 1.Coding the questionnaire. 2.Defining the variables in SPSS variable view. 3.Entering the data in SPSS data view. 8 Data Processing After collecting the data, data processing is started that involves 1. Data coding 2. Defining the variables 3. Data entry in the software 9 Coding Coding is the process of assigning numbers to the values or levels of each variable. Rules of Coding 1. 2. 3. 4. 5. 6. 7. All data should be numeric. Each variable for each case or participant must occupy the same column. All values (codes) for a variable must be mutually exclusive. Each variable should be coded to give maximum information. For each participant, there must be a code or value for each variable. Apply any coding rule consistently for all participants. Use high numbers (values or codes) for the “agree,” “good,” or “positive” end of a variable that is ordered 10 SAMPLE QUESTIONNAIRE Please circle or supply your answer ID_________ SD SA 1. I would recommend this course to other students 1 2 3 4 5 2. I worked very hard in this course 1 2 3 4 5 3. My college is : Arts & sciences___ _ Business____ Engineering____ 4. My gender is M F 5. My GPA is _____________ 6. For this class, I did: (Check all that apply The reading The homework Extra credit 11 7.2 Defining the variables In SPSS first of all the variables are defined in variable view This includes 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Name of the variable (Short without space) Type (Numeric, String) Width (8, 10, etc) Decimals (2, 3, 5 etc) Label (Full name of the variable) Values (answer categories with codes) Missing (blank, multiple, wrong answers) Columns (6, 8, 10 etc) Align (Left, right, centre) Measure (Nominal, ordinal, scale) 12 Data Analysis 1. Data analysis is a process of organizing, summarizing, presenting, interpreting, and drawing conclusions based on data with the goal of highlighting useful information, and supporting decision making. In quantitative research data analysis is performed objectively using statistical techniques. 2. Statistics is a branch of applied mathematics concerned with the collection and interpretation of quantitative data to draw conclusions and test (accept or reject) hypothesis. There are two levels/types of statistics 1. Descriptive statistics 2. Inferential statistics 13 Descriptive Statistics Descriptive statistics are the statistics that are used to understand and describe the data. They are used to answer the descriptive type of research questions. It involves 1. 2. 3. 4. 5. 6. Summarizing the data Measure of central tendency Measure of dispersion Checking data normality Data file management Recode and transform variables 14 Descriptive Statistics 1. Summarizing the data A data matrix contains too much information to be taken in at a glance due to which it becomes difficult to understand and get feel of the data. A set of data can be understood only if it is summarized in some appropriate way. i. Summarizing categorical data A categorical variable is usually summarized in frequencies and there percentages. This process is called Frequency distribution. It can be presented in two ways that are in the form of • Tables of frequency and percentages or • Graphs 15 Descriptive Statistics a. Frequency Distribution A frequency distribution is a tally (IIII) or count of the number of times each score (category) on a single variable is marked by respondents. A frequency can be further summarized by expressing them as percentages of the total using following formula Percentage = (frequency/total) X100 To get a frequency distribution Table: Analyze Descriptive Statistics Frequencies move religion to the variable box OK (make sure that the Display frequency tables box is checked) 16 Frequency Distribution Table Valid protestant catholic no religion Total Missing Frequency Percent Valid Percent Cumulative Percent 30 40.0 44.8 44.8 23 30.7 34.3 79.1 14 18.7 20.9 100.0 67 89.3 100.0 4 5.3 4 5.3 8 10.7 75 100.0 other religion blank Total Total 17 Interpretation: In this example, there is a Frequency column that shows the numbers of students who marked each type of religion (e.g., 30 said protestant and 4 left it blank). Notice that there are a total of (67) for the three responses considered Valid and a total (8) for the two types of responses considered to be Missing as well as an overall total (75). The Percent column indicates that 40.0% are protestant, 30.7% are catholic, 18.7% are not religious, 5.3% had one of several other religions, and 5.3% left the question blank. The Valid Percentage column excludes the eight missing cases and is often the column that you would use. Given this data set, it would be accurate to say that of those not coded as missing, 44.8% were protestant and 34.3% catholic and 20.9% were not religious. 18 Descriptive Statistics b) Frequency distribution graphs With Nominal data, you should not use a graphic that connects adjacent categories because with nominal data, there is no necessary ordering of the categories or levels. Thus, it is better to make a bar graph or chart of the frequency distribution of variables like religion, ethnic group, or other nominal variables; the points that happen to be adjacent in your frequency distribution are not by necessarily adjacent. c) Bar Charts bar graphs are usually used to display "categorical qualitative data", the bars in bar graphs are usually separated and the height of the bars shows the frequency of that category. 19 Descriptive Statistics To get a bar chart select Graphs legacy dialogues interactive bar chart move variable to the box ok Fig.2. Bar chart for the nominal variable religion 20 Descriptive Statistics 1. Summarizing Numerical Data Simple numerical summaries of a numerical variable can be obtained through a)Five Figure Summary the data can be summarized by quoting five figures if the data is first sorted into (ascending) numerical order These five figures are Minimum value (Min) —the smallest value, with rank 1 Maximum value (Max) — the largest value, with rank n, and Median (M/Q2) —The middle value, with rank (n+1)/2 The median divides the data into two halves, each with the same number of observations. Each of these halves may, in turn, be divided into two by quartiles, so that the data is split into 4 quarters. These are known as: Lower quartile (Q1) —The middle value of first half. Upper quartile (Q3) —The middle value of second half Rank the values from 1 (the smallest value) to n (the largest value; n denotes the total number of observation). 21 Descriptive Statistics 1. Summarizing Numerical Data Simple numerical summaries of a numerical variable can be obtained through a)Five Figure Summary the data can be summarized by quoting five figures if the data is first sorted into (ascending) numerical order These five figures are Minimum value (Min) —the smallest value, with rank 1 Maximum value (Max) — the largest value, with rank n, and Median (M/Q2) —The middle value, with rank (n+1)/2 The median divides the data into two halves, each with the same number of observations. Each of these halves may, in turn, be divided into two by quartiles, so that the data is split into 4 quarters. These are known as: Lower quartile (Q1) —The middle value of first half. Upper quartile (Q3) —The middle value of second half Rank the values from 1 (the smallest value) to n (the largest value; n denotes the total number of observation). 22 Descriptive Statistics Example 1: Department An absenteeism data. Consider the absenteeism data for a department in an organization Department A: 20 employees 0 0 2 2 0 0 1 1 3 1 2 3 3 5 95 5 5 8 10 15 Step 1-Ascending order 0 0 0 0 1 1 1 2 2 2 3 3 3 5 5 5 8 10 15 Step 2 Ranking 95 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 95 Value 0 0 0 0 1 1 1 2 2 2 3 3 3 5 5 5 8 10 15 95 23 Descriptive Statistics Boxplot A boxplot is a quick method of summarizing and graphically representing ordinal and scale data for examining one or more sets of data. It is also called box and whisker plot. It is useful to • • • • Summarize the data by getting five figure summary Check the data for errors Examine and compare frequency distributions Check assumption for inferential statistics (Check normality of data) Boxplot for one set of data Graphs Boxplot in boxplot window select simple and summaries of separate variables click define select the variable and move it into the boxes represent box click ok 24 Case Processing Summary Cases Valid N math achievement test Missing Percent 75 100.0% N Total Percent 0 .0% N Percent 75 100.0% Upper whisker =Max Upper Quartile = Q3 Median = Q2 Lower Quartile = Q1 Lower whisker =Min 25 Interpretation The case processing summary table shows the valid N=75, with no missing values for total sample of 75 for the variable math achievement. The plot shows a box plot for math achievement. The box represents the middle 50% of the cases (M=13), lower end of the box shows lower quartile (Q1=7.67), and upper end of the quartile shows upper quartile (17.00). The whiskers indicate the expected range (25.33) of scores from minimum (Min=-1.67) to Maximum (Max=23.67). Scores outside of this range are considered unusually high or low, such scores are called outliers. There are no outliers for in this case. 26 Descriptive Statistics Histograms Histogram is a form of a bar graph used with numerical (scale) variable preferably of continuous nature. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a rectangle located above the interval. Unlike the bar graph, in a histogram there is no space between the bars. The data is continuous so the lower limit of any one interval is also the upper limit of the previous interval. It is useful to • Summarize the data Examining and comparing frequency distributions • Check normality of data To draw a histogram select: Graphs legacy dialogues variable to the box OK interactive histogram move 27 Fig.3. Histogram of SAT- math score 28 Interpretation In fig. 3 the frequencies (number of students), shown by the bars are for a range of points (in this case SPSS selected a range of 50: 250-299, 300349, 350-399, etc). Notice that the largest number of students (about 20) had scores in the middle two bars of the range (450-499 and 500-549). Similar small numbers of students have very low and very high scores. The bars in the histogram form a distribution (pattern or curve) that is similar to the normal, bell shaped curve. Thus, the frequency distribution of the SAT math scores is said to be approximately normal. In fig. 4 shows the frequency distribution for the competence scale. Notice that the bars form a pattern very different from the normal curve line. This distribution can be said to be not normally distributed. As we see later in the chapter, the distribution is negatively skewed. That is, extreme scores or the tail of the curve are on the low end or left side. Note how much this differs from the SAT math score frequency distribution. As you will see in the Levels of Measurement section, we call the competence scale variable ordinal. 29 Descriptive Statistics Scatter plot Scatter plot is a plot or graph of two variables that shows how the score on one variable associates with his or her score on the other variable. Each dot or circle on the plot represents a particular individual’s score on the two variables with one variable being represented on the X axis and the other on the Y axis. The measurement for both variables is continuous (measurement data). It is useful to • Gain insight into the relationship between two scale variables. • To check the assumptions of linearity for correlation and regression statistics • To locate the outliers that are far from the regression line. Graphs legacy dialogues interactive scatter plot move “Scholastic aptitude” to the x-axis and move competence scale to y-axis OK 30 Interpretation The output shows a scatter plot for two scale variables i.e. scholastic aptitude test and competence scale The overall pattern of the dots show that it is from diagonal upward straight regression line showing positive association between the two variables and the points fit the line pretty well (r2= 0.08) and there are very few values dispersed far from the regression line so it seems that there is strong relationship between scholastic aptitude test and competence scale 31 Descriptive Statistics Measures of Central Tendency (Average/ Location) Central tendency of a data set refers to a measure of the "middle, central or average" value of the data set in order to find out the only one value that can represent the whole data set. It is also called measure of the location. It includes Mean is the arithmetic average of numerical data. It is an appropriate measure of central tendency when there is less fluctuation in data and values are more consistent with no outliers. It is the most common measure of central tendency. It can be calculated by dividing sum of the values (∑ X) with the number of values (n). Its formula is X = ∑ X /n 32 Descriptive Statistics Median is the middle value of the numerical data. It is an appropriate measure of central tendency for ordinal raw data is less consistent with more fluctuations and outliers. It is the midpoint of a distribution that the same numbers of scores are above the median as below it. It can be calculated by Arranging the data in ascending order Ranking them And locating the value at middle rank using the formula as under X = (n+1)/2th value Mode is the most common category in the data. It is a measure of central tendency for any kind of data but it is most appropriate for categorical data preferably of qualitative nature. It generally provides the least precise information about central tendency in case of categories of ordinal or scale data. Remember that some time data can have more than one mode. Mode is denoted by X = the most frequent value 33 Descriptive Statistics To find out mean, median, and mode click on Analyze Descriptive statistics frequencies move the “Scholastic aptitude” to the variables box click on statistics tab check mean median mode click OK Fig.6. Mean, Median Mode of SAT math score N Valid Missing Mean Median Mode 75 0 490.53 490.00 500 34 Descriptive Statistics Measures of Variability: Measure of variability is the quantitative measure of the degree of variation or dispersion of values in a data set including score of one variable. Standard deviation Standard deviation is the most commonly used measure of the variability. It is the average distance of the values from the mean of data and thus shows how much variation is there in the data from the "average" (mean). The formula for standard deviation is as follows 35 Descriptive Statistics Example Suppose we wished to find the standard deviation of the data set consisting of the values 3, 7, 7, and 19. Step 1: find the arithmetic mean (average) of 3, 7, 7, and 19, Step 2: find the deviation of each number from the mean, by subtracting the mean from values (x-x) Step 3: square each of the deviations to obtain (x-x)2 , which amplifies large deviations and makes negative values positive, =48 Step 4: find the average of those squared deviations by adding them up and dividing by n-1 to get the variance s2) 4–1 Step 5: take the non-negative square root of the quotient (converting squared units back to regular units), S= 36 Descriptive Statistics Interpretation of standard deviation In order to measure the dispersion of the data from its mean (x = 9) standard deviation is calculated. The standard deviation (s=6.93) shows that the average distance of the values from the means is 6.93 which relates that the most of the values falls in the range of 9 ± 6.93 (x±s) that is from 2.07 to 15.93. Zero Standard deviation means that the data values are clustered at one point i.e. mean. A low standard deviation indicates that the data points tend to be very close to the mean, whereas high standard deviation indicates that the data are spread out over a large range of values. For data with a symmetric and approximately normal distribution it can be shown that 37 Descriptive Statistics Other measures of Variability Besides standard deviation there are also some other measures of variability that are as follows Range - The range is the difference between the highest and lowest score in a distribution. It is the simplest measure to compute and understand variability of the data but it is not often used as the sole measure of variability due to its instability. Because it is based solely on the most extreme scores in the distribution and does not fully reflect the pattern of variation within a distribution, hence the range is a very limited measure of variability. Range = Max - Min 38 Descriptive Statistics Interquartile Range (IQR) - The interquartile range is the range of the middle 50% of a distribution. Because any outliers in our distribution must be on the ends of the distribution, the range as dispersion can be strongly influenced by outliers. One solution to this problem is to eliminate the ends of the distribution and measure the range of scores in the middle. Thus, with the interquartile range we will eliminate the bottom 25% and top 25% of the distribution, and then measure the distance between the extremes of the middle 50% of the distribution that remains. IQR = Q3 - Q1 Variance - The variance is a measure based on the deviations of individual scores from the mean. As noted in the definition of the mean, however, simply summing the deviations will result in a value of 0. To get around this problem the variance is based on squared deviations of scores about the mean. 39 Table 4 Selection of Appropriate Descriptive Statistics and Plots 40 Descriptive Statistics The Normal Curve The frequency distributions of many of the variables used in the behavioral sciences are distributed approximately as a normal curve when N is large. Properties of Normal Curve The mean, median and mode are equal. It has one “hump” and this hump is in the middle of the distribution. The curve is symmetric. If you fold the normal curve in half, the right side would fit perfectly with the left side; that is, it is not skewed. The range is infinite. The curve is neither too peaked nor too flat and its tails are neither too short nor too long. 41 Descriptive Statistics Data File Management Data file management involves different methods for data transformations into the form needed to answer the research questions. Data file management can be quite time consuming especially if you have a lot of questions/items that you combine to compute that summated or composite variables that you want to use in later analysis. 1. Count Math Courses Taken 2. Recode and Re-label revise reverse 3. Compute Variable 42 SUPERIOR GROUP OF COLLEGES 43 43