Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Navigator IBMA SPSSA Statistics (A Vade Mecum for SPSS 19 on Windows7) 1 2 Navigator IBMA SPSSA Statistics (A Vade Mecum for SPSS 19 on Windows7) 3 Preface This is an elementary guide intended for navigating IBM SPSS v19 on the Windows platform. Little or no previous knowledge of SPSS is assumed. The guide is to be treated as an interface to manipulate tools available from IBM SPSS before embarking on statistical applications and their interpretations. Brief comments on data properties and necessary requirements for applying specific statistical techniques have been outlined in the appendix. September 2011 Royal Holloway University of London Further Readings: Frequently Applied Statistical Techniques Multivariate Statistical Analysis P Pal P Pal 4 Contents Working with Input Editor 4 Data View Variable View Data Manipulations Pre-processing of Raw Data Compute Re-Code Case Selection Simple Condition; Conjugate Condition Split File 4 5 6 7 8 10 12 13 15 Working with Output Editor 17 Controlling Output Export Output to another Application 18 18 Appendix Basic Statistics Uncertainties Frequently Applied Statistical Techniques Mathematical Models 22 5 NAVIGATOR IBM SPSS 19 § Working with the Input When SPSS is opened, the first window that appears on the screen is the Input Editor (Data View). At the top of this Editor, two bars appear: standard menu bar and standard tool bar. The main body consists of a spreadsheet with rows and columns. Like most applications packages under Windows OS, there are arrows on the right hand and bottom margins of the SPSS spreadsheet for scrolling it up, down or side ways as required. At the bottom (left hand corner) two tabbed buttons are located e,g Data View and Variable View/ The standard menu bar shows the following menu items: Item Used for File Edit View Data Transform Analyze Graphs Add-ons Utilities Window Help opening, saving, printing files copy, paste, editing files controlling the appearance of software organizing data file, code labelling of data values recode, compute, defining new variables statistical applications drawing different types of graphs plug-in additional applications accessing command languages activating a particular window accessing SPSS help facilities on applications Exhibit 1 The Input Editor (Data View) Usage (extensive) (extensive) (less extensive) (extensive) (extensive) (extensive) (extensive) (less frequent) (less frequent) (seldom) (extensive) 6 The tool bar below the standard menu bar contains tools which are used as short cut tools to perform some of the standard menu functions. The specific function by the individual icon may be displayed by anchoring the mouse pointer on it. Exhibit 1 shows 5 sets of data have been entered in 3 columns. By default, each value has 2 decimal places as the values are entered. The three active columns containing values are automatically labelled VAR00001, VAR00002, VAR00003 as values are typed in. The remaining columns remain passive with blank cells. These columns labelled Var (faintly displayed). By clicking on the Variable View tab, the Input Editor is opened showing the variables. Exhibit 2. Input Editor Variable View This window shows the list of variables entered in the Input Editor (Data View) and information about the types and format including names and other properties of individual variables, together with their character width, decimal places etc. In the above Exhibit, notable elements are: 3 variables with their names, their type (i.e. numeric or string type; here it is numeric type), width (8 character width), decimals (2 decimal places), alignment of data values (right alignment). 7 § Manipulation of Input data Different techniques are available for the manipulation of input data. In the following sections, a selection of frequently used manipulation methods is outlined. These include: Changing variable names, Defining data properties, Labelling code values, Increase or decrease decimal places. Change Variable names: Steps to Change Variable names (header labels) Anchor mouse pointer to a required cell in the Name column (Col 1 of the Variable View Editor) Type the desired name; Repeat process to all the variable cells as required. Variable names must begin with a letter, have up to 8 characters (without space), Data manipulation: Usually the width of data values should not require any change. However, default decimal places of 2, may need changing. Steps to change decimal places in the data values Anchor mouse pointer to a cell in the Decimal column, (shows the up and down arrow scroll button appears). Click Up arrow to increase or Down arrow to decrease decimal places in the data values. Steps to change value labels Values in some variables may consist of discrete numeric codes representing some class names. These codes may be given their appropriate value labels. Click on Data from the Menu bar and Choose Define Data Properties (The variable scanning window appears) Exhibit 3 Variable scanning window 8 Steps (contd) Click on the desired variable from the list and transfer to the right box Click on Continue; next window opens (see Exhibit below) Click on the Variable appearing on the left. Anchor mouse pointer in the Label box (top right hand corner) and type an elaborate label as desired; Type in the labels in the boxes next to each Code value. Finally OK. Exhibit 4 Value edit window § Pre-processing of raw data. Pre-processing of raw data may consist of (i) direct arithmetic calculation, or (ii) by application of an in-built function e.g. Log10 or Ln function or exponential or trigonometric functions or (iii) application of a statistical probability function e.g. CDF (cumulative distribution function) Direct calculation may be based on a formula or an equation. The formula expression may contain variables (entered as input data) and constants. A variety of in-built functions are offered for transformation of raw data values. Any specific function may be selected from a list of functions. Data values contained selected variable(s) from the input list are used as the argument(s) of the selected function. 9 (a) Pre-processing by Direct calculation using an expression Here use constant numeric values, data variables, and operator symbols are used from the numeric key pad of the Compute dialog box Example 1: Calculate AGE from the raw data set containing the Month and Year of birth (to 2011) This is to be calculated with reference to the year 2011. Suggested formula is AGE = (2011 – Year) + (12 – Month)/12 Here AGE is the target variable in which the computed result values go and Year and Month variables are provided as raw data input variables (See Exhibit 3) Exhibit 5 Compute Dialog box. The operators (e.g. + , -, *, / operators), parantheses, and numeric constants are available from the key pad displayed by the Compute box, the input variables list appears on the left hand side pane. Steps to compute a new variable Transform Compute (Compute dialog box appears) Enter desired name in the Target box Enter the desired formula to compute in the Numeric Expression box Finally OK 10 Note: In the Transform --- Compute process (SPSS 19), when the entries are made from the key board, these entries are recorded as a syntax document. This syntax document appears after the final OK is clicked. The sequence is illustrated with the following simple computational example. Example: Conversion of Centigrade values to Fahrenheit scale. Formula Fahrenheit = (5/9)*Centigrade + 32 In the Compute process, we chose Fahrenheit as the Target variable and in the Function box, we type the formula by picking the appropriate symbols and operators and finally OK. The syntax representation appears on the screen. This is shown in the Exhibit. Exhibit 6 Syntax file on the foreground and Data file on the background. The Syntax representation may be cancelled and the results (in this case the Fahrenheit scale values) appear as a new variable data as follows. 11 .Exhibit 7 The results after pre-processing are shown as the target variable. (b) Computation by using a function For computation of any in-built function on the data values, select required function from the functions group list (top half of the box) and select any procedure function from special variables group (bottom half). Example 2: (1 ) Compute Log10 of AGE and (2 ) natural log of AGE The formulas for these computations are as follows: Log_AGE = Lg10 (AGE) LnAGE = LN (AGE) Here LOG_AGE and LNAGE are the Target variables for Log10 and natural log transformations respectively, and LG10 and Ln are the corresponding functions to be applied to the variable AGE. Steps Transform Compute (Compute dialog box appears) Enter desired name in the Target box Enter the desired function in Numeric Expression box Enter required argument from the variable list Finally OK. (For entering a function, click on the up arrow button at the bottom right hand corner of the key pad. The function is displayed which requires appropriate arguments.) 12 From the input variable list highlight the appropriate variable and enter as the required argument.) Exhibit 8 (Using Functions on Variables) These pre-processed results (after applying chosen functions) appear in the Input Editor as new Variables with the name that was given as the Target Variable in the Compute process. Exhibit 9.(Input Editor displaying Computed Values) 13 § Re-Code Re-code tool is useful to hold new specified (re-coded) values from an existing variable. Re-code mechanism allows to form a categorical variable from an existing variable (old) containing continuous data. Steps to re-code Transform Re-code (select into different variable option) (Re-code dialog box appears) Exhibit 8 (Re-code Dialog box) Steps (Contd). Select the variable to be re-coded from the input variable list Type an output variable in the Output variable box and click Change Click Old and New values button (Old and New values box appears) Exhibit 10. (Old and New values Dialog box) In this box three range selection options are available (i) (ii) (iii) Range Range Range ---------through ----------Lowest through ---- Value Value ---- through Highest 14 Steps (Contd Enter the limits of each range in appropriate blank boxes Type a desired value in the New value box, and click ADD (this New value is displayed in Old New box Continue (Return to the previous dialog box) Finally OK. Exhibit 11 The re-coded values appear in a new column in the Input Editor. Exhibit 11 15 § Selection of a sub-set of data It is often necessary to analyse values contained in a variable by sub-groups. The subsets are to be specified in terms of some categorical variable. Two variables are in this process. Analysis of data is achieved by the Split file tool which organises the output by the specified sub-groupings. This selects only the desired sub-set from the full data file. The selection may be based on simple conditions or compound conditions. (a) Case selection with simple condition Example 3. Given a sample of candidates with their gender and marital status, select (a) the female candidates (b) the female candidates who are married. Calculate (a) the average age of the candidates (b) the average age of male and (c) average age of female candidates. Gender Maritals Age 1 2 2 2 1 1 1 2 2 2 1 2 2 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 1 1 2 2 2 1 23.00 19.00 33.00 42.00 26.00 28.00 39.00 27.00 18.00 31.00 23.00 19.00 33.00 42.00 26.00 28.00 39.00 27.00 18.00 31.00 Codes: Gender male = 1 female = 2 Maritals Unmarried = 1 Married = 2 Select female cases only (i.e. Gender = 2) Steps Data – Choose Select Cases (the select cases window appears) This window shows the list of all the variables entered (left hand pane) and All cases selected as default Choose the If condition satisfied option by clicking in the adjacent circle to open the next window (see Exhibit) 16 Finally OK Exhibit 12 (Select Cases Conditions Dialog box). The exhibit shows all cases are selected. Exhibit 13 (Condition applied) Steps (contd). Choose appropriate variable (for the given Example, this is Gender) Transfer to the right hand side box. 17 Paste = symbol, followed by 2, ( i.e. Gender = 2) Continue (return to the previous window) This window now shows the instruction next to the If Condition satisfied option, Finally OK. The results of the condition are applied in the Input Editor by selecting the selected cases and de-selecting the remaining cases. (see left hand margin and Filter_$ col) . Exhibit 14 (Input Editor showing the selection with a simple condition) (b) Case selection with compound condition With the sample example Select cases for female and married candidates. This is a compound condition case. In order to apply the condition, two variables are to be sought (i) gender from which gender = 2 condition is to be employed and (ii) maritals from which maritals = 2 is to be employed. The two separate conditions are combined by the & conjunction. See the Exhibit below. 18 Exhibit 15 (Selection with conjugate conditions) As a result, the Input Editor displays all cases which satisfy the compound condition and crosses out the remaining cases. See the margin and filter $ column of the Editor. Exhibit 16 (Input Editor showing the sub-set selected with conjugate condition) 19 § Split Files It is often necessary to analyse data from a data file (full set) by different categories (sub-sets). This is performed by specifying the sub-sets with appropriate defining variable which contains the category codes. Example: Given a sample of 10 male and female candidates with their ages. Calculate (a) the average age of the candidates (b) the average age of male and (c) average age of female candidates. Note: (b) and (c) are split file cases. 1 2 2 2 1 1 1 2 1 1 42.00 34.00 32.00 23.00 28.00 31.00 43.00 36.00 27.00 42.00 Steps to Split file From the Data menu select Split File (Split File box appears) Choose Organize Output by groups and then OK Note: Data file is sorted by the grouping variable Exhibit 17 20 Results (a) Average of all candidates Exhibit 18 Average age of all the candidates calculated Average age is 33.80 (b) Average age of Male and Female candidates Exhibit 20 (The results on the top window) 21 Appendix § Measurement scales Majority of statistical analysis involves simple arithmetic operations on the data values. However the types of admissible operations depend on the measurement scale of the data. Measurement scales of the data values are of the following types Nominal Ordinal/Rank/Likert Interval/Ratio Discrete Discrete Continuous Non-parametric Non-parametric Parametric Relative strength of 3 measurement scales Weak <-----------------------------------------> Strong Nominal Ordinal/Rank/Likert Interval/Ratio Transformation from Interval/Ratio to Ordinal is possible by grouping Interval scale values in discrete groups within ranges defined by researchers. (cf. Re-code application). Transformation from Ordinal scale to Interval scale is done by applying various transformation on the Ordinal scale data, e.g Log transform, Square root or Inverse transform. Admissible Statistical Analyses In general, the non-parametric variables (Nominal or Ordinal/Rank scale) define categories. Hence these variables are called categorical or classification or grouping or factor variables, containing the so-called non-metric or discrete data. Parametric variables contain actual measurements or observations which represent metric or continuous data. Various statistical techniques that are applied on the data are dependent on the measurement scale of the data values. A list of admissible statistical techniques is given below. _________Non-parametric_________ Nominal Ordinal Parametric Ratio/Interval Mode Frequency Non-parametric tests Contingency coeff Summary parameters Mean, Variance, Skewness, Kurtosis Inferential Statistics ANOVA MANOVA Pearsons Correlations Linear and non-linear regressions Median Frequency Non-parametric tests Spearmans correlation 22 § Summary statistics Statistical parameters are computed with a view to describing (or summarizing) the properties of sampled data. Each parameter is a single-valued quantity. In most cases, these parameters are computed for continuous scale data, with a few for discrete data. A selection of these parameters is given below. Definitions and Formulas of Summary Parameters for a sample Mode: Mode is the most frequently occurring value in a collection of observations. There may be more than one mode in a set of observations. (Nominal) Median: Median is the value in the middle value in a range of observations. There are exactly same number of observations greater than and less than the median value With N observed values, if N is odd, the median value is in the position [N+1]/2, and if N is even, median is the average of the values in position N/2 and N/2 + 1. (Ordinal/Rank) Arithmetic Mean X : The (sample) mean of a sample of N observations consisting of X1, X2,...Xi values, is given by X 1 N N X i 1 1 i (Interval/Ratio) Range: Range is the difference between the maximum and minimum values of N observations, i.e. Range = Xmax - Xmin 2 Variance v: The variance of N observation represents the spread of the observed data set about the mean and is expressed as X N v i 1 i X N 1 2 3 (Interval/Ratio) Standard Deviation s: Standard deviation is given by the square root of variance s= v (Interval/Ratio) 23 Skewness Sk: Skewness represents features (presence or absence) of symmetry of the distribution of the data set about the mean and is given by Sk X i X 3 6 N * s3 (Interval/Ratio) Distribution Sk zero Symmetrical Sk -ve Left skewed Sk +ve Right skewed Kurtosis Kr: Kurtosis represents the peakedness of a distribution. Kr X i X i N * s4 4 7 (Interval/Ratio) 24 § Summary statistics contd. Quartiles: Quartiles divide a data set into quarters (4 equal parts) The first quartile, Q1, is the median of the portion of data set that lies at or below the median of the entire data set The second quartile, Q2 is the median of the entire data set The third quartile Q3, is the median of the portion of data set that lies at or above the median of the entire data set. Note: The entire data set is arranged in ascending order for quartile evaluation. Five number summary: Five number summary is the list of five summary measures including the minimum, Q1, Q2 (median), Q3 and maximum of the data set. Note: The entire data set is arranged in ascending order for five number evaluation. A Box and Whisker plot gives a graphic representation of a data set based on the Five number summary. Q1 Median Q3 Also the B and W plot gives a visual understanding of the skewness of the distribution of the data set. The equal areas of the two polygons on either side of the Median indicates a symmetrical (skew = 0) distribution of the dta. Percentiles: A percentile of a specific value K (Kth Percentile) is the value of the data at or below which lies the K percent of observations. 25 § Uncertainties associated with sample data Confidence Interval CI A confidence interval is a range if values, wa to wb, which is expected to include a statistical parameter θ (i.e. wa ≤ θ ≤wb ). A sampling distribution is associated with a statistical parameter (e,g, mean, correlation coefficient, regression coefficient and so on) calculated from a sample data set and the Central Limit Theorem states that this distribution is normal with the well known bell shape and two tails (areas of error) on either side. By specifying a confidence level α (traditionally α is set at 95%) with the sample parameter, it is necessary to exclude the regions of error. This region is 1 – α taking both sides (two tail) of the distribution. In other words, the area of error on each side is (1 – α)/2The area of exclusion is usually looked up from a table showing values at desired error (A t-distribution table). a) Calculation of CI for a mean (m) Find the value for the error region (look up t0.025 from t-distribution table). This value is 1.96 at 95% confidence level, or at 5% error level. Multiply the standard error (SE) of the mean by this value which gives the required interval. Note: the SE of a mean m is (s/√N), where s is the standard deviation of m and N the sample size Thus the CI of a sample mean m is CI if m = m ± t0.025 * SE of m i.e. = m ± 1.96* (s/√N) b) Calculation of CI for a proportion from a dichotomous set of values Consider a sample of size N with dichotomous values. The proportion p of values with one dichotomy sub-set is f1/N and the other sub-set is 1- p. Calculate the standard error SE of a proportion (p). This is given by SE of a proportion = p(1 p) N The CI of a sample proportion p is CI of p = p± t0.025 * SE of p at 95%level p(1 p) = p± t0.025 * N 26 § Inferential Statistics The key objective of statistical analysis with a set of data is to draw some valid inference from it. Two computational steps are involved in deriving such inference. (a) Calculation of an appropriate test measure of the relevant hypothesis and (b) Estimation of the confidence interval. Inferential statistics comprises a wide range of tests which enable researchers to draw inferences from their sample data set. If an inference is to be drawn on a full population of data with a sample, certain conditions are to satisfied. Associated with every statistical test, there are two basic components. These components basically specify the conditions for an appropriate inference. These include i) ii) A measurement requirement and A model. Parametric and Non-parametric Tests: A parametric tests is a test which is to be applied on a set of data measured on Interval or Ratio scale (continuous) or on counts (discrete). It is generally assumed that the data conform to some distribution e.g. Normal or Poisson etc. A non-parametric test applied to a data set on Ordinal scale (e.g. ranked data) or on Nominal scale. In non-parametric tests, conformity of the data to any distribution is not assumed. Power Efficiency of Parametric and Non-Parametric Tests In general, inference of a more general type are drawn from the application of statistical tests which demand weaker assumption about the data. However, the test of null hypothesis through these applications is less powerful for any particular sample of size N (say) under consideration. This true of any Non-Parametric tests compared with Parametric test. Thus when two different tests are applied Test A and Test B) to two samples of two sizes Na and Nb where Nb > Na, then test B may be more powerful than test A This provides a scope for test B over test A. Efficiency and Sample size The extent of increase in the sample size to make a test (test B say) as powerful as another test (test A say) is linked with the power efficiency. This is defined in terms of the sample size used the respective tests 27 Power efficiency of test B = Na/Nb Using the above formula, it is easy to calculate the sample size of test B in order to raise its power to level of test A. Considering the parametric and non-parametric tests for comparing means of two groups (i.e. students t-test and Mann-Whitney test), the power efficiency of MannWhitney test is 95% relative to the t-test. Similarly among the parametric and non-parametric version of analysis of variance (ANOVA) the power efficiency of Kruskal-Wallis (non-parametric) test is again 95% relative to F-test (parametric). Same level of efficiency may be achieved for two types of tests (parametric and nonparametric) by drawing appropriate size for each. For example, a researcher needs to have a sample size of 20 cases for a Mann Whitney test to be able to reject Ho with the same level of confidence for every 19 cases for a t-test. Similarly a sample of size 20 is needed in a Kruskal-Wallis test to be able to reject Ho for every 19 cases for application of the F-test. 28 § Hypothesis testing steps A no difference statement is set up as the starter (Ho Null hypothesis). The implicit aim of a null hypothesis is to negate what you wish to demonstrate with your sample data set. Apply appropriate statistical technique, which produces the test statistic value, degrees of freedom (df), and probability p (sig-value) Compare the test statistic p (sig) value with the p value of the sampled distribution If the p value is low (conventionally < 0.05), reject Ho with residual uncertainty proportional to p or else accept Ho. Significance level (p): The dichotomy p yes Reject Ho no Accept Ho < 0.05 § Common Errors The common errors in arriving at a decision about Ho are of two types: Type I Error To reject Ho, when in fact it is true. Probability associated with Type I error = alpha Type II Error To accept Ho when in fact it is false Probability associated with Type II error = beta The two probabilities alpha and beta are inversely related, if alpha is increased. Beta is decreased. These errors originate from the sample size N of the data set. The errors are reduced by by increasing the sample size N. 29 § Specific Parametric Tests. Simple ANOVA applications ANOVA (One Way or Univariate) Data specification: 2 variables - one Test Variable which includes all test scores (Interval/Ratio scale) and one categorical variable (Ordinal or Nominal scale) which defines the groups (sub-sets) of data on the test variables. The key quantity that is computed in ANOVA is the F-statistic, together with the degrees of freedom and the sig-value. The F statistic (a single valued result) is the ratio of the between group (b-g) variation and the within-group (w-g) variation present in the test variable. F= SSb g SS w g Both the numerator and the denominator are single-valued scaler quantities for ANOVA applications and hence the F-statistic turns out to be a single-valued quantity. Simple MANOVA applications GLM Multiple Analysis of Variance (MANOVA) The MANOVA technique is applied to analyse the variances with 2 or more test variables (contrast to MANOVA). These test variables containing continuous data are treated as the dependent variables and the categorical variable(s) as independent variable(s). MANOVA computes the effects of the categorical (factor) variable on the dependent variables through the F-statistic. Normality assumptions on the test variables are not stringent. In Manova with multiple test variables, the between group (b-g) and within group (wg) variations are not single values scaler quantities but matrices. These matrices are known as SSCP(bg) and SSCP(wg) matrices respectively. 30 The latent roots from the matrix represent the so-called eigenvalues. The F-statistic calculation from these eigenvalues proceeds along the route as follows. SSCP b-g (Matrix)/SSCPw-g(Matrix) SSCP b-g (Matr ix)/S SCPwg(Mat rix) Computes eigenvalues Computes Wilks The following equations relate the eigenvalues (s) with the Wilk's and the Fstatsistic. 1 Wilks = i 1 i 1 n1 n2 p 1 F = p where n1 and n2 represent the number of cases in p groups. Other commonly used Manova test indices are also computed from the eigenvalues. These are given below. Pillais Trace = i Hotelling Trace i 1 i = i i Roy’s max root = max 1 max 31 Specific MANOVA Applications Variability of 2 or more test variables with 1 or more factor variables Data Specification: (i) 2 or more test variables (Continuous) and (ii) 1 or more categorical (factor) variables (discrete). The discrete variable may include 2 or more levels Repeated measure MANOVA (Between groups) Data Specification: Sub-sets of Test variables (Interval/Ratio scale) are to be entered in a 2-dimensional matrix format. The first element of the matrix represents the main factor and the second element denotes the factor levels (the levels indicate time repetition) . Repeated measure MANOVA (mixed Within-Between groups) Sub-sets of Test variable (Interval/Ratio scale) is to be entered in a 1-dimensional matrix form with factor levels (the levels indicate time repetition) § BI-VARIATE RELATIONSHIPS Pearsons Product-Moment Correlation (Parametric) Data specification Measurement scale 2 or more variables Ratio/Interval scale Example Correlation Matrix 3 variables: ABC Correlation coefficients appear as a Matrix AB = BA; Range of Correlation coefficient r AA BA CA AB BB CB AC BC CC AC = CA; BC = CB -1 ≤ r ≤ +1 Qualitative description of strength of r Range +/-0.1 to +/+/-0.31 to +/+/-0.51 to +/+/-0.71 to +/- 0.3 0.5 0.7 1.0 Relationship weak medium moderate strong 32 § MATHEMATICAL MODELS An aspect of rationality is the possibility of going beyond purely intuitive judgements to the use of structural models that provide methods of aggregating intuitive judgements. P Suppes Lucien Stern Professor of Philosophy Stanford University California. § General Linear Model Family Regression Analysis GLM Discriminant Analysis Factor Analyis Principal Component Analysis § GLM Prediction Model Simple Linear Regression A single linear regression is formed with one dependent variable y and one independent variable x1, and the equation is reduced to the form: y = constant + b1* x1 The coefficient b1 which represents the slope of the regression line and the constant is the intercept of the regression line with the vertical axis The coefficient may have a positive (+ve) or a negative (-ve) value. A +ve coefficient means that as the independent variable x increases, the dependent variable y also increases and a -ve coefficient means that as x increases y decreases. The magnitude of the coefficient value gives the measure of steepness of the regression line. 33 Multiple Linear Regression Regression is an example of GLM structures with the objective of prediction In Multiple regression, a linear equation is constructed by least square fitting using multiple independent (explanatory) variables X1, X2, X3 and one dependent variable Y.. The equation is expressed as follows: Y = constant + b1* X1 + b2* X2 + b3* X3 +... In the above equation, the coefficients (b1, b2 etc) associated with the independent variables are partial regression coefficients. Each coefficient or parameter should be regarded as an estimator of the measure of influence of a particular independent variable on the dependent variable. These coefficients may be +ve or -ve Data Specification Usually all the variables (independent and dependent) contain continuous (Interval/Ratio scale) data. In many regression applications, some of the independent variables may contain Ordinal /Rank scale data. In special applications (e.g. regression used as a classification model), the dependent variable may contain categorical data (Nominal or Ordinal/Rank scale) Terms and Definitions Partial Regression Coefficient - The coefficient associated with each independent variable (e.g. b1, b2 etc) is called the partial regression coefficient. These are the measures of influence of the independent variables on the dependent variable. These are unstandardised coefficients Standardised Coefficients - Owing to the presence of different scales of values in the independent variables, it is necessary to transform the unstandardised coefficients to produce the standardised coefficient which gives the measure of relative influence of each regression coefficient in comparable units. This is given by S = b x Sy where is the standardised coefficient which is related to b, the unstandardised coefficient for a particular independent variable X and the dependent variable Y, and Sx and Sy are the respective standard deviations. Coefficient of Multiple Determination R2 - This is the measure of the proportion of the variation in the dependent variable explained by the independent riables 34 § The Ballantine for Multiple Regression The Ballantine model provides a diagrammatic representation of the influence of the explanatory variables on the dependent variable. The variables involved (dependent and independents) in the multivariate regression analysis are depicted as circles. which mutually overlap. The overlap sections of these circles represent the extent of partial contributions by the independent variables to the dependent variable and among themselves. A B Y X2 X1 C D Three circles (above) represent 3 variables Y, X1 and X2. There are 4 overlap regions shown by A, B, C, and D. 'A' represents the overlap between X1 and Y (independent of X2) equivalent to b1 'B' represents the overlap between X2 and Y (independent of X1) equivalent to b2 'C' represents the overlap between X1, X2 and Y The ratio of the sum 'A' + 'B' + 'C' to the total area of the 'Y' circle is equivalent to R2, the coefficient of multiple determination. 'D' represents the overlap between X1 and X2 (independent of Y) is equivalent to partial correlation of X1 and X2. 35