Download Brain size and Intelligence - Neas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
NEAS Fall 2012
Regression Analysis Project
Brain size and Intelligence
If someone was to call you a pea-brain you would automatically think they were
calling you stupid. After all to have a small or “pea size” brain would imply a lack of
intelligence. But is there any truth to this assumption; is brain size correlated to
intelligence? I came across a study measuring brain size and intelligence that will help
answer this question. The study was done in 1991 at a large southwestern university.
Intelligence was determined by the Wechsler Adult Intelligence Scale (WAIS). A sample
of 20 males and 20 females participated, the following 7 components were recorded for
these individuals
1. Gender: Male or Female [M,F]
2. FSIQ: Full Scale IQ scores based on Wechsler (1981) test [IQ]
3. VIQ: Verbal IQ scores based on Wechsler (1981) test
4. PIQ: Performance IQ scores based on Wechsler (1981) tests
5. Weight: body weight in pounds [W]
6. Height: height in inches [H]
7. MRI_Count: total pixel Count from the 18 MRI scans to
determine the brain size of subjects [M]
Note, for the WAIS Performance IQ [PIQ] and Verbal IQ [VIQ] are combined to
determine the Full Scale IQ [FS IQ].
My goal was to determine which, if any of these variabless can be considered in
predicting IQ.
My first step was to examine the basic characteristics of the data in order to
determine a reasonable distribution. Below is a table summarizing my results.
Total
HL Lower Hinge
Median
Hu Upper Hinge
(HU – Median) / (Median – HL)
Mean
Range
Variance
FSIQ
90
117
136
0.72
113.45
67
22,618
VIQ
95
114
131
0.89
112.35
79
21,751
PIQ
86
115
129
0.48
111.025
78
19,693
In this table, the upper and lower hinges are set at the 75 th and the 25th
percentile respectively. The ratio of (HU – Median) / (Median – HL) summarizes the
skewness of the data. A number greater than 1 is positively skewed whereas a number
less than 1 is negatively skewed. The data set has a skewness ratio of .72, meaning the
data set is slightly negatively skewed. The mean is also relatively close to the median.
Because of this approximate symmetry, I assumed the data set followed a normal
distribution and normalized all the IQ scores by subtracting the mean and dividing by
the standard deviation . This transformation helped to minimize the unexplained
variance and increase the coefficient of determination [R2].
Correlation
My next step was to test the relationship between the explanatory variables.
When two variables are independent they will have a covariance of zero and a
consequently a correlation of zero. On the other hand, two mutually exclusive variables
will always have a negative correlation. Below is a table summarizing my results.
Covariance
H
W
M
M/R
H
607
Correlation
H
W
M
M/R
H
1
W
2,427
20,396
M
6,692
32,357
203,763
M/H
(18.49)
14.02
1,689
28
W
0.69
1
M
0.60
0.50
1
M/H
(0.14)
0.02
0.70
1
As you can see, height and weight had a strong positive relation of .69. More
surprising was the correlation of .6 between height and brain size, I was expecting these
variables to be independent and have a covariance close to zero. Consequently weight
and brain size had a correlation of about .5. Because of this relationship I decided to
create an additional variable equal to brain size divided by height [M/H] to remove
some of the bias between M and W. Since H and M/H will be negatively correlated, and
H has a strong positive correlation with W, M/H and W should have almost no
correlation.
I also tested the relationship between the two IQ sub-tests and discovered that
PIQ and VIQ had a strong positive correlation of .78. So a higher verbal intelligence is
linked to a higher performance intelligence and vice versa.
It is important to consider covariance and correlation when creating models with
multiple variables. Excluding explanatory variables can cause bias when there is strong
correlation between variables. For example, since height and weight have a strong
positive correlation, and the correlation of height and IQ is not zero, a regression
equation using only weight as an explanatory variable would overestimate the
relationship of weight and IQ since a part of that covariance can be explained by height.
Likewise, since height and brain size have a strong positive correlation, and IQ is
positively correlated with both explanatory variables, a regression equation using only
one of these variables could misrepresent the relationship between IQ and the single
input variable being used. Thus an un-biased regression equation is one using height,
weight, and brain size
Regression Models
In my attempt to find a model that predicted IQ, I began testing the different
combination of possible explanatory variables in each case the normalized FSIQ was the
output or dependent variable. All numbers were calculated using the Data Analysis
regression tool in excel. Below is a table summarizing the different models and their
outputs.
Model
1
2
3
4
5
6
7
8
9
Terms
H,W,M,M/H
H,W ,M
H,W,M/H
H,W
H,M
H
M
M/H
W
df
R2
4
3
3
2
2
1
1
1
1
0.31
0.28
0.29
0.18
0.23
0.15
0.00
0.12
0.16
Adjusted R2
0.23
0.22
0.23
0.14
0.19
0.13
(0.03)
0.09
0.14
F
3.96
4.76
4.89
4.16
5.62
6.77
0.00
5.09
7.19
Significance F
0.94%
0.68%
0.60%
2.34%
0.74%
1.31%
97.2%
2.99%
1.08%
The R2 value expresses the overall accuracy of the regression. That is, R Square
tells how well the regression line approximates the real data. This number tells you how
much of the output variable’s variance is explained by the input variables’ variance.
With numbers below .35 it is clear that the given inputs are not necessarily good
predictors of IQ. Ideally we would like to see this at least 0.6 (60%) or 0.7 (70%).
The R Square always goes up when a new variable is added, whether or not the new
input variable improves the regression equation’s accuracy, so I also listed the adjusted
R Square, which is more conservative then the R Square because it takes into account
the degrees of freedom. When new input variables are added to the Regression
analysis, the adjusted R Square increases only when the new input variable makes the
Regression equation better able to predict the output.
The Significance of F indicates the probability that the Regression output could
have been obtained by chance. A small Significance of F confirms the validity of the
Regression output. Most models produced a reasonable significance of F value,
however Model 7 had an alarmingly high significance of F value that would not be
accepted. In this case, a Significance of F = 0.972, means there is a 97.2% chance that
the Regression output was merely a chance occurrence. This confirms that brain size by
itself does not possess a linear relationship to IQ
The P-values of each coefficient and the Y-intercept provide the likelihood that
they are real results that did not occur by chance. If true value of the coefficient is zero
the probability that the absolute value of the coefficient is at least as great as it is in this
regression equation is equal to the p value. For a coefficient on Xi, the P-value tells us
the probability of obtaining values within this regression if Bi = 0 and there is no
correlation between Xi and the output Y. A P-value near 0 means that there is little
probability that the relationship between the independent variable(s) and the
dependent variable the model established doesn't actually exist. So the lower the Pvalue, the higher the likelihood that that coefficient or Y-Intercept is valid. The table
below gives the P-values for each model.
Model
1
2
3
4
5
6
7
8
9
Terms
H,W,M,M/H
H,W ,M
H,W,M/H
H,W
H,M
H
M
M/H
W
Intercept
28.83%
13.06%
79.81%
6.17%
2.33%
1.33%
97.25%
3.02%
1.16%
1.31%
97.24%
2.99%
1.08%
H
32.12%
6.10%
55.21%
29.69%
0.19%
W
7.46%
11.84%
10.96%
23.27%
5.43%
M
29.68%
3.10%
2.65%
M/H
24.55%
The lowest P-values occurred for models 5, 6 and 9. For models 6 and 9, PValues below .0133 on all regression coefficients and intercepts indicates that there is
less than 1.33% probability that the outcome occurred only as a result of chance when H
or W is the only dependent variable. Model 7 again produces the greatest probability of
unpredictability, with a P-value of .972442, there is less than 3% chance that the result is
not a product of chance.
Gender
I was also curious to see how women compared to men in this study. I ran the same
statistics as above only this time separated the data by gender, results are summarized
in the following table.
Female
HL
FSIQ
VIQ
PIQ
87
88
87
Median
116
116
Hu
134
(HU – Median) / (Median – HL)
0.64
Male
FSIQ
VIQ
92
PIQ
HL
89
115
Median
118
131
132
Hu
140
145
130
0.52
0.59
(HU – Median) / (Median – HL)
0.76
1.86
0.40
110.5
I was disappointed to find that the average (mean) IQ score for males is higher
than it is for the females in this study. However, females have a lower range and
variance for each IQ subtest indicating greater consistency in scores.
What I found more intriguing was the skewness in the sub-components of IQ
between gender. When comparing percentiles you will see that males have a lower
median for VIQ, indicating that there is a greater distribution of data points below the
mean. A skewness ratio of 1.86 confirms that the distribution is positively skewed. This
is offset by a negatively skewed PIQ. In laymen’s terms this means females have a
higher probability of out performing males on verbal intelligence, whereas males have a
greater probability of scoring higher on PIQ.
I next wanted to incorporate gender in some of the regression models to see if
gender was a significant factor. When testing the role of gender in predicting IQ it is
necessary to introduce dummy variables. That is, height, weight, and brain size are all
86
117
quantitative measurements while gender is qualitative. In this case I assigned males as
the base gender with dummy variable equal to 0. I decided to replicate model 6 and use
height as my single explanatory variable. The regression lines for males and females will
be in the following format.
Male: YM= α + βxi
Female: YW= α + γ1+ (β + ɖ1)xi
Where Y = IQ, X = Height
In the regression output the intercept (α) for both equations is the intercept for
males. γ1 is equal to the difference between the males and female intercepts, this
represents the constant vertical distance between regression planes for females and
males when height is not considered. Females have a lower intercept so γ1 is negative
ɖ1 is the coefficient on height-female interaction, this is equal to the difference between
height coefficients between the 2 regressions. A ɖ1 not equal to zero signifies that the
equations are not parallel. In this case, the low ɖ1 value means there is minimal
difference between the slopes of the 2 lines. That is, the rate of change between IQ and
height very close for both genders. The final values for each equation are given below.
α
β
γ1
ɖ1
(0.36)
0.01
(0.64)
0.01
The low d1 coefficient combined with the relatively high y1 coefficient tells me that
gender differentiation does have significance in determining IQ. In comparing these two
new equations to the original Model 6, it important to note the reduction of the sample
size. With only 20 observations for each gender, the adjusted R2 value drops
considerably and the significance F values become very high, which would indicate that
separating the data by gender produces a less predictive model of IQ. Because of the
limited size of this dataset, I did not isolated gender for the remaining models.
Conclusion
Ideally, I would be able to configure a model with an R2 close to 1, with minimal
unexplained variance and low significance of F values. This regression analysis is a
perfect example of a non-textbook situation where the results don't always turn out
perfectly. The limited size of the data set creates problems of validity and accuracy,
especially when splitting the data by gender. That is, a greater number of data points
will increase the validity of the regression models by reducing the probability that
observations are a byproduct of randomness.
Obviously, with very low R2 values no single model would be a reasonable estimate of
IQ, but there is still information to be had from this regression analysis. This regression
analysis tells us a great deal about how the variables interact and which models should
not be used.
With an R2 value of near 0, Model 7 would not be appropriate to predict IQ. Its
high p-values also expresses the uncertainty of this Model. It is safe to say that brain
size alone does not have a reliable linear relationship to IQ.
One should also take caution to models with bias. That is models with correlated
explanatory variables should be questioned. As previously discussed, any model not
using all three explanatory variables [H,W,M] would produce bias. Thus models 4-9
would not provide the best explanation for changes in IQ. This leaves model 1,2, or 3.
All three models have similar R2 and significance of F values. However model 2 produces
the lowest p-values out of the 3, so the model 2 coefficients are less likely to be the
result of chance, which leads me to have more faith in this model over the others.
Even though this model may be the most reasonable of all options, with an R2 of
.28 it is still not a great predictor of IQ. That is, no linear combination of height, weight,
and brain size will provide a good estimate of intelligence. So being a pea brain may not
be such an insult after all.
Source of data
http://lib.stat.cmu.edu/DASL/Datafiles/Brainsize.html
Datafile Name: Brain size