Download violentcrime

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
December 13, 2004
Violent Crime in America
Introduction
Violent crime in the United States is an important subject, particularly in New
York City where people perceive the risk of being victimized by crime to be relatively
high. As residents of New York City, the risk of violent crimes affects the way we live
our lives, whether or not we actually become a victim of a crime. We have to think twice
about traveling alone on the subway late at night, or jogging in Central Park after dark.
Therefore, in thinking about the quality of our lives here, we wonder what societal factors
must be in place in order to live in a more peaceful world, where the risks of being a
violent crime victim would be lower (or maybe we should just move out of the city).
Our data analysis project analyzes violent crime in America. We will determine
the most important statistical drivers of violent crime over the period 1970-2002. We are
interested in other environmental/societal factors that fluctuate year to year that may be
correlated with the rate of violent crime. Are there factors that we assume are correlated
but are really not? Are there factors that we assume to have no association with violent
crime but that really do? We aim to draw conclusion about what factors must be in place
in order for violent crime to be reduced over the next 30 years.
The Data
To analyze the violent crime rate, and its drivers, we have collected the following
data:
1
Data
Violent crime rate
(target variable)
Unemployment rate
Federal Prison population
Poverty rate
Economic growth – GDP
Source
Bureau of Justice Statistics
Frequency
Annual
Timeframe
1960-2002
Census Bureau
Federal Bureau of Prisons
Census Bureau
Bureau of Economic Analysis
Monthly
Annual
Annual
Annual
1960-2002
1970-2002
1960-2002
1960-2002
In collecting the data, we have already faced several issues. First, we had
expected to analyze data for 1964-2003, however several of the data series are not
available as far back as 1964 so we will limit it to 1970-2002. Specifically, were unable
to find data on prison population, going back to the 1960’s; therefore, we have chosen to
use the Federal Prison Population instead, since this data series extends back to 1970.
Although less ideal than the total U.S. prison population, we believe the Federal data
series may add to our understanding. The second issue we faced is that we have chosen to
analyze annual observations; however, the unemployment data seems to be only available
on a monthly basis therefore we had to transform it to annual data. Since we don’t have
weightings, we have annualized it by calculating an unweighted mean of the monthly
data. This transformation could potentially have a negative affect on the validity of our
conclusions. We also expected an issue with data for 2001 if victims of 9/11 were
counted as victims of violent crime, but upon further analysis, they were not. Were that
the case, looking at our other variables in 2001 would not have been as relevant as it is in
other years. Finally, our data is in different units: some are rates (crime rate,
unemployment rate) that fluctuate over time while some are absolute numbers (prison
population) that tend to grow over time. We may need to transform our data to make
regression analysis more meaningful.
2
Expected Outcome
Through an analysis of the data we expect to find that the unemployment rate is
correlated to the violent crime rate and that higher unemployment produces higher violent
crime. This is because unemployment produces lower income which may drive crime
related to robbery. We expect that a higher poverty rate will be associated with higher
crime for the same reason. We expect that when GDP is lower or falling, violent crime
will rise. We expect higher prison population to be associated with lower violent crime
because those most likely to commit violent crime are incarcerated.
We believe that by statistically analyzing the violent crime rate and its potential
drivers, we can increase our understanding of crime and what factors are associated with
a lower incidence of it.
General Observation of the Variables
We will begin our analysis with an examination of the descriptive statistics as
well as a histogram for each of our variables. This will enable us to determine whether or
not the data is normally distributed and to see if there are any variables they may cause
problems when we go deeper into our statistical analysis of the data. The descriptive
statistics are as follows:
3
Descriptive Statistics
Variable
Violent Crime ra
Total Prison Pop
Avg Annual Unemp
Poverty for Fami
GDP in billions
Mean
565.9
48598
6.285
10.361
4883
Variable
Violent Crime ra
Total Prison Pop
Avg Annual Unemp
Poverty for Fami
Maximum
758.1
128090
9.700
12.300
SE Mean
18.8
5937
0.244
0.190
511
StDev
107.9
34105
1.404
1.091
2937
Minimum
363.5
19023
4.000
8.700
1039
Q1
491.2
21654
5.350
9.300
2163
Median
556.6
30104
6.000
10.300
4463
Q3
636.9
75453
7.200
11.300
7235
As is apparent in the data above, some of the variables seem to be fairly normally
distributed as the mean and median for the variables are similar to each other. This fact is
supported by each of the histograms we looked at as well. The exceptions to this are the
variables Total Prison Population and GDP, which both have a higher mean relative to
the median. This lack or normality is apparent in the histograms of each of these
variables as seen below.
Histogram of Violent Crime rate
7
6
Frequency
5
4
3
2
1
0
400
500
600
Violent Crime rate
700
Histogram of Total Prison Pop
Histogram of Avg Annual Unemployment Rate
10
9
8
7
Frequency
Frequency
8
6
4
6
5
4
3
2
2
1
0
30000
60000
90000
Total Prison Pop
120000
0
4.0
4.8
5.6
6.4
7.2
8.0
Avg Annual Unemployment Rate
8.8
9.6
4
Histogram of Poverty for Families
Histogram of GDP in billions of current doll
6
5
4
4
Frequency
Frequency
5
3
3
2
2
1
1
0
9
10
11
Poverty for Families
0
12
2000
4000
6000
8000
GDP in billions of current doll
10000
Because Total Prison Population has a long right tail, we decided to perform a
transformation by taking a log base 10 of the data in order to see if that would help create
a more normal distribution. We also logged the GDP data, since it is money data. As is
apparent from the histograms of the logged data, this transformation did not seem to
sufficiently affect the distribution of the data.
Histogram of LogT GDP
5
8
4
Frequency
Frequency
Histogram of LogT Prison Pop
10
6
4
2
0
3
2
1
4.4
4.6
4.8
LogT Prison Pop
5.0
0
3.0
3.2
3.4
3.6
3.8
4.0
LogT GDP
This may have to do with the fact that these are time series data, fixing which is
beyond the scope of this project. While taking the logs for Prison Population and GDP
did not make them normally distributed, we decided to continue using this logged data in
the rest of our analysis.
We also examined correlations among our variables, substituting our two
transformed variables for their original variables. The best regressions arise when the
predictor variables are highly correlated with the target variable but not with each other.
5
In our data, the poverty rate and log of GDP are highly correlated with the violent crime
rate; however, several pairs of predictor variables are highly correlated with one another.
Correlations
Violent Crim
0.129
0.656
0.395
0.647
Avg Annual U
Poverty for
LogT Prison
LogT GDP
Avg Annual U
Poverty for
0.596
-0.516
-0.277
LogT Prison
0.017
0.284
0.888
Single Variable Regressions
While we are ultimately concerned with how all the variables together predict
Violent Crime, we are first going to examine how each one, on its own, relates to our
target. To do this, we created a scatter plot with a fitted regression line for each of the
predictor variables against the target of violent crime rate, as displayed below.
Fitted Line Plot
Fitted Line Plot
Violent Crime rate = 503.5 + 9.94 Avg Annual Unemployment Rate
800
S
R-Sq
R-Sq(adj)
Violent Crime rate = - 106.2 + 64.87 Poverty for Families
800
108.676
1.7%
0.0%
600
500
400
S
R-Sq
R-Sq(adj)
83.5545
41.9%
40.0%
600
500
400
4
5
6
7
8
Avg Annual Unemployment Rate
9
10
8.5
Fitted Line Plot
800
9.0
9.5
10.0 10.5
11.0
Poverty for Families
11.5
12.0
12.5
Fitted Line Plot
Violent Crime rate = - 130.9 + 151.7 LogT Prison Pop
Violent Crime rate = - 249.2 + 226.7 LogT GDP
S
R-Sq
R-Sq(adj)
800
100.662
15.6%
12.9%
700
Violent Crime rate
700
Violent Crime rate
82.6857
43.1%
41.2%
700
Violent Crime rate
Violent Crime rate
700
S
R-Sq
R-Sq(adj)
600
500
400
600
500
400
4.2
4.3
4.4
4.5
4.6
4.7
4.8
LogT Prison Pop
4.9
5.0
5.1
3.0
3.2
3.4
3.6
LogT GDP
3.8
4.0
6
In looking at the slope of the fitted line, all of the variables appear to have a
positive relationship with the target, indicating that as each variable increases, the violent
crime rate increases as well. That being said, however, it seems that no one variable alone
has a very strong correlation with the violent crime rate. For instance, the variability
between the violent crime rate and the log of GDP is increasing over time. We can
therefore conclude at this point that each variable on its own is not a good predictor of
violent crime. It is our hope that when these variables are acting together, the
relationship will be stronger and as a group perhaps they will be better predictors of the
violent crime. In order to determine this, we will move on to our next step in analyzing
the data, that of a multiple regression model.
Initial Multiple Regression
Next we ran a multiple regression of the violent crime rate and our four predictor
variables (Avg Annual Unemployment Rate, Poverty for Families, log of GDP Current
Dollars, and log of Federal Prison Population). The regression equation is given below.
Regression Analysis
The regression equation is
Violent Crime rate = - 96 - 16.0 Avg Annual Unemployment Rate
+ 52.8 Poverty for Families - 200 LogT Prison Pop
+ 316 LogT GDP
Predictor
Constant
Avg Annual Unemployment Rate
Poverty for Families
LogT Prison Pop
LogT GDP
S = 63.1435
R-Sq = 70.0%
Coef
-96.2
-16.05
52.77
-199.9
315.51
SE Coef
302.4
13.16
15.87
110.7
97.71
T
-0.32
-1.22
3.32
-1.81
3.23
P
0.753
0.233
0.002
0.082
0.003
R-Sq(adj) = 65.7%
In looking at the coefficients of this regression equation, we learn for example
that holding all else fixed, a one point increase in the poverty rate is associated with a
52.77 point increase in the violent crime rate. Similarly, the coefficient of the log of the
7
prison population tells us that every one point increase in the log of the prison population
is associated with a negative 199.9 point impact on the violent crime rate. Interestingly,
an increase in the unemployment rate is associated with a decrease in the violent crime
rate, and an increase in the logged GDP is associated with an increase in the violent
crime rate. Next, the regression model succeeded in reducing the noise in the violent
crime rate from 107.9 before the regression to a standard error of regression of 63.1. This
means that we are confident that 95% of the time our regression model can predict the
crime rate to within  2*63.1. This is an indication that a prediction of violent crime
using this regression equation would be much more accurate than an estimate based
solely on its historical mean and variance. In addition to looking at the standard error, it is
also important to examine the degree to which these four variables explain the variance in
the violent crime rate. To do this we looked at the adjusted R-Sq. The adjusted R-Sq
indicates that the four predictor variables account for 65.7% of the variance in the violent
crime rate. It is difficult for us to tell at this time whether this R-Sq is better or worse than
other models that attempt to explain crime.
Finally we considered the T and P values of the predictor variables to determine if
each is significant to the regression equation. There are two variables for which the Pvalue is above 0.05 (the log of the prison population and the unemployment rate);
therefore, these variables appear statistically insignificant to the model. This indicates
that perhaps these variables could be removed without much reduction in model power.
Assumptions
Linear regression involves four major assumptions, and this regression violates
two of the four. The first assumption is that the expected value of the error terms for all
8
observations is equal to zero. Judging by the Residuals Versus the Fitted Values plot
below, the expected value of the error terms appears approximately equal to zero. Also,
there are no known subgroups whose fitted values are systematically above or below the
regression line. We believe this first assumption holds. The second assumption is
homoscedasticity, that the regression relationship is equally strong throughout the
population. That assumption does not hold in this regression. The Residuals Versus the
Fitted Values plot shows that the variance is not constant – the variance is larger for
larger fitted values. The third assumption is that the residual of one term tells us nothing
about the residual of another term. This assumption is violated in this regression, as it is
in many regressions of time series data. The Residuals Versus the Order of the Data plot
shows that each residual is related to the residual of the prior observation. The fourth
assumption of linear regression is that the residuals are normally distributed. The plots
Normal Probability Plot of the Residuals and Histogram of the Residuals show that the
residuals are approximately normal; therefore this assumption holds for this regression.
9
Residual Plots for Violent Crime rate
Normal Probability Plot of the Residuals
Residuals Versus the Fitted Values
99
100
50
Residual
Percent
90
50
0
-50
10
-100
1
-100
0
Residual
100
400
100
4.5
50
3.0
0
-50
1.5
-100
0.0
-100
-50
0
Residual
50
100
1
Residuals Versus LogT GDP
10
15
20
25
Observation Order
30
(response is Violent Crime rate)
100
100
50
50
Residual
Residual
5
Residuals Versus LogT Prison Pop
(response is Violent Crime rate)
0
-50
0
-50
-100
-100
3.0
3.2
3.4
3.6
LogT GDP
3.8
4.0
4.2
4.3
Residuals Versus Poverty for Families
4.4
4.5
4.6
4.7
4.8
LogT Prison Pop
4.9
5.0
5.1
Residuals Versus Avg Annual Unemployment Rate
(response is Violent Crime rate)
(response is Violent Crime rate)
100
100
50
50
Residual
Residual
700
Residuals Versus the Order of the Data
6.0
Residual
Frequency
Histogram of the Residuals
500
600
Fitted Value
0
-50
0
-50
-100
-100
8.5
9.0
9.5
10.0
10.5
11.0
Poverty for Families
11.5
12.0
12.5
4
5
6
7
8
Avg Annual Unemployment Rate
9
10
10
In addition to considering the four assumptions, we also looked for any outliers in
the data by more closely examining the Normal Probability Plot of the Residuals. We
noticed a couple of outliers toward the very top of the graph. Upon analysis of these
outliers, we believe they occurred due to the relative increase in the crime rate during the
early 1990s and do not feel it necessary to remove the data points from our model at this
time.
Improving the Model
Several factors indicate that our initial model may not be the optimal model
possible with our predictor variables. First, two variables, the unemployment rate and the
log of prison population, have p-values below 0.05. Second, our model violates three of
the four assumptions of linear regression. To improve the model, we ran a “best subsets”
regression, the output of which follows.
Best Subsets Regression
Response is Violent Crime rate
A=Avg Annual Unemployment Rate
B=Poverty for Families
C=LogT Prison Population
D=LogT GDP
Vars
1
2
3
4
R-Sq
43.1
66.2
68.4
70.0
R-Sq(adj)
41.2
63.9
65.2
65.7
Mallows
C-p
24.2
4.6
4.5
5.0
S
82.686
64.792
63.672
63.143
A B C D
X
X
X
X X X
X X X X
The best subsets analysis indicates that only two variables are necessary to have
an adjusted R-Sq of 63.9%, whereas our four-variable equation had an adjusted R-Sq of
65.7%, a very small difference. The two variables that add so little power to the model
are the unemployment rate and the log of the prison population; these are the same two
variables with low p-values in our initial regression. We believe that by eliminating these
11
two variables, the model will maximize the trade-off between model power and
complexity. Our optimal model then is as follows.
Regression Analysis
The regression equation is
Violent Crime rate = - 592 + 50.8 Poverty for Families + 176 LogT GDP
Predictor
Constant
Poverty for Families
LogT GDP
S = 64.7915
Coef
-591.8
50.81
175.59
R-Sq = 66.2%
SE Coef
153.2
10.94
38.79
T
-3.86
4.64
4.53
P
0.001
0.000
0.000
R-Sq(adj) = 63.9%
This new model explains 63.9% of the variance in the violent crime rate (as
indicated by the adjusted R-Sq). The original noise in our target variable was 107.9; our
model reduces noise in the target variable to 64.8 (the standard error of regression). Both
predictor variables are significant to the model (as indicated by p-values less than 0.05).
The equation tells us that, all else held constant, a one point increase in the poverty rate is
associated with a 50.81 point increase in the violent crime rate. Similarly, a one point
increase in the log of GDP is associated with a 175.59 point increase in the violent crime
rate.
This new model conforms to the four assumptions of linear regression better than
our initial model did. It does not violate the first assumption (expected value of error
terms equal to zero), as seen in the below plot. This regression does violate the second
assumption (homoscedasticity) since variance of the residuals is higher for larger fitted
values, but the variance is more constant than in our initial model. This regression also
violates the third assumption (residuals tell us nothing about one another) since it is a
time series. The fourth assumption (normality of residuals) is not violated by this
regression equation. While not exactly normal, the residuals are approximately normal
12
and certainly more normal than the residuals of our initial regression equation. In sum,
our improved model violates two of the four linear regression assumptions, whereas our
initial model violated three of the four.
Residual Plots for Violent Crime rate
Normal Probability Plot of the Residuals
Residuals Versus the Fitted Values
99
100
Residual
Percent
90
50
10
50
0
-50
-100
1
-100
0
Residual
100
400
Histogram of the Residuals
700
Residuals Versus the Order of the Data
8
100
6
Residual
Frequency
500
600
Fitted Value
4
2
50
0
-50
-100
0
-120
-60
0
Residual
60
120
1
Residuals Versus Poverty for Families
100
100
50
50
Residual
Residual
150
0
0
-50
-50
-100
-100
9.5
10.0
10.5
11.0
Poverty for Families
11.5
30
(response is Violent Crime rate)
150
9.0
10
15
20
25
Observation Order
Residuals Versus LogT GDP
(response is Violent Crime rate)
8.5
5
12.0
12.5
3.0
3.2
3.4
3.6
LogT GDP
3.8
4.0
Initial Conclusion and Original Expectations
First let us take a look at the nature of the relationship of the national violent
crime rate with each of the predictor variables, based on the multiple regression model
we ran. In half of the cases the direction of the relationship matched our expectations, and
in the other half the relationship was the opposite of what we had expected. As stated
13
earlier, we had assumed that an increase in GDP would be associated with a decrease in
the crime rate, this does not seem to be the case based on the positive coefficient for the
logged GDP. It seems that there is actually a positive rather than negative relationship
between the two—an increase in GDP is associated with an increase in the violent crime
rate. Additionally, we had expected that an increase in the unemployment rate would be
associated with a decrease in the violent crime rate. However, based on the negative
coefficient for unemployment, it seems that an increase in unemployment, in our model,
is actually associated with a decrease in violent crime. The other two variables do in fact
have the relationships we assumed they would have. An increase in the poverty rate
correlates with an increase in the violent crime rate as interpreted by the positive
coefficient for the poverty rate. In addition, as we had assumed, an increase in the prison
population is associated with a decrease in the crime rate. These associations, of course.
assume all other variables are held constant.
More importantly perhaps, we chose these four variables under the assumption,
prior to statistically analyzing the data, that all four variables together would serve as a
fairly good predictor of the national violent crime rate. After looking at the multiple
regression model for the data, the results do not fully support our original expectations.
To begin with, in order to strengthen our analysis we had to make the choice to
completely remove two of the four variables, the unemployment rate and the prison
population. We now believe that the national rate of violent crime for the period 19702000 is best explained by the poverty rate and the level of GDP. That said, violent crime
is quite difficult to predict using the data we have analyzed thus far. Therefore, we
decided to try one last thing in our effort to predict the national violent crime rate.
14
Incorporating a Lagged Variable
We considered the fact that the best predictor of the violent crime rate may be the
violent crime rate of the prior year. To examine this we first ran a correlation between the
violent crime rate and the lag (by one period) of the violent crime rate.
Correlations: Violent Crime rate, Lag of Violent Crime Rate
Pearson correlation of Violent Crime rate and Lag of Violent Crime Rate = 0.957
This very high correlation of 0.957 tells us that the violent crime in one period is
likely to have predictive power in predicting the violent crime rate of the next period. We
next constructed a second best subsets regression but this time included the lag variable.
Best Subsets Regression
Response is Violent Crime rate
32 cases used, 1 cases contain missing values
A=Avg Annual Unemployment Rate
B=Poverty for Families
C=LogT Prison Population
D=LogT GDP
E=Lag of Violent Crime Rate
Vars
1
2
3
4
5
R-Sq
91.5
93.3
94.1
94.4
94.5
R-Sq(adj)
91.2
92.9
93.5
93.6
93.5
Mallows
C-p
12.4
5.8
3.9
4.5
6.0
S
30.531
27.557
26.284
26.105
26.326
A B C D E
X
X
X
X
X
X
X
X X X
X X X X X
The result was surprising: a regression with only the lag variable had an adjusted
R-Sq of 91.2%, significantly higher than the 63.9% adjusted R-Sq of our previous best
subsets model. Once the lag variable was included, the other variables added little
additional power. As a result, our new best model has only the lag of the violent crime
rate as predictor. The regression equation for this model is below.
15
Regression Analysis: Violent Crime rate versus Lag of Violent Crime Rate
The regression equation is
Violent Crime rate = 56.8 + 0.907 Lag of Violent Crime Rate
32 cases used, 1 cases contain missing values
Predictor
Constant
Lag of Violent Crime Rate
S = 30.5314
R-Sq = 91.5%
Coef
56.83
0.90719
SE Coef
29.13
0.05039
T
1.95
18.00
P
0.061
0.000
R-Sq(adj) = 91.2%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
30
31
SS
302126
27965
330091
MS
302126
932
F
324.11
P
0.000
Residual Plots for Violent Crime rate
Normal Probability Plot of the Residuals
Residuals Versus the Fitted Values
99
50
Residual
Percent
90
50
0
10
1
-50
-80
-40
0
Residual
40
80
400
Histogram of the Residuals
500
600
Fitted Value
700
Residuals Versus the Order of the Data
50
6
Residual
Frequency
8
4
0
2
0
-50
-40
-20
0
20
Residual
40
60
1
5
10
15
20
25
Observation Order
30
The coefficient tells us that each one point increase in the violent crime rate is
associated with a 0.907 increase in the violent crime rate for the following year. This
regression reduces the noise of the response variable to a standard error of 30.5 from an
16
original standard deviation of 107.9. The adjusted R-Sq tells us that the regression
explains 91.2% of the variance in the violent crime rate. The p-value for the predictor
variable tells us that the probability that the coefficient is actually zero is less than
0.0005. Since this is now a one variable regression, the F statistic and associated p value
tell us the same information as the p value of the coefficient.
Our new regression violates two of the four assumptions of linear regression. It
does not violate the first assumption, since the expected value of the residuals appears
close to zero. The second assumption is violated since the residuals exhibit non-constant
variance; the variance increases for larger fitted values. The regression violates the third
assumption since each residual value is related to the residual of the prior year. Our
regression does not violate the fourth assumption since the residuals are approximately
normally distributed.
Implications
Our analysis has taught us three lessons. First, we learned that the poverty rate
and the growth of the economy are each more highly correlated with the violent crime
rate than the unemployment rate and the federal prison population are. Second, we
confirmed that a higher poverty rate is associated with a higher rate of violent crime and
learned that a larger economy is associated with a higher rate of violent crime. Third, we
learned that the most effective data for predicting the violent crime rate is in fact the rate
itself, from the previous year.
In conclusion, it seems reasonable to suppose that the since the violent crime rate
has fallen every year for the past ten years that it may do so next year as well. As for
predicting the national violent crime rate based on our original four variables, we found
that it is quite difficult to do.
17