Download Milestone2 Logistic Regression and Naive Bayes Classifier

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Milestone2 Logistic Regression and Naive Bayes Classifier
Kushal Thakkar, Snighdha Petluru, Viral Shah
April 4, 2016
Logistic Regression with different Models
##logistic model
##helps find the intercept, co-efficients of each feature and log-odds
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.2.4
## Loading required package: Matrix
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.2.4
## Loaded glmnet 2.0-5
##DV with individual IV's
m1 = glm(eliteStatus~review_count,family=binomial(),data=train_vardata)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
m2 = glm(eliteStatus~nmonths,family=binomial(),data=train_vardata)
m3 = glm(eliteStatus~fans,family=binomial(),data=train_vardata)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
m4 = glm(eliteStatus~total_compliments,family=binomial(),data=train_vardata)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
m5 = glm(eliteStatus~total_votes,family=binomial(),data=train_vardata)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
m6 = glm(eliteStatus~nfriends,family=binomial(),data=train_vardata)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
m7 =
glm(eliteStatus~AverageLeniencyScore,family=binomial(),data=train_vardata)
m8 =
glm(eliteStatus~nmonths+review_count+nfriends+fans+total_compliments+total_vo
tes,family=binomial(),data=train_vardata,maxit=100)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
m9 =
glm(eliteStatus~nmonths+review_count+fans+total_compliments+total_votes+Avera
geLeniencyScore,family=binomial(),data=train_vardata,maxit=100)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
logreg.model = glm(eliteStatus~nmonths+review_count+fans+total_compliments+
total_votes+nfriends+AverageLeniencyScore,data=train_vardata,family=binomial(
),maxit=100)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
library(aod)
## Warning: package 'aod' was built under R version 3.2.4
Co-efficients, Intercepts and Odd Ratios of each Models
##coefficients and intercepts of each of the logistic regression model
##with individual independent feature and the full model
summary(m1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
glm(formula = eliteStatus ~ review_count, family = binomial(),
data = train_vardata)
Deviance Residuals:
Min
1Q
Median
-8.4904 -0.2009 -0.1753
3Q
-0.1660
Max
2.9205
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.3328801 0.0149974 -288.9
<2e-16 ***
review_count 0.0274719 0.0001509
182.1
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 155622
Residual deviance: 75625
AIC: 75629
on 306401
on 306400
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 9
summary(m2)
##
## Call:
## glm(formula = eliteStatus ~ nmonths, family = binomial(), data =
train_vardata)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -1.3949 -0.3948 -0.2751 -0.1870
3.0569
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.8192167 0.0209684 -229.8
<2e-16 ***
## nmonths
0.0390971 0.0002914
134.1
<2e-16 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##
Null deviance: 155622 on 306401 degrees of freedom
## Residual deviance: 135274 on 306400 degrees of freedom
## AIC: 135278
##
## Number of Fisher Scoring iterations: 6
summary(m3)
##
## Call:
## glm(formula = eliteStatus ~ fans, family = binomial(), data =
train_vardata)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -8.4904 -0.2247 -0.2247 -0.2247
2.7173
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.666728
0.011538 -317.8
<2e-16 ***
## fans
0.441793
0.002785
158.6
<2e-16 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##
Null deviance: 155622 on 306401 degrees of freedom
## Residual deviance: 92066 on 306400 degrees of freedom
## AIC: 92070
##
## Number of Fisher Scoring iterations: 14
summary(m4)
##
## Call:
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
glm(formula = eliteStatus ~ total_compliments, family = binomial(),
data = train_vardata)
Deviance Residuals:
Min
1Q
Median
-8.4904 -0.2768 -0.2713
3Q
-0.2713
Max
2.5770
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-3.2836673 0.0097478 -336.9
<2e-16 ***
total_compliments 0.0407717 0.0003189
127.8
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 155622
Residual deviance: 94972
AIC: 94976
on 306401
on 306400
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 12
summary(m5)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
glm(formula = eliteStatus ~ total_votes, family = binomial(),
data = train_vardata)
Deviance Residuals:
Min
1Q
Median
-8.4904 -0.2259 -0.2112
3Q
-0.2070
Max
2.7649
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.841e+00 1.240e-02 -309.7
<2e-16 ***
total_votes 8.056e-03 5.031e-05
160.1
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 155622
Residual deviance: 80606
AIC: 80610
on 306401
on 306400
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 14
summary(m6)
##
## Call:
## glm(formula = eliteStatus ~ nfriends, family = binomial(), data =
train_vardata)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -8.4904 -0.3001 -0.2864 -0.2864
2.5353
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1729499 0.0092741 -342.1
<2e-16 ***
## nfriends
0.0476524 0.0003641
130.9
<2e-16 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##
Null deviance: 155622 on 306401 degrees of freedom
## Residual deviance: 122759 on 306400 degrees of freedom
## AIC: 122763
##
## Number of Fisher Scoring iterations: 7
summary(m7)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
glm(formula = eliteStatus ~ AverageLeniencyScore, family = binomial(),
data = train_vardata)
Deviance Residuals:
Min
1Q
Median
-0.4412 -0.3924 -0.3822
3Q
-0.3626
Max
2.4118
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-2.607211
0.007352 -354.64
<2e-16 ***
AverageLeniencyScore 0.081692
0.006208
13.16
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 155622
Residual deviance: 155444
AIC: 155448
on 306401
on 306400
Number of Fisher Scoring iterations: 5
degrees of freedom
degrees of freedom
summary(m8)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
glm(formula = eliteStatus ~ nmonths + review_count + nfriends +
fans + total_compliments + total_votes, family = binomial(),
data = train_vardata, maxit = 100)
Deviance Residuals:
Min
1Q
Median
-8.4904 -0.1996 -0.1682
3Q
-0.1495
Max
2.8998
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-4.746e+00 2.672e-02 -177.611 < 2e-16 ***
nmonths
7.711e-03 4.329e-04
17.811 < 2e-16 ***
review_count
2.017e-02 2.097e-04
96.211 < 2e-16 ***
nfriends
6.947e-03 4.083e-04
17.013 < 2e-16 ***
fans
7.724e-02 2.753e-03
28.055 < 2e-16 ***
total_compliments -5.015e-04 7.232e-05
-6.934 4.08e-12 ***
total_votes
8.786e-04 6.511e-05
13.495 < 2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 155622
Residual deviance: 71628
AIC: 71642
on 306401
on 306395
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 31
summary(m9)
##
## Call:
## glm(formula = eliteStatus ~ nmonths + review_count + fans +
total_compliments +
##
total_votes + AverageLeniencyScore, family = binomial(),
##
data = train_vardata, maxit = 100)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -8.4904 -0.2010 -0.1701 -0.1510
2.9246
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept)
-4.732e+00 2.674e-02 -176.966 < 2e-16 ***
## nmonths
7.500e-03 4.316e-04
17.379 < 2e-16 ***
## review_count
1.992e-02 2.108e-04
94.499 < 2e-16 ***
##
##
##
##
##
##
##
##
##
##
##
##
##
##
fans
8.763e-02 2.970e-03
29.506 < 2e-16
total_compliments
-4.816e-04 6.375e-05
-7.554 4.22e-14
total_votes
1.152e-03 6.576e-05
17.521 < 2e-16
AverageLeniencyScore 8.946e-02 1.097e-02
8.154 3.51e-16
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
***
***
***
***
1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 155622
Residual deviance: 71842
AIC: 71856
on 306401
on 306395
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 36
summary(logreg.model)
##
## Call:
## glm(formula = eliteStatus ~ nmonths + review_count + fans +
total_compliments +
##
total_votes + nfriends + AverageLeniencyScore, family = binomial(),
##
data = train_vardata, maxit = 100)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -8.4904 -0.2006 -0.1685 -0.1489
2.9305
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept)
-4.771e+00 2.700e-02 -176.666 < 2e-16 ***
## nmonths
7.718e-03 4.329e-04
17.826 < 2e-16 ***
## review_count
2.015e-02 2.097e-04
96.083 < 2e-16 ***
## fans
7.717e-02 2.750e-03
28.058 < 2e-16 ***
## total_compliments
-5.041e-04 7.443e-05
-6.773 1.26e-11 ***
## total_votes
8.877e-04 6.524e-05
13.607 < 2e-16 ***
## nfriends
6.831e-03 4.073e-04
16.770 < 2e-16 ***
## AverageLeniencyScore 8.414e-02 1.102e-02
7.638 2.20e-14 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##
Null deviance: 155622 on 306401 degrees of freedom
## Residual deviance: 71567 on 306394 degrees of freedom
## AIC: 71583
##
## Number of Fisher Scoring iterations: 76
##odd ratios of the outcome
exp(coef(m1))
##
##
(Intercept) review_count
0.01312968
1.02785270
exp(coef(m2))
## (Intercept)
nmonths
## 0.008073108 1.039871482
exp(coef(m3))
## (Intercept)
## 0.02555996
fans
1.55549346
exp(coef(m4))
##
##
(Intercept) total_compliments
0.03749052
1.04161431
exp(coef(m5))
## (Intercept) total_votes
## 0.02148023 1.00808852
exp(coef(m6))
## (Intercept)
## 0.04187988
nfriends
1.04880599
exp(coef(m7))
##
##
(Intercept) AverageLeniencyScore
0.0737399
1.0851213
exp(coef(m8))
##
##
##
##
(Intercept)
nmonths
0.008683711
1.007741165
fans total_compliments
1.080306144
0.999498617
review_count
1.020378145
total_votes
1.000879009
nfriends
1.006971213
exp(coef(m9))
##
(Intercept)
##
0.008806231
##
fans
##
1.091584449
## AverageLeniencyScore
##
1.093580534
exp(coef(logreg.model))
nmonths
1.007528580
total_compliments
0.999518547
review_count
1.020124506
total_votes
1.001152794
##
##
##
##
##
##
(Intercept)
nmonths
0.008476039
1.007747561
fans
total_compliments
1.080222128
0.999495988
nfriends AverageLeniencyScore
1.006854137
1.087781343
review_count
1.020356199
total_votes
1.000888083
Correlations of predicted and actual values of different models
##predicting eliteStatus using test data
pred1 <- predict(m1,test_vardata,type="response")
pred2 <- predict(m2,test_vardata,type="response")
pred3 <- predict(m3,test_vardata,type="response")
pred4 <- predict(m4,test_vardata,type="response")
pred5 <- predict(m5,test_vardata,type="response")
pred6 <- predict(m6,test_vardata,type="response")
pred7 <- predict(m7,test_vardata,type="response")
pred8 <- predict(m8,test_vardata,type="response")
pred9 <- predict(m9,test_vardata,type="response")
test_vardata$eliteStatusP <predict(logreg.model,test_vardata,type="response")
##finding correlation between the predicted and actual value for eliteStatus
cor(pred1,test_vardata$eliteStatus)
## [1] 0.7275434
cor(pred2,test_vardata$eliteStatus)
## [1] 0.2990372
cor(pred3,test_vardata$eliteStatus)
## [1] 0.6869403
cor(pred4,test_vardata$eliteStatus)
## [1] 0.6657621
cor(pred5,test_vardata$eliteStatus)
## [1] 0.7265785
cor(pred6,test_vardata$eliteStatus)
## [1] 0.4689936
cor(pred7,test_vardata$eliteStatus)
## [1] 0.01568872
cor(pred8,test_vardata$eliteStatus)
## [1] 0.7452816
cor(pred9,test_vardata$eliteStatus)
## [1] 0.7448852
cor(test_vardata$eliteStatusP,test_vardata$eliteStatus)
## [1] 0.74534
Graph of Predicted and Actual Elite and Non-elite Users
##graphical representation of prediction of eliteStatus against nfriends and
average leniency score
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
predictplot <ggplot(test_vardata,aes(x=nfriends,y=AverageLeniencyScore,color=eliteStatusP)
)+geom_point()
predictplot + facet_grid(eliteStatus~.)+labs(x="Number of Friends",y="Average
Leniency Score",color="Predicted Elite Status",title="Actual vs Predicted
Elite Status of User")
Graph shows the plotting of predicted elite users and non elite users against their actual
elite non-elite users
## Bayes Classifier
## train data has 75% of data and rest 25% is in test data
setwd("C:/Users/ESHAN/Desktop/Viral/Dropbox/YelpAnalysis/Datasets")
continous_data = read.csv("continuous random.csv",stringsAsFactors = F)
row_divider = nrow(continous_data)*0.75
train_data_bayes=continous_data[1:row_divider,]
test_data_bayes=continous_data[row_divider:nrow(continous_data),]
## categorizing variables all the variables
train_data_bayes <- lapply(train_data_bayes, factor)
test_data_bayes <- lapply(test_data_bayes, factor)
##checking the proportion of elitestatus user
prop.table(table(train_data_bayes$eliteStatus))
##
##
0
1
## 0.92961534 0.07038466
prop.table(table(test_data_bayes$eliteStatus))
##
##
0
1
## 0.92993587 0.07006413
##applying bayes classifier
library(e1071)
## Warning: package 'e1071' was built under R version 3.2.4
elite_classifier <- naiveBayes(train_data_bayes[9],train_data_bayes$eliteStatus)
##model performance
elite_classifier_pred <- predict(elite_classifier,test_data_bayes)
summary(elite_classifier_pred)
##
0
1
## 82110 20025
##cross table
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.2.4
CrossTable(test_data_bayes$eliteStatus,elite_classifier_pred,prop.chisq =
F,prop.t = F,prop.c = F,dnn = c("actual","predicted"))
##
##
##
Cell Contents
## |-------------------------|
## |
N |
## |
N / Row Total |
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
|-------------------------|
Total Observations in Table:
102135
| predicted
actual |
0 |
1 | Row Total |
-------------|-----------|-----------|-----------|
0 |
81934 |
13045 |
94979 |
|
0.863 |
0.137 |
0.930 |
-------------|-----------|-----------|-----------|
1 |
176 |
6980 |
7156 |
|
0.025 |
0.975 |
0.070 |
-------------|-----------|-----------|-----------|
Column Total |
82110 |
20025 |
102135 |
-------------|-----------|-----------|-----------|
##naive bayes with laplace smoother
elite_classifier_laplace <- naiveBayes(train_data_bayes[9],train_data_bayes$eliteStatus,laplace = 1)
##prediction with laplace smoother
elite_classifier_pred_laplace <predict(elite_classifier_laplace,test_data_bayes)
##cross table
CrossTable(test_data_bayes$eliteStatus,elite_classifier_pred_laplace,prop.chi
sq = F,prop.t = F,prop.c = F,dnn = c("actual","predicted"))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Cell Contents
|-------------------------|
|
N |
|
N / Row Total |
|-------------------------|
Total Observations in Table:
102135
| predicted
actual |
0 |
1 | Row Total |
-------------|-----------|-----------|-----------|
0 |
87504 |
7475 |
94979 |
|
0.921 |
0.079 |
0.930 |
-------------|-----------|-----------|-----------|
##
1 |
365 |
6791 |
7156 |
##
|
0.051 |
0.949 |
0.070 |
## -------------|-----------|-----------|-----------|
## Column Total |
87869 |
14266 |
102135 |
## -------------|-----------|-----------|-----------|
##
##
R Markdown
Related documents