Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture
Data Mining in R
732A44 Programming in R
Logistic regression: two classes
• Consider Logistic model with one predictor X=Price of the car
Y=Equipment
• Logistic model
P(Y  1 | X  x)
P(Y  1 | X  x)
 log
  0  1 x
P(Y  0 | X  x)
1  P(Y  0 | X  x)
exp(  0  1 x)
P(Y  1 | X  x) 
1  exp(  0  1 x)
log
• Use function glm(formula, family, data)
– Formula: Response~Model
• Model consists of a+b (addition), a:b (interaction terms, a*b (addition and
interaction) . All predictors
– Family: specify binomial
732A44 Programming in R
Logistic regression: two classes
reg<-glm(X3...Equipment~Price.in.SEK.,
family=binomial, data=mydata);
732A44 Programming in R
Logistic regression: several predictors
log
P(Y  1 | X  x)
  0   1T x
P(Y  0 | X  x)
P(Y  1 | X  x) 
exp(  0   1T x)
1  exp(  0   1T x)
Data about contraceptive use
– Several analysis plots can be
obtained by plot(lrfit)
– Response: matrix
success/failure
732A44 Programming in R
Logistic regression
Further comments
• Nominal logistic regressions (library mlogit, function
mlogit)
• Stepwise model selection: step() function.
• Prediction: predict() function
732A44 Programming in R
Smoothing splines
Minimize a penalized sum of squared residuals
N
RSS  f ,      y i  f xi      f t  dt
2
i 1
where λ is smoothing parameter.
λ=0 : any function interpolating data
λ=+ : least squares line fit
732A44 Programming in R
2
Smoothing splines
• smooth.spline(x, y, df, spar, cv,…)
– Df degrees of freedom
– Spar: penalty parameter
– CV=
plot(m2$Kilometer,m2$Price,
• TRUE=GCV
• FALSE=CV
• NA= no CV
main="df=40");
res<-smooth.spline(
m2$Kilometer,
m2$Price,df=40);
lines(res, col="blue");
732A44 Programming in R
Generalized additive models
A function
g ( E (Y | X 1 , ..., X n ))
of the expected response is additive in the set of inputs, i.e.,
g ( E (Y | X 1 , ..., X n ))    s1 ( X 1 )  ...  s p ( X p )
Example: Nonlinear logistic regression of a binary response
log
E (Y | X  x)
P(Y  1 | X  x)
 log
  0  s ( x)
1  E (Y | X  x)
P(Y  0 | X  x)
732A44 Programming in R
GAM
• gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp)
– Method: method for selection of smoothing parameters
– Select: TRUE – variable selection is performed
– Sp: smoothing parameters (maximal df)
– Formula: usual terms and spline terms s(…)
Library: mgcv
• Car properties
bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3)
vis.gam(bp, theta=10, phi=30);
• Predict.gam() can be used for predictions
732A44 Programming in R
GAM
Smoothing components
plot(bp, pages=1)
732A44 Programming in R
Principal components analysis
15
PC1
X2
Idea: Introduce a new coordinate system (PC1,
PC2, …) where
• The first principal component (PC1) is the
direction that maximizes the variance of the
projected data
• The second principal component (PC2) is
the direction that maximizes the variance of
the projected data after the variation along
PC1 has been removed
• …
10
PC2
5
0
5
X1
In the new coordinate system, coefficients
corresponding to the last principal components
are very small  can take away this columns
732A44 Programming in R
10
Principal components analysis
• princomp(x, ...)
m4<-m3;
m4$MODEL<-c();
res<-princomp(m4);
loadings(res);
plot(res);
biplot(res);
summary(res);
732A44 Programming in R
Decision trees
20
X1
<9
>=9
X2
<16
0
10
0
10
20
732A44 Programming in R
>=16
1
X2
<7
>=7
1
X1
<15
>=15
1
0
Regression tree example
732A44 Programming in R
Training-validation-test
• Training-validation (60/40)
sub <- sample(nrow(m2), floor(nrow(m2) * 0.6))
training <- m2[sub, ]
validation <- m2[-sub, ]
• If training-validation-test is required, use similar
strategy
732A44 Programming in R
Decision trees by CART
Growing a full tree
Library ”tree”.
• Create tree: tree(formula, data, subset, split = c("deviance", "gini"),…)
– Subset: if subset of cases needs to be used for training
– Split: splitting criterion
– More parameters with control parameter
• Prune tree with help of validation set: prune.tree(tree, newdata,
method = c("deviance", "misclass”),…)
• Prune tree with cross-validation: cv.tree(object, FUN = prune.tree, K = 10,
...)
– K is number of folds in cross-validation
732A44 Programming in R
Classification trees: CART
Example: Olive oils in Italy
sub <- sample(nrow(m5), floor(nrow(m5) * 0.6))
training <- m5[sub, ]
validation <- m5[-sub, ]
mytree<-tree(Area~.-Region-X,data=training);
summary(mytree)
plot(mytree,type="uniform");
text(mytree,cex=0.5);
732A44 Programming in R
Classification trees: CART
• Dependence of the misclassification rate on the length of the
tree:
treeseq1<-prune.tree(mytree, newdata=validation,method="misclass")
plot(treeseq1);
title("Validation");
treeseq2<-cv.tree(mytree, method="misclass")
plot(treeseq2);
title("CV");
732A44 Programming in R
Regression trees: CART
mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training);
mytree3<-prune.tree(mytree2, best=4) #totally 4 leaves
print(mytree3)
summary(mytree3)
plot.tree(mytree3)
text(mytree3)
732A44 Programming in R
Decision trees: other techniques
• Conditional inference trees
Library: party
training$X<-c();
training$Area<-c();
mytree4<-ctree(Region~.,data=training);
print(mytree4)
plot(mytree4, type= "simple");# gives nice plots
• CART, another library ”rpart”
732A44 Programming in R
Neural network
• Input nodes, input layer
• [Hidden nodes, Hidden
layer(s)]
• Output nodes, output layer
• Weights
• Activation functions
• Combination functions
z1
x1
732A44 Programming in R
…
f1
z2
x2
fK
…
…
zM
xp
Neural networks
• Feed –forward NNs
Library: neuralnet
• neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm =
"rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…)
–
–
–
–
–
–
–
•
•
•
Hidden: vector showing amount of hidden neurons at each layer
Rep: amount of runs of network
Startweights: starting weights
Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr”
Err.fct: any function +”sse”+”ce” (cross-entropy)
Act.fct:any function+”logistic”+”tanh”
Linear.output: TRUE, if no activation at the output
confidence.interval(x, alpha = 0.05) Confidence intervals for weights
compute(x, covariate) Prediction
plot(x,…) plot given neural network
732A44 Programming in R
Neural networks
• Example
mynet<-neuralnet(
Region~eicosenoic+linoleic+linolenic+palmitic, data=training,
rep=5, hidden=c(2,2),act.fct="tanh")
plot(mynet);
mynet$result.matrix
732A44 Programming in R
Neural networks
• Prediction with compute()
• Finding misclassification rate: table(true_values,predicted
values) – not only for neural networks
• Another package, ready for qualitative response (classical
nnet):
mynet1<-nnet( Region~eicosenoic+linoleic,
data=training, size=3);
coef(mynet1)
predict(mynet1, data=validation);
732A44 Programming in R
Clustering
• Purpose is to identify groups of observations into
intput space (separated)
– K-means
– Hierarchical
– Density-based
732A44 Programming in R
K-means
• Amount of seeds K should be given
• Starting seed positions needed
•
kmeans(x, centers, iter.max = 10, nstart = 1)
–
–
–
X: data frame
Centers: either ”K” value or set of initial cluster centers
Iter.max: maximum number of iterations
732A44 Programming in R
res<-kmeans(data.frame (m5$linoleic,
m5$eicosenoic),2);
K-means
• One way to visualize
plot(m5$linoleic, m5$eicosenoic, col=res$cluster);
points(res$centers[,1], res$centers[,2], col = 1:2, pch = 8, cex=2)
732A44 Programming in R
Hierarchical clustering
• Agglomerative
– Place each point into a single cluster
– Merge nearest clusters until you get 1 cluster
• Meaning of ”two objects are close”?
– Measure of proximity (ex: quantiative vars, Euclidian distance)
• Similarity measure srs (=1 if same object, <1 otherwise)
– Ex: correlation
• Dissimilarity measure δrs (=0 if same object, >0 otherwise)
– Ex: euclidian distance
732A44 Programming in R
Hierarchical clustering
• hclust(d, method = "complete", members=NULL)
– D: dissimilarity measure
– Method: ”ward”, "single", "complete", "average",
"mcquitty", "median" or "centroid".
Returned: a tree showing merging sequence
• cutree(tree, k = NULL, h = NULL)
– K: number of clusters to make
– H: at which level to cut
Returned: cluster index
732A44 Programming in R
Hierarchical clustering
• Example
x<-data.frame(m5$linolenic, m5$eicosenoic);
m5_dist<-dist(x);
m5_dend<-hclust(m5_dist,
method="complete")
plot(m5_dend);
732A44 Programming in R
Hierarchical clustering
• Example
clust=cutree(m5_dend, k=2);
plot(m5$linoleic, m5$eicosenoic, col=clust);
 DO NOT forget to standardize!
732A44 Programming in R
Density-based clustering
• Kernel-based density estimation.
Library: pdfcluster
•
pdfCluster(x, h = h.norm(x), hmult = 0.75,…)
–
–
–
X: Data to be partitioned
h: a vector of smoothing parameters
Hmult: shrinkage factor
x<-data.frame(m5$linolenic, m5$eicosenoic);
res<-pdfCluster(x);
plot(res)
732A44 Programming in R
Reference
http://cran.r-project.org/doc/contrib/YanchangZhaorefcard-data-mining.pdf
732A44 Programming in R
Related documents