Download R - unece

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Machine Learning in R
and its use in the statistical offices
stat.unido.org
[email protected]
1
Outline
1. Machine learning and R
2. R packages
3. Machine learning in official statistics
4. Top 10 algorithms
5. References
2
What I talk about when I talk about
Machine Learning
Machine Learning (ML)
Artificial
Intelegence
(AI)
Data Mining
(DM)
Statistics
3
R and R packages
•
What makes R so useful?
–
•
•
The users can extend and improve the software or
write variations for specific tasks.
The R package mechanism allows packages
written for R to add advanced algorithms,
graphs, machine learning and and mining
techniques
Each R package provides a structured standard
documentation including code application
examples
R and R packages
## Naive Bayes example
> install.packages('e1071', dependencies =
TRUE)
> library(class)
> library(e1071)
> data(iris)
> pairs(iris[1:4], main = "Iris Data
(red=setosa,green=versicolor,blue=virgini
ca)", pch = 21, bg = c("red", "green3",
"blue")[unclass(iris$Species)])
R and R packages
> classifier <- naiveBayes(iris[,1:4], iris[,5])
> table(predict(classifier, iris[,-5]), iris[,5])
setosa
versicolor
virginica
setosa
50
0
0
versicolor
0
47
3
virginica
0
3
47
Machine Learning for Official Statistics
I.
II.
III.
IV.
Automatic Coding
Editing and Imputation
Record Linkage
Other Methods
Automatic coding
A. Automatic coding via Bayesian classifier:
caret, klaR
B. Automatic occupation coding via
CASCOT: algorithm not described
C. Automatic coding via open-source
indexing utility: ?
D. Automatic coding of census variables via
SVM: e1071 (interface to libsvm)
Editing and Imputation
A.
Categorical data imputation via neural networks and
Bayesian networks: neuralnet, gRain, bnlearn, deal
B. Identification of error-containing records via
classification trees: rpart, tree, caret
C. Imputation donor pool screening via cluster analysis:
class, klaR, cluster, kmeans(), hclust()
D. Imputation via Classification and Regression Trees
(CART): rpart, caret, RWeka
E. Determination of imputation matching variables via
Random Forests: randomForest
F. Creation of homogeneous imputation classes via CART:
rpart
G. Derivation of edit rules via association analysis: arules
Record Linkage
•
Weighting vector classification:
– The last major step in record linkage or
record de-duplication
– could be understood as a classification
problem
•
In R: rpart, bagging() in package ipred,
ada, functions svm() and nnet() in
package e1071
Other Methods
A.
B.
C.
D.
E.
F.
G.
Questionnaire consolidation via cluster analysis: class,
klaR, cluster...
Forming non-response weighting groups via classification
trees: rpart, tree, caret
Non-respondent prediction via classification trees: rpart,
tree, caret
Analysis of reporting errors via classification trees: rpart,
tree, caret
Substitutes for surveys via internet scraping: scrapeR,
rvest
Tax evader detection via k-nearest neighbours: class,
kknn
Crop yield estimation via image processing on satellite
imaging data: is this ML?
Do we Need Hundreds of Classiers to Solve Real World
Classication Problems?
•
•
Fernandez-Delgado, Cernadas, Barro (2014)
Evaluate 179 classifiers arising from 17 families
on 121 data sets
By far best are random forests and SVM with
Gaussian kernel
Most of the best classiffiers are implemented in
R and tuned using caret
•
•
•
seems the best alternative to select a classier
implementation
Top 10 ML/DM Algorithms
Xindong Wu and Vipin Kumar (2009)
C4.5 – generates classifiers expressed as decision trees or
ruleset form
2. K-Means – simple iterative method to partition a given
dataset into a userspecified number of clusters, k
3. SVM – support vector machines
4. Apriori - derive association rules
5. EM - Expectation–Maximization algorithm
6. PageRank - produces a static ranking of Web pages
7. AdaBoost – Ensemble learning
8. kNN - k-nearest neighbor classification
9. Naive Bayes – simple classifier, applying the Bayes‘ theorem
with independence assumptions between the features
10. CART - Classification and Regression Trees
1.
The best R packages for ML
1.
2.
3.
4.
5.
6.
7.
e1071: Naive Bayes, SVM, latent class analysis
rpart: regression trees
RandomForest: RF
gbm: generalized boosting models
kernlab: SVM
caret: Classification and Regression Training
neuralnet: neural networks
CRAN Task View: Machine Learning &
Statistical Learning
Machine learning books
17
Related documents