Download Algorithms of the Intelligent Web J

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Algorithms of the
Intelligent Web
H ARAL AM BOS MARMANIS
DMITRY BABENKO
J
-
MANNING
Greenwich
(74° w. long.)
contents
preface xiii
acknowledgments xvi
about this book xviii
Wliat is the intelligent web ? 1
1.1
Examples of intelligent web applications
3
1.2
Basic elements of intelligent applications
4
1.3
What applications can benefit from intelligence?
6
Social networking sites 6 • Mashups 7 • Portals 8 • Wikis 9
Media-sharing sites 9 • Online gaming 10
1.4
How can I build intelligence in my own application?
11
Examine your functionality and your data 11 • Get more data from
the web 12
1.5
Machine learning, data mining, and all that
1.6
Eight fallacies of intelligent applications
15
16
Fallacy #1: Your data is reliable 17 • Fallacy #2: Inference happens
instantaneously 18 • Fallacy #3: The size of data doesn't matter 18
Fallacy #4: Scalability of the solution isn't an issue 18 • Fallacy #5:
Apply the same good library everywhere 18 • Fallacy #6: The
computation time is known 19 • Fallacy #7: Complicated models are
better 19 - Fallacy #8: There are models without bias 19
1.7
1.8
Summary 19
References 20
Searching 21
2.1 Searching with Lucene 22
Understanding the Lucene code 24 •• Understanding the basic stages
of search 29
2.2
2.3
Why search beyond indexing? 32
Improving search results based on link analysis 33
An introduction to PageRank 34 • Calculating the PageRank vector 35
alpha: The effect of leleportation between iveb pages 38 • Understanding
the power- method 38 • Combining the index scores and the PageRank
scores 43
2.4
Improving search results based on user clicks 45
A first look at user clicks 46 • Using the NaiveBayes classifier 48
Combining Lucene indexing, PageRank, and user clicks 51
2.5
Ranking Word, PDF, and other documents without links 55
An introduction to DocRank 55 • The inner workings ofDocRank 57
2.6
2.7
Large-scale implementation issues 61
Is what you got what you want? Precision and recall
64
2.8 Summary 65
2.9 To do 66
2.10 References 68
Creating suggestions and recommendations^ 69
3.1 An onlineTnusic.store: the basic concepts 70
The concepts of distance and similarity 71 • A closer look at the
calculation of similarity 76 • Which is the best similarity formula ? 79
3.2
How do recommendation engines work? 80
Recommendations based on similar users 80 • Recommendations
based on similar items 89 • Recommendations based on content 92
3.3
Recommending friends, articles, and news stories 99
Introducing MyDiggSpace.com 99 • Finding friends 100 • The
inner workings of DiggDelphi 102
3.4
RecommendingmoviesonasitesuchasNetflix.com 107
An introduction of movie datasets and recommenders 107 • Data
normalization and correlation coefficients 110
3.5
Large-scale implementation and evaluation issues
115
3.6 Summary 117
3.7 To Do 117
3.8 References 119
Clustering: grouping things together 121
4.1
The need for clustering
122
User groups on a website: a case study 123 • Finding groups with a
SQL order by clause 124 • Finding groups with array sorting 125
4.2
An overview of clustering algorithms
128
Clustering algorithms based on cluster structure 129 • Clustering
algorithms based on data type and structure 130 • Clustering
algorithms based on data size 131
4.3
Link-based algorithms
132
The dendrogram: a basic clustering data structure 132 • A first look
at link-based algorithms 134 * The single-link algorithm 135 • The
average-link algorithm 137 • The minimum-spanning-tree
algorithm 139'
4.4 - T h e k-means algorithm
142
A first look at the k-means algorithm 142 • The inner workings ofkmeans 143
4.5
Robust Clustering Using Links (ROCK)
146
Introducing ROCK 146 • Why does ROCK rock ? 147
4.6
DBSCAN
151
A,first look at density-based algorithms 151 • The inner workings of
DBSCkN~153
4.7
Clustering issues in very large datasets
157
Computational complexity 157 • High dimensionality 158
4.8 Summary 160
4.9 To Do 161
4.10 References 162
^
Classification: placing things where they belong 164
5.1
5.2
The need for classification 165
An overview of classifiers 169
Structural classification algorithms 170 • Statistical classification
algorithms 172 • The lifecycle of a classifier 173
5.3
Automatic categorization of emails a n d spam
filtering
NaiveBayes classification 175 • Rule-based classification 188
174
CONTENTS
5.4
Fraud detection with neural networks
199
A use case of fraud detection in transactional data 199 • Neural
networks overview 201 • A neural network fraud detector at work 203
The anatomy of the fraud detector neural network 208 • A base class for
building general neural networks 214
5.5
Are your results credible?
219
5.6
Classification with very large datasets
5.7
Summary
5.8
To do
5.9
References
223
225
226
230
Classification schemes 230 • Books and, articles 230
Combining classifiers 232
6.1 Credit worthiness: a case study for combining classifiers 234
A brief description of the data 235 • Generating artificial data for
real problems 239
6.2
Credit evaluation with a single classifier
243
The naive Bayes baseline 243 • The decision tree baseline 245 • The
neural network baseline 247
6.3
C o m p a r i n g multiple classifiers on the same data
250
McNemar's test 251 • The difference of proportions test 253
Cochran 's Q test and the F test 255
6.4
Bagging: bootstrap aggregating
257
The bagging classifier at work 258 • A look under the hood of the
bagging classifier 260 • Classifier ensembles 263
6.5
Boosting: an iterative i m p r o v e m e n t a p p r o a c h
265
The boosting classifier at work 266 • A look under the hood of the
boosting classifier 268
6.6 Summary 272
6.7 To Do 273
6.8 References 277
Putting it all together: an intelligent news portal 2 78
<B
7.1 An overview of the functionality 280
7.2 Getting and cleansing content 281
Get set. Gel ready. Crawl the Web! 281 • Review of the search prerequisites 282 • A default set of retrieved and processed news stories 284
'
7.3 Searching for news stories 286
7.4 Assigning news categories 288
Order matters! 289 • Classifying with the NeiusProcessor class 294
Meet the classifier 295 • Classification strategy: going beyond lorvlevel assignments 297
7.5
Building news groups with the NewsProcessor class 300
Clustering general news stories 301 • Clustering nexus stories within
a news category 305
7.6
Dynamic content based on the user's ratings
7.7
7.8
7.9
Summary 311
To do 312
References 316
appendix A
appendix B
appendix C
appendix D
appendix E
Introduction to BeanShell 317
Web craxvling 319
Mathematical refresher 323
Natural language processing 32 7
Neural networks 330
index 333
308