Download Course Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Information
Retrieval and Extraction
Chia-Hui Chang, Associate Professor
National Central University, Taiwan
[email protected]
Course Content

Web Information Retrieval




Browsing via categories
Searching via search engines
Query answering
Web Information Integration



Web page collection
Data extraction from semi-structured Web pages
Data integration
Sep. 21, 2004
2
Web Categories

Yahoo http://www.yahoo.com



Technology


Fourteen categories and ninety subcategories
Categorization by humans
Document classification
Pros and Cons


Overview of the content in the database
Browsing without specific targets
Sep. 21, 2004
3
Search Engines

Google http://www.google.com



Technology




Search by keyword matching
Business model
Web Crawling
Indexing for fast search
Ranking for good results
Pros and Cons

Search engines locate the documents not the answers
Sep. 21, 2004
4
Question Answering

Askjeeves http://www.ask.com




Input a question or keywords
Relevance feedback from users to clarify the
targets
ExtAns (Molla et al., 2003)
Technology


Text information extraction
Natural Language Processing
Sep. 21, 2004
5
Web Page Collection

Metacrawler http://www.metacrawler.com/


Ebay http://www.ebay.com/


Google · Yahoo · Ask Jeeves About · LookSmart ·
Overture · FindWhat
Information asymmetry between buyers and
sellers
Technology


Program generators
WNDL, W4F, XWrap, Robomaker
Sep. 21, 2004
6
Data Extraction from Semistructured Documents


Example
Technology




Information Extraction Systems
WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA,
Roadrunner, EXALG, XWrap, W4F, etc.
Data Annotation
Wrapper induction is an excellent exercise of
machine learning technologies
Sep. 21, 2004
7
Data Integration

Technology

Template based interface design


Microsoft
Visual Programming tools
Sep. 21, 2004
8
Available Techniques





Artificial Intelligence
 Search and Logic programming
Machine Learning
 Supervised learning (classification)
 Unsupervised learning (clustering)
Database and Warehousing
 OLAP and Iceberg queries
Data Mining
 Pattern mining from large data sets
Other Disciplines
 Statistics, neural network, genetic algorithms, etc.
Sep. 21, 2004
9
Classical Tasks

Classification


Clustering


Artificial Intelligence, Machine Learning
Pattern recognition, neural network
Pattern Mining

Association rules, sequential patterns, episodes
mining, periodic patterns, frequent continuities,
etc.
Sep. 21, 2004
10
Classification Methods

Supervised Learning (Concept Learning)








General-to-specific ording
Decision tree learning
Bayesian learning
Instance-based learning
Sequential covering algorithms
Artificial neural networks
Genetic algorithms
Reference: Mitchell, 1997
Sep. 21, 2004
11
Clustering Algorithms

Unsupervised learning (comparative analysis)






Partition Methods
Hierarchical Methods
Model-based Clustering Methods
Density-based Methods
Grid-based Methods
Reference: Han and Kamber (Chapter 8)
Sep. 21, 2004
12
Pattern Mining

Various kinds of patterns

Association Rules





Closed itemsets, maximal itemsets, non-redundant
rules, etc.
Sequential patterns
Episodes mining
Periodic patterns
Frequent continuities
Sep. 21, 2004
13
Applications

Relational Data







E.g. Northern Group Retail (Business Intelligence)
Banking, Insurance, Health, others
Web Information Retrieval and Extraction
Bioinformatics
Multimedia Mining
Spatial Data Mining
Time-series Data Mining
Sep. 21, 2004
14
Techniques from
Information Retrieval (IR)

Text Operations




Indexing and Searching





Lexical analysis of the text
Elimination of stop words
Index term selection
Inverted files
Suffix trees and suffix arrays
Signature files
Ranking Models
Query Operations


Relevance feedback
Query expansion
Sep. 21, 2004
15
Course Schedule

Techniques from Information Retrieval





Text Information Extraction for Query answering


AutoSlog, SRV, Rapier, etc.
Data extraction from semi-structured Web pages


Text Operations
Indexing and Searching
Ranking Models
Query Operations
WIEN, Softmealy, Stalker, IEPAD, DeLA, Roadrunner,
EXALG, OLERA, etc.
Web page collection

XWrap, W4F, Robomaker, etc.
Sep. 21, 2004
16
Grading

Two projects (by groups): 50%



Paper reading (by yourself): 20%


Chosen from the topics covered in the course
Presentation and reports
Presentation
Information Integration Projects: 30%


Chosen freely
Presentation and reports
Sep. 21, 2004
17
References




Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern
Information Retrieval, Addison Wesley
Han, J. and Kamber, M. 2001. Data
Mining: Concepts and Techniques, Morgan
Kaufmann Publishers
Mitchell, T. M. 1997. Machine Learning, McGRAWHILL.
Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and
Hess, M. 2003. ExtrAns: Extracting Answers from
Technical Texts. IEEE Intelligent Systems,
July/August 2003, 12-17.
Sep. 21, 2004
18
Related documents