Download Web Information Retrieval and Extraction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Information
Retrieval and Extraction
Chia-Hui Chang, Associate Professor
National Central University, Taiwan
[email protected]
Sep. 16, 2005
Course Content




Web Information Integration
Web Information Retrieval
Traditional IR systems
Web Mining
Sep. 21, 2004
2
Topic I: Web Information
Integration





Search Interface Integration
Web page collection
Web data extraction
Search result integration
Web Service
Sep. 21, 2004
3
Web Page Collection

Metacrawler http://www.metacrawler.com/


Ebay http://www.ebay.com/


Google · Yahoo · Ask Jeeves About · LookSmart ·
Overture · FindWhat
Information asymmetry between buyers and
sellers
Technology


Program generators
WNDL, W4F, XWrap, Robomaker
Sep. 21, 2004
4
Web Data Extraction


Example
Technology




Information Extraction Systems
WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA,
Roadrunner, EXALG, XWrap, W4F, etc.
Data Annotation
Wrapper induction is an excellent exercise of
machine learning technologies
Sep. 21, 2004
5
Topic II: Web Information
Retrieval

From User Perspective




Browsing via categories
Searching via search engines
Query answering
From System Perspective





Web crawling
Indexing and querying
Link-based ranking
Query answering
Semantic Web, XML retrieval, etc.
Sep. 21, 2004
6
Web Categories

Yahoo http://www.yahoo.com



Technology


Fourteen categories and ninety subcategories
Categorization by humans
Document classification
Pros and Cons


Overview of the content in the database
Browsing without specific targets
Sep. 21, 2004
7
Search Engines

Google http://www.google.com



Technology




Search by keyword matching
Business model
Web Crawling
Indexing for fast search
Ranking for good results
Pros and Cons

Search engines locate the documents not the answers
Sep. 21, 2004
8
Question Answering

Askjeeves http://www.ask.com




Input a question or keywords
Relevance feedback from users to clarify the
targets
ExtAns (Molla et al., 2003)
Technology


Text information extraction
Natural Language Processing
Sep. 21, 2004
9
Topic III: Techniques from
Traditional IR

Text Operations




Indexing and Searching





Lexical analysis of the text
Elimination of stop words
Index term selection
Inverted files
Suffix trees and suffix arrays
Signature files
IR Model and Ranking Technique
Query Operations


Relevance feedback
Query expansion
Sep. 21, 2004
10
Topic IV: Web Mining




Usage Analysis
Focused Crawling
Clustering of Web search result
Text classification
Sep. 21, 2004
11
Available Techniques





Artificial Intelligence
 Search and Logic programming
Machine Learning
 Supervised learning (classification)
 Unsupervised learning (clustering)
Database and Warehousing
 OLAP and Iceberg queries
Data Mining
 Pattern mining from large data sets
Other Disciplines
 Statistics, neural network, genetic algorithms, etc.
Sep. 21, 2004
12
Classical Tasks

Classification


Clustering


Artificial Intelligence, Machine Learning
Pattern recognition, neural network
Pattern Mining

Association rules, sequential patterns, episodes
mining, periodic patterns, frequent continuities,
etc.
Sep. 21, 2004
13
Classification Methods

Supervised Learning (Concept Learning)








General-to-specific ording
Decision tree learning
Bayesian learning
Instance-based learning
Sequential covering algorithms
Artificial neural networks
Genetic algorithms
Reference: Mitchell, 1997
Sep. 21, 2004
14
Clustering Algorithms

Unsupervised learning (comparative analysis)






Partition Methods
Hierarchical Methods
Model-based Clustering Methods
Density-based Methods
Grid-based Methods
Reference: Han and Kamber (Chapter 8)
Sep. 21, 2004
15
Pattern Mining

Various kinds of patterns

Association Rules





Closed itemsets, maximal itemsets, non-redundant
rules, etc.
Sequential patterns
Episodes mining
Periodic patterns
Frequent continuities
Sep. 21, 2004
16
Applications

Relational Data







E.g. Northern Group Retail (Business Intelligence)
Banking, Insurance, Health, others
Web Information Retrieval and Extraction
Bioinformatics
Multimedia Mining
Spatial Data Mining
Time-series Data Mining
Sep. 21, 2004
17
Course Schedule










Web Data Extraction (3 weeks)
Web Interface Integration (1 week)
Web Page Collection (1 week)
Techniques from Traditional IR (2 weeks)
Query Answering (1 week)
Link Based Analysis (1 week)
Focused Crawling (1 week)
Web Usage Mining (1 week)
Clustering Search Result (1 week)
Text Classification (1 week)
Sep. 21, 2004
18
Grading

Project I: 30%


Project II: 30%



Topic can be chosen freely (W16)
Paper reading: 20%


Implementation of the chosen paper (W10)
Presentation
Homework: 10%
Involvement in the Class: 10%
Sep. 21, 2004
19
References




Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern
Information Retrieval, Addison Wesley
Han, J. and Kamber, M. 2001. Data
Mining: Concepts and Techniques, Morgan
Kaufmann Publishers
Mitchell, T. M. 1997. Machine Learning, McGRAWHILL.
Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and
Hess, M. 2003. ExtrAns: Extracting Answers from
Technical Texts. IEEE Intelligent Systems,
July/August 2003, 12-17.
Sep. 21, 2004
20
Related documents