Download Web Mining - FernUni Hagen

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining by means of
Concept Lattice Theory
Dr. Joyee Yi Zhao
FernUniversität in Hagen
1
Outline
* Data mining
• Web mining
• Concept Lattice Theory
• Concept Lattices based web mining
• Conclusion
2
Why Is Data Mining Hot?
• Data mining (knowledge discovery in databases)
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information (knowledge)
or patterns from data in large databases or other
information repositories
• Necessity is the mother of invention
– Data is everywhere—data mining should be everywhere,
too!
– Understand and use data—an imminent task!
3
Data Mining: Confluence of
Multiple Disciplines
Database
Technology
Machine
Learning (AI)
Information
Science
Statistics
Data Mining
Visualization
Other
Disciplines
4
5
Principle of KDD
How can
the data
help
solve my
problem?
What's
hiding in
there?
A
Define Problem
B
I
E
F
D
G
C
H
Wow! I did
not know
that.
Mine Data
Apply to Problem
6
Data Mining Techniques
•
•
•
•
•
•
•
Association Rules
Sequential Patterns
Classification
Clustering
Similar Images
Outlier Discovery
Text/Web Mining
7
Recent Progress of R&D
in Data Mining
• Multi-dimensional data analysis: Data warehouse and OLAP
(on-line analytical processing)
• Association, correlation, and causality analysis
• Sequential patterns and time-series analysis
• Classification: scalability, associative classification, etc.
• Clustering and outlier analysis
• Similarity analysis: curves, trends, images, texts, etc.
• Text mining, Web mining and Weblog analysis
• Spatial, multimedia, scientific data mining
• Data preprocessing and database compression
• Visual data mining, invisible data mining, etc.
8
Association Rules
• Given:
– A database of customer transactions
– Each transaction is a set of items
• Find all rules X => Y that correlate the presence of one set of
items X with another set of items Y
– Example: 98% of people who purchase diapers and baby
food also buy beer.
– Any number of items in the consequent/antecedent of a rule
– Possible to specify constraints on rules (e.g., find only rules
involving expensive imported products)
9
Association Rules(c.)
• Sample Applications
– Market basket analysis
– Attached mailing in direct marketing
– Fraud detection for medical insurance
– Department store floor/shelf planning
10
Sequential Patterns
• Given:
– A sequence of customer transactions
– Each transaction is a set of items
• Find all maximal sequential patterns supported by more
than a user-specified percentage of customers
• Example: 10% of customers who bought a PC did a
memory upgrade in a subsequent transaction
– 10% is the support of the pattern
• Apriori style algorithm can be used to compute frequent
sequences
11
Classification
• Given:
– Database of tuples, each assigned a class label
• Develop a model/profile for each class
– Example profile (good credit):
– (25 <= age <= 40 and income > 40k) or (married = YES)
• Sample applications:
– Credit card approval (good, bad)
– Bank locations (good, fair, poor)
– Treatment effectiveness (good, fair, poor)
12
Clustering
• Given:
– Data points and number of desired clusters K
• Group the data points into K clusters
– Data points within clusters are more similar than across
clusters
• Sample applications:
–
–
–
–
Customer segmentation
Market basket customer analysis
Attached mailing in direct marketing
Clustering companies with similar growth
13
Similar Images
• Given:
– A set of images
• Find:
– All images similar to a given image
– All pairs of similar images
• Sample applications:
– Medical diagnosis
– Weather predication
– Web search engine for images
– E-commerce
14
Outlier Discovery
• Given:
– Data points and number of outliers (= n) to find
• Find top n outlier points
– Outliers are considerably dissimilar from the remainder
of the data
• Sample applications:
–
–
–
–
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
15
Outline
• Data mining
* Web mining
• Concept Lattice Theory
• Concept Lattices based web mining
• Conclusion
16
Web Mining: Challenges
• Today’s search engines are plagued by problems:
–the abundance problem (99% of data of no interest to
99% of people)
–limited coverage of the Web (Internet sources hidden
behind search interfaces)
–limited query interface based on keyword-oriented search
–limited customization to individual users
17
Web is …..
• The web is a huge collection of documents
–
–
–
–
Semistructured (ambiguous structure,HTML, XML)
Hyper-link information
Access and usage information
Dynamic
(i.e. New pages are constantly being generated)
18
Web Mining
19
Web Mining
• Web Content Mining
– Extract concept hierarchies/relations from the web
– Automatic categorization
– Describe the automatic search of information resources available on-line
• Web Usage Mining
– Trend analysis (i.e web dynamics info)
– Web access association/sequential pattern analysis
– Data from server access logs, user registration or profiles, user sessions or
transactions etc.
• Web Structure Mining
– Mine the web document’s structures and links
20
Semantic Web
• Tim Berners-Lee, inventor of WWW, URI,
HTTP and HTML
• Next generation of the current web
• Enrich the web by machine processable
information which is organized on different
levels
•
•
•
•
XML
RDF
Ontologies
Topic Maps
21
XML(eXtensible Markup Language)
• XML & HTML
– XML supports the electronic exchange of machine readable
documents
– HTML is designed primarily for human-readable documents
• XML data shares many features of semistructured
data
– its structure is irregular, and is not always known ahead of time,
and can change frequently and without notice
– easy to convert data from any source into XML
22
RDF(Resource Description Framework)
• An XML-based language for describing
information contained in a web resource
• Triple
– Subject A
– Property C
– Object B
• RDF schema -- a simple datatyping model
for RDF
23
Ontologies
• Meta-data schemas
• Providing a controlled vocabulary of
concepts, each with an explicitly defined
and machine processable semantics
• A successful approach for structuring
informal,
semi-formal
and
formal
knowledge
24
Topic Maps
• Designed to solve the problem of large
quantities of unorganized information
• Online equivalent of printed indexes
• Allow users to create a large quantity of
metadata and tightly interconnected data
25
Semantic Web Mining =
Semantic Web + Web Mining
• Improve the results of web mining by
exploiting the new semantic structure in the
web
• Exploit web mining for building the
semantic web
26
Outline
• Data mining
• Web mining
* Concept Lattice Theory
• Concept Lattices based web mining
• Conclusion
27
Concept Lattice Theory
• Group objects into classes that materialise
concepts of the domain under study
–
–
–
–
a set of objects E
the relative properties E '
a binary relation R
partial order on concepts R ⊆ E × E '
• let C1 = ( X 1, X '1), C 2 = ( X 2, X '2)
• C1 < C 2 ⇔ X '1 ⊆ X ' 2 ⇔ X 2 ⊆ X 1
– Hasse diagram: generalization/specification
relationship
28
Illustration of a Concept Lattice
R
a
b
1
1
1
2
1
3
1
4
1
c
1
1
1
A matrix data mining
context and its Hasse
diagram(concept lattice)
29
Why can Concept Lattices support
knowledge discovery in databases?
• Knowledge discovery
– information discovery combined with knowledge creation
– representation of information to make the inherent logical structure
of the information transparent
• Concept
– the logical structure of information is based on concepts and
concept systems
• Concept Lattice
– as mathematical abstraction of concept system can support humans
to discover information and then to create knowledge
30
Outline
• Data mining
• Web mining
• Concept Lattice Theory
* Concept Lattices based web mining
• Conclusion
31
Concept Lattice Theory based
Web Mining Research
• Structure of Concept Lattices
• Web mining by means of Concept Lattices
– normal web mining
– semantic web mining
32
Simplify structure
of concept lattices
- Pruned concept lattice
- Hierarchical concept
lattice
- Tree
33
Web Mining Research based
on Concept Lattices
• Agent design to improve the performance of search engine
• User browsing behavior extraction and prediction
• Analyze the structural content of web pages through
exploiting the latent information given by HTML tags
• Extraction of semantic information from unstructured and
semi-structured text
• Combination of normal data mining techniques and
Concept Lattice techniques for web mining
34
Semantic Web Mining Research
based on Concept Lattices
• A Concept Lattice itself is a kind of semantics
• Interesting research:
– Using CL based (clustering) techniques to generate
class hierarchies expressible as RDF schema,
ontologies
– Using semantic structure by means of CL into web
content, structure and usage mining
35
Outline
• Data mining
• Web mining
• Concept lattice
• Concept Lattice based web mining
* Conclusion
36
Thank you !!!
37
Useful links
•
•
•
•
•
•
•
www.kdnuggets.com
www.almaden.ibm.com
www.acm.org/sigkdd/
www.dmg.org
www.math.tu-dresden.de/~ganter/fba.html
www.w3c.org
www.semanticweb.org
38