Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining by means of Concept Lattice Theory Dr. Joyee Yi Zhao FernUniversität in Hagen 1 Outline * Data mining • Web mining • Concept Lattice Theory • Concept Lattices based web mining • Conclusion 2 Why Is Data Mining Hot? • Data mining (knowledge discovery in databases) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories • Necessity is the mother of invention – Data is everywhere—data mining should be everywhere, too! – Understand and use data—an imminent task! 3 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning (AI) Information Science Statistics Data Mining Visualization Other Disciplines 4 5 Principle of KDD How can the data help solve my problem? What's hiding in there? A Define Problem B I E F D G C H Wow! I did not know that. Mine Data Apply to Problem 6 Data Mining Techniques • • • • • • • Association Rules Sequential Patterns Classification Clustering Similar Images Outlier Discovery Text/Web Mining 7 Recent Progress of R&D in Data Mining • Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing) • Association, correlation, and causality analysis • Sequential patterns and time-series analysis • Classification: scalability, associative classification, etc. • Clustering and outlier analysis • Similarity analysis: curves, trends, images, texts, etc. • Text mining, Web mining and Weblog analysis • Spatial, multimedia, scientific data mining • Data preprocessing and database compression • Visual data mining, invisible data mining, etc. 8 Association Rules • Given: – A database of customer transactions – Each transaction is a set of items • Find all rules X => Y that correlate the presence of one set of items X with another set of items Y – Example: 98% of people who purchase diapers and baby food also buy beer. – Any number of items in the consequent/antecedent of a rule – Possible to specify constraints on rules (e.g., find only rules involving expensive imported products) 9 Association Rules(c.) • Sample Applications – Market basket analysis – Attached mailing in direct marketing – Fraud detection for medical insurance – Department store floor/shelf planning 10 Sequential Patterns • Given: – A sequence of customer transactions – Each transaction is a set of items • Find all maximal sequential patterns supported by more than a user-specified percentage of customers • Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction – 10% is the support of the pattern • Apriori style algorithm can be used to compute frequent sequences 11 Classification • Given: – Database of tuples, each assigned a class label • Develop a model/profile for each class – Example profile (good credit): – (25 <= age <= 40 and income > 40k) or (married = YES) • Sample applications: – Credit card approval (good, bad) – Bank locations (good, fair, poor) – Treatment effectiveness (good, fair, poor) 12 Clustering • Given: – Data points and number of desired clusters K • Group the data points into K clusters – Data points within clusters are more similar than across clusters • Sample applications: – – – – Customer segmentation Market basket customer analysis Attached mailing in direct marketing Clustering companies with similar growth 13 Similar Images • Given: – A set of images • Find: – All images similar to a given image – All pairs of similar images • Sample applications: – Medical diagnosis – Weather predication – Web search engine for images – E-commerce 14 Outlier Discovery • Given: – Data points and number of outliers (= n) to find • Find top n outlier points – Outliers are considerably dissimilar from the remainder of the data • Sample applications: – – – – Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis 15 Outline • Data mining * Web mining • Concept Lattice Theory • Concept Lattices based web mining • Conclusion 16 Web Mining: Challenges • Today’s search engines are plagued by problems: –the abundance problem (99% of data of no interest to 99% of people) –limited coverage of the Web (Internet sources hidden behind search interfaces) –limited query interface based on keyword-oriented search –limited customization to individual users 17 Web is ….. • The web is a huge collection of documents – – – – Semistructured (ambiguous structure,HTML, XML) Hyper-link information Access and usage information Dynamic (i.e. New pages are constantly being generated) 18 Web Mining 19 Web Mining • Web Content Mining – Extract concept hierarchies/relations from the web – Automatic categorization – Describe the automatic search of information resources available on-line • Web Usage Mining – Trend analysis (i.e web dynamics info) – Web access association/sequential pattern analysis – Data from server access logs, user registration or profiles, user sessions or transactions etc. • Web Structure Mining – Mine the web document’s structures and links 20 Semantic Web • Tim Berners-Lee, inventor of WWW, URI, HTTP and HTML • Next generation of the current web • Enrich the web by machine processable information which is organized on different levels • • • • XML RDF Ontologies Topic Maps 21 XML(eXtensible Markup Language) • XML & HTML – XML supports the electronic exchange of machine readable documents – HTML is designed primarily for human-readable documents • XML data shares many features of semistructured data – its structure is irregular, and is not always known ahead of time, and can change frequently and without notice – easy to convert data from any source into XML 22 RDF(Resource Description Framework) • An XML-based language for describing information contained in a web resource • Triple – Subject A – Property C – Object B • RDF schema -- a simple datatyping model for RDF 23 Ontologies • Meta-data schemas • Providing a controlled vocabulary of concepts, each with an explicitly defined and machine processable semantics • A successful approach for structuring informal, semi-formal and formal knowledge 24 Topic Maps • Designed to solve the problem of large quantities of unorganized information • Online equivalent of printed indexes • Allow users to create a large quantity of metadata and tightly interconnected data 25 Semantic Web Mining = Semantic Web + Web Mining • Improve the results of web mining by exploiting the new semantic structure in the web • Exploit web mining for building the semantic web 26 Outline • Data mining • Web mining * Concept Lattice Theory • Concept Lattices based web mining • Conclusion 27 Concept Lattice Theory • Group objects into classes that materialise concepts of the domain under study – – – – a set of objects E the relative properties E ' a binary relation R partial order on concepts R ⊆ E × E ' • let C1 = ( X 1, X '1), C 2 = ( X 2, X '2) • C1 < C 2 ⇔ X '1 ⊆ X ' 2 ⇔ X 2 ⊆ X 1 – Hasse diagram: generalization/specification relationship 28 Illustration of a Concept Lattice R a b 1 1 1 2 1 3 1 4 1 c 1 1 1 A matrix data mining context and its Hasse diagram(concept lattice) 29 Why can Concept Lattices support knowledge discovery in databases? • Knowledge discovery – information discovery combined with knowledge creation – representation of information to make the inherent logical structure of the information transparent • Concept – the logical structure of information is based on concepts and concept systems • Concept Lattice – as mathematical abstraction of concept system can support humans to discover information and then to create knowledge 30 Outline • Data mining • Web mining • Concept Lattice Theory * Concept Lattices based web mining • Conclusion 31 Concept Lattice Theory based Web Mining Research • Structure of Concept Lattices • Web mining by means of Concept Lattices – normal web mining – semantic web mining 32 Simplify structure of concept lattices - Pruned concept lattice - Hierarchical concept lattice - Tree 33 Web Mining Research based on Concept Lattices • Agent design to improve the performance of search engine • User browsing behavior extraction and prediction • Analyze the structural content of web pages through exploiting the latent information given by HTML tags • Extraction of semantic information from unstructured and semi-structured text • Combination of normal data mining techniques and Concept Lattice techniques for web mining 34 Semantic Web Mining Research based on Concept Lattices • A Concept Lattice itself is a kind of semantics • Interesting research: – Using CL based (clustering) techniques to generate class hierarchies expressible as RDF schema, ontologies – Using semantic structure by means of CL into web content, structure and usage mining 35 Outline • Data mining • Web mining • Concept lattice • Concept Lattice based web mining * Conclusion 36 Thank you !!! 37 Useful links • • • • • • • www.kdnuggets.com www.almaden.ibm.com www.acm.org/sigkdd/ www.dmg.org www.math.tu-dresden.de/~ganter/fba.html www.w3c.org www.semanticweb.org 38