Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar 08, 2012 Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 1 / 36 Introduction The Web is the largest data source in the world Web Mining aims to extract and mine knowledge from the data on the Web Data → Information (Data in context) → Knowledge (Information in context) Typically, knowledge inside of human mind Automatic extraction to prepare it for humans Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 2 / 36 Example: Navigational behavior on the Web Study by Huberman in 1998 Strong Regularities in World Wide Web Surfing Observing the number of links users follow on a website Theoretical model confirmed with the log analysis of several large websites Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 3 / 36 Example: Navigational behavior on the Web Figure: Number of links followed vs. number of users Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 4 / 36 Introduction Web Mining is multidisciplinary field Data mining, machine learning, network science Statistics, information retrieval, multimedia, etc. Databases, in particular NoSQL databases Map/Reduce, GraphDB, etc. Lack of structure, heterogeneity → very challenging task Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 5 / 36 Opportunities and Challenges The amount of information is huge and easily accessible The coverage of information is huge (information on anything) All types of information exist (structured databases, text, multimedia,...) Much of the Web information is semi-structured (HTML) Much of the Web information is linked Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 6 / 36 Opportunities and Challenges A lot of redundancy (copy&paste instead of linking) A lot of noise (advertisement, copyright notices, navigation panels, ...) A lot of Web services that provide different responses for different request parameters The Web is dynamic (information changes → snapshots) It is virtual society → not only about data but also about people Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 7 / 36 Web Mining Web Mining classification Web Usage Mining User access and interaction patterns Search access and search interaction → search query logs Navigation and browsing → access logs We will deal mostly with this topic Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 8 / 36 Web Mining Web Structure Mining Discover knowledge from the link structure E.g. PageRank But also HITS algorithm Discussed in e.g. Web Science or MMIS1 Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 9 / 36 Web Mining Web Content Mining Mining, integration and extraction of knowledge from the Web content E.g. clustering search results according to the content similarity Sentiment analysis (positive, negative opinions, ...) Discussed in e.g. Application areas of Knowledge Management Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 10 / 36 Web Mining A subcategory that belongs to all other categories Web Metadata Mining Extraction of knowledge from the user metadata, e.g. tags Tags are also content, tags are typically represented as links, tags are a specific product of interaction with the system But other types of metadata are possible: e.g. Wikipedia categories We will deal with extraction of hierarchies from Web metadata Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 11 / 36 Data Sources Web Metadata Mining Datasets of diverse social Web sites E.g. Wikipedia dumps Crawls from tagging systems, e.g. delicious or flickr Typically crawled via APIs offered by those systems Very large files Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 12 / 36 Delicious crawl 000001b9e9e5be0c86cac873e42c2c4d basil3whitehouse http://en.wikipedia.org/wiki/Roomano 1176073200 food cheese 00000c9d3fee7592680fa80646c36fa7 NicoC http://en.wikipedia.org/wiki/Green_Anaconda 1170720000 anacondas animalinfo Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 13 / 36 Data Sources Web Usage Mining Server level collection Client level collection Proxy level collection Very large files and multiple files Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 14 / 36 Server level collection 2012-03-07 00:14:20,469 |INFO| /af/AEIOU/Conrad_von_H%C3%B6tzendorf,_Franz_Freiherr| -|66.249.66.206| Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 2012-03-07 00:14:21,026 |INFO| /af/Wissenssammlungen/Fossilien/Escharella| -|62.47.22.30|Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0) Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 15 / 36 Server level collection POST methods typically not logged Cache hits not logged Tracking of user session difficult Cookies, query data stored in separate files → integration Single site but multiple users Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 16 / 36 Client level collection Javascript or plugin, extension code E.g. Google Analytics sending client data from Javascript to Google for a specific site Search toolbars for collecting search and navigation(!) paths No problems with caching or sessions Single or multiple sites but single user Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 17 / 36 Proxy level collection Proxy servers in organizations Multiple users and multiple sites Users are anonymous Still possible to track sessions with heuristics Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 18 / 36 Web Usage Mining Framework Site Files Preprocessing v Raw Logs Preprocessed Ciickstream Data Rules, Patterns, and Statistics "Interesting" Rules, Patterns, and Statistics Figure 1: High Level Web Usage Mining Process Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 19 / 36 Web Metadata Mining Process slightly different E.g. instead of log data we have a raw dataset Depending on the task there might be some additional steps E.g. extracting a hierarchy After the analysis apply an optimal algorithm for hierarchy extraction Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 20 / 36 Web Metadata Preprocessing Preprocessing typically involves removing irrelevant data Stemming Grouping and integration of data Sorting, etc. Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 21 / 36 Web Metadata Preprocessing Depending on the file size (up to e.g. 40 50G) Unix shell commands are very useful E.g. awk, sed, sort, uniq, grep, wc, ... Also perl E.g. distribution of items: sort -n -r data.txt | uniq -c Filter lines: grep -v ‘‘null’’ Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 22 / 36 Web Metadata Preprocessing Wikipedia dumps, e.g. link dump 30G (12,0,’Alain_Badiou’),(12,0,’Albert_Camus’) perl -p -i.bac -e "s/\((.+?),’(.+?)’,.+?’\)(,|;)/\1,\2\n/g" test.txt Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 23 / 36 Web Usage Preprocessing Difficult because on the server we have only IP address, agent, server click stream We need to identify users and sessions Single IP but multiple sessions because of ISP proxies Multiple IPs but single user using different machines Multiple agents but single user even from the same machine Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 24 / 36 Web Usage Preprocessing Assuming that each user has been identified (e.g. trough cookies/IP-agent/path analysis) We need to extract sessions Difficult to know when user left the site for another site Session time-out, typically 30 minutes Problems with client side caching If session state is managed elsewhere difficult to know what content is served Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 25 / 36 Web Usage Preprocessing Session heuristics Time heuristics Total time must not exceed 30 minutes Total time at a single page must not exceed 10 minutes Path heuristics (href) A page must be reached from a previous page in the same session Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 26 / 36 Web Usage Preprocessing Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 27 / 36 Web Usage Preprocessing Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 28 / 36 Pattern Discovery Based on insights and algorithms from statistics, data mining, machine learning and pattern recognition Statistical analysis, association rules, clustering, classification, sequential patterns, ... Statistical analysis: descriptive statistics Frequency, mean, median, mode, standard deviation, ... E.g. access statistics, average time spent on page, etc. Outliers detection, e.g. non-valid URLs Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 29 / 36 Pattern Discovery Association rules: correlation statistics Which pages are often visited in the same session Correlation of visits to two non-linked pages Improving the site navigation structure Clustering: grouping similar items together E.g. usage clusters and page clusters Improving search results by showing similar pages Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 30 / 36 Pattern Discovery Classification: labeling pages from a predefined set of labels User profiling Classifying users to product groups such as e.g. music, movies, etc. Sequential patters: identify time-ordered sequences of visits Can use to predict future visit patterns Identify points of changing directions, etc. Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 31 / 36 Pattern Analysis Visual analytics Filter out what is not needed Concentrate on patterns important for the task at hand E.g. to improve navigation structure Identify navigation sequences and navigational hubs E.g. problems in continuing from the hubs Potential improvements, e.g. more links, hierarchy, more hints what is behind links, ... Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 32 / 36 Applications Personalization Dynamic recommendations of links, pages, products, ... E.g. Facebook You click a couple of times on liberal blogs posted by your liberal friends The conservative blogs posted by your conservative friends are not shown anymore Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 33 / 36 Applications System improvement Depending on patterns in accessing you might design new caching strategies Also load balancing, or data distribution Security: you might recognize malicious access, ... Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 34 / 36 Applications Site modification Redesigning content and structure Better linking More usable navigation structures Removing of distractions, etc. Evaluation of improvements Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 35 / 36 Further Info Web usage mining: discovery and applications of usage patterns from Web data http://dl.acm.org/citation.cfm?id=846183.846188 Book Web Data Mining http://www.cs.uic.edu/~liub/WebMiningBook.html Tutiorial Web Content Mining http://www.cs.uic.edu/~liub/WebContentMining.html Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 36 / 36