Download Web Analytics - UT School of Information

WIRED - Web Analytics Week • Web Logs overview • Web Analytics - Understanding Queries - Tracking Users • Web Log Reliability • Web Log Data Mining & KDD Web Analytics • Evaluation of Web Information Retrieval (& Web Information Seeking) • What can we learn? - IR systems use - Web server administration • Who are the users? - Types of users - User situations • How does it affect or help IR? Web Server Overview • Any application that can serve files using the HTTP protocol - Text, HTML, XHTML, XML… Graphics CGI, applets, serlets other media & MIME types • Apache or MS IIS that serve primarily Web pages • Servers create ASCII text log files showing: - Date, time, bytes transferred, (cache status) - Status/error codes, user IP address, (domain name) - Server method, URI, misc comments Web Log Overview • Access Log - Logs information such as page served or time served • Referer Log - Logs name of the server and page that links to current served page - Not always - Can be from any Web site • Agent Log - Logs browser type and operating system • Mozilla • Windows What can we learn from Web logs? • Every time a Web browser requests a file, it gets logged - Where the user came from - What kind of browser used to access the server - Referring URL • Every time a page gets served, it gets logged - Request time, serve time, bytes transferred, URI, status code Web Log Analysis in Action • UT Web log reports (Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00). Successful requests: 39,826,634 (39,596,364) Average successful requests per day: 5,690,083 (5,656,623) Successful requests for pages: 4,189,081 (4,154,717) Average successful requests for pages per day: 598,499 (593,530) Failed requests: 442,129 (439,467) Redirected requests: 1,101,849 (1,093,606) Distinct files requested: 479,022 (473,341) Corrupt logfile lines: 427 Data transferred: 278.504 Gbytes (276.650 Gbytes) Average data transferred per day: 39.790 Gbytes (39.521 Gbytes) Problems with Web Servers • • • • • Actual user or intent not known Paths difficult to determine Infrequent access challenging to uncover No State Information Server Hits not Representative - Counters inaccurate • • • • DOS, Floods, Bandwidth can Stop “intended” usage Robots, etc. ISP Proxy servers “5.3 Unsound inferences from data that is logged” Haigh & Megarity, 1998. Web Server Configuration • • • • Unique file & directory names = “at a glance analysis” Hierarchical directory structure Redirect CGI to find referrer Use a database - store web content - record usage data with context of content logged • Create state information with programming - Servlets, ActiveX, Javascript - Custom server or log format • Log rollover, report frequency, special case testing Log File Format • Extended Log File Format - W3C Working Draft WD-logfile-960323 192.117.240.3 - - [24/Jul/1998:00:00:04 -0400] "GET /10/3/a3-160-e.html HTTP/1.0" 200 2308 "http://www.amicus.nlcbnc.ca/wbin/resanet/itemdisp/l=0/d=1/r=1/e=0/h=10/i=11683503" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" • Every server generates slightly different logs - Versions & operating system issues - Admin tweaks to log formats • Extended Log Format most common - WWW Consortium Standards (= apache) Let’s Look at some logs • http://www.ischool.utexas.edu/analogmonthly.html • http://www.ischool.utexas.edu/analogweekly.html Log Analysis Tools • • • • • • • • • Analog Webalizer Sawmill WebTrends AWStats WWWStat GetStats Perl Scripts Data Mining & Business Intelligence tools WebTrends • A whole industry of analytics • Most popular commercial application Measuring Web Site Usage • Now that the Web is a primary source, understanding its use is critical • Little external cues that the Web site is being used • What - pages and their content/subject • How - browsers • Who - userid or IP • When - trends, daily, weekly, yearly • Where - the user is and what page they came from What you can’t measure? • Who the user is - Always - If the user’s needs have changed • If they’re using the information - Browsing vs. Reading vs. Acting on the information • Changes to site and how they affect each user • Pages not used at all - and why Analysis of a Very Large Search Log • What kinds of patterns can we find? • Request = query and results page • 280 GB – Six Weeks of Web Queries - Almost 1 Billion Search Requests, 850K valid, 575K queries - 285 Million User Sessions (cookie issues) - Large volume, less trendy - Why are unique queries important? • Web Users: - Use Short Queries in short sessions - 63.7% one request - Mostly Look at the First Ten Results only - Seldom Modify Queries • Traditional IR Isn’t Accurately Describing Web Search • Phrase Searching Could Be Augmented • Silverstein, Henzinger, Marais, Moricz (1998) Analysis of a Very Large Search Log • 2.35 Average Terms Per Query - 0 = 20.6% (?) - 1 = 25.8% - 2 = 26.0% = 72.4% • Operators Per Query - 0 = 79.6% • Terms Predictable • First Set of Results Viewed Only = 85% • Some (Single Term Phrase) Query Correlation - Augmentation - Taxonomy Input - Robots vs. Humans Web Analytics and IR? • Knowing access patterns of users • Lists of search terms - Numbers of words - Words, concepts to add (synonyms) - Types of queries • Success of searching a site - Was a result link clicked on? - How many pp/user after a search? • Is a new or better search interface needed? Real Life Information Retrieval • 51K Queries from Excite (1997) • Search Terms = 2.21 • Number of Terms - 1 = 31% 2 = 31% 3 = 18% (80% Combined) • Logic & Modifiers (by User) - Infrequent - AND, “+”, “-” • Logic & Modifiers (by Query) - 6% of Users - Less Than 10% of Queries - Lots of Mistakes • Uniqueness of Queries - 35% successive - 22% modified - 43% identical Real Life Information Retrieval • Queries per user 2.8 • Sessions - Flawed Analysis (User ID) - Some Revisits to Query (Result Page Revisits) • Page Views - Accurate, but not by User • Use of Relevance Feedback (more like this) - Not Used Much (~11%) • Terms Used Typical & frequent • Mistakes - Typos - Misspellings - Bad (Advanced) Query Formulation • Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998) KDD for Extracting Knowledge • Knowledge extraction, information discovery, information extraction, data archeology, data pattern processing, OLAP, HV statistical analysis • Sounds as if “knowledge” is there to be found. • User and usage context help find the knowledge • Hypothesis before analysis • Why KDD, why now? - Data storage, analysis costs - Visualization KDD Process QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. • Database for structured data and queries - How structured, alorithms for queries - How results can be understood and visualized - Iterative & Interactive, hypothesis driven & hypothesis generating KDD Efforts • Data Cleaning • Formulating the Questions • “Finding useful features to represent the data” p30 • Models: - Classification to fit data into pre-defined classes Regressions to fit predictions & values Clustering to class sets found in data Summarization to briefly describe data Dependency discovery of variable relationships Sequence analysis for time or interaction patterns Data Prep for Mining the WWW • Processing the data before mining • WEBMINER system - site toplogy - Cleaning User identification Session identification (episodes) Path completion QuickTime™ and a TIF F (LZW ) decompressor are needed to see this picture. Web Usage Mining • VL Verification • Data Mining to Discover Patterns of Use - Pre-Processing - Pattern Discovery - Pattern Analysis • Site Analysis, Not User Analysis • Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.N. - 2000 Web Usage Discovery - Content • Text • Graphics • Features - Structure • Content Organization • Templates and Tags - Usage • Patterns • Page References • Dates and Times - User Profile • Demographics • Customer Information Web Usage Collection • Types of Data - Web Servers - Proxies - Web Clients • Data Abstractions - Sessions Episodes Clickstreams Page Views • The Tools for Web Use Verification Web Usage Preprocessing • Usage Preprocessing - Understanding the Web Use Activities of the Site - Extract from Logs • Content Preprocessing - Converting Content Into Formats for Processing - Understanding Content (Working with Dev Team) • Structure Preprocessing - Mining Links and Navigation from Site - Understanding Page Content and Link Structures Web Usage Pattern Discovery • Clustering for Similarities - Pages - Users - Links • Classification - Mapping Data to Pre-defined Classes Rule Discovery Rule Rules Computation Intensive Many Paths to the Similar Answers • Pattern Detection - Ordering By Time - Predicting Use With Time Web Usage Mining as Evaluation? • Mining Goals - Improved Design - Improved Delivery - Improved Content • • • • • Personalization (XMod Data) System Improvement (Tech Data) Site Modification (IA Data) Business Intelligence (Market Data) Usage Characterization (User Behavior Data) Web Analytics Wrap-up • • • • What can we learn about users? What can we learn about services? How can we help users improve their use? How can IR models benefit from this analysis? • What kind of improvements in Web IR systems and their interfaces can be take from this?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Web Analytics - UT School of Information