Download ppt - UT School of Information

Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004 Outline  Introduction  What is Web Analytics  Why Web Analytics matter  Secondary readings      Log files analysis Web usage mining Data preparation KDD process Document access in repositories Log File Lowdown (Michael Calore, 2001 ) Log file  What are in log file   Traffic  Audience  Browsers/Platforms  Errors  Referers Log File Lowdown  Sample Log File adsl-63-183-164.ilm.bellsouth.net - - [09/May/2001:13:42:07 -0700] "GET /about.htm HTTP/1.1" 200 3741 “http://www.e-angelica.com“ "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)"  Log File Analyzers  WebTrends, Sawmill, Analog, Webalizer, HTTP-analyze WebTrends   log file analyzer Advantages  Fast and effective  User-friendly interface  Feature-rich  Support different operating systems  Disadvantages  Not free WebTrends The KDD Process for Extracting Useful Knowledge from Volumes of Data (Fayyad, U., G. Piatetsky-Shapiro, et al. 1996)  KDD: Knowledge Discovery in Databases  The value of data  Definitions   KDD Data mining The KDD Process The KDD process 1.Creating a target dataset 2.Preprocessing and data cleaning 3.Data reduction and projection 4.Data mining Choosing the data mining function Choosing the data mining algorithm 5.Interpretation and evaluation The KDD Process  Data Mining  Data mining involves fitting models to or determining patterns from observed data  Data mining algorithms    The model The preference criterion The search algorithm The KDD Process  Data Mining  Model functions Classification Regression Clustering Dependency modeling Link anlysis  Goals of Data Mining Predictive and descriptive Data Preparation for Mining World Wide Web Browsing Patterns (Cooley, R. W., B. Mobasher, et al. 1999)   Web Usage Mining vs. data mining The WEBMINER process    Preprocessing Mining algorithms Pattern Analysis Data Preparation  Preprocessing  Data cleaning  User identification  Session identification  Path completion  Formatting Data Preparation Data Preparation R2 Tracking the Growth of a Site ( Nielsen, Jakob, 1998)   Exponential growth of the web and the internet Statistical method  Logarithmic convert to get linear regression Statistical analysis  Hypothesis: the site is growing (number of pageviews and date are correlated)  R2 and significance Tracking the Growth of a Site R2 = 0.96, p<0.001 Tracking the Growth of a Site  Predict growth rate  Clean noise  Confident interval Predicting Document Access in Large, Multimedia Repositories (by Recker, M. R. and J. E. Pitkow, 1996) patterns of document requests in networkaccessible multimedia databases  Main idea   Two related domains: Human memory and libraries  Borrow models and research results from them Predicting Document Access  The model – human memory (Anderson and Schooler)  The relationship of recency and performance is a power function  The relationship of frequency and performance is a power function  Tow parameters for performance  Need probability p and Need odds p/(1-p)  The linear function:  Log(Need odds) = a Log(Frequency) + b Predicting Document Access  Apply Human Memory Analysis in Document Requests Model      Dataset: log file of Georgia Tech WWW repository A dynamic information ecology Frequency analysis  Regression equation:  Log(Need Odds) = .99 Log (Frequency) – 1.30 Recency analysis  Regression equation:  Log(Need Odds) = -1.15 Log(days) + .41 Combining recency and frequency Predicting Document Access  Conclusion  Recency and frequency of past document access are strong predictors of future document access  Recency probed to be a stronger predictor than frequency  Applications for the design of information systems  Determine optimal ordering  Inform design decisions  Design of retrieved items of caching algorithms

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - UT School of Information