Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 2: Web Server Log 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" An extract from KDnuggets web log © 2006 KDnuggets Web Server Log – An Example KDnuggets.com Server Page contents http://www.kdnuggets.com/jobs/ Web server log 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET … HTTP/1.1" 200 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /gps.html HTTP/1.1" 200 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 © 2006 KDnuggets … Web (Server) Log – In Depth A sample web log line 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 152.152.98.11 -[16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining &hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" © 2006 KDnuggets Web log field: IP 152.152.98.11 IP address - can be converted to host name, such as xyz.example.com © 2006 KDnuggets Web log fields: Name, Login The name of the remote user (usually omitted and replaced by a dash “-”) Login of the remote user (also usually omitted and replaced by a dash “-”) © 2006 KDnuggets Web log field: Date/Time/TZ [16/Nov/2005:16:32:50 -0500] Date: DD/Mon/YYYY Time: HH:MM:SS Time Zone: (+|-)HH00 relative to GMT -0500 is US EST © 2006 KDnuggets Web log field: Request "GET /jobs/ HTTP/1.1" URL: relative to domain HTTP protocol: e.g. HTTP/1.0 or HTTP/1.1 Method: GET HEAD POST OPTIONS … Note: the request is recorded as sent, so it may contain errors, hacks, and any strange thing you can imagine © 2006 KDnuggets Web log field: Status code 200 Status (Response) code. Most important ones are: 200 – OK (most frequent, hopefully) 206 – partial access 301 – permanently redirected (e.g. access to /courses is redirected to /courses/ ) 302 – temporarily redirected 304 – not modified 404 – not found … © 2006 KDnuggets Web log field: Object size 15140 size of the object returned to the client, in bytes Can also be “-” if status code is 304 (not modified) © 2006 KDnuggets Web log field: Referrer http://www.google.com/search?q=salary +for+data+mining&hl=en&lr=&start=10 &sa=N URL the visitor came from (here it was a Google query for “salary for data mining”, 2nd page of results – starting from 10) Referrer can also be a static page, internal (same domain) or external (different domain), or “-” in case of a direct request (e.g. type-in, bookmark) Referrer analysis is very valuable © 2006 KDnuggets Web log field: User agent "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" User agent (browser) http://en.wikipedia.org/wiki/User_agent Almost all browsers start with Mozilla – for historic reasons In many cases additional information: Browser type, version : MSIE 6.0 - Internet Explorer 6.0 OS: Windows NT 5.1 (XP SP2) with .NET Framework 1.1 installed © 2006 KDnuggets Web Usage Mining Basic Totals Simple Request level breakdowns Advanced Visit level analysis Target pages; Conversion analysis © 2006 KDnuggets Web Log Analysis Programs Free Analog, awstats, webalizer Google analytics Commercial WebTrends, WebSideStory, … www.kdnuggets.com/software/web-mining.html © 2006 KDnuggets Web Usage Mining - Basic Totals for each component Hits – total number of requests Files – number of GETs Pages – number of HTML pages Sites – unique IP addresses Response codes Kbytes – total Kbytes transferred User Agents © 2006 KDnuggets Example: KDnuggets.com Nov 2005 totals Monthly Statistics (from webalizer) Total Value Hits 1,121,643 Files 930,468 Pages 312,889 Kbytes Unique Sites (IP) 10,578,535 35,942 Unique URLs 6,769 Unique Referrers 7,213 Unique User Agents 2,724 © 2006 KDnuggets More details Q: What is the meaning of the difference between Hits and Files? Example: KDnuggets.com Nov 2005 totals, 2 Monthly stats for Files by Status Code Answer: the difference between Hits and Files is the number of requests with status code not 200. Code Hits Code 200 - OK 930,468 Code 206 - Partial Content 9,303 Code 301 - Moved Permanently 4,217 Code 302 - Found 457 Code 304 - Not Modified 170,874 Code 404 - Not Found Other © 2006 KDnuggets 6,297 27 Difference between Files and Pages Q: What is the meaning of difference between Files and Pages ? © 2006 KDnuggets Difference between Files and Pages A: the difference between Files and Pages is the number of non-HTML files (e.g. image, javascript, etc In November 2005 KDnuggets log HTML files were about 1/3 of all requests However, this data does not separate bot requests (which are heavily weighted towards HTML pages) © 2006 KDnuggets Notes: web log formats We used web log in Apache standard format Some old logs have a different format without the last 2 fields (referrer and user agent), but these are now rare. © 2006 KDnuggets