Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012 Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data Challenges •Big data processing •Extracting useful information that reflects user behavior from massive log •Instance data management •Data analysis 2 Opportuni ties Behavior data (like web log) can be used for improving and supporting business processes. Data mining, process mining and so on Analytic applications BI/ Data Machine Process Reporting Mining Learning Mining Cassandra Big Data processing Big Data Access Cloud Storage Unstructured Data 3 Cloud computing (Map/Reduce Framework) Hive Distributed File System(HDFS) Web Data Instance data NoSQL Key-value Database(HBase ,Cassandra, MongoDB) Raw data Case study: Search Engine Company •News, Page, Image, Maps, Music, navigation Dataset: 66 million clicks in one month, 2.2 million clicks per day ->generate behavior in 10 minutes User Behavior: •Visiting path (Referer) •Searching result effectiveness •Abs Clicking Behavior •Source and Destination of User visiting •Robot Behavior Reorganization and Analysis •Visiting page layout •Behavior comparison and product improvement •User grouping and recommendation 4 外部页 网页搜 索 网页结果页 网页结 果点击 图片结果页 图像点 击 点击全 文 图片首页 图片搜 索 新闻首页 新闻搜 索 新闻结果页 首页 页面切 换 页面切 换 时评首页 时评搜 索 新闻专题页 时评结果页 新闻点 击 图片过渡页 新闻过渡页 Data features • It contains massive information in a well recorded format • Large scale with big growing potential • Real-time analysis 5 existing tools Data extracting: XESame,Prom Import Cloud Storage /no rational DB Extracting data from cloud Instance data(XES) Process Mining : ProM 1)Due to large data set, analysing has low speed and in most situations it got crash 2)Offline analysis-> real-time analysis 6 System Structure Understandable model Extracting useful information that reflects user behavior from massive log Log processing 7 Convert raw log to instance data(event log) with Map/Reduce 8 A E B D C F D A E F G CaseID1+T1+A CaseID1+T3+E CaseID2+T3+B CaseID3+T2+D CaseID2+T1+C CaseID3+T3+F CaseID1+T2+D CaseID2+T4+A CaseID3+T1+E CaseID4+T1+F CaseID2+T2+G Map 9 CaseID1+T1+A CaseID1+T2+D CaseID1+T3+E ADE CaseID2+T1+C CaseID2+T2+G CaseID2+T3+B CaseID2+T4+A CGBA CaseID3+T1+E CaseID3+T2+D CaseID3+T3+F EDF CaseID4+T1+F F Sort and Partition ADE CGBA EDF F UKOC Reduce XESName_0.xes XESName_1.xes XESName_2.xes If the events number in Xlog exceed 5000, output one Xlog, to avoid the exceed heap size of computer CPU: Intel Xeon 2.40GHZ RAM:2GB 14Nodes fileSize 10 logNum OnePCTime MapReduceTime MapNum ReduceNum 8.84 MB 36422 5 s, 921 ms 7s 3 15 65.8M 218177 30 s, 846 ms 25s 3 15 112 M 772241 48 s, 559 ms 30s 3 15 One day(371M) 2,200,000 2.5minutes 1.3minutes 40 15 One week 15,000,000 2.5minutes 280 15 One month 66,000,000 20 Minutes (Expected ) 2 hours (Expected ) 6 minutes 1200 15 Process Discovery One instance/case is defined as one visitor’s one time visiting. •IP+UA •CookieID Activity varies based on different requirements Alpha miner Heuristic miner Fuzzy miner Sequence model 11 Behavior analysis User behavior pattern range activity Interaction between channels all ContentType Web Map vising path webpage layout all Referer/URL news ContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping Registration all 12 Data selection User behavior pattern range activity Interaction between channels all ContentType Web Map vising path webpage layout all Referer/URL news ContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping Registration all 13 Data selection Behavior analysis User behavior pattern range activity Interaction between channels all ContentType Web Map vising path webpage layout all Referer/URL news ContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping Registration all 14 Data selection Active visitor’s visiting path 15 Behavior analysis User behavior pattern range activity Interaction between channels all ContentType Web Map vising path webpage layout all Referer/URL news ContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping Registration all 16 Data selection Main page 17 18 Sequence model 19 ` 20 XES statistics 21 Conclusion It is a nice project to get into data analysis field ,with the combination of web data analysis, process mining and cloud computing technology. Future work: 1 More algorithms and technologies should be applied to this data set. 2 Behavior comparison and user recommendation still need to be accomplished. 3 Can process mining analyze the behavior that does not have a certain pattern. 1 Log Sampling 2 Detect the incorrectness from logs before applying log to analysis technologies. 3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame. 22 feedback 1 What is the real questions? 2 Why process mining? 23 Thank you ! Meng Dou 13/9/2012