Download Behavior data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Processing and
Analyzing
Large log from Search Engine
Meng Dou
13/9/2012
Web-browsing data
social network communications
sensor data
->Behavior data
Google and Facebook,
for example, are Big Data companies.
Big data
Challenges
•Big data processing
•Extracting useful information that
reflects user behavior from massive log
•Instance data management
•Data analysis
2
Opportuni
ties
Behavior data (like web log) can
be used for improving and
supporting business processes.
Data mining, process mining and
so on
Analytic
applications
BI/
Data
Machine
Process
Reporting
Mining
Learning
Mining
Cassandra
Big Data
processing
Big Data
Access
Cloud
Storage
Unstructured
Data
3
Cloud computing
(Map/Reduce Framework)
Hive
Distributed File
System(HDFS)
Web Data
Instance data
NoSQL
Key-value
Database(HBase ,Cassandra,
MongoDB)
Raw data
Case study: Search Engine Company
•News, Page, Image, Maps, Music, navigation
Dataset:
66 million clicks in one month, 2.2 million clicks per day
->generate behavior in 10 minutes
User Behavior:
•Visiting path (Referer)
•Searching result effectiveness
•Abs Clicking Behavior
•Source and Destination of User visiting
•Robot Behavior Reorganization and
Analysis
•Visiting page layout
•Behavior comparison and product
improvement
•User grouping and recommendation
4
外部页
网页搜
索
网页结果页
网页结
果点击
图片结果页
图像点
击
点击全
文
图片首页
图片搜
索
新闻首页
新闻搜
索
新闻结果页
首页
页面切
换
页面切
换
时评首页
时评搜
索
新闻专题页
时评结果页
新闻点
击
图片过渡页
新闻过渡页
Data features
• It contains massive information in a
well recorded format
• Large scale with big growing
potential
• Real-time analysis
5
existing tools
Data extracting: XESame,Prom Import
Cloud
Storage
/no rational
DB
Extracting
data
from cloud
Instance
data(XES)
Process Mining : ProM
1)Due to large data set, analysing has low speed and in
most situations it got crash
2)Offline analysis-> real-time analysis
6
System Structure
Understandable
model
Extracting useful
information that
reflects user behavior
from massive log
Log processing
7
Convert raw log to instance data(event log)
with Map/Reduce
8
A
E
B
D
C
F
D
A
E
F
G
CaseID1+T1+A
CaseID1+T3+E
CaseID2+T3+B
CaseID3+T2+D
CaseID2+T1+C
CaseID3+T3+F
CaseID1+T2+D
CaseID2+T4+A
CaseID3+T1+E
CaseID4+T1+F
CaseID2+T2+G
Map
9
CaseID1+T1+A
CaseID1+T2+D
CaseID1+T3+E
ADE
CaseID2+T1+C
CaseID2+T2+G
CaseID2+T3+B
CaseID2+T4+A
CGBA
CaseID3+T1+E
CaseID3+T2+D
CaseID3+T3+F
EDF
CaseID4+T1+F
F
Sort and Partition
ADE
CGBA
EDF
F
UKOC
Reduce
XESName_0.xes
XESName_1.xes
XESName_2.xes
If the events number
in Xlog exceed 5000,
output one Xlog, to
avoid the exceed
heap size of
computer
CPU: Intel Xeon 2.40GHZ
RAM:2GB
14Nodes
fileSize
10
logNum
OnePCTime
MapReduceTime
MapNum
ReduceNum
8.84 MB
36422
5 s, 921 ms
7s
3
15
65.8M
218177
30 s, 846 ms
25s
3
15
112 M
772241
48 s, 559 ms
30s
3
15
One day(371M)
2,200,000
2.5minutes
1.3minutes
40
15
One week
15,000,000
2.5minutes
280
15
One month
66,000,000
20 Minutes
(Expected )
2 hours
(Expected )
6 minutes
1200
15
Process Discovery
One instance/case is defined as one visitor’s one time visiting.
•IP+UA
•CookieID
Activity varies based on different requirements
Alpha miner
Heuristic miner
Fuzzy miner
Sequence model
11
Behavior analysis
User behavior
pattern
range
activity
Interaction
between
channels
all
ContentType
Web Map
vising path
webpage layout
all
Referer/URL
news
ContentType+Page
Type+Block
(Channel
=news)AND(
PageType=19
5)
image
ContentType+Page
Type+Block
(Channel
=image)AND(
PageType=43
5)
Searching
result
all
Behavior
grouping
Registration
all
12
Data selection
User behavior
pattern
range
activity
Interaction
between
channels
all
ContentType
Web Map
vising path
webpage layout
all
Referer/URL
news
ContentType+Page
Type+Block
(Channel
=news)AND(
PageType=19
5)
image
ContentType+Page
Type+Block
(Channel
=image)AND(
PageType=43
5)
Searching
result
all
Behavior
grouping
Registration
all
13
Data selection
Behavior analysis
User behavior
pattern
range
activity
Interaction
between
channels
all
ContentType
Web Map
vising path
webpage layout
all
Referer/URL
news
ContentType+Page
Type+Block
(Channel
=news)AND(
PageType=19
5)
image
ContentType+Page
Type+Block
(Channel
=image)AND(
PageType=43
5)
Searching
result
all
Behavior
grouping
Registration
all
14
Data selection
Active visitor’s visiting path
15
Behavior analysis
User behavior
pattern
range
activity
Interaction
between
channels
all
ContentType
Web Map
vising path
webpage layout
all
Referer/URL
news
ContentType+Page
Type+Block
(Channel
=news)AND(
PageType=19
5)
image
ContentType+Page
Type+Block
(Channel
=image)AND(
PageType=43
5)
Searching
result
all
Behavior
grouping
Registration
all
16
Data selection
Main page
17
18
Sequence model
19
`
20
XES statistics
21
Conclusion
It is a nice project to get into data analysis field ,with the
combination of web data analysis, process mining and cloud
computing technology.
Future work:
1 More algorithms and technologies should be applied to this data set.
2 Behavior comparison and user recommendation still need to be
accomplished.
3 Can process mining analyze the behavior that does not have a
certain pattern.
1 Log Sampling
2 Detect the incorrectness from logs before applying log to analysis
technologies.
3 Extend function of “converting data from key-value database or
cloud storage to event log” in Prom or XESame.
22
feedback
1 What is the real questions?
2 Why process mining?
23
Thank you !
Meng Dou
13/9/2012
Related documents