Download Presentation_Hemanth_Murthy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining on the Web via
Cloud Computing
COMS E6125
Web Enhanced Information Management
Presented By
Hemanth Murthy
Data Mining on the Web via Cloud
Computing

Introduction to –




Web Mining
Cloud computing infrastructure
Apache’s Hadoop
Web Usage Mining using Hadoop HDFS and
Map/Reduce technologies
What is Web Mining…

What is Web Mining - data mining techniques applied to
the Web to discover user patterns like




what users are looking for on the internet,
to deduce type of information the users are looking for,
structuring data available on the web etc.
Why Web Mining –



amount of information available on the Web is enormous.
difficult for users to find and utilize information
not easy for content providers to classify and catalog documents
Types of Web Mining

Web mining types –



Web usage mining.
Web content mining.
Web structure mining.

Web usage mining - applying data mining techniques to
discover usage patterns from Web data, to understand
and serve the needs of Web-based applications better.

Web content mining describes the automatic search of
information available online, and involves mining web
data content.

Web structure mining is concerned with the description/
organization of the content.
More on Web Usage Mining…

Preprocessing.



Pattern Discovery.


convert the usage, content, and structure information in the
available data sources.
regarded as the most difficult task in Web Usage Mining.
uses the algorithms and techniques from data mining, machine
learning, statistics and pattern recognition.
Pattern analysis.



lot of redundant rules or patterns found during discovery phase.
the main objective here is to filter out such data which would aid
in the data analysis.
SQL queries, visualization techniques such as graphing patterns
etc
Cloud Computing

Use of existing commodities.




Virtualization technique used as a standard deployment
object.



reduce cost of the services.
helps in concentrating on deploying the services faster.
more flexibility.
provides abstraction between hardware and computing software.
enables loose coupling of the resources.
Services are delivered over the network.
HDFS - Hadoop Distributed File
System

Data parallel but process sequential.

Data processing is in a batch oriented fashion.

Data communication is via distributed file system. So,
latency is an issue. But HDFS is designed for giving
higher throughputs than latency.

In Facebook, jobs that took more than a day were cut
down to less than a day by using Hadoop.
Important characteristics of
HDFS…

Hardware Failure.

Streaming Data Access.

Large Data Sets.

Moving Computation is Cheaper than Moving Data
Web Mining, HDFS and
Map/Reduce

HDFS can be the storage backbone for Web Mining
applications.

HDFS replicates data at several nodes in the cluster to
ensure robustness, data recovery in case of failure etc.

Map/Reduce – A framework for realizing Distributed
computing/Compute Cloud.
Web Mining & HIVE

Developed by the Facebook Data Infrastructure Team in
order to exploit the features of Hadoop HDFS and
Map/Reduce.

The next generation infrastructure designed with the
goals of providing data processing systems:



enable easy data summarization
ad-hoc querying and analysis of large volumes of data
Allows users to embed custom map/reduce functions
Web Usage Mining Architecture using
HDFS, Map/Reduce and HIVE

How Apache Hadoop can be used in Web Usage Mining.

The system consists of HDFS as the Storage Cloud.

Map/Reduce framework can be used as the Compute
Cloud.

Hive can be used to format the data.
Web Usage Mining Architecture
References

HDFS: http://hadoop.apache.org/hdfs

Map/Reduce: http://hadoop.apache.org/mapreduce

Web Mining: Information and Pattern Discovery on the World Wide
Web:
http://maya.cs.depaul.edu/~mobasher/webminer/survey/survey.html

Ashish Thusoo - Hive - A Petabyte Scale Data Warehouse using
Hadoop: http://www.facebook.com/note.php?note_id=89508453919
References

Dhruba Borthakur: Hadoop Introduction:
http://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html#In
troduction

Jaideep Srivastava, Robert Cooleyz, Mukund Deshpande, PangNing Tan: Web Usage Mining: Discovery and Applications of Usage
Patterns from Web Data
Thank You!