Download Presentation_Hemanth_Murthy

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy Data Mining on the Web via Cloud Computing  Introduction to –     Web Mining Cloud computing infrastructure Apache’s Hadoop Web Usage Mining using Hadoop HDFS and Map/Reduce technologies What is Web Mining…  What is Web Mining - data mining techniques applied to the Web to discover user patterns like     what users are looking for on the internet, to deduce type of information the users are looking for, structuring data available on the web etc. Why Web Mining –    amount of information available on the Web is enormous. difficult for users to find and utilize information not easy for content providers to classify and catalog documents Types of Web Mining  Web mining types –    Web usage mining. Web content mining. Web structure mining.  Web usage mining - applying data mining techniques to discover usage patterns from Web data, to understand and serve the needs of Web-based applications better.  Web content mining describes the automatic search of information available online, and involves mining web data content.  Web structure mining is concerned with the description/ organization of the content. More on Web Usage Mining…  Preprocessing.    Pattern Discovery.   convert the usage, content, and structure information in the available data sources. regarded as the most difficult task in Web Usage Mining. uses the algorithms and techniques from data mining, machine learning, statistics and pattern recognition. Pattern analysis.    lot of redundant rules or patterns found during discovery phase. the main objective here is to filter out such data which would aid in the data analysis. SQL queries, visualization techniques such as graphing patterns etc Cloud Computing  Use of existing commodities.     Virtualization technique used as a standard deployment object.    reduce cost of the services. helps in concentrating on deploying the services faster. more flexibility. provides abstraction between hardware and computing software. enables loose coupling of the resources. Services are delivered over the network. HDFS - Hadoop Distributed File System  Data parallel but process sequential.  Data processing is in a batch oriented fashion.  Data communication is via distributed file system. So, latency is an issue. But HDFS is designed for giving higher throughputs than latency.  In Facebook, jobs that took more than a day were cut down to less than a day by using Hadoop. Important characteristics of HDFS…  Hardware Failure.  Streaming Data Access.  Large Data Sets.  Moving Computation is Cheaper than Moving Data Web Mining, HDFS and Map/Reduce  HDFS can be the storage backbone for Web Mining applications.  HDFS replicates data at several nodes in the cluster to ensure robustness, data recovery in case of failure etc.  Map/Reduce – A framework for realizing Distributed computing/Compute Cloud. Web Mining & HIVE  Developed by the Facebook Data Infrastructure Team in order to exploit the features of Hadoop HDFS and Map/Reduce.  The next generation infrastructure designed with the goals of providing data processing systems:    enable easy data summarization ad-hoc querying and analysis of large volumes of data Allows users to embed custom map/reduce functions Web Usage Mining Architecture using HDFS, Map/Reduce and HIVE  How Apache Hadoop can be used in Web Usage Mining.  The system consists of HDFS as the Storage Cloud.  Map/Reduce framework can be used as the Compute Cloud.  Hive can be used to format the data. Web Usage Mining Architecture References  HDFS: http://hadoop.apache.org/hdfs  Map/Reduce: http://hadoop.apache.org/mapreduce  Web Mining: Information and Pattern Discovery on the World Wide Web: http://maya.cs.depaul.edu/~mobasher/webminer/survey/survey.html  Ashish Thusoo - Hive - A Petabyte Scale Data Warehouse using Hadoop: http://www.facebook.com/note.php?note_id=89508453919 References  Dhruba Borthakur: Hadoop Introduction: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html#In troduction  Jaideep Srivastava, Robert Cooleyz, Mukund Deshpande, PangNing Tan: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data Thank You!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation_Hemanth_Murthy