Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Analytics An Introduction Oliver Fuchsberger – University of Paderborn © 2014 Table of Contents I. Introduction & Motivation • What is „Big Data Analytics“? • Why is it so important? II. Techniques & Solutions • Business Strategies • Data Storage • Data Diversity • Information Filtering • Real-Time Data • Analysis Techniques III. Conclusion Oliver Fuchsberger, University of Paderborn, 2014 2 Introduction & Motivation PART I Oliver Fuchsberger, University of Paderborn, 2014 3 Big Data Analytics in a Cloud. Oliver Fuchsberger, University of Paderborn, 2014 4 What is „Big Data Analytics“? • Buzz Word for a combination of: o Big Data o Advanced Analytics • Not just one Data Type and not just one technique • But we will see this in a minute!!! Oliver Fuchsberger, University of Paderborn, 2014 5 Big Data – The three V‘s (I) • Most definition focus on the data size o NOT SUFFIECIENT!! • Big Data can be defined using the “three V’s”: o Volume o Velocity o Variety • The measurements for each “V” are absolutely divers Oliver Fuchsberger, University of Paderborn, 2014 6 Big Data – The three V‘s (II) • Volume: o Gigabytes, Terabytes or Petabytes o Number of Files or Records • Velocity: o Real-time (as Stream) o Batches • Variety: o Structure of data (un-, semi- or structured) o Web data o Real-time data Oliver Fuchsberger, University of Paderborn, 2014 7 Advanced Analytics (I) • “Advanced Analytics”, as “Big Data Analytics” is a Buzz word! • It stands for a collection of different analysis techniques o All techniques are suited to deal with unknown data sets • A.k.a. “Discovery Analytics” Oliver Fuchsberger, University of Paderborn, 2014 8 Advanced Analytics (II) • Some Techniques: o o o o o Predictive Analytics Data Mining Statistical Analysis Natural Language Processing Data base capabilities • MapReduce • In-database analytics • In-memory databases Oliver Fuchsberger, University of Paderborn, 2014 9 Importance of „Big Data Analytics“ (I) • Big Data Analytics is seen as “one of the most profound trends in Business Intelligence” according to TDWI • Today more and more data is collected by enterprises o See “Big Data” • To gain new insights this data has to be analysed o Not possible with standard analytic platforms Oliver Fuchsberger, University of Paderborn, 2014 10 Importance of „Big Data Analytics“ (II) • The 5 main benefits are: 1. Better targeted social influencer marketing (61%) 2. More numerous and accurate business insights (45%) 3. Segmentation of customer base (41%) 4. Recognition of sales and market opportunities (38%) 5. Automated decisions for real-time processes (37%) Oliver Fuchsberger, University of Paderborn, 2014 11 Importance of „Big Data Analytics“ (III) • The 5 main barriers are: 1. Inadequate staffing or skills for big data analytics (46%) 2. Cost, overall (42%) 3. Lack of business sponsorship (38%) 4. Difficulty of architecting big data analytics system (33%) 5. Current database software lacks in-database analytics (32%) Oliver Fuchsberger, University of Paderborn, 2014 12 Techniques & Solutions PART II Oliver Fuchsberger, University of Paderborn, 2014 13 Business Strategies Problems • Strategy or architecture for dealing with Big Data Analytics is needed • Problems: o Different programming abstractions (compared to desktop environment) o Every choice has direct dollar costs, regardless of the field: • Computation • Upload / Download • Data storage Oliver Fuchsberger, University of Paderborn, 2014 14 Business Strategies Cloud Computing • Every choice directly effects the computation time! • Supports many Virtual Machines • Correlation of paying more and increasing the computation power o Doubling memory or speed does not linearly scale to halve the time! • There are many vendor-based solutions for data upload into the cloud databases Oliver Fuchsberger, University of Paderborn, 2014 15 Data Storage The HDFS Goals • Belongs to the so-called “No-SQL Databases” • Goals of the HDFS: o o o o o o Fault detection & fast automatic recovery Streaming data access Handling large data sets Simple coherency model “moving computation is cheaper than moving data” portability Oliver Fuchsberger, University of Paderborn, 2014 16 Data Storage The HDFS Architecture Oliver Fuchsberger, University of Paderborn, 2014 17 Data Diversity Filtering Information (I) • Data mining describes: o Application of methods and algorithms o Supporting or enabling the extraction of empirical links of data objects in data sets • Goals of data mining: o Find new correlations, patterns and trends inside large amounts of data Oliver Fuchsberger, University of Paderborn, 2014 18 Data Diversity Filtering Information (II) • Most of the data arriving is “unlabeled” => classification not possible • A clustering is: o A group of same or similar elements gathered or occurring closely together • Task: o Organize a collection of n objects into a partitioning or a hierarchy of partitions o Label the data Oliver Fuchsberger, University of Paderborn, 2014 19 Data Diversity Filtering Information (III) • Problems: o Measure similarity o The unknown number of clusters needed o Cluster validity o Outliers Oliver Fuchsberger, University of Paderborn, 2014 20 Data Diversity Real-Time Data (I) • CEP: Complex Event Processing • Events are complex in sense of the relations between arriving data parts • CEP systems will non only consider arriving events separated from each other o Timestamp + Content + optional constraints • Goal is to identify interesting situations by processing event notifications (not generic data) Oliver Fuchsberger, University of Paderborn, 2014 21 Data Diversity Real-Time Data (II) • CEP is an extension to the traditional publishsubscribe interaction concept: o Observer: RSS feed (example) o Consumer: other systems • Examples for CEP Engine: o Next CEP (rules based pattern detection) o PB-CEP (plan based pattern detection) Oliver Fuchsberger, University of Paderborn, 2014 22 Data Diversity Analysis Techniques (I) • Analytical computations are moved into the database system – in-database analytics: o Model scoring o Predictive analytics o And others • Calculations are executed in a single, centralized location o o o o o Data access right where it is stored No data extraction Memory capabilities Load balancing Parallel processing Oliver Fuchsberger, University of Paderborn, 2014 23 Data Diversity Analysis Techniques (II) • Using historical data to predict the future (long or short term) o Data mining techniques (clustering, regression, classification) o Statistical analysis techniques • Build a predictive model o Exploit patterns in historical data to identify risks and opportunities • Combination with CEP makes sense: o CEP can ensure the calculation of the predictors (main problem!) o Short term realization of complex events Oliver Fuchsberger, University of Paderborn, 2014 24 Conclusion PART III Oliver Fuchsberger, University of Paderborn, 2014 25 Summary What we‘ve seen! • Big Data is not all about size • Big Data Analytics is important due to the positive influence on many enterprise departments. But it is expensive! • One needs the right computation platform, storage system and analysis techniques depending on the data one is working with o Cloud Computing o HDFS o CEP / In-database Analytics … Oliver Fuchsberger, University of Paderborn, 2014 26 FINAL WORDS • All presented techniques are just examples o Numerous more systems, software products available in this field • Persons from many different fields have to work together to enable the analysis of big data. o o o o Business analysts Database specialists System engineers … Oliver Fuchsberger, University of Paderborn, 2014 27 Thank You for Your Attention! ANY QUESTIONS? Oliver Fuchsberger, University of Paderborn, 2014 28