Download Big Data Analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Data
Analytics
An Introduction
Oliver Fuchsberger – University of Paderborn
© 2014
Table of Contents
I. Introduction & Motivation
• What is „Big Data Analytics“?
• Why is it so important?
II. Techniques & Solutions
• Business Strategies
• Data Storage
• Data Diversity
• Information Filtering
• Real-Time Data
• Analysis Techniques
III. Conclusion
Oliver Fuchsberger, University of
Paderborn, 2014
2
Introduction & Motivation
PART I
Oliver Fuchsberger, University of
Paderborn, 2014
3
Big Data Analytics in a
Cloud.
Oliver Fuchsberger, University of
Paderborn, 2014
4
What is
„Big Data Analytics“?
• Buzz Word for a combination of:
o Big Data
o Advanced Analytics
• Not just one Data Type and not just one technique
• But we will see this in a minute!!!
Oliver Fuchsberger, University of
Paderborn, 2014
5
Big Data –
The three V‘s (I)
• Most definition focus on
the data size
o NOT SUFFIECIENT!!
• Big Data can be defined
using the “three V’s”:
o Volume
o Velocity
o Variety
• The measurements for
each “V” are absolutely
divers
Oliver Fuchsberger, University of
Paderborn, 2014
6
Big Data –
The three V‘s (II)
•
Volume:
o Gigabytes, Terabytes or Petabytes
o Number of Files or Records
•
Velocity:
o Real-time (as Stream)
o Batches
•
Variety:
o Structure of data (un-, semi- or structured)
o Web data
o Real-time data
Oliver Fuchsberger, University of
Paderborn, 2014
7
Advanced Analytics (I)
• “Advanced Analytics”, as “Big Data Analytics” is a
Buzz word!
• It stands for a collection of different analysis
techniques
o All techniques are suited to deal with unknown data sets
• A.k.a. “Discovery Analytics”
Oliver Fuchsberger, University of
Paderborn, 2014
8
Advanced Analytics (II)
• Some Techniques:
o
o
o
o
o
Predictive Analytics
Data Mining
Statistical Analysis
Natural Language Processing
Data base capabilities
• MapReduce
• In-database analytics
• In-memory databases
Oliver Fuchsberger, University of
Paderborn, 2014
9
Importance of
„Big Data Analytics“ (I)
• Big Data Analytics is seen as “one of the most
profound trends in Business Intelligence” according
to TDWI
• Today more and more data is collected by
enterprises
o See “Big Data”
• To gain new insights this data has to be analysed
o Not possible with standard analytic platforms
Oliver Fuchsberger, University of
Paderborn, 2014
10
Importance of
„Big Data Analytics“ (II)
• The 5 main benefits are:
1. Better targeted social influencer marketing (61%)
2. More numerous and accurate business insights (45%)
3. Segmentation of customer base (41%)
4. Recognition of sales and market opportunities (38%)
5. Automated decisions for real-time processes (37%)
Oliver Fuchsberger, University of
Paderborn, 2014
11
Importance of
„Big Data Analytics“ (III)
• The 5 main barriers are:
1. Inadequate staffing or skills for big data analytics (46%)
2. Cost, overall (42%)
3. Lack of business sponsorship (38%)
4. Difficulty of architecting big data analytics system (33%)
5. Current database software lacks in-database analytics
(32%)
Oliver Fuchsberger, University of
Paderborn, 2014
12
Techniques & Solutions
PART II
Oliver Fuchsberger, University of
Paderborn, 2014
13
Business Strategies
Problems
• Strategy or architecture for dealing with Big Data
Analytics is needed
• Problems:
o Different programming abstractions (compared to desktop
environment)
o Every choice has direct dollar costs, regardless of the field:
• Computation
• Upload / Download
• Data storage
Oliver Fuchsberger, University of
Paderborn, 2014
14
Business Strategies
Cloud Computing
• Every choice directly effects the computation time!
• Supports many Virtual Machines
• Correlation of paying more and increasing the
computation power
o Doubling memory or speed does not linearly scale to halve the
time!
• There are many vendor-based solutions for data
upload into the cloud databases
Oliver Fuchsberger, University of
Paderborn, 2014
15
Data Storage
The HDFS Goals
• Belongs to the so-called “No-SQL Databases”
• Goals of the HDFS:
o
o
o
o
o
o
Fault detection & fast automatic recovery
Streaming data access
Handling large data sets
Simple coherency model
“moving computation is cheaper than moving data”
portability
Oliver Fuchsberger, University of
Paderborn, 2014
16
Data Storage
The HDFS Architecture
Oliver Fuchsberger, University of
Paderborn, 2014
17
Data Diversity
Filtering Information (I)
• Data mining describes:
o Application of methods and algorithms
o Supporting or enabling the extraction of empirical links of
data objects in data sets
• Goals of data mining:
o Find new correlations, patterns and trends inside large
amounts of data
Oliver Fuchsberger, University of
Paderborn, 2014
18
Data Diversity
Filtering Information (II)
• Most of the data arriving is “unlabeled” => classification not
possible
• A clustering is:
o A group of same or similar elements gathered or occurring closely
together
• Task:
o Organize a collection of n objects into a partitioning or a
hierarchy of partitions
o Label the data
Oliver Fuchsberger, University of
Paderborn, 2014
19
Data Diversity
Filtering Information (III)
• Problems:
o Measure similarity
o The unknown number of clusters needed
o Cluster validity
o Outliers
Oliver Fuchsberger, University of
Paderborn, 2014
20
Data Diversity
Real-Time Data (I)
• CEP: Complex Event Processing
• Events are complex in sense of the relations
between arriving data parts
• CEP systems will non only consider arriving events
separated from each other
o Timestamp + Content + optional constraints
• Goal is to identify interesting situations by processing
event notifications (not generic data)
Oliver Fuchsberger, University of
Paderborn, 2014
21
Data Diversity
Real-Time Data (II)
• CEP is an extension to the traditional publishsubscribe interaction concept:
o Observer: RSS feed (example)
o Consumer: other systems
• Examples for CEP Engine:
o Next CEP (rules based pattern detection)
o PB-CEP (plan based pattern detection)
Oliver Fuchsberger, University of
Paderborn, 2014
22
Data Diversity
Analysis Techniques (I)
• Analytical computations are moved into the
database system – in-database analytics:
o Model scoring
o Predictive analytics
o And others
• Calculations are executed in a single,
centralized location
o
o
o
o
o
Data access right where it is stored
No data extraction
Memory capabilities
Load balancing
Parallel processing
Oliver Fuchsberger, University of
Paderborn, 2014
23
Data Diversity
Analysis Techniques (II)
• Using historical data to predict the future (long or
short term)
o Data mining techniques (clustering, regression, classification)
o Statistical analysis techniques
• Build a predictive model
o Exploit patterns in historical data to identify risks and opportunities
• Combination with CEP makes sense:
o CEP can ensure the calculation of the predictors (main problem!)
o Short term realization of complex events
Oliver Fuchsberger, University of
Paderborn, 2014
24
Conclusion
PART III
Oliver Fuchsberger, University of
Paderborn, 2014
25
Summary
What we‘ve seen!
• Big Data is not all about size
• Big Data Analytics is important due to the positive
influence on many enterprise departments. But it is
expensive!
• One needs the right computation platform, storage
system and analysis techniques depending on the
data one is working with
o Cloud Computing
o HDFS
o CEP / In-database Analytics …
Oliver Fuchsberger, University of
Paderborn, 2014
26
FINAL WORDS
• All presented techniques are just examples
o Numerous more systems, software products available in this field
• Persons from many different fields have to work
together to enable the analysis of big data.
o
o
o
o
Business analysts
Database specialists
System engineers
…
Oliver Fuchsberger, University of
Paderborn, 2014
27
Thank You for Your
Attention!
ANY QUESTIONS?
Oliver Fuchsberger, University of
Paderborn, 2014
28