Download Lecture9

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining and Big
Data
Ahmed K. Ezzat,
Big Data Overview
1
Outline
1. What is Big Data?
2. Why Big Data matters?
3. Big Data and Open Source
4. Big Data Strategies (Howie):
 Shared nothing parallel RDBMS
 NoSQL (K/V stores)
 MapReduce
5. Big Data and Visualization
6. Big Data Integration Challenge (Data virtualization / Federation)
7. Summary
2
What is Big Data?
3
1. What is Big Data?

EMC/IDC definition: Big Data technologies describe a new
generation of technologies and architectures, designed to
economically extract value from very large Volumes of a
wide Variety of data, by enabling high Velocity capture,
discovery, and/or analysis.

IBM /Microsoft definition: three characteristics define Big
Data:

Volume (Terabytes (1000 GB - 240)  Zettabytes (billion TB - 270))

Variety (Structured + Semi-structured + Unstructured)

Velocity (Batch + Streaming Data)
4
1. What is Big Data?

Volume + Velocity  No Consistency

CAP Theorem (Eric Brewer, 2000)
For highly scalable distributed system, you can only have
two of the following:

Consistency,

high Availability,

network Partition tolerance (network failure tolerance)
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Implication: Big data solutions must stop worrying about
consistency if they want high availability
5
1. What is Big Data?

Big Data Terminology:

CRUD: Create Read Update Destroy

CRAP: Create Read Analytical Processing

RDBMS provide ACID

NoSQL provides BASE (Basically Available, Soft state,
Eventual consistency):

Basically Available: allow parts of the system to fail

Soft state: an object may have multiple simultaneous values

Eventual Consistency: consistency happen over time
6
1. What is Big Data?

CAP Theorem with ACID and BASE Visualized:
7
1. What is Big Data?

Big Data: The 3 Vs characterizing Big Data limits the ability to perform
analysis using traditional RDBMS.

Big Data Science: the study of
techniques covering acquisition,
conditioning, and evaluation of
Big Data; including information
technology and math science.

Big Data Frameworks: software libraries
and algorithms that enable distributed
processing and analysis of big data
across clusters of CPUs, or GPUs, etc.

Big Data Infrastructure: instances of one/more Big Data Frameworks
that include Management APIs and servers (physical or virtual) to solve
specific Big Data problem or to serve as general purpose analysis and
processing engine.
8
1. What is Big Data?

What Technologies typically used with Big Data:

Machine Learning (ML): is the study of systems that improve
their performance with experience (typically learning from the
data)

Artificial Intelligence (AI): is the science of automating
complex behavior such as learning, problem solving, and
decision making

Data Mining (DM): is the process of extracting useful
information from massive quantities of complex data.

Many of the technique used are statistical in nature, but are
very different from classical statistics.
9
1. What is Big Data?

What is Data Science:

Data Science: automatically extracting knowledge from data:




Applications in Business:






Many…
Scientific Applications:


Mathematics & Statistics
Machine Learning
Doman Expertise
Astronomy
High-energy physics
Biology, Genomics
Social science
Medicine
Government
10
1. What is Big Data?

What is Data Mining:

Overlap of machine learning, statistics, AI, databases but with
more stress on:



Scalability
Algorithms and
architectures
Automation for handling
large data
11
1. What is Big Data?

Programming Platforms:



SQL vs. NoSQL debate
Popular Programming Abstractions:

MapReduce

Dryad

Pipelines: Hadoop is not an island; requires orchestration between
many diverse technologies

Graphs: Pregel

Iterations: many analysis tasks involve iterations (Vectorwise)

DISC: Data Intensive Scalable Computing
How do you get the Big Data into your platform:

Bandwidth and application topology bottlenecks
12
1. What is Big Data?

What is Big Data Processing:

Data size is large enough that cannot be processed in a
single node

What do we mean by “processed”?

CRUD: Create Read Update Destroy + potentially huge
amount of actual processing

Big data examples come from machine generated
domain, e.g., web crawling or tracking, real-time sensor
data, logfiles, etc.

So for Big Data, CRUD  Crud or perhaps
CRAP  Create Read Analytical Processing
13
1. What is Big Data?


Where Big Data is processed:

In the cloud

Specific Big Data scientific platforms

Smaller scale virtualized platforms

….
How is Big Data stored:

File Systems: GFS, HDFS, etc.

Tabular: Big Table, etc.

Key/Value Store: Dynamo, etc.

In-memory (Spark – amplab, UC Berkeley)

….
14
1. What is Big Data?

Data Is Infinite/Never-Ending (Data Streams):

In SQL, input is under
the control of the
programmer

Stream Management
is important if input rate
is controlled externally

The system cannot store
the entire stream in time

How do you make critical
calculations using limited
amount of memory?
15
1. What is Big Data?

Applications of Streams Processing:

Mining query streams
Example: What queries are more frequent today than yesterday?

Mining Click Streams
Yahoo! Wants to know which of its pages are getting an unusual
number of hits in the past hour?

Sensors need monitoring, especially when there are many sensors
of the same type, feeding into a central controller!

Telephone call records are summarized into customer bills?

IP packets can b monitored at a switch ad used for optimal routing
and to detect denial-of-service attacks!

Sliding Windows: where queries are about a window of length (N);
most recent received elements.
16
1. What is Big Data?


Different types of Data:

Data is high dimensional

Data is a graph

Data is infinite/never ending

Data is labeled
Different Models of Computation:

MapReduce

Streams and online algorithms

In-memory processing
17
1. What is Big Data?

What Processes typically used in Data Analysis:



Prediction (explaining specific attribute of the data in terms of other attributes):

Classification (predict discrete value) and regression (estimate
numeric value)

Rule-based, case-based, and model-based learning
Modeling (describing the relationships between many attributes and many entities):

Representation and heuristic search

Clustering (modeling group structure of data)

Bayesian networks (modeling probabilistic relationships)
Detection (identifying relevant patterns in massive/complex datasets):

Anomaly Detection (detecting outliers, novelties, etc.)

Pattern Detection (e.g., event surveillance, anomalous patterns)

Applications to biosurveillance, crime prevention, etc.
18
1. What is Big Data?

Models vs. Analytic Processing:

To a DBMS person, data mining is an extreme form of analytic
processing – queries that examine large amounts of data. Result is
the answer of the query.


Given a billion numbers, a DBMS person would compute their
average and standard deviation.
To a statistician, data mining is the inference of models. Result is
the parameters of the model.

Given a billion numbers, a statistician might fit the billion to the
best Gaussian distribution and report the mean and standard
deviation of the distribution.
19
1. What is Big Data?

Google’s Computing Infrastructure:

200+ processors

200+ terabyte database

10 10 total clock cycles

0.1 second r.p.t.

5¢ average advertising revenue
20
1. What is Big Data?

Google’s Computing Infrastructure:

~3 million processors in 200+ processors

X86 processors, IDE disks, Ethernet communications.
Reliability is through redundancy and software management.

Partitioned Workload:


Data: Web Pages, indices distributed across processors

Function: crawling, index generation, index search, documents
retrieval, Ad placement
Data-Intensive Scalable Computer (DISC):

Large-scale computer centered around data: collecting,
maintaining, indexing, computing

Similar systems at Yahoo! And Microsoft
21
1. What is Big Data?

DISC Environment:

Architecture:



Operating System:

Hadoop

Apsara by Aliyun: http://www.aliyun.com
Programming Model:


Cloud computing
MapReduce
Data Analysis:

Data Mining
22
1. What is Big Data?

DISC Challenge


Computation accesses 1 TB in 5 minutes:

Data distributed over 100+ disks

Compute using 100+ processors

Connected by 1 Gbps Ethernet
System Requirements:

Lots of disks

Lots of processors

Located in close proximity

Interactive Access

Robust Fault Tolerance
23
1. What is Big Data?

Data Mining Role in Big Data

Non-trivial discovery of implicit, previously unknown, and
useful knowledge from massive data
24
1. What is Big Data?

Data Mining Tasks in Big Data

Association rule discovery

Classification

Clustering

Recommendation systems: Collaborative filtering

Link analysis and graph mining

Managing Web advertisements

Descriptive Methods: find human-interpretable patterns that
describe the data

Predictive Methods: Use some variables to predict unknown
or future values of other variable
25
1. What is Big Data?


Real-world Problems to be addressed:

Recommendation systems

Association rules

Link analysis

Duplicate detection (Dedup)
Various Tools:

Linear algebra (Singular Value Decomposition (SVD), etc.)

Optimization (stochastic gradient descent)

Dynamic programming (frequent itemsets)

Hashing (Locality Sensitive Hashing (LSH), Bloom filters)
26
1. What is Big Data?

How It All Fits Together:
27
Why Big Data Matters?
28
2. Why Big Data Matters?

Big Data Enable new things!


Google – 1st big success of Big Data
Social Network (Facebook, Twitter, LinkedIn, etc.)

Location analytics

Healthcare

Personalized medicine

Semantics and AI!

….
29
2. Why Big Data Matters?

Big Data Bubble
Big Data
Gartner Hype Cycle
30
2. Why Big Data Matters?

Provides groundbreaking opportunities for enterprise
information Management and decision making

The rate of information growth appears to be exceeding
Moore’s law

The amount of data is exploding; companies are capturing
and digitizing more information

35 zettabytes (billion TB) of data will be generated and
consumed by the end of the decade

Hadoop (HDFS, HBase, MapReduce) and Memcached are
gaining a lot of momentum.
31
2. Why Big Data Matters?
69%
Extremely important for
competitive advantage
70%
Helped manage costs or
improve operations
32
2. Why Big Data Matters?

Where Data Mining and Analytics are applied:
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
CRM/ consumer analytics
Banking
Health care/ HR
Fraud Detection
Direct Marketing/ Fundraising
Finance
Telecom / Cable
Science
Insurance
Advertising
Education
Web usage mining
Credit Scoring
Retail
Medical/ Pharma
Manufacturing
e-Commerce
Social Networks
Search / Web content mining
Government/Military
Biotech/Genomics
Investment / Stocks
Entertainment/ Music
Security / Anti-terrorism
Travel / Hospitality
Social Policy/Survey analysis
Junk email / Anti-spam
Other
33
2. Why Big Data Matters?

Data Types Analyzed/Mined:
34
2. Why Big Data Matters?

Largest Data Set Analyzed/Mined:
2011 median dataset
size ~10-20 GB,
vs. 8-10 GB in 2010.
Increase in
10 GB to 1 PB range
35
2. Why Big Data Matters?

Which Algorithms Used for Data Analysis/Mining in 2011:
% analysts who used it
0%
10%
20%
30%
40%
50%
60%
70%
Decision Trees
Regression
Clustering
Statistics
Visualization
Time series/Sequence analysis
Support Vector (SVM)
Association rules
Ensemble methods
Text Mining
Neural Nets
Boosting
Bayesian
Bagging
Factor Analysis
Anomaly/Deviation detection
Social Network Analysis
Survival Analysis
Genetic algorithms
Uplift modeling
36
Big Data and Open Source
37
3. Big Data and Open Source

STORM and KAFKA:

Storm (http://storm-project.net/): distributed real-time streamprocessing “computation” system.

Kafka (http://kafka.apache.org/documentation.html#design): highthroughput distributed messaging system.

SPARK (http://spark.incubator.apache.org/): very fast
cluster (in-memory) processing system

DREMEL and DRILL:

Dremel (http://research.google.com/pubs/pub36632.html): Google
platform for fast and interactive Hadoop queries.

Drill (http://wiki.apache.org/incubator/DrillProposal): Apache open
source version of Dremel.
38
3. Big Data and Open Source

GREMLIN, GIRAPH, Neo4j:

Gremlin(https://github.com/tinkerpop/gremlin/wiki): is a graph
traversal language.

Giraph (http://giraph.apache.org/): interactive graph processing
system. Open source version of Google Pregel.

Neo4j (http://www.neo4j.org/): fully ACID robust property graph
database – single node.

R (http://www.r-project.org/): is an open source statistical
programming language. R works well with Hadoop.

D3 visualization (Data Driven Documents - http://d3js.org/):
a javascript library for manipulating documents based on
data.
39
3. Big Data and Open Source

Software Tools for Data Analysis:

WEKA data mining toolkit (http://www.cs.waikato.ac.nz/ml/weka):
popular data mining open source toolkit.

ASL (Accelerated Statistical Learning - http://www.autonlab.org):
another statistical data mining from CMU.

Implement your own ML methods, you can use Matlab, R, C++, etc.
40
Big Data Strategies
41
4. Big Data Strategies?

Big Data Strategies:

Shared nothing “except Network” parallel RDBMS. Network
needs to be fast at minimum for throughput but preferably for
latency as well.


Partitioning/Sharding:

Data distribution among shards

ideally little/no inter-shard/inter-node
Columnar:

Useful for aggregation

Useful for compression
42
4. Big Data Strategies?

Big Data Strategies:


NoSQL (Key/value Stores): low latency, not ACID, No Joins

K/V or Document (complex K/V store): MongoDB,
CouchDB, etc.

Column oriented (tabular K/V store): BigTable, Hbase,
Cassandra, etc.
K/V Store Characteristics: relatively low latency,
not ACID, and no joins
43
4. Big Data Strategies?

Big Data Strategies:


MapReduce Platform: general purpose implicit parallel
programming paradigm

Hadoop addresses the crAP partition

Is composed of HDFS and map reduce itself

Not limited to Java  streams interface (Python, Ruby,
etc.)
Example:
% $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output /bin/cat \
reducer /bin/wc
44
Big Data and Visualization
45
5. Big Data and Visualization

Introduction:

Collecting information is no longer a problem, but extracting value
from information collections has become progressively more
difficult.

Visualization links the human eye and computer, helping to identify
patterns and to extract insights from large amounts of information

Visualization technology shows considerable promise from
increasing the value of large-scales collections of information

Visualization has been used to communicate ideas, to monitor
trends implicit in data, and to explore large volumes of data from
hypothesis generation.

Visualization can be classified as scientific visualization, software
visualization, and information visualization
46
5. Big Data and Visualization

Visualization Classification:

Scientific Visualization

Scientific visualization helps understanding physical phenomena in data
(Nielson, 1991)
 Mathematical model plays an essential role
 Isosurfaces, volume rendering, and glyphs are commonly used
techniques



Isosurfaces depict the distribution of certain attributes
Volume rendering allows views to see the entire volume of 3-D data in a
single image (Nielson, 1991)
Glyphs provides a way to display multiple attributes through
combinations of various visual cues (Chernoff, 1973)
47
5. Big Data and Visualization

Software Visualization and Information Visualization

Software visualization helps people understand and use computer
software effectively (Stasko et al. 1998)

Program visualization helps programmers manage complex software
(Baecker & Price, 1998)

Visualizing the source code (Baecer & Marcus, 1990) data structure,
and the changes made to the software (Erick et al., 1992)
Algorithm animation is used to motivate and support the learning of
computational algorithms
Information visualization helps users identify patterns, correlations, or
clusters

Structured information





Graphical representation to reveal patterns. e.g. Spotfire,
SAS/GRAPH, SPSS
Integration with various data mining techniques (Thealing et al.,
2002; Johnston, 2002)
Unstructured Information

Need to identify variables and construct visualizable structures. e.g.
antage Point, SemioMap, and Knowledgist
48
5. Big Data and Visualization

Framework for Information Visualization:

Research on taxonomies of visualization

Chuah and Roth (1996) listed the tasks of information visualization
 Bertin (1967) and Mackinlay (1986) described the characteristics of
basic visual variables and their applications.
 Card and Mackinlay (1997) constructed a data type-based taxonomy.
 Chi (2000) proposed a taxonomy based on technologies:

Four stages: value, analytic abstraction, visual abstraction, and view

Shnederman (1996) identified two aspects of visualization:
representation and user-interface interface
 C.Chen (1999) indicated that information analysis also helps support a
visualization system

Three research dimensions support the development of an
information visualization system



Information representation
User interface interaction
Information analysis
49
5. Big Data and Visualization

Information Representation:

Shneiderman (1996) proposed seven types of representation
methods:







1-D
2-D
3-D
Multidimensional
Tree
Network
Temporal approaches
50
5. Big Data and Visualization

1-D
TileBars (Hearst, 95)
51
5. Big Data and Visualization

2-D


Represent information as two-dimensional visual objects
3-D

To represent information as 3-D visual objects
52
5. Big Data and Visualization

Multidimensional

To represent information as multidimensional objects and
project them into a 3-D or 2-D space:

Dimensionality reduction algorithm need to be used:




Multidimensional scaling (MDS)
Hierarchical clustering
K-means algorithms
Principal components
analysis
53
5. Big Data and Visualization

Tree

To represent hierarchical relationship

Challenge: nodes grows exponentially
54
5. Big Data and Visualization

Network

To represent complex relationships that a simple tree is
insufficient.
55
5. Big Data and Visualization

Temporal

To represent data based on temporal order

Location and animation are common
56
Big Data Integration
Challenge
57
6. Big Data Integration Challenges

Analytics Push the Limits of Traditional Data Management

Data Virtualization Solves
Big Data Integration Problem

The answer to too many silos
is not to integrate into a data
warehouse because of
scalability limitations, replication
processing and storage cost
would be too high

The answer is data federation
or virtualization
58
Summary
59
12. Summary

Big Data is best described in terms of 3Vs (Volume,
Variety, Velocity)

Big Data matters as it can give the enterprise an
advantage in a competitive market

Hadoop and may analytics tools are available in the open
source community

We covered different strategies to process Big Data

Visualization is critical in Big Data to enable extracting
value from information

We also covered how to integrate the Big Data silos to be
able to query and do analysis
60
END
61