Download Dynamic and Distributed Scheduling in Communication Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Fall 2016
Big Data Processing & Analytics
V. CHRISTOPHIDES
University of Crete &
INRIA Paris
1
Fall 2016
The Data Avalanche: From
Science to Business
2
1
1
Shifting Paradigm in Sciences




Fall 2016
Thousand years ago: science was empirical
describing natural phenomena
2
 .
Last few hundred years: theoretical branch
a
4G
c2
 a   3  2
a


using models, generalizations
 
Last few decades: a computational branch
simulating complex phenomena
The fourth paradigm today (eScience): data
exploration unify theory, experiment, and simulation
Data captured by instruments or generated by
simulator
Processed by software
Information/Knowledge stored in computer
Scientist analyzes data using data management
services and statistics
© Jim Gray
3
Fall 2016
Large Synoptic Survey Telescope (LSST)
–100-200 Petabyte image archive
–20-40 Petabyte database catalog


LSST will take more
than 800 panoramic
images each night
recording the entire
visible sky twice
each week
Ten-year time series
imaging of the night
sky – mapping the
Universe !
http://www.lsst.org
8.4-meter diameter
primary mirror = 10
square degrees!
3.2 billion-pixel
camera
4
2
2
Large Hadron Collider (LHC)


Fall 2016
Protons collide some 1 billion times
per second where each collision
produces about a megabyte of data
Even after filtering out about 99% of
it, scientists are left with around 30
petabytes each year to analyze for a
wide range of physics experiments,
including studies on the Higgs boson
9km diameter, ≈100m below ground
27-kilometre ring of superconducting magnets
http://home.web.cern.ch/topics/large-hadron-collider
Human Brain Project (HBP)


5
Fall 2016
Generate and interpret
strategically selected data
needed to build multilevel atlases
and unifying models of the brain
Use anatomical frameworks to
organize and convey spatially
and temporally distributed
functional information about the
brain at all organizational levels,
from genes to cognition, and at
all the relevant spatial and
temporal scales
http://blogs.scientificamerican.com/sa-visual/2014/04/02/how-do-you-visualize6
the-brain/
3
3
Fall 2016
Data-driven Discovery
Innovation is no longer hindered by the ability to collect data but, by the
ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
© JOHN R. JOHNSON
Disruption Time!

8
Fall 2016
Until recently, you were a folder
Now, you are Your Data!
9
4
4
Fall 2016
Blurring the Boundaries of Real & Virtual Worlds

Ubiquitous sensing & reasoning in physical and cyber worlds
10
Digital Disruption Already Happening !








Fall 2016
Largest telco companies owns no telco infrastructure (Skype)
World’s largest movie houses owns no cinemas (Netflix)
Largest software companies don’t write the apps (Apple, Google)
World’s most valuable retailer has no inventory (Alibaba)
Most popular media owner creates no content (Facebook)
World’s largest taxi company owns no vehicles (Uber)
Largest accommodation provider owns no real estate (Airbnb)
Faster growing banks have actually no money (SocietyOne)
http://www.independent.co.uk/news/business/comment/hamishmcrae/facebook-airbnb-uber-and-the-unstoppable-rise-of-thecontent-non-generators-10227207.html
11
5
5
The Data Tsounami
Fall 2016
Mobile devices
(tracking all objects all the time)
Scientific instruments
(collecting all sorts of data)
Social media and networks
(all of us are generating data)

Sensor networks
(measuring all kinds of data)
Three main reasons:
Processes are increasingly automated
Systems are increasingly interconnected
People are social and increasingly generate data exhausts by
interacting online
12
The New Moore’s Law

The Economist: digital information
10 times/5 years!

Fall 2016
Data volume is increasing exponentially
44x increase from 2009 to 2020
From 0.8 ZB to 35ZB
13
6
6
Fall 2016
Big Data = Transactions+Interactions+Observations
IoT sensors are reporting even more personal data than humans are!
Petabytes
Terabytes
Gigabytes
Megabytes
Increasing Data Variety, Velocity, & Veracity
hortonworks.com/blog/7-key-drivers-for-the-big-data-market
The Emerging Data Economy
15
Fall 2016
Source: CISCO ISBG 2012
16
7
7
Where Analytics Generate Value?
Fall 2016
http://fr.slideshare.net/wclquang/the-analytics-valuechain-key-to-delivering-business-value-in-iot
17
Fall 2016
What Makes Data, “Big” Data?
18
8
8
Definitions

Fall 2016
No single standard definition…
“Big Data” is data whose scale,
diversity, and complexity require new
architecture, techniques, algorithms,
and analytics to manage it and
extract value and hidden knowledge
from it… (McKinsey Global Inst.)
“Big Data” is high-volume, highvelocity and high-variety information
assets that demand cost-effective,
innovative forms of information
processing for enhanced insight and
decision making (Gartner)
19
The Four V’s of Big Data
Fall 2016
20
9
9
Fall 2016
Characteristics of Big Data: 1-Scale (Volume)

Too big: petabyte-scale collections or lots of (not necessarily big) data sets
21
Fall 2016
Characteristics of Big Data: 2-Speed (Velocity)

Too fast: needs to be processed quickly
22
10
10
Fall 2016
Characteristics of Big Data: 3-Complexity (Variety)

Too diverse: does not fit neatly in an existing tool
23
Fall 2016
Characteristics of Big Data: 4-Quality (Veracity)

Too crappie: needs to assess their quality
24
11
11
Fall 2016
Other Big Data Qualities

variability (data whose meaning can be constantly shifting in relation to
the context in which they are generated) (McNulty, 2014)

exhaustivity (an entire system is captured, n = all, rather than being
sampled) (Mayer-Schonberger and Cukier, 2013)

fine-grained (in resolution) and uniquely indexical (in identification)
(Dodge and Kitchin, 2005)

relationality (containing common fields that enable the conjoining of
different datasets) (Boyd and Crawford, 2012)

extensionality (can add/change new fields easily) and scaleability (can
expand in size rapidly) (Marz and Warren, 2012)
25
Fall 2016
Summary of 4 Big Data Characteristics
Characteristic Description
Properties
Drivers
Volume
The amount of data generated
or data intensity that must be
ingested processed& analyzed
to make data-driven decisions
Batch Processing
High number of
data sources
High resolution
sensors
Velocity
How fast data is being
produced and ingested and
the speed at which data is
transformed into insight
Streaming, online
Processing
(Near) Real-time
Analytics
real-time, high-rate
data acquisition
low hardware cost
Variety
The degree of diversity of
data from sources both inside
and outside an organization
Structuring degree
Multi-Modality
Complex interrelations
Sequences
Implicit Semantics
Mobile
Social media
Video
Scientific data
M2M / IoT
Veracity
The quality and
traceability of data
Consistency
Completeness
Integrity
Ambiguity
IID assumption
Crowd data
production,
Human Sensing
26
12
12
We’ve Moved into a New Era of Data
Analytics
12+ terabytes
5+ million
of Tweets
create daily
100’s
Fall 2016
trade events
per second
Volume
Velocity
Variety
Veracity
of different
types of data
1 in 3
Only
decision makers trust
their information
27
Declining % of Data an Organization
Can Analyse
http://www.youtube.com/watch?v=B27SpLOOhWw
Fall 2016
28
13
13
Fall 2016
Big Data Processing
29
Fall 2016
Big Data: Old Wine in a New Bottle?



No, it is a different type of data wave:
 one needs to put together many sources of information, coming through
many different channels, throwing away what is not important, working
under resource constraints (time), serving real users’ needs
Yes, most of these problems have been in the focus of data management
research for years
Big Data movement: exponential growth of data enthusiasts!
Proliferation of data producers and consumers, e.g., on the Web,
scientific, social, government, urban, home and personal spaces
The main issue is to put all this together to satisfy concrete data analysis
needs via innovative technology
30
14
14
The WRONG Picture!
Fall 2016
31
Beyond Big Data Size!
Fall 2016
Volume = Length Width Depth
Length: Collect & Compare
Width: Curate & Integrate
Depth: Analyze & Understand
Massive Data Analysis J. Freire & J. Simeon New York University
Course 2013
32
15
15
Big Data vs Deep Insights
data
Fall 2016
knowledge
Data exploration is hard regardless of whether data are big or small !
33
The TRUE Picture!
The time for developing an
analysis (with small data)
Fall 2016
The time for developing an
analysis (with big data)
Big Data Infrastructures: Exploiting the Power of Big Data
T. Sellis School of CS & IT, 2015 Athens
34
16
16
Exploring Big Data: What is Hard?





Fall 2016
Scalability for computations?
NOT REALLY!
Lots of work on distributed
systems, parallel databases, …
Cloud elasticity: Add more
nodes!
But there are no one-size-fits-all
solution:
often, you have to build your
own…
Rapidly-evolving technology
Many different tools
Different computation model:
need new algorithms!
35
Advanced Analytics Requires a Robust,
Comprehensive Information Platform
©2011 IBM Corporation
Fall 2016
36
17
17
Big Data Research Agenda
Acquisition, Storage, and
Management of “Big Data”
Data Analytics
Computational, mathematical,
statistical, and algorithmic
techniques for modelling high
dimensional data
Data representation,
storage, and retrieval
New parallel data
architectures, including
clouds
Data management policies,
including privacy and secure
access
Learning, inference,
prediction, and knowledge
discovery for large volumes of
dynamic data sets
Communication and storage
devices with extreme
capacities
Data mining to enable
automated hypothesis
generation, event correlation,
and anomaly detection
Sustainable economic
models for access and
preservation
Information infusion of
multiple data sources
Fall 2016
Data Sharing and
Collaboration
Tools for distant data
sharing, real time
visualization, and
software reuse of
complex data sets
Cross disciplinary model,
information and
knowledge sharing
Remote operation and
real time access to
distant data sources and
instruments
Source Big Data R&D Initiative Howard
Wactlar NIST Big Data Meeting June, 2012
37
Fall 2016
Big Data Mining
39
18
18
What to Do with Big Data?

Data contains value and knowledge

But to extract the knowledge data
needs to be
Stored
Managed
And ANALYZED

Data Analysis include:
Mine/summarize large datasets
Extract knowledge from past data
Predict trends in future data
Fall 2016
Data Mining ≈ Big Data ≈ Data
Analytics ≈ Data Science
40
A Bit of Terminology


Fall 2016
Data mining is the old big data: an overused term including anything such as
collecting, storing, curating and visualizing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the focus is on
new knowledge, not on learning of existing knowledge)
"Business intelligence", "business analytics“ are marketing terms
stressing that more data leads to better business decisions (periodic
reporting as well as ad hoc queries, importance of tools and dashboards);
Most "Big Data" today isn't ML: It's Extract, Transform, Load (ETL), so it is
replacing data warehousing (except computational advertisement)
Business Intelligence aims at descriptive statistics with data with high
information density to measure things, detect trends etc.
Big Data targets inductive statistics with data with low information density
whose huge volume allow to infer laws (regressions…) and thus giving
(with the limits of inference reasoning) to Big Data some predictive
capabilities (called Deep Analytics)
41
19
19
Data Analysis: ERP & CRM Examples
Who are our
lowest/highest margin
customers ?
What is the most
effective distribution
channel?
What product prom-otions have the biggest
impact on revenue?
Fall 2016
Who are my customers
and what products
are they buying?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
Agrawal et al., VLDB 2010 Tutorial 42
Fall 2016
Data Analysis in Urban Computing
Where are our
lowest/highest margin
passengers?
What is the distribution
of trip lengths?
What would the
impacts be of a
fare change?
What is the quickest
route from midtown
to downtown at 4pm
on Monday?
Where should drivers
go to get
passengers?
What impact will
the introduction of
additional medallions
have?
Agrawal et al., VLDB 2010 Tutorial 43
20
20
The Data Analysis Spectrum
Fall 2016
How can we
make it happen?
Source: Gartner
Prescriptive
Analytics
Value
What might happen?
Predictive
Analytics
Why did it happen?
What happened?
Descriptive
Analytics
Diagnostic
Analytics
What is happening?
Monitoring (Dashboards,
Scorecards)
Difficulty
Data Mining Methods
45
Fall 2016
Use some variables to predict unknown Find human-interpretable patterns
or future values of other variables
that describe the data
46
21
21
Large-Scale, Real-World Analytics
Fall 2016
Question
Method
How can I identify fraudulent activity?
Classification, Logistic Regression
How do I segment my customers?
K-means Clustering
Does this product appeal to some
segments more than others?
Log-likelihood
Which campaign is working better?
Mann-Whitney U Test
How is product ownership distributed
across customer segments?
SQL, Cumulative Distribution Functions
How do I target my marketing efforts
Logistic Regression
towards customers most likely to churn?
What new products should I offer my
customers?
Cosine similarity, k-Nearest Neighbors,
Matrix multiplication
What are my customers saying about
the new product launch?
NLP, sparse vectors
Tools and Technologies for Big Data Steven Hillion V.P. Analytics EMC Data
Computing Division 2011
Data Mining: Different Cultures



47
Fall 2016
Data mining overlaps with:
Databases (DB): Large-scale data, simple queries
Machine Learning (ML): Small data, Complex models
Computer Science Theory: (Randomized) Algorithms
Different cultures:
To a DB person, data mining is an extreme form of analytic processing
– queries that examine large amounts of data
Result is the query answer
To a ML person, data-mining is the inference of models
Result is the parameters of the model
Big Data urges for a cross-culture curriculum stressing on
Scalability
Machine
Statistics Learning/
Algorithms
/AI
Computing architectures
Data Pattern
Automation for handling large data
Recognition
Mining
Database
systems
48
22
22
Big Data: Small vs. Big Analysis


Fall 2016
If you want to analyze the whole set by accessing data several times, it
can be much harder
Many trial-and-error steps, easy to get lost…
Most existing data mining/ML methods were designed without
considering data access and communication of intermediate results
They iteratively use data by assuming they are readily available
How to integrate these two worlds together?
Efficient in analyzing/mining data
Do not scale
Efficient in managing big data
Does not analyze or mine the data
49
Big Data: Small vs. Big Analysis
Fall 2016

Lots of intensive computations for complex math operations need new
support in parallel/distributed settings
Matrix multiplication
QR decomposition (QR factorization)
Singular Value Decomposition (SVD)
Linear regression

So we are facing many challenges
methods not ready
tools are not convenient
platforms rapidly change,
…
50
23
23
Big Data Mining



Fall 2016
Focused Services
This is an on-going research topic
(Deep Insights)
Roughly there are two types of
approaches
Parallelize existing (single-machine)
algorithms
Design new algorithms particularly for
Deep Analytics
distributed settings
of course there are things in between
To have technical breakthroughs for bigdata analytics, we should know both
algorithms and systems well, and
consider them together
Big Data Platform
51
Big Data Mining Challenges






Fall 2016
Analytics Architecture
combine historical with fresh data at the same time
Statistical Significance: Bigger Data are not always Better Data!
achieve robust statistical results, and not be fooled by randomness
Distributed Mining: Just because it is accessible doesn’t make it ethical!
paralyze data mining techniques with practical & theoretical guarantees
Stream Mining: “Big” will evolve/change!
learn from evolving data streams
Sampling and Compression: Not all Data are equivalent!
subsampling is easy on one machine, but may not be in a distributed
setting
Hidden Big Data
currently only 3% of the potentially useful data is tagged, and even less
is analyzed
52
24
24
Big Data Analytics Platforms
Online Machine
Learning
Fall 2016
Big Machine Learning
(Mahout, MillWheel, R/Hadoop)
(SAMOA, Rapid Miner,
OIIDM)
IoT Data
Analysis
(Parstream, Vitria,
Splunk, virdata)
(Striim, Storm, Spark,
Google RT, Apache S4,
MS , Azure, AWS
Kinesis)
Big Time
Series
Analytics
(InfluxDB, AT&T M2X, IBM
Informix TS, OpenTSDB)
Source: Vmware
53
Fall 2016
What this Course is About?
54
25
25
The Need: Making Sense of
the Big Data Universe
Fall 2016
Frameworks:




New computing paradigms:
Cloud, Hadoop – Map/Reduce
New storage solutions: NoSQL,
column stores, Big Table
New languages: JAQL, Pig Latin
We will study these and how they
relate to previous technologies
Analysis:


New frameworks demands new
approaches to explore data
We will study algorithms to
process and mining data in BigData environments
Tackling Big Data M. Cooper &
P. Mell NIST Information
Technology Laboratory Computer
Security Division
55
Fall 2016
What You Will learn




Understand different models of computation:
MapReduce
Streams and online algorithms
Mine different types of data:
Data is high dimensional
Data is infinite/never-ending
Use different mathematical ‘tools’:
 Hashing (LSH, Bloom filters)
Dynamic programming (frequent itemsets)
Solve real-world problems:
Duplicate document detection
Market Basket Analysis
56
26
26
Fall 2016
Tentative Course Schedule
















Lecture 1-2 (20-22/09): Course Overview
Lecture 3-4 (27-29/09): The Map-Reduce Software Ecosystem
Lecture 5-6 (04-06/10): Massive Data Processing
Lecture 7-8 (11-13/10): Finding Similar Items
Lecture 9-10 (18-20/10): Extracting Association Rules
Lecture 11-14 (25-27/10&01-03/11): Analysing Data Streams
Lecture 15-17 (08-10-15/11): Web Entity Resolution
Lab 1 Introduction to Hadoop Map-Reduce (28-09)
Lab 2 Advanced Map-Reduce Programming (05-10)
Lab 3 MapReduce Design Patterns (12-10)
Lab 4 (19-10) Programming Assignment 1
Lab 5 (26-10) Programming Assignment 2
Lab 6 (2 -11) Hive
Lab 7 (9-11) Pig
Lab 8 (16-11) SparK
Students presentations (22-23-24-29-30)
© NY Times
57
Course Text Books

Jure Leskovec, Anand Rajaraman, Jeff Ullman. “Mining
of Massive Datasets” Cambridge University Press, 2014
http://www.cambridge.org/gr/academic/subjects/compute
r-science/knowledge-management-databases-and-datamining/mining-massive-datasets-2nd-edition
Free download http://www.mmds.org

Donald Miner, Adam Shook “MapReduce Design
Patterns” O'Reilly Media 2013
http://shop.oreilly.com/product/0636920025122.do
Free download of chapter samples
http://cdn.oreillystatic.com/oreilly/booksamplers/9781
449327170_sampler.pdf
Fall 2016
58
27
27
Fall 2016
Course Organization



Three Programming Exercises (40%):
Map/Reduce & Haddop
A Research Project (30%):
Presentation of 2 research papers and written report
Final Examination (30%):
60
Fall 2016
Words of Caution






We can only cover a small part of the big data universe
Do not expect all possible architectures, programming models,
theoretical results, or vendors to be covered
This really is an algorithms course, not a basic programming course
But you will need to do a lot of non-trivial programming
There are few certain answers, as people in research and leading tech
companies are trying to understand how to deal with big data
We are working with cutting edge technology
Bugs, lack of documentation, new Hadoop API
In short: you have to be able to deal with inevitable frustrations and plan
your work accordingly…
…but if you can do that and are willing to invest the time, it will be a
rewarding experience
61
28
28
Fall 2016
References







CS246: Mining Massive Datasets Jure Leskovec, Stanford University, 1014
CS9223 – Massive Data Analysis J. Freire & J. Simeon New York University
Course 2013
CS 6240: Parallel Data Processing in MapReduce Mirek Riedewald
Northeastern University 2014
Big Data Infrastructures: Exploiting the Power of Big Data T. Sellis School of
CS & IT, 2015 Athens
CS525: Special Topics in DBs Large-Scale Data Management Advanced
Analytics on Hadoop Mohamed Eltabakh Spring 2013
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department
of Computer Science National Taiwan University August 30, 2014
Knowledge Discovery and Data Mining Evgueni Smirnov Maastricht School
on Data Mining Department of Knowledge Engineering, Maastricht
University, Maastricht, The Netherlands August 27 - August 30, 2013
62
Fall 2016
63
29
29
The Big Data Analysis Pipeline
Fall 2016
Major
Processing
Steps
Major
Challenges
http://cacm.acm.org/magazines/2014/7/176204-big-data-and-its64
technical-challenges/abstract
Fall 2016
65
30
30
Fall 2016
66
Fall 2016
Big Data Processing and Analysis Framework
67
31
31
Fall 2016
http://www.pinterest.com/pin/272327108689099496/
68
Fall 2016
69
32
32