Download PPT - Rice University Campus Wiki

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Fayé A. Briggs, PhD
Adjunct Professor and Intel Fellow(Retired)
Rice University
Material mostly derived from “Mining Massive Datasets” from:
Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
http://www.mmds.org
One
Petabyte
=
~50x
~13y to view
~11s to generate in 2012
the content in the Library of Congress
an HD video continuously
A transatlantic flight in a Boeing 777 produces so much telemetry, about 30 terabytes of data.
A new generation of technologies and architectures
designed to economically
extract
value from very large VOLUMES of a
wide VARIETY of data by enabling highVELOCITY capture, discovery, and/or
analysis and ensure VERACITY
Source:
IDC
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
http://blogs.loc.gov/digitalpreservation/2011/07/transferring-libraries-of-congress-of-data/
3
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
4
•
•
•
•
By year 2020, more than 200 billion devices will be connected to the cloud and each other;
commonly called as IoT
IDC predicts that 1/3rd of billions of devices will be intelligent devices
Large amount of legacy equipment is not connected, managed or secured
Need to address the interoperability of legacy systems
 to avoid incredibility avoid large cost of replacing I/F that can be securely connected to cloud
(public/private)
“Economically” - Productivity, Efficiency & better decisions
 “Extract Value & Analysis” - Through Parallel Computations &
algorithms for better decisions
 “Very Large Volumes Capture” – Memory & Storage
 “Wide Variety”- Sensors, Software & Security
 “High-Velocity”- Networking & I/O
 “Veracity” – Correlation to reality(quality and correctness)

6

Telecom
 Calling patterns, signal processing, forecasting
 Analyze switches/routers data for quality of call, frequency of calls, region loads,
etc.
 Act before problems happen. Act before customer calls arrive.

Financial
 Trading behavior
 Analyze real-time data to understand market behavior, role of individual
institution/investor
 Detect fraud, detect impact of an event, detect the players

Search Engines
 Process the data collected by Web bot in multiple dimensions
 Enhance relevance of search
Big Data impacts e-connected businesses through capture, processing and storage of
huge amount of data efficiently

Click Stream Analysis
 Analysis of online users behavior
 Develop comprehensive insight (Business Intelligence) to run effective strategies in real
time

Graph analysis
 Term for discovering the online influencers in various walks of life
 Enables a business to understand key players and devise effective strategies

Lifecycle Marketing
 Strategies to move away from spam/mass mail
 Enables a business to spend money on high probable customers only

Revenue Attribution
 Term for analyzing the data to accurately attribute revenue back to various marketing
investments
 Business can identify effectiveness of campaign to control expenses
Big Data phenomenon allows businesses to know, predict and influence
customer behaviors!!!
% of population over age 60
30+ %
25-29%
20-24%
10-19%
2050
0-9%
WW Average Age 60+: 21%
Source: United Nations “Population Aging 2002”
Healthcare costs are RISING
Significant % of GDP

Global AGING
Average age 60+: growing
from 10% to 21% by 2050
U.S. Healthcare BIG DATA Value
$300 Billion in value/year
~ 0.7% annual productivity growth
Healthcare effectiveness analysis:



medical histories, clinical information, imaging results, laboratory test results, physician interactions,
preferred prescriptions and patient accountability for taking those medications.
providers using graph analytics for assessing many similar medical histories managed within a graph model
that not only links patients to physicians, medications and presumed diagnoses, and providers.
Providers can rapidly scan through the graph to discover therapies used with other patients with similar
characteristics (such as age, diagnostics, clinical history, associated risk factors, etc.) that have the most
positive outcomes
Source: McKinsey Global Institute Analysis
ESG Research Report 2011 – North American Health Care Provider Market Size and Forecast
Where is the data coming from?
1. Pharma/Life Sciences
2. Clinical Decision Support & Trends
(includes Diagnostic Imaging)
3. Claims, Utilization and Fraud
4. Patient Behavior/Social Networking
How do we create value? (examples)

1. Personalized Medicine
2. Clinical Decision Support
3. Enhanced Fraud Detection
4. Analytics for Lifestyle and Behaviorinduced Diseases
Penn State Planned :
 Broke ground for a $54M new data center dedicated to
making use of big data to enhance medical research
and patient care potential.
 Allow the university to better gather, and analyze large
volumes of rich heath data for effective prediction,
modeling of diseases and disease behaviors.

.
McKinsey Global Institute Analysis
Digital rendering of the 46,000 sq ft new data center to open April 2016
Health Info Services
Primary Care
Personal Health
Management
Aging Society
New Healthcare
Applications
Clinical Decision
Support
Personalized
Medicine
Cancer Genomics
Analytics and
Visualization
SQL-like Query
Machine Learning
Medical Imaging
Analytics
Medical Records
Genome Data
Medical Images
Storage Optimization
Security and Privacy
Imaging Acceleration
Data Processing/
Management
Distributed
Platform

Goals
 Advance state-of-the-art core technologies required to collect, store, preserve,
manage, analyze massive amount of data, and visualize results for business
intelligence.
 Term for discovering the online influencers in various walks of life
 Enables a business to understand key players and devise effective strategies

Challenge
 The large data sets from, for example, the proliferation of sensors, patient records,
experiential medical data are overwhelming data analysts, as they lack the tools to
efficiently store, process, analyze, and visualize the resulting meta-data from the vast
amounts of data.
 Lack of IoT interface and data delivery standards. Fractured providers causing data
format wars (jerry-mandering) (data loss from consolidations)
 Systems architecture to store these big data, extracting meta-data on the fly and
provide the computing capability to analyze the data real-time pose major challenges.
Big Data phenomenon allows businesses to know, predict and influence customer behaviors!!!
Data contains value and knowledge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
13

But to extract the knowledge
data needs to be
 Stored
 Managed
 And ANALYZED to
 Predict actionable insight from the data
 Create data products that have business impacts
 Communicate relevant visuals to influence business
 Build confidence in data value to drive business decisions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
14
Carlos Somohano, Data Science, London: What does a Data Scientist Do?
15
Carlos Somohano, Data Science, London: What does a Data Scientist Do?
16
Carlos Somohano, Data Science, London: What does a Data Scientist Do?
17
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
18
Context &
Location
Compute
SW FRAMEWORK
Store
Analytics Leading
to Insight
Transport
Generate
Protect
By Requiring A Variety of
Well Optimized Technologies Working Together
19
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
20


Given lots of data
Discover patterns and models that are:




Valid: hold on new data with some certainty
Useful: should be possible to act on the item
Unexpected: non-obvious to the system
Understandable: humans should be able to
interpret the pattern
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
21

Descriptive methods
 Find human-interpretable patterns that
describe the data
 Example: Clustering

Predictive methods
 Use some variables to predict unknown
or future values of other variables
 Example: Recommender systems
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
22


A risk with “Data mining” is that an analyst
can “discover” patterns that are meaningless
Statisticians call it Bonferroni’s principle:
 Roughly, if you look in more places for interesting
patterns than your amount of data will support,
you are bound to find crap
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
23
Example:
 We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day






109 people being tracked
1,000 days
Each person stays in a hotel 1% of time (1 day out of 100)
Hotels hold 100 people (so 105 hotels)
If everyone behaves randomly (i.e., no terrorists) will the
data mining detect anything suspicious?
Expected number of “suspicious” pairs of people:
 250,000
 … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
24
Usage
Quality
Context
Streaming
Scalability
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
25

Data mining overlaps with:
 Databases: Large-scale data, simple queries
 Machine learning: Small data, Complex models
 CS Theory: (Randomized) Algorithms

Different cultures:
 To a DB person, data mining is an extreme form of
analytic processing – queries that
CS
examine large amounts of data
 Result is the query answer
Theory
 To a ML person, data-mining
is the inference of models
 Result is the parameters of the model

In this class we will review both!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Machine
Learning
Data
Mining
Database
systems
26

This class overlaps with machine learning,
statistics, artificial intelligence, databases,
systems architecture but stresses more on:




Scalability (big data)
Algorithms
Computing architectures
Review of handling
large data
 Visualization
Statistics
Machine
Learning
Data Mining
Computer
Systems
Database
systems
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
27

Learn to mine different types of data:





Data is high dimensional
Data is a graph
Data is infinite/never-ending
Data is labeled
Learn to use different models of
computation:
 MapReduce
 Streams and online algorithms
 Single machine in-memory
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
28

Review real-world problems:





Recommender systems
Market Basket Analysis
Spam detection
Duplicate document detection
Review various “tools”:




Linear algebra (SVD, Rec. Sys., Communities)
Optimization (stochastic gradient descent)
Dynamic programming (frequent itemsets)
Hashing (LSH, Bloom filters)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
29
High dim.
data
Graph
data
Infinite
data
Machine
learning
Apps
Locality
sensitive
hashing
PageRank,
SimRank
Filtering
data
streams
SVM
Recommender
systems
Clustering
Community
Detection
Web
advertising
Decision
Trees
Association
Rules
Dimensionality
reduction
Spam
Detection
Queries on
streams
Perceptron,
kNN
Duplicate
document
detection
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
30
How do you want that data?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
31

Instructor:
 Wish I had TAs!

Office hours:
 TBD
33

Course website:
TBD
 Lecture slides(To be posted to a Rice U website- TBD)
 Readings

Readings: Book Mining of Massive Datasets
with A. Rajaraman and J. Ullman
Free online:
http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
34

(1+)4 longer homeworks: 40%
 Theoretical and programming questions
 HW0 (Hadoop tutorial) has just been posted
 Assignments take lots of time. Start early!!

How to submit?
 Homework write-up:
 Stanford students: In class or in Gates submission box
 SCPD students: Submit write-ups via SCPD
 Attach the HW cover sheet (and SCPD routing form)
 Upload code:
 Put the code for 1 question into 1 file and
submit at: http://snap.stanford.edu/submit/
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
35