Download Massive Data Analysis: Course Overview

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Massive Data Analysis: Course Overview
Juliana Freire!
Content obtained from many sources, including: Agrawal et al., VLDB 2010 tutorial; Shim, VLDB 2012 tutorial; Jeff Ullman’s lecture notes, G. Weikum
Course Staff and Information
•  Instructors:!
o  Juliana Freire!
o  Jerome Simeon!
•  Reach us at [email protected]!
•  More info on http://vgc.poly.edu/~juliana/courses/cs9223 !
•  In our wiki you will find:!
Tentative schedule!
News and announcements!
Reading list!
Assignments!
Check it often!!!
http://www.vistrails.org/index.php/Course:_Big_Data_Analysis!
!
2!
What we will cover
•  Infrastructure: Architecture, computing models (e.g.,
MapReduce), storage solutions (e.g., Big Table, MongoDB),
query/processing languages!
•  Algorithms and analysis: statistics, data mining techniques!
•  Tentative schedule in:!
http://www.vistrails.org/index.php/Course:_Big_Data_Analysis!
!
•  Readings from:!
o  Scientific papers!
o  Textbooks (they are free to download!)!
Mining of Massive Data Sets (version 1.1), by Anand Rajaraman, Jure
Leskovec and Jeff Ullman.
http://infolab.stanford.edu/~ullman/mmds.html !
Data-Intensive Text Processing with MapReduce, by Jimmy Lin and
Chris Dyer. http://lintool.github.com/MapReduceAlgorithms/index.html!
Pre-­‐‑Requisites
•  A course in database systems, covering application
programming in SQL and other database-related
languages such as XQuery!
•  A course on algorithms and data structures!
•  Good programming skills!
What you will do
•  Programming assignments (50%) done individually!
o  You will need a computer!
o  We will provide you access to Amazon AWS (more details
later)!
•  Quizzes (15%): you will use Gradiance!
o  Register at http://www.newgradiance.com/services!
o  Use token 00B06796!
•  Final exam (35%)!
Motivation
Big Data: What is the Big deal?
http://www.google.com/trends/explore#q=%22big%20data%22!
Big Data: What is the Big deal?
•  Many success stories!
o  Google: many billions of pages indexed, products, structured
data! Google grew from processing 100 TB of data a day with MapReduce in 2004 [45] to processing 20 PB a day with o  Facebook:
1.1 billion users using the site each month!
MapReduce in 2008 [46]. In April 2009, a blog post1 was o  Twitter: wri^en about eBay’s two enormous data warehouses: 517 million accounts, 250 million tweets/day
one with 2 petabytes of user data, and the other with 6.5 •  This is changing
society!!
petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed2 similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Lin and Dyer, 2010
The McKinsey Report on Big Data
Data have swept into every industry and business function and
are now an important factor of production, alongside labor and
capital. We estimate that, by 2009, nearly all sectors in the US
economy had at least an average of 200 terabytes of stored data
(twice the size of US retailer Wal-Mart's data warehouse in 1999)
per company with more than 1,000 employees…!
The use of big data will become a key basis of competition and
growth for individual firms.!
There will be a shortage of talent necessary for organizations to
take advantage of big data. By 2018, the United States alone
could face a shortage of 140,000 to 190,000 people with deep
analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective
decisions.!
http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation!
Big Data: New Opportunities
•  Enable scientific breakthroughs!
•  Petabytes of data generated each day, e.g., Australian
radio telescopes, Large Hadron Collider, Sloan Sky
Survey, genomes, climate data, …!
•  Social data, e.g., Facebook, Twitter !
•  3,180,000 and 3,410,000 results in Google Scholar!!
Big Data: New Opportunities
•  Smart Cities: 50% of the world population lives in cities !
o  Census, crime, emergency visits, taxis, public
transportation, real estate, noise, energy, …!
•  Cities are making their data available!!
http://www.data.gov/united-states-datasites!
https://nycopendata.socrata.com/!
•  Make cities more efficient and sustainable, and improve
the lives of their citizens
!
NYC Inspections
New York City gets 25,000
illegal-conversion complaints a
year, but it has only 200
inspectors to handle them. !
Flowers’ group integrated
information from 19 different
agencies that provided
indication of issues in buildings!
!
Result: hit rate for inspections
went from 13% to 70%!
Big Data: New Opportunities
NYU CUSP aims to “use
New York City as its laboratory and classroom to help cities
around the world become more productive, livable,
equitable, and resilient. CUSP observes, analyzes, and
models cities to optimize outcomes, prototype new
solutions, formalize new tools and processes, and develop
new expertise/experts.”
!
!
!
!http://cusp.nyu.edu/!
Big Data: New Opportunities
•  Data is currency: companies are profiting from
knowledge extracted from Big Data!
o  Better understand customers, targeted advertising, …!
Big Data: New Opportunities
h^p://blogs.wsj.com/venturecapital/tag/big-­‐‑data/
What is Massive/Big Data?
The three V’s of big data: Volume, Variety, and Velocity!
•  Too big: petabyte-scale collections or lots of (not
necessarily big) data sets !
•  Too hard: does not fit neatly in an existing tool!
o  Data sets that need to be cleaned, processed and integrated!
o  E.g., Twitter, news, customer transactions!
•  Too! fast: needs to be processed quickly!
!
Big Data: What is the Big deal?
•  Big data is not new: financial transactions, call detail
records, astronomy, …!
•  What is new:!
-  Many more data enthusiasts!
-  More data are widely available, e.g., Web, data.gov,
scientific data, social and urban data!
- Computing is cheap and easy to access!
o  Server with 64 cores, 512GB RAM ~$11k!
o  Cluster with 1000 cores ~$150k!
o  Pay as you go: Amazon EC2!
Big Data: More than Volume
Volume = Length × Width × Depth
Big Data Length: Collect & Compare
Big Data Width: Discover & Integrate
Big Data Depth: Analyze & Understand
Slide by Gerhard Weikum
Big Urban Data: NYC Taxis
Collect, Clean, and Compare
Beijing
NYC
Compare different cities
Collect, Clean, and Compare
7-8am
8-9am
9-10am
Compare effects over time
Bigger picture of city life!
10-11am
Discover and Integrate
Compare with other data sources, e.g., NYC Citi bikes
Was there a traffic problem? An important event?
Discover information in news, blogs, etc. Discover and Integrate
Compare with other data sources, e.g., NYC Citi bikes
Discover information in news, blogs, etc. Analyze and Understand
The Sandy Effect
Analyze and Understand
Studying traffic pa^erns to and from the airports
Taxis in NYC: Rides per Hour
Big Data: What is hard?
•  Scalability for computations? NOT!!
o  Lots of work on distributed systems, parallel databases, …!
o  Elasticity: Add more nodes!!
•  But there are no one-size-fits-all solution: often, you have to
build your own…!
•  Rapidly-evolving technology!
•  Many different tools!
•  Different computation model: need new algorithms!
Big Data: What is hard?
•  Scalability for people: Data exploration is hard!
regardless of whether data are big or small!
algorithms
provenance
machine learning
data integration
visual encodings
interaction modes
statistics
data curation
data
math
data management
knowledge
(Big) Data Analysis Pipeline
http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf!
Big Data: Challenges
Big Data: Challenges
Apple: Fruit or company?
Big Data: Opportunities and Challenges
Big Data: Challenges
•  Taxi data: > .5 billion trips!
•  Can’t load on Excel and !
even commercial databases!
are too slow!
Our solution to support !
interactive queries!
o  New spatio-temporal index !
o  New index that leverages GPU (work in progress)!
!
Big Data: New Technologies
Infrastructure:!
•  New computing paradigms: Cloud, Hadoop – Map/
Reduce!
•  New storage solutions: NoSQL, column stores, Big Table!
•  New languages: JAQL, Pig Latin!
•  We will study these and how they relate to previous
technologies !
Analysis and Mining:!
•  New infrastructure demands new approaches to explore
data!
•  We will study algorithms to process and analyze data in
Big-Data environments!
Infrastructure
What is Cloud Computing?
•  Old idea: Software as a Service (SaaS)!
o  Delivering applications over the Internet!
!
•  Recently: “[Hardware, Infrastructure, Platform] as a
service”!
!
•  Utility Computing: pay-as-you-go computing!
o  Illusion of infinite resources!
o  No up-front cost!
o  Fine-grained billing (e.g., hourly) !
Agrawal et al., VLDB 2010 Tutorial!
Cloud Computing: Why Now?
•  Experience with very large data centers!
o  Unprecedented economies of scale!
o  Transfer of risk!
•  Technology factors!
o  Pervasive broadband Internet!
o  Maturity in virtualization technology!
•  Business factors!
o  Minimal capital expenditure!
o  Pay-as-you-go billing model!
Agrawal et al., VLDB 2010 Tutorial!
Warehouse Scale Computing
Google’s data center in Oregon
16 Million Nodes per building
Agrawal et al., VLDB 2010 Tutorial!
Economics of Cloud Users
Resources
Capacity
Demand
Resources
• Pay by use instead of provisioning for peak
Capacity
Demand
Time
Static data center
Time
Data center in the cloud
Unused resources
Agrawal et al., VLDB 2010 Tutorial!
Slide Credits: Berkeley RAD Lab
Economics of Cloud Users
• Risk of over-­‐‑provisioning: underutilization
Resources
Capacity
Unused resources
Demand
Time
Static data center
Agrawal et al., VLDB 2010 Tutorial!
Slide Credits: Berkeley RAD Lab
Economics of Cloud Users
Resources
Resources
• Heavy penalty for under-­‐‑provisioning
3
Lost revenue
Resources
Demand
3
Demand
2
1
Time (days)
Capacity
2
1
Time (days)
Capacity
Capacity
Demand
2
1
Time (days)
3
Lost users
Agrawal et al., VLDB 2010 Tutorial!
Slide Credits: Berkeley RAD Lab
Just hype?
…Cloud Computing? What are you !
talking about? Cloud Computing is !
nothing but a computer attached to!
a network.!
-­‐‑-­‐‑ Larry Ellison, Excerpts from an interview
Agrawal et al., VLDB 2010 Tutorial!
Cloud Computing: Hype or Reality
•  Unlike the earlier attempts:!
o  Distributed Computing!
o  Distributed Databases!
o  Grid Computing!
•  Cloud Computing is REAL:!
o  Organic growth: Google, Yahoo, Microsoft, and Amazon!
o  Poised to be an integral aspect of National Infrastructure in
US and elsewhere!
Agrawal et al., VLDB 2010 Tutorial!
Cloud Computing Modalities
“Can we outsource our IT software and
hardware infrastructure?”
•  Hosted Applications and
services!
•  Pay-as-you-go model!
•  Scalability, fault-tolerance,
elasticity, and self-manageability!
Agrawal et al., VLDB 2010 Tutorial!
“We have terabytes of click-stream data –
what can we do with it?”
•  Very large data repositories!
•  Complex analysis!
•  Distributed and parallel data
processing!
Why Data Analysis?
Who are our
lowest/highest margin customers ?
What is the most effective distribution channel?
What product prom-
-otions have the biggest impact on revenue?
What impact will new products/services !
have on revenue and margins?
VLDB 2010 Tutorial!
Who are my customers and what products are they buying?
Which customers
are most likely to go to the competition ?
Why Data Analysis?
Where are our
lowest/highest margin passengers?
What is the distribution !
of trip lengths?
What would the !
impacts be of a !
Fare change?
What is the quickest!
route from midtown!
To downtown at 4pm on!
Monday?
What impact will the introduction of !
additional medallions!
have?
Where should drivers!
go to get !
passengers?
Decision Support
• 
• 
• 
• 
• 
Used to manage and control business!
Data is historical or point-in-time!
Optimized for inquiry rather than update!
Use of the system is loosely defined and can be ad-hoc!
Used by managers and end-users to understand the
business and make judgements!
Agrawal et al., VLDB 2010 Tutorial!
Decision Support
•  Data-analysis in the enterprise context emerged:!
o  As a tool to build decision support systems!
o  Data-centric decision making instead of using intuition!
o  New term: Business Intelligence!
•  Traditional approach:!
o  Decision makers wait for reports from disparate OLTP
systems!
o  Put it all together in a spreadsheet!
o  Manual process!
Agrawal et al., VLDB 2010 Tutorial!
Data Analytics in the Web Context
•  Data capture at the user interaction level:!
o  in contrast to the client transaction level in the Enterprise
context!
•  As a consequence, the amount of data increases
significantly!
•  Need to analyze such data to understand user behaviors!
Agrawal et al., VLDB 2010 Tutorial!
Data Analytics outside Big Corporations
•  Even data capture at client transaction level leads to a
lot of data!!
!
•  Need to analyze such data to understand behavior!
•  Cannot afford expensive warehouse solutions!
Data Analytics in the Cloud
•  Scalability to large data volumes:!
o  Scan 100 TB on 1 node @ 50 MB/sec = 23 days!
o  Scan 100 TB on 1000-node cluster = 33 minutes!
è Divide-And-Conquer (i.e., data partitioning)!
!
•  Cost-efficiency:!
o  Commodity nodes (cheap, but unreliable)!
o  Commodity network!
o  Automatic fault-tolerance (fewer admins)!
o  Easy to use (fewer programmers)!
!
Agrawal et al., VLDB 2010 Tutorial!
Platforms for Large-­‐‑scale Data Analysis
•  Parallel DBMS technologies!
o  Proposed in the late eighties!
o  Matured over the last two decades!
o  Multi-billion dollar industry: Proprietary DBMS Engines
intended as Data Warehousing solutions for very large
enterprises!
•  Map Reduce !
o  pioneered by Google!
o  popularized by Yahoo! (Hadoop)!
Agrawal et al., VLDB 2010 Tutorial!
Parallel DBMS technologies
•  Popularly used for more than two decades!
o  Research Projects: Gamma, Grace, …!
o  Commercial: Multi-billion dollar industry but access to only
a privileged few!
• 
• 
• 
• 
• 
• 
Relational Data Model!
Indexing!
Familiar SQL interface!
Advanced query optimization!
Well understood and studied!
Very reliable!!
Agrawal et al., VLDB 2010 Tutorial!
MapReduce •  Overview:!
o  Data-parallel programming model !
o  An associated parallel and distributed implementation for
commodity clusters!
•  Pioneered by Google!
o  Processes 20 PB of data per day (circa 2008)!
•  Popularized by open-source Hadoop project!
o  Used by Yahoo!, Facebook, Amazon, and the list is
growing …!
[Dean et al., OSDI 2004, CACM Jan 2008, CACM Jan 2010]
Agrawal et al., VLDB 2010 Tutorial!
Hadoop!
•  Open source of MapReduce framework of Apache Project!
•  Hadoop Distributed File System (HDFS) !
o  Store big files across machines !
o  Store each file as a sequence of blocks!
o  Blocks of a file are replicated for fault tolerance!
•  Distribute processing of large data across thousands of
commodity machines!
•  Key components!
o  MapReduce - distributes applications!
o  Hadoop Distributed File System (HDFS) - distributes data!
•  A single Namenode (master) and multiple Datanodes (slaves)!
o  Namenode: manages the file system and access to files by clients!
o  Datanode: manages the storages attached to the nodes running on!
© Kyuseok Shim (VLDB 2012 TUTORIAL)!
MapReduce Programming Model!
•  Borrows from functional programming!
•  Users should implement two primary methods:!
o  Map: (key1, val1) → [(key2, val2)]!
o  Reduce: (key2, [val2]) → [(key3, val3)]!
!
© Kyuseok Shim (VLDB 2012 TUTORIAL)!
Word Counting with MapReduce!
M1!
Documents!
Doc2 Financial, IMF, Cris
is
Map!
Doc1 Financial, IMF, Eco
nomics, Crisis
Key!
Financial!
IMF!
`!
Financial!
Documents!
Map!
Doc4 Financial, Harry, P
o^er, Film
`!
`!
Economics!
Crisis!
Doc3 Economics, Harry
Value!
Doc5 Crisis, Harry, Po^er!
M2!
© Kyuseok Shim (VLDB 2012 TUTORIAL)!
IMF!
Crisis!
`!
Key!
Value!
1
Economics!
1!
Harry!
1!
Financial!
1!
1!
Harry!
1!
1!
Po^er!
1!
Film!
1!
1!
Crisis!
1!
Harry!
1!
Po^er!
`!
`!
`!
1
1!
1!
1!
Word Counting with MapReduce!
Documents!
Value list!
Value!
Crisis!
1, 1, 1
1
Crisis!
1!
Crisis!
1!
Harry!
1, 1!
1, 1!
IMF!
Crisis! 1!
1, 1, 1!
1!
IMF!
Harry! 1!
1, 1, 1!
Harry!
1!
Harry!
1!
Film!
1!
Po^er!
1!
Po^er!
1!
Economics!
Film!
1!
Economics!
Po^er! 1!
Doc5 Crisis, Harry, Po^er!
1!
1, 1!
Before reduce functions are called,
for each distinct key, the list of its values is generated!
© Kyuseok Shim (VLDB 2012 TUTORIAL)!
Key!
Value!
Financial!
3
IMF!
`!
`!
2
Economics!
2
Crisis!
3
Harry!
Reduce!
Map!
Doc4 Financial, Harry, P
o^er, Film
1!
Financial!
1!
Economics!
Documents!
Doc3 Economics, Harry
Financial!
1
Financial!
Financial!
IMF!
Key!
Key!Value!
Reduce!
Doc2 Financial, IMF, Cris
is
Map!
Doc1 Financial, IMF, Eco
nomics, Crisis
Key!
Film!
Po^er!
`!
3
1
2
MapReduce Advantages
•  Automatic Parallelization:!
o  Depending on the size of RAW INPUT DATA è instantiate
multiple MAP tasks!
o  Similarly, depending upon the number of intermediate
<key, value> partitions è instantiate multiple REDUCE
tasks!
•  Run-time:!
o  Data partitioning!
o  Task scheduling!
o  Handling machine failures!
o  Managing inter-machine communication!
•  Completely transparent to the programmer/analyst/user!
Agrawal et al., VLDB 2010 Tutorial!
MapReduce Experience
•  Runs on large commodity clusters:!
o  1000s to 10,000s of machines!
•  Processes many terabytes of data!
•  Easy to use since run-time complexity hidden from the
users!
•  1000s of MR jobs/day at Google (circa 2004)!
•  100s of MR programs implemented (circa 2004)!
Agrawal et al., VLDB 2010 Tutorial!
The Need
•  Special-purpose programs to process large amounts of
data: crawled documents, Web Query Logs, etc.!
•  At Google and others (Yahoo!, Facebook):!
o  Inverted index!
o  Graph structure of the WEB documents!
o  Summaries of #pages/host, set of frequent queries, etc.!
o  Ad Optimization!
o  Spam filtering!
Agrawal et al., VLDB 2010 Tutorial!
Takeaway
•  MapReduce’s data-parallel programming model
hides complexity of distribution and fault tolerance!
•  Principal philosophies:!
o  Make it scale, so you can throw hardware at problems!
o  Make it cheap, saving hardware, programmer and
administration costs (but requiring fault tolerance)!
•  Hive and Pig further simplify programming!
•  MapReduce is not suitable for all problems, but
when it works, it may save you a lot of time!
Agrawal et al., VLDB 2010 Tutorial!
Map Reduce vs Parallel DBMS!
Parallel DBMS!
MapReduce!
Schema Support!
ü!
Not out of the box!
Indexing!
ü!
Not out of the box!
Programming Model!
Declarative!
(SQL)!
Imperative!
(C/C++, Java, …)!
Extensions through !
Pig and Hive!
Optimizations (Compres
sion, Query !
Optimization)!
ü!
Not out of the box!
Flexibility!
Not out of the box!
ü!
Fault Tolerance!
Coarse grained !
techniques!
ü!
[Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, …]
Agrawal et al., VLDB 2010 Tutorial!
MapReduce: A step backwards?
•  Don’t need 1000 nodes to process petabytes:!
o  Parallel DBs do it in fewer than 100 nodes!
•  No support for schema:!
o  Sharing across multiple MR programs is difficult!
•  No indexing:!
o  Wasteful access to unnecessary data!
•  Non-declarative programming model:!
o  Requires highly-skilled programmers!
•  No support for JOINs:!
o  Requires multiple MR phases for the analysis!
We will study this in more detail!
Agrawal et al., VLDB 2010 Tutorial!
Analysis and Mining
Data Analysis and Mining
•  Many challenges, even when data is not big…!
•  Data cleaning and curation: !
o  Detection and correction of errors in data E.g., age = 150.!
o  Entity resolution and disambiguation, e.g., apple the fruit
vs. Apple the company!
Data Analysis and Mining
•  Many challenges, even when data is not big…!
•  Data cleaning and curation: !
o  Detection and correction of errors in data E.g., age = 150.!
o  Entity resolution and disambiguation, e.g., apple the fruit
vs. Apple the company!
•  Visualization: Pictures help us to think
!
o  Substitute perception for cognition!
o  External memory: free up limited cognitive/memory resources for
higher-level problems!
•  Mining: Discovery of useful, possibly unexpected,
patterns in data!
(Big) Data Analysis Pipeline
http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf!
Data Analysis and Mining
•  In exploratory tasks, change is the norm!!
o  Data analysis and mining are iterative processes!
o  Many trial-and-error steps!
Data
Process
Data
Product
Specification
Data
Manipulation
Perception &
Cognition
Knowledge
Exploration
User
Figure modified from J. van Wijk, IEEE Vis 2005
Data Analysis and Mining
•  In exploratory tasks, change is the norm!!
o  Data analysis and mining are iterative processes!
o  Many trial-and-error steps, easy to get lost…!
•  Need to manage the data exploration process:!
o  Guide users!
o  Need provenance for reproducibility [Freire et al., CISE 2008]!
Data
Process
Data
Product
Specification
Data
Manipulation
Perception &
Cognition
Knowledge
Exploration
User
Figure modified from J. van Wijk, IEEE Vis 2005
Analyzing and Mining Big Data: Issues
•  Besides scalability for algorithms and computations…!
•  A big data-mining risk is that you will “discover”
patterns that are meaningless.!
•  Statisticians call it Bonferroni’s principle: (roughly) if you
look in more places for interesting patterns than your
amount of data will support, you are bound to find junk.!
Jeff Ullman’s lecture notes!
Examples of Bonferroni’s Principle
1. 
2. 
A big objection to Total Information Awareness (TIA)
was that it was looking for so many vague connections
that it was sure to find things that were bogus and thus
violate innocents’ privacy.!
The Rhine Paradox: a great example of how not to
conduct scientific research.!
Jeff Ullman’s lecture notes!
Stanford Professor Proves Tracking Terrorists Is Impossible!
•  Reporter from the LA Times picked an example in
Professor Ullman’s class!
•  Despite attempts by Professor Ullman, the reporter was
unable to grasp the point that the story was made up to
illustrate Bonferroni’s Principle, and was not real…!
Modified from Jeff Ullman’s lecture notes!
The “TIA” Example
•  Suppose we believe that certain groups of evil-doers are
meeting occasionally in hotels to plot doing evil.!
•  We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day.!
Jeff Ullman’s lecture notes!
TIA Example: Details
•  109 people being tracked.!
•  1000 days.!
•  Each person stays in a hotel 1% of the time (10 days out
of 1000).!
•  Hotels hold 100 people (so 105 hotels to hold 1% of the
people being tracked).!
•  If everyone behaves randomly (i.e., no evil-doers) will the
data mining detect anything suspicious?!
Jeff Ullman’s lecture notes!
TIA Example: Calculations – (1)
p at
some
hotel
q at
some
hotel
Same
hotel
•  Probability that given persons p and q will be at the
same hotel on given day d :!
o 1/100 × 1/100 × 1/105 = 10-9!
•  Probability that p and q will be at the same hotel on
given days d1 and d2:!
o  10-9 × 10-9 = 10-18!
•  Pairs of days: C(1000,2) = 1000!/(1000-2)!*2!!
o  ~5×105!
Jeff Ullman’s lecture notes!
TIA Example: Calculations – (2)
•  Probability that p and q will be at the same hotel on
some two days:!
o  5×105 × 10-18 = 5×10-13.!
•  Pairs of people: C(109 2) =~!
o  5×1017.!
•  Expected number of “suspicious” pairs of people:!
o  5×1017 × 5×10-13 = 250,000.!
Jeff Ullman’s lecture notes!
Conclusion
•  Suppose there are (say) 10 pairs of evil-doers who
definitely stayed at the same hotel twice.!
•  Analysts have to sift through 250,000 candidates to find
the 10 real cases.!
o  Not gonna happen.!
o  But how can we improve the scheme?!
Jeff Ullman’s lecture notes!
Moral
•  When looking for a property (e.g., “two people stayed at
the same hotel twice”), make sure that the property
does not allow so many possibilities that random data
will surely produce facts “of interest.”!
Jeff Ullman’s lecture notes!
Rhine Paradox – (1)
•  Joseph Rhine was a parapsychologist in the 1950’s
who hypothesized that some people had Extra-Sensory
Perception.!
•  He devised (something like) an experiment where
subjects were asked to guess 10 hidden cards – red or
blue.!
•  He discovered that almost 1 in 1000 had ESP – they
were able to get all 10 right!!
Jeff Ullman’s lecture notes!
80!
Rhine Paradox – (2)
•  He told these people they had ESP and called them in
for another test of the same type.!
•  Alas, he discovered that almost all of them had lost their
ESP.!
•  What did he conclude?!
You shouldn’t tell people they have ESP; it causes them to
lose it.!
Jeff Ullman’s lecture notes!
81!
Moral
•  Understanding Bonferroni’s Principle will help you look
a little less stupid than a parapsychologist.!
Jeff Ullman’s lecture notes!
82!
Next Class
Introduction to Map-Reduce and high-level data processing
languages !