Download Big Data Tech - Fordham University

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Department of Computer and Information Science
Big Data and Its Technologies
CISC 6930 Data Mining
CISC 4631 Data Mining
Department of Computer and Information Science
What We Are Going to Learn
o What is Big Data?
o Characteristics of Big Data
o What To Do With The Data?
o What Technology Do We Have For Big Data?
o A Simple Big Data Mining Example
o Hadoop in the Wild
o Big Data in the Cloud
2
CISC 4631 Data Mining
Department of Computer and Information Science
Imagine:
You are working in a company, tomorrow
morning you go to your office and there’s a mail
from your CEO regarding a new task:
Dear <Your Name>,
As you know we are building a blogging platform blogger2.com, I need some statistics. I
need to find out, across all blogs ever written on blogger.com, how many times one
character words occur (like 'a', 'I'), How many times two character words occur (like 'be',
'is')…, and so on till how many times do ten character words occur.
I know its a really big job. So, I will assign, all 50,000 employees working in our company
to work with you on this for a week. I am going on a vacation for a week, and its really
important that I've this when I return. Good luck.
regds,
The CEO
P.s : and one more thing. Everything has to be done manually, except going to the blog and copy
pasting it on notepad. I read somewhere that if you write programs, google can find out about it
3
CISC 4631 Data Mining
Department of Computer and Information Science
Picture yourself in that position for a
moment, like CEO.
• You have 50,000 people to work for you for a week. And you need to find out the number
of one character words, No. of two character words etc., covering the maximum number of
blogs in BlogSpot.
• Finally you need to give a report to your CEO with something like this:
Occurrence of one character words – Around 937688399933
Occurrence of two character words – Around 23388383830753434
.. hence forth till ten
• If homicide, suicide or resigning the job is not an option, how would you
solve it?
• How would you avoid the chaos of so many people working?
• How will you co-ordinate those many since the output of one has to be
merged with another?
4
CISC 4631 Data Mining
Department of Computer and Information Science
The Big Questions
o What is Big Data?
o What makes Data “Big”?
o How to manage very large amounts of data and
extract value and knowledge from them?
5
CISC 4631 Data Mining
Department of Computer and Information Science
What is Big Data?
o No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity
require
new
architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
6
CISC 4631 Data Mining
Department of Computer and Information Science
What is Big Data?
Here is from Wikipedia:



7
Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
The challenges include capture, curation, storage, search, sharing, transfer,
analysis, and visualization.
The trend to larger data sets is due to the additional information derivable
from analysis of a single large set of related data, as compared to separate
smaller sets with the same total amount of data, allowing correlations to be
found to "spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine real-time
roadway traffic conditions.”
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data EveryWhere!
o Lots of data is being collected
and warehoused
 Web data, e-commerce
 Purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Social Network
8
CISC 4631 Data Mining
Department of Computer and Information Science
How Much Data?
o Man on the moon with 32KB (1969); my laptop had 8GB RAM (2013)
640K ought to be
enough for anybody.
o Google collects 270PB data in a month (2007), 20PB a day (2008)
o Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
o eBay has 6.5 PB of user data + 50 TB/day (5/2009)
9
CISC 4631 Data Mining
Department of Computer and Information Science
How Much Data?
 2.7 Zetabytes of data exist in the digital universe today.
 235 Terabytes of data has been collected by the U.S. Library
of Congress in April 2011.
 The Obama administration is investing $200 million in big
data research projects.
 According to estimates, the volume of business data
worldwide, across all companies, doubles every 1.2 years.
 140,000 to 190,000, too many people with deep analytical
skills to fill the demand of Big Data jobs in the U.S. by 2018.
10
CISC 4631 Data Mining
Department of Computer and Information Science
We Are in a Knowledge Economy
o Data is an important asset to any organization
 Discovery of knowledge
 Enabling discovery
 Annotation of data
o We are looking at newer
 Programming models, and
 Supporting algorithms and data structures.
o NSF refers to it as “data-intensive computing” and
industry calls it “big-data” and “cloud computing”
11
CISC 4631 Data Mining
Department of Computer and Information Science
What We Are Going to Learn
o What is Big Data?
o Characteristics of Big Data
o What To Do With The Data?
o What Technology Do We Have For Big Data ??
o A Simple Big Data Mining Example
o Hadoop in the Wild
o Big Data in the Cloud
12
CISC 4631 Data Mining
Department of Computer and Information Science
Characteristics of Big Data:
1-Scale (Volume/Scale)
o Data Volume
 44x increase from 2009 2020
 From 0.8 zettabytes to 35zb
o Data volume is increasing exponentially
Exponential increase in
collected/generated data
13
CISC 4631 Data Mining
Department of Computer and Information Science
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
CERN’s Large
Hydron Collider
(LHC) generates
15 PB a year
14
CISC 4631 Data Mining
Department of Computer and Information Science
The Earthscope
• The Earthscope is the world's largest
science project.
• Designed to track North America's
geological evolution
• This observatory records data over 3.8
million square miles, amassing 67
terabytes of data.
• It analyzes seismic slips in the San
Andreas fault, sure, but also the plume of
magma underneath Yellowstone and
much, much more.
(http://www.msnbc.msn.com/id/44363598/ns/technology_an
d_science-future_of_technology/#.TmetOdQ--uI)
15
CISC 4631 Data Mining
Department of Computer and Information Science
30 billion
12+ TBs
25+ TBs of
log data
every day
sold
annually
billion
76 million smart
meters in 2009…
200M by 2014
16
camera
phones
world wide
100s of
million
s of
GPS
enable
d devices
2+
data every day
? TBs of
of tweet data
every day
RFID tags today
(1.3B in 2005)
4.6
billion
CISC 4631 Data Mining
people on
the Web by
end 2011
Department of Computer and Information Science
Characteristics of Big Data:
2-Complexity (Variety- Complexity)
o Various formats, types, and structures
o Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
o Static data vs. streaming data
o A single application can be
generating/collecting many types of data
To extract knowledge all these
types of data need to linked together
17
CISC 4631 Data Mining
Department of Computer and Information Science
Characteristics of Big Data:
2-Complexity (Variety- Complexity)
o Types of Data
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data

Social Network, Semantic Web (RDF), …
 Streaming Data

You can only scan the data once
 A single application can be generating/collecting
many types of data
 Big Public Data (online, weather, finance, etc.)
18
CISC 4631 Data Mining
Department of Computer and Information Science
A Single View to the Customer
Banking
Finance
Social
Media
Purchase
Entertain
19
Our
Known
History
Customer
Gaming
CISC 4631 Data Mining
Department of Computer and Information Science
Real-Time Analytics/Decision
Requirement
Product
Recommendations
that are Relevant
& Compelling
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
20
Influence
Behavior
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Customer
Preventing Fraud
as it is Occurring
& preventing more
proactively
CISC 4631 Data Mining
Friend Invitations
to join a
Game or Activity
that expands
business
Department of Computer and Information Science
Characteristics of Big Data:
3-Speed (Velocity)
o Data begins generated fast and need to be
processed fast
 Online Data Analytics
 Late decisions  missing opportunities
o Examples
 E-Promotions: Based on your current location, your
purchase history, what you like  send promotions right
now for store next to you
 Healthcare monitoring: sensors monitoring your activities
and body  any abnormal measurements require
immediate reaction
21
CISC 4631 Data Mining
Department of Computer and Information Science
Characteristics of Big Data:
3-Speed (Velocity)
o Real-time/Fast Data
Mobile devices
(tracking all objects all the time)
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Sensor technology and networks
(measuring all kinds of data)
o The progress and innovation is no longer hindered by the ability to collect data
o But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
22
CISC 4631 Data Mining
Department of Computer and Information Science
3 Vs of Big Data
o The “BIG” in big data isn’t just about volume
23
CISC 4631 Data Mining
Department of Computer and Information Science
Some Make it 4V’s
24
CISC 4631 Data Mining
Department of Computer and Information Science
Some Make it 4V’s
25
CISC 4631 Data Mining
Department of Computer and Information Science
What We Are Going to Learn
o What is Big Data?
o Characteristics of Big Data
o What To Do With The Data?
o What Technology Do We Have For Big Data ??
o A Simple Big Data Mining Example
o Hadoop in the Wild
o Big Data in the Cloud
26
CISC 4631 Data Mining
Department of Computer and Information Science
Harnessing Big Data
o OLTP: Online Transaction Processing (DBMSs)
o OLAP: Online Analytical Processing (Data Warehousing)
o RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
27
CISC 4631 Data Mining
Department of Computer and Information Science
What To Do With These Data?
o Aggregation and Statistics
 Data warehouse and OLAP
o Indexing, Searching, and Querying
 Keyword based search
 Pattern matching (XML/RDF)
o Knowledge discovery
 Data Mining
 Statistical Modeling
28
CISC 4631 Data Mining
Department of Computer and Information Science
Who’s Generating Big Data
Mobile devices
(tracking all objects all the time)
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Sensor technology and networks
(measuring all kinds of data)
o The progress and innovation is no longer hindered by the ability to collect data, but
o By the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
29
CISC 4631 Data Mining
Department of Computer and Information Science
The Model Has Changed…
o The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
30
CISC 4631 Data Mining
Department of Computer and Information Science
The Evolution of Business Intelligence
Interactive Business
Intelligence &
In-memory RDBMS
Speed
QliqView, Tableau, HANA
Scale
Graph Databases
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other SQL
Reporting Tools
Big Data:
Real Time &
Single View
Scale
Speed
Big Data:
Batch Processing &
Distributed Data Store
Hadoop/Spark; HBase/Cassandra
1990’s
31
2000’s
CISC 4631 Data Mining
2010’s
Department of Computer and Information Science
Value of Big Data Analytics
o Big data is more real-time in
nature than traditional DW
applications
o Traditional DW architectures
(e.g. Exadata, Teradata) are not
well-suited for big data apps
o Shared nothing, massively
parallel processing, scale out
architectures are well-suited for
big data apps
32
CISC 4631 Data Mining
Department of Computer and Information Science
Challenges in Handling Big Data
o The Bottleneck is in technology
 New architecture, algorithms, techniques are needed
o Also in technical skills
 Experts in using the new technology and dealing with big data
33
CISC 4631 Data Mining
Department of Computer and Information Science
What We Are Going to Learn
o What is Big Data?
o Characteristics of Big Data
o What To Do With The Data?
o What Technology Do We Have For Big Data ??
o A Simple Big Data Mining Example
o Hadoop in the Wild
o Big Data in the Cloud
34
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data Landscape
Apps
Data
as a
service
Infrastructure
Technology
35
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data Technology
36
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop/MapReduce Technology
o What is Hadoop and why does it matter?
 Hadoop is the core platform for structuring Big Data.
 Hadoop is an open-source software framework for structuring and storing
data and running applications on clusters of commodity hardware




37
Hadoop uses a distributed computing architecture consisting of many servers
It also solves the problem of formatting it for analytic purposes.
A storage part, known as Hadoop Distributed File System (HDFS)
A processing part called MapReduce.
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop/MapReduce Technology
o Hadoop was created by Doug Cutting
and Mike Cafarella in 2005. Cutting, who
was working at Yahoo! at the time,
named it after his son's toy elephant.
 It was originally developed to support
distribution for the Nutch search engine
project.
 The objective of design is to answer a
question: “How to process big data with
reasonable cost and time?”
38
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop/MapReduce Technology
o Why Hadoop is important?
It is a flexible, scalable, and highly-available architecture for distributed
computation and data processing on a network of commodity hardware.
39
CISC 4631 Data Mining
Department of Computer and Information Science
What We Are Going to Learn
o What is Big Data?
o Characteristics of Big Data
o What To Do With The Data?
o What Technology Do We Have For Big Data ??
o A Simple Big Data Mining Example
o Hadoop in the Wild
o Big Data in the Cloud
40
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Tomorrow morning you go to your office and there’s a
mail from your CEO regarding a new work:
Dear <Your Name>,
As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find
out, across all blogs ever written on blogger.com, how many times one character words occur (like 'a',
'I'), How many times two character words occur (like 'be', 'is').. and so on till how many times do ten
character words occur.
I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with
you on this for a week. I am going on a vacation for a week, and its really important that I've this when I
return. Good luck.
regds,
The CEO
P.s : and one more thing. Everything has to be done manually, except going to the blog and copy pasting it on notepad. I
read somewhere that if you write programs, google can find out about it
41
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Chapter 1: Picture yourself in that position for a
moment.
Picture yourself in that position for a moment.
• You have 50,000 people to work for you for a week. And you need to find out the number
of one character words, No. of two character words etc., covering the maximum number of
blogs in BlogSpot.
• Finally you need to give a report to your CEO with something like this:
Occurrence of one character words – Around 937688399933
Occurrence of two character words – Around 23388383830753434
.. hence forth till ten
• If homicide, suicide or resigning the job is not an option, how would you
solve it?
• How would you avoid the chaos of so many people working?
• How will you co-ordinate those many since the output of one has to be
merged with another?
42
CISC 4631 Data Mining
Department of Computer and Information Science
How to Mine the Data?
Or
How to Solve it
43
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Chapter 2: Proclamation: Let there be caste
The next day, you stand with a mike on the day before 50,000 and proclaim.
For a week, you will all be divided into many groups:
• The Mappers (tens of thousands of people will be in this group)
• The Grouper (assume just one guy for now)
• The Reducers (around 10 of employees) and..
• The Master (that’s you).
Then you talk to each one of the groups.
44
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Chapter 3: Your talk with the Mappers
• Each mapper will get a set of 50 blog urls and really Big sheet of paper.
• Each one of you need to go to each of that url, and for each word in those blogs,
write one line on the paper.
• The format of that line should be the number of characters in the word, then a
comma, and then the actual word.
For example, if you find the word “a”, you write “1,a”, in a new line in your paper.
since the word “a” has only 1 character. If you find the word “hello”, you write
“5,hello” on the new line.
Each take 4 days. So, After 4 days, your sheet might look like this
• “1,a”
At the end of the 4th day, each one of
• “5,hello”
you will give your sheet completely
• “2,if”
filled to the Grouper
• .. and a million more lines
45
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Chapter 4: Your talk with the Grouper
Someone gives you 10 papers. The first paper will be marked one, the second paper
will be marked two, and so on, till ten.
You collect the output from mappers and for each line in the mapper’s sheet, if it
says “one,”, your write the on sheet one, if it says “two, ”, you write it on sheet two.
For example, if the first line of a mapper’s sheet says “1,a”, you write
“a” on sheet 1.
if it says “2,if”, your write “if” on sheet 2.
If it says “5,hello”, you write hello on sheet 5.
46
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Chapter 4: Your talk with the Grouper
So at the end of your work, the 10 sheets you have might look like this
• Sheet 1: a, a ,a , I, I , i, a, i, i, i…. millions more
• Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of … millions more
• Sheet 3 :the, the, and, for, met, bet, the, the, and, … millions more
• ..
• Sheet 10: ……
Once you are done, you distribute, each sheet to one reducer.
For example sheet 1 goes to reducer 1, sheet 2 goes to reducer 2 and so
on.
47
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Chapter 5: Your talk with the Reducers:
Each one of you gets one sheet from the grouper. For each sheet you count the
number of words written on it and write it in big bold letters on the back side of the
paper.
For example, if you are reducer 2. You get sheet 2 from the grouper that looks like
this: “Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of …”
You count the number of words on that sheet, say the number of words
is 28838380044, you write it on the back side of the paper, in big bold
letters and give it to me (the master).
48
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Chapter 6: The controlled Chaos and the climax:
At the end of this process you have 10 sheets.
Sheet 1, having the count of the number of words with 1 character on the back side.
Sheet2, having the count of the number words with 2 characters on the back side.
It is done. Genius!
49
CISC 4631 Data Mining
Department of Computer and Information Science
Let’s Have A Simple Big Data Mining Example
o Comments
You essentially did map reduce. The greatest advantage in your approach was
• the Mappers can work independently
• the Reducers can work independently
• the Grouper can work really fast
The process can be easily applied to other kinds of problems. In such a case :
• The work of the Master (dividing the work) and the Grouper (Grouping the
values by key [the value before comma]), remains the same. This is what any
map-reduce library provides.
• The work of the Mappers and Reducers differ according to the problem. This
is what you should write.
50
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop/MapReduce Technology
o MapReduce
51
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop/MapReduce Technology
o MapReduce
52
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop/MapReduce Technology
2003
2004
2006
53
CISC 4631 Data Mining
Department of Computer and Information Science
What We Are Going to Learn
o What is Big Data?
o Characteristics of Big Data
o What To Do With The Data?
o What Technology Do We Have For Big Data ??
o A Simple Big Data Mining Example
o Hadoop in the Wild
o Big Data in the Cloud
54
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop in the Wild
• Hadoop is in use at most organizations that handle
big data:
o
o
o
o
o
Yahoo!
Facebook
Amazon
Netflix
etc…
• Some examples of scale:
o Yahoo!’s Search Webmap runs on 10,000 core Linux cluster
and powers Yahoo! Web search
55
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop in the Wild
• System requirements
o
o
o
o
High write throughput
Cheap, elastic storage
Low latency
High consistency (within a
single data center good
enough)
o Disk-efficient sequential
and random read
performance
56
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop in the Wild
• Classic alternatives
o These requirements typically met using large MySQL
cluster
o Content on HDFS could be loaded into MySQL
• Problems with previous solutions
o MySQL has low random write throughput… BIG problem
for messaging!
o Difficult to scale MySQL clusters rapidly while
maintaining performance
o MySQL clusters have high management overhead, require
more expensive hardware
57
CISC 4631 Data Mining
Department of Computer and Information Science
Hadoop in the Wild
Typical Hadoop Cluster
Aggregation switch
Rack switch
o 40 nodes/rack, 1000-4000 nodes in cluster
o 1 Gbps bandwidth within rack, 8 Gbps out of rack
o Node specs (Yahoo terasort):
8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
58
CISC 4631 Data Mining
Department of Computer and Information Science
HBase
o HBase is an open-source, distributed, column-
oriented database built on top of HDFS based on
BigTable!
 Designed to operate on top of the Hadoop
 Distributed file system (HDFS) or Kosmos File System
(KFS, aka Cloudstore) for scalability, fault tolerance, and
high availability.






59
No real indexes
Automatic partitioning
Scale linearly and automatically with new nodes
Commodity hardware
Fault tolerance
Batch processing
CISC 4631 Data Mining
Department of Computer and Information Science
MongoDB
o MongoDB is the leading NoSQL solution
 free and open-source cross-platform document-oriented
database program
 Founded in 2007, by


Dwight Merriman, Eliot Horowitz
Doubleclick, Oracle, Marklogic, HP

CISC 4631 Data Mining
Department of Computer and Information Science
MongoDB
It is:
General
Purpose
Rich data
model
Full featured
indexes
Sophisticated
query language
Easy to
Use
Easy mapping
to object
oriented code
Native language
drivers in all
popular
languages
Simple to setup
and manage
Fast &
Scalable
Operates at inmemory speed
wherever
possible
Auto-sharding
built in
Dynamically add
/ remove capacity
with no downtime
61
CISC 4631 Data Mining
Department of Computer and Information Science
What We Are Going to Learn
o What is Big Data?
o Characteristics of Big Data
o What To Do With The Data?
o What Technology Do We Have For Big Data ??
o A Simple Big Data Mining Example
o Hadoop in the Wild
o Big Data in the Cloud
62
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data in the Cloud
o Why? WEB is Replacing the Desktop
63
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data in the Cloud
o Paradigm Shift in Computing
64
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data in the Cloud
o What is Cloud Computing?
 Storing, processing, and accessing data and programs over
the Internet instead of your computer's hard drive

IT resources provided as a service


Clouds leverage economies of scale of commodity hardware



65
Compute, storage, databases, queues
Cheap storage, high bandwidth networks & multicore processors
Geographically distributed data centers
Offerings from Microsoft, Amazon, Google, …
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data in the Cloud
Resources
Capacity
Demand
Resources
o Economics of Cloud Users
 Pay by use instead of provisioning for peak
Capacity
Demand
Time
Time
Static data center
Data center in the cloud
Unused resources
Slide Credits: Berkeley RAD Lab
66
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data in the Cloud
2.7 ZB
Global Digital Data
0.5
Petabytes
Two years tweets
43%
analytics could be improved in
their organization if data analytics
cloud
services
was part of
66%
67
Will or plan to use Big
Data in the cloud
CISC 4631 Data Mining
think that data
Department of Computer and Information Science
Big Data in the Cloud
o Data Mining in the Cloud: 3 Reasons
o Skills
 Do you really need/want this all in-house?
o Huge amounts of external data.
 Does it make sense to move and manage all this
data behind your firewall?
o Focus on the value of your data
Holger Kisker
68
 Instead of big data management.
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data in the Cloud
o Data Mining in the Cloud: Another Reason
 Data Warehousing, Data Analytics & Decision Support
Systems





69
Used to manage and control business
Transactional Data: historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can be ad-hoc
Used by managers and analysts to understand the business and make
judgments
CISC 4631 Data Mining
Department of Computer and Information Science
Big Data in the Cloud
o Data Analytics in the Cloud
 Scalability to large data volumes:
 Scan 100 TB on 1 node @ 50 MB/sec = 23 days
 Scan on 1000-node cluster = 33 minutes
 Divide-And-Conquer (i.e., data partitioning)
 Cost-efficiency:
 Commodity nodes (cheap, but unreliable)
 Commodity network
 Automatic fault-tolerance (fewer administrators)
 Easy to use (fewer programmers)
70
CISC 4631 Data Mining
Department of Computer and Information Science
o References










71
http://www.slideshare.net/nasrinhussain1/big-data-ppt-31616290
www.cs.kent.edu/~jin/Cloud12Spring/BigData.pptx
feihu.eng.ua.edu/bigdata/week1_1.pptx
https://web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
www.cigi.illinois.edu/cybergis12/ppt/gahegan.ppt
https://www.ee.columbia.edu/.../bigdata/EECS6893-BigDataAnalyticsLecture1.pdf
www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf
www.kdnuggets.com/data_mining.../x4-data-mining-to-knowledgediscovery.ppt
www.cse.buffalo.edu/~bina/cse487/spring2013/MRParallelJan22.pdf
www.cse.buffalo.edu/faculty/bina/MapReduce/mapreduceApril24.ppt
CISC 4631 Data Mining