Download Data Hub - Zhangxi Lin

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Data Lake & Data
Hub
Zhangxi Lin, The Rawls College, 2016-03-31
Data Warehouse
• Popular for business intelligence tasks, and being replaced by lessstructured Data Lakes which allow more flexibility.
• The limitation of data warehouses is that they store data from
various sources in some specific static structures and categories that
dictate the kind of analysis that is possible on that data, at the very
point of entry. While this was sufficient during the early stages of
evolution of business intelligence where analysis was primarily done
on proprietary databases and the scope was restricted to the
canned reports, dashboards with limited and pre-defined interaction
paths.
• This approach has started to fall apart in the world of big data
discovery where it is very difficult to ascertain upfront all the
intelligence and insights one would be able to derive from the
variety of different sources, including proprietary databases, files,
3rd party tools to social media and web, that keep cropping up on a
regular basis.
Data Lake
• A large-scale storage repository and processing engine.
• Provides "massive storage for any kind of data, enormous
processing power and the ability to handle virtually limitless
concurrent tasks or jobs“
• The term was coined by James Dixon, Pentaho chief
technology officer. Dixon used the term initially to contrast
with "data mart", which is a smaller repository of interesting
attributes extracted from the raw data.
• One example of a data lake is the distributed file system,
Apache Hadoop.
Top Five Differences between Data
Lakes and Data Warehouses
•
•
•
•
•
Retain all data
Support all data types
Support all users
Adapt easily to changes
Provide faster insights
Data Hub
• A collection of data from multiple sources organized for
distribution, sharing, and often subsetting and sharing.
Generally this data distribution is in the form of a hub and
spoke architecture.
• A data hub differs from a data warehouse in that it is generally
unintegrated and often at different grains. It differs from an
operational data store because a data hub does not need to
be limited to operational data.
• A data hub differs from a data lake by homogenizing data and
possibly serving data in multiple desired formats, rather than
simply storing it in one place, and by adding other value to the
data such as de-duplication, quality, security, and a
standardized set of query services. A Data Lake tends to store
data in one place for availability, and allow/require the
consumer to process or add value to the data.
Turn ‘Data Lake’ into an
Enterprise Data Hub
• Using Hadoop as a “data lake” — a scalable data repository built on
the cheap-and-deep HDFS (Hadoop Distributed File System) storage
economics — to capture data from anywhere, and in any format, for
future analysis.
• As Hadoop deployments shift from proof-of-concept sandbox
experiments to enterprise-grade, mission-critical production
solutions, they take on new workloads, and those workloads need
all the power and all the flexibility of those ecosystem components
listed above. Customers with existing investments in non-HDFS data
lakes are just as excited about attacking new analytic and processing
workloads as everyone else.
• Setting up an alternative HDFS-based Hadoop cluster using Direct
Attached Storage (DAS) would mean copying data from the existing
NAS-based data lake into a separate Hadoop installation. Copying is
expensive; copying terabytes or petabytes is prohibitively so.
Enterprise Data Hub