* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Hive - DWH Community
Survey
Document related concepts
Transcript
DATA WAREHOUSE Oracle Data Warehouse Mit Big Data neue Horizonte für das Data Warehouse ermöglichen Alfred Schlaucher, Detlef Schroeder DATA WAREHOUSE Themen Big Data Buzz Word oder eine neue Dimension und Möglichkeiten Oracles Technologie zu Speichern von unstrukturierten und teilstrukturierten Massendaten Cloudera Framwork „Connectors“ in die neue Welt Oracle Loader for Hadoop und HDFS Big Data Appliance Mit Oracle R Enterprise neue Analyse-Horizonte entdecken Big Data Analysen mit Endeca Hive • Hive is an abstraction on top of MapReduce • Allows users to query data in the Hadoop cluster without knowing Java or MapReduce • Uses the HiveQL language • Very similar to SQL • The Hive Interpreter runs on a client machine • Turns HiveQL queries into MapReduce jobs • Submits those jobs to the cluster • Note: this does not turn the cluster into a relational database server! • It is still simply running MapReduce jobs • Those jobs are created by the Hive Interpreter Hive (cont’d) • Sample Hive query: SELECT stock.product, SUM(orders.purchases) FROM stock INNER JOIN orders ON (stock.id = orders.stock_id) WHERE orders.quarter = 'Q1' GROUP BY stock.product; Pig • Pig is an alternative abstraction on top of MapReduce • Uses a dataflow scripting language • Called PigLatin • The Pig interpreter runs on the client machine • Takes the PigLatin script and turns it into a series of MapReduce jobs • Submits those jobs to the cluster • As with Hive, nothing ‘magical’ happens on the cluster • It is still simply running MapReduce jobs Pig (cont’d) • Sample Pig script: stock = LOAD '/user/fred/stock' AS (id, item); orders= LOAD '/user/fred/orders' AS (id, cost); grpd = GROUP orders BY id; totals = FOREACH grpd GENERATE group, SUM(orders.cost) AS t; result = JOIN stock BY id, totals BY group; DUMP result; Flume and Sqoop • Flume provides a method to import data into HDFS as it is generated • Rather than batch-processing the data later • For example, log files from a Web server • Sqoop provides a method to import data from tables in a relational database into HDFS - HIVE • Does this very efficiently via a Map-only MapReduce job • Can also ‘go the other way’ • Populate database tables from files in HDFS Oozie • Oozie allows developers to create a workflow of MapReduce jobs • Including dependencies between jobs • The Oozie server submits the jobs to the server in the correct sequence HBase • HBase is ‘the Hadoop database’ • A ‘NoSQL’ datastore • Can store massive amounts of data • Gigabytes, terabytes, and even petabytes of data in a table • Scales to provide very high write throughput • Hundreds of thousands of inserts per second • Copes well with sparse data • Tables can have many thousands of columns • Even if most columns are empty for any given row • Has a very constrained access model • Insert a row, retrieve a row, do a full or partial table scan • Only one column (the ‘row key’) is indexed HBase vs Traditional RDBMSs RDBMS HBase Data layout Row-oriented Column-oriented Transactions Yes Single row only Query language SQL get/put/scan Security Authentication/Authorizati on TBD Indexes On arbitrary columns Row-key only Max data size TBs PB+ Read/write throughput 1000s queries/second limits Millions of queries/second Kontakt und mehr Informationen Oracle Data Warehouse Community Mitglied werden Viele kostenlose Seminare und Events Download – Server: www.ORACLEdwh.de Nächste deutschsprachige Oracle DWH Konferenz: 19. + 20. März 2013 Kassel