* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Impala and BigQuery
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Impala and BigQuery By David Gruzman BigDataCraft.com Impala and BigQuery Big Query is google's database service based on the Dremel. Big Query is hosted by Google. ►Impala is open source database inspired by the Dremel paper. Impala is part of the Cloudera Hadoop distribution. ► by David Gruzman Today agenda ► Overview of Dremel as a technology ► Overview of the BigQuery ► A few words about Impala ► DG Mediamind use case ► Deeper insights into Impala ► Conclusions ► Q&A Why dremel? ► ► ► Google is first who got MapReduce Google is first faced MapReduce main problem – latency. The problem was propagated to engines on top of MapReduce also. It is logical that Google was first who approached it by developing real time query capability for big data. How dremel is used in google ► ► ► Dremel is not replacement for the MapReduce or Tenzing but complements it. (Tenzing is Google's Hive) Analyst can make many fast queries using Dremel After getting good idea what is needed – run slow MapReduce (or SQL based on MapReduce) to get precise results Why dremel is Unique ► ► Dremel with BigQuery built on top of it is probably only Interactive big data query engine today. I mean that it is only engine capable to produce results over terabytes of data in seconds! ► Main idea (my guess) that is harness huge cluster of machines for the single query. Dremel as technology Novel Hierarchical columnar format. ► LLVM based code generation. ► Distributed aggregation Tree ► In-situ data processing. (inside the storage) Dremel : Aggregation tree Dremel : Nested columnar format Big Query ► ► ► Service built by google on top of the Dremel engine Only (known to me) query engine as a service working with BigData. Query time not depends on data size BigQuery main capabilities ► Aggregations ► Join of big table to small table. ► Join of two big tables (recently added) ► Hierarchical data format. It makes preaggregations cheaper. Main limitations ► ► ► Small results size Intermediate results should not exceed memory size. No “external tables” Why BigQuery is not popular So,why BigQuery is not popular ► ► ► ► Data is not created in google cloud. It is hard and not practical to move big data. It is heavy, after all. Google is used to change APIs. BigQuery also changed during last years. It is hard to build busines. Many companies in Internet related businesses a wary of sharing data with Google. It is expensive. 35$ per TB can give 1000th of dollars bills per day. Dremel In the same time – it is good technically ► ► I got referances from company doing serious testing Marting Fawler's company also tested it and give very good feedback. Question to all of you Why Your organization decided not to use google's Big Query? Where we can find Impala Impala What is impala ► ► Massive parralel processing (MPP) database engine, developed by Cloudera. Integrated into Hadoop stack on the same level as MapReduce, and not above it (as Hive and Pig) Pig Impala Map Reduce HDFS Hive Why impala ► Data has a gravity ► Today a lot of data live in HDFS ► It is not practical to move big data ► It is practical to bring engine to the data ► In the same time – MapReduce is not must ► Impala process data in Hadoop cluster without using MapReduce MapReduce bypass ► ► ► Several other modern Database engines also realized the opportunity to bypass MapReduce but work right with HDFS. They takes various approaches. MapReduce Bypass ► Existing MPP databases, like Greenplum – store their external tables in the HDFS MapReduce bypass ► ► Jethrodata store data in their own format on HDFS and also work with it without MR layer. They have their proprietary format which enable full indexing of the data together with columnar efficiency. In cases of high selectivity queries this approach has serious advantages. Use Case from DG I think it is will be typical case in the future ► ► ► DG is using Hadoop and Hive Evaluation Impala to do part of things more efficiently. After their case presentation we will back to discuss insights of the Impala Again – Impala has different place then Pig and Hive Hive and Pig Impala Map Reduce HDFS Impala architecture Impala – Dremel traces ► LLVM code generation ► It is really fast ► C++ as implementation language (not Java...) ► ► Simple query engine. It actually doing things which can be done in memory. Broadcast join algorithm is implemented LLVM code generation ► ► ► ► ► Assume you want to write custom code for the specific query. It will be super efficient Code generation automate this process for each query We actually need to super-optimize inner loop doing filtering (where) and group by. LLVM enables us to compile in fraction of seconds into native code LLVM enable us to enjoy new CPU capabilities like SSE in a portable way. Why code generation it interesting? ► ► If you develop own engine, or some peace of code responsible to process serious data volumes code generation may give you order of magnitude boost. I had cases when usage of such technology was game changing Impala – Hive Traces ► While dremel converts data into own format, Impala supports multiple formats. It is kind of schema on read. ► ► Impala shares metastore with Hive, which enables very simple adoption Internally Impala have well defined way to add new formats Impala – unique things ► ► ► ► Impala “format adapters”, called scanners have predicate pushdown capability. Probably only open source MPP engine Today we do not have any other means to run hundreds of CPU cores in one query efficiently without expensive license. Hive give us the same but not efficiently. Impala vs MPP ► It usually tooks many years to create MPP database. ► There are serious simplifications: ► The data is read only ► ► There is actually not DBMS – only query engine. No serious resource management, but measurement (all over code). Impala – hive killer? ► ► ► ► ► Not so quickly. Hive is doing things Impala can not do yet, like joins between several big tables. Hive has convinient java UDF, while impala is not Impala does not have inter-query fault tolerance. In the same time – MapReduce is not good framework for the database engine Impala – Data Formats ► There are scanners for the following types: ► RCFile ► Parquet (native dremel format) ► CSV ► AVRO ► Sequence File Impala – future ► Will get closer to other MPP engines ► Support more formats ► More advanced scheduling and resource management Basic benchmark ► TPC-H, Q1, SF=10 ► 4 EC2 large instances ► 4 seconds, while hive takes about 1 minute. ► This number means group by speed of about 235MB/sec per core. Impala price per GB ► 1 Large instance costs $0.24 ► Cluster costs 0.96 per hour. ► Cost of 1 second : 0.96 / 3600 ► We process by such cluster 1.75GB per second ► So cost of 1 TB processing is about $0.15 ► It is about 300 times cheaper then BigQuery Performance - summary ► It is fast when data reduction is big ► It is fast, when data is hot. ► ► It should enjoy fast storage / SSD. My measurements shows about 200 MB/sec per core group by processing Always faster then Hive at least 10 times What with clouds? Impala in cloud is not elastic ► ► ► ► To be elastic we need to create cluster when we need it. Even if we agree to by hour resolution – storage will be a problem S3 will not give us hundreds of Mbs per second per instance To store data in local file system – is transient Impala - conclusions ► ► ► ► It is first time I remember when we can put our hands on free MPP database. There is no risk to try it side-by-side with Hive It is possible to offload part of the work to Impala and do the rest with Hive It is part of the Cloudera Hadoop distribution and easily installed by Cloudera Manager Materials used ► Benchmarks http://www.slideshare.net/sudabon/performanceevaluation-of-cloudera-impala-2012120815536323 https://amplab.cs.berkeley.edu/benchmark/ ► Architecture http://www.slideshare.net/scottleber/impala19176906 https://cloud.google.com/files/BigQueryTechnical WP.pdf Material used - comparisons ► ► ► To hive: http://www.quora.com/Cloudera/DoesCloudera-Impala-have-any-drawbacks-whencompared-with-Hive To vertica: http://www.quora.com/ClouderaImpala/How-does-Cloudera-Impala-compare-toVertica To dremel: http://www.quora.com/ClouderaImpala/How-does-Clouderas-Impala-compareto-Googles-Dremel Thank you!!! ► ► Special thanks to Faina Kamenetsky – who helped set up clusters in amazon. BigDataCraft.com ► We are boutique consulting company ► Our services are: ► On paper POC ► On hardware POC ► Architecture / Design reviews ► Custom integrations and bug fixing Impala - Flow
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            