Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho Overview • Collaboration with Impala team, work to run against Hive • Automates generation of test cases, solves: • Humans can only generate so many test queries • Humans focus on positive queries (what about machine-generated queries) • Idea is to have two databases: test (Hive, Impala) and reference database (Postgres, Mysql, Oracle) • Generate random data, issue random queries against both © 2014 Cloudera, Inc. All rights reserved. 2 Data Generator • Table-count (max, min) • Column-count (max, min) • Row-count (max, min) Column Data Types Boolean Float TinyInt Decimal(r_precision, r_scale) SmallInt Char(r_length) BigInt Varchar(r_length) Double Timestamp © 2014 Cloudera, Inc. All rights reserved. 3 Query Generator 1. Generate QueryModel based on QueryProfile 2. ModelTranslator to translate from Model to database’s SQL dialect 3. Execute the SQL on via DbConnectors 4. Result comparison (sort if unsorted) “Reference databases” “Test databases” HiveProfile PostgresTranslator SQL (Postgres dialect) MysqlTranslator SQL (Mysql dialect) HiveTranslator HiveQL ImpalaProfile QueryModel © 2014 Cloudera, Inc. All rights reserved. 4 Query Model, High Level • Represent valid SQL query • Query consist of one or more clause (from, select, group-by, union) • Clause has one or more expressions (constants, columns, functions of columns, tables), different for different clause types • Model is Recursive in nature: • Funcs can be run on output of other funcs • Union clause can contain another query • Some boolean funcs can contain subquery Query Clause Constant/Col Funcs TableExpr © 2014 Cloudera, Inc. All rights reserved. 5 Query Model, Funcs • Func types: • Boolean funcs (isnull, and, or, in, =, !=, >, <) • Subquery funcs (exists, not exists, in, not in): May contain another Query • Val funcs (Trim, Length, Concat, Add, Abs, Floor, Ceil, Greatest, Least, etc) • Agg funcs (Eg, Max, Min, Sum, Avg, Count) • Analytic Funcs (Rank, DenseRank, RowNumber, Lead, Lag, FirstValue, LastValue, Max, Min, etc..) • Window specification (“Rows between x and y”, “rows unbounded preceding”, etc) • PartitionByClause (“over (partition by x)”) • OrderByClause • Rules to determine where to use a func, based on func type and return type © 2014 Cloudera, Inc. All rights reserved. 6 QueryModel: Clauses WithClause: Adds a table expression: “With bar as (select * from foo) select * from bar; • QueryModel • WithClause • SelectClause • FromClause: Table Expression • WhereClause: • Predicate (Boolean expr) • GroupByClause: if Select (Basic or AggFunc) • HavingClause: if Select (AggFunc) • Predicate (Boolean expr) • UnionClause (Query) • OrderByClause • LimitClause • SelectClause, List of Expr’s: • Constant • Col • Val Funcs • AggFunc • AnalyticFunc • Window • PartitionByClause • OrderByClause GroupByClause, List of: • Constant • Col OrderByClause, List of: • Constant • Col • Func © 2014 Cloudera, Inc. All rights reserved. 7 QueryModel: Joins • QueryModel • WithClause • SelectClause • FromClause: • JoinClause Types: • Inner • Left • Right • Multiple table expressions • Left semi • JoinClause (define table relationship) • Right semi • WhereClause: • Predicate (Boolean function, using expr from tables in JoinClause) • Right anti • Full outer • Cross • GroupByClause • HavingClause © 2014 Cloudera, Inc. All rights reserved. 8 Demo © 2014 Cloudera, Inc. All rights reserved. 9 Results 1: HiveQL Discrepancies • Language Deficiences (as of Hive 1.1) • Support “Interval” for date arithemetic operations: date + INTERVAL expr unit • With {…} cannot be used in subquery • Having must have a group by • Cannot sort by two expressions in window function, unless window specified • Negative lag or lead amount not allowed • Only “Union all” and not “Union” (since fixed) • Null Ordering • Hive lacks specifying null order (opposite of Postgres) © 2014 Cloudera, Inc. All rights reserved. 10 Results 2: JIRA’s so far • Many valid issues found, fixed since 1.1 • HIVE-12082 : Null comparison for greatest and least operator • HIVE-12070 : Relax type restrictions on ‘Greatest’ and ‘Least’ • HIVE-11737: IndexOutOfBounds compiling query with duplicated groupby keys • HIVE-11712: Duplicate groupby keys cause ClassCastException • HIVE-11835: Type decimal(1,1) reads 0.0, 0.00, etc from text file as NULL • HIVE-12296 : ClassCastException when selecting constant in inner select (pending) © 2014 Cloudera, Inc. All rights reserved. 11 Going Forward • Tackle non-SQL-92 query-support • Nested Types • Partitioned tables • Multi-insert © 2014 Cloudera, Inc. All rights reserved. 12 Thank you.