Download Random Query Gen- Hive Meetup

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

PL/SQL wikipedia , lookup

Join (SQL) wikipedia , lookup

Null (SQL) wikipedia , lookup

SQL wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Random Query
Generator for Hive
November 2015 Hive Contributor Meetup
Szehon Ho
Overview
• Collaboration with Impala team, work to run against Hive
• Automates generation of test cases, solves:
• Humans can only generate so many test queries
• Humans focus on positive queries (what about machine-generated queries)
• Idea is to have two databases: test (Hive, Impala) and reference
database (Postgres, Mysql, Oracle)
• Generate random data, issue random queries against both
© 2014 Cloudera, Inc. All rights reserved.
2
Data Generator
• Table-count (max, min)
• Column-count (max, min)
• Row-count (max, min)
Column Data Types
Boolean
Float
TinyInt
Decimal(r_precision, r_scale)
SmallInt
Char(r_length)
BigInt
Varchar(r_length)
Double
Timestamp
© 2014 Cloudera, Inc. All rights reserved.
3
Query Generator
1. Generate QueryModel based on QueryProfile
2. ModelTranslator to translate from Model to database’s SQL dialect
3. Execute the SQL on via DbConnectors
4. Result comparison (sort if unsorted)
“Reference databases”
“Test databases”
HiveProfile
PostgresTranslator
SQL
(Postgres dialect)
MysqlTranslator
SQL
(Mysql dialect)
HiveTranslator
HiveQL
ImpalaProfile
QueryModel
© 2014 Cloudera, Inc. All rights reserved.
4
Query Model, High Level
• Represent valid SQL query
• Query consist of one or more clause
(from, select, group-by, union)
• Clause has one or more expressions
(constants, columns, functions of
columns, tables), different for different
clause types
• Model is Recursive in nature:
• Funcs can be run on output of other
funcs
• Union clause can contain another
query
• Some boolean funcs can contain
subquery
Query
Clause
Constant/Col
Funcs
TableExpr
© 2014 Cloudera, Inc. All rights reserved.
5
Query Model, Funcs
• Func types:
• Boolean funcs (isnull, and, or, in, =, !=, >, <)
• Subquery funcs (exists, not exists, in, not in): May contain another
Query
• Val funcs (Trim, Length, Concat, Add, Abs, Floor, Ceil, Greatest, Least,
etc)
• Agg funcs (Eg, Max, Min, Sum, Avg, Count)
• Analytic Funcs (Rank, DenseRank, RowNumber, Lead, Lag, FirstValue,
LastValue, Max, Min, etc..)
• Window specification (“Rows between x and y”, “rows unbounded
preceding”, etc)
• PartitionByClause (“over (partition by x)”)
• OrderByClause
• Rules to determine where to use a func, based on func type and return
type
© 2014 Cloudera, Inc. All rights reserved.
6
QueryModel: Clauses
WithClause:
Adds a table expression:
“With bar as (select * from foo) select * from bar;
• QueryModel
• WithClause
• SelectClause
• FromClause: Table Expression
• WhereClause:
• Predicate (Boolean expr)
• GroupByClause: if Select (Basic or
AggFunc)
• HavingClause: if Select (AggFunc)
• Predicate (Boolean expr)
• UnionClause (Query)
• OrderByClause
• LimitClause
• SelectClause, List of Expr’s:
• Constant
• Col
• Val Funcs
• AggFunc
• AnalyticFunc
• Window
• PartitionByClause
• OrderByClause
GroupByClause, List of:
• Constant
• Col
OrderByClause, List of:
• Constant
• Col
• Func
© 2014 Cloudera, Inc. All rights reserved.
7
QueryModel: Joins
• QueryModel
• WithClause
• SelectClause
• FromClause:
• JoinClause Types:
• Inner
• Left
• Right
• Multiple table expressions
• Left semi
• JoinClause (define table relationship)
• Right semi
• WhereClause:
• Predicate (Boolean function, using expr
from tables in JoinClause)
• Right anti
• Full outer
• Cross
• GroupByClause
• HavingClause
© 2014 Cloudera, Inc. All rights reserved.
8
Demo
© 2014 Cloudera, Inc. All rights reserved.
9
Results 1: HiveQL Discrepancies
• Language Deficiences (as of Hive 1.1)
• Support “Interval” for date arithemetic operations: date + INTERVAL expr unit
• With {…} cannot be used in subquery
• Having must have a group by
• Cannot sort by two expressions in window function, unless window specified
• Negative lag or lead amount not allowed
• Only “Union all” and not “Union” (since fixed)
• Null Ordering
• Hive lacks specifying null order (opposite of Postgres)
© 2014 Cloudera, Inc. All rights reserved.
10
Results 2: JIRA’s so far
• Many valid issues found, fixed since 1.1
• HIVE-12082 : Null comparison for greatest and least operator
• HIVE-12070 : Relax type restrictions on ‘Greatest’ and ‘Least’
• HIVE-11737: IndexOutOfBounds compiling query with duplicated
groupby keys
• HIVE-11712: Duplicate groupby keys cause ClassCastException
• HIVE-11835: Type decimal(1,1) reads 0.0, 0.00, etc from text file as
NULL
• HIVE-12296 : ClassCastException when selecting constant in inner
select (pending)
© 2014 Cloudera, Inc. All rights reserved.
11
Going Forward
• Tackle non-SQL-92 query-support
• Nested Types
• Partitioned tables
• Multi-insert
© 2014 Cloudera, Inc. All rights reserved.
12
Thank you.