Download SQL Concepts vs. SQL Code: Improving Programs using the 8 SQL

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DBase wikipedia , lookup

Microsoft Access wikipedia , lookup

Oracle Database wikipedia , lookup

Tandem Computers wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Ingres (database) wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational algebra wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Database model wikipedia , lookup

Join (SQL) wikipedia , lookup

Relational model wikipedia , lookup

SQL wikipedia , lookup

PL/SQL wikipedia , lookup

Transcript
SQL Concepts vs. SQL Code: Improving Programs using the 8 SQL Concepts
ABSTRACT
SQL is a well-known standard for writing programs manipulating databases. Consequently, SQL is a part
of several programming languages – e.g. Oracle, SAS – useful for database manipulation. Surprisingly
however, the SQL language uses only eight programming concepts, and only five of these are frequently
used. Therefore, an understanding of SQL as programming concepts vs. programming code can
significantly improve database programming. In fact, this approach is also useful to improve code in
languages without SQL proper such as Microsoft Excel. In this paper, we explore several illustrative
examples of improving code through knowledge of the five major SQL concepts. Upon reading this
paper, a person will feel comfortable in modifying or designing database programs with improvements
in readability and efficiency.
INTRODUCTION
This paper is most useful to programmers who are modestly familiar with database programming.
However, for purposes of completeness, we lightly review database theory and the eight SQL concepts.
Upon reading this paper, a person will feel comfortable in designing or modifying existing database
programs with improvements in readability and efficiency.
In the first half of the 20th century, many private and government companies in a variety of nations were
each developing programming methods to deal with their specific databases. A critical turning point
came when mathematicians proved that the basic programming constructs in any particular language
were implementable in any other language. In other words, the mathematicians proved that these
diverse database languages shared commonality. This led to the definition of universal concepts –
database system, database tables and database views – as well as SQL, structured query language,
which captured the commonality of programming procedures. Let us briefly review these ideas.
A database may be thought of as a collection of database tables. Each database table may be thought of
as rectangularly arranged data. The rows of each table heuristically correspond to entities – people,
Medicare claims, airline trips, corporate transactions etc. The columns of each table heuristically
correspond to attributes of these entities. The database tables may share common columns, common
rows or have relations between columns and rows of several tables. Hence the name relational
database.
The programmer has one goal, which manifests itself in a diversity of ways. The programmer is
interested in views – that is, the programmer wishes to construct new tables using the rows and
columns of the existing tables. Some simple examples might be: Finding which items with more than
average purchase frequency are in stock; finding which customers have outstanding payments for more
than thirty days; finding which airlines connect between two cities. Here, we are using the term view
broadly to include either a view (with no query or table creation) or a created query or table containing
the desired result.
A remarkable discovery emerged from the review of the diverse database attempts of the first half of
the 20th century: Only eight concepts, five of which were frequently used, are needed to construct any
view. To appreciate the remarkableness of this discovery simply visit the online help of familiar
programming languages – Oracle, SAS, Excel, and Access. Each such language has many procedures,
diverse keywords, and many functions. Nevertheless, every SQL program to extract a view may be
programmed using only the five keywords and concepts enumerated in Table I.
SELECTION - Select only certain rows meeting certain criteria
PROJECTION – Select only certain columns
UNION, INTERSECTION, (DIFFERENCE )– Take all rows belonging to any of (all of) several tables
JOIN, (PRODUCT) - Join columns of tables by linking on rows describing common entities
AGGREGATION(DIVISION)– Numerically summarize – e.g. average– rows with common elements
Table I: The eight SQL concepts grouped into five categories.
Note that projection and products are
implemented in SQL using the comma operator and without using keywords. There are five main
aggregate functions, implemented using specific keywords – sum, average, count, max, min. Although
product is a distinct SQL concept, in practice, one only sees products when implementing joins which are
products followed by a selection (of rows with common entities). We have grouped the three settheoretic operations – union, intersection and difference. For example, although union and intersection
are distinct they are both set theoretic operations on rows whose implementation differs in one
keyword.
In the remainder of the paper, we explore illustrative examples, improving code through application of
concepts. SQL code is presented in SAS SQL, version 9.1. An alternative would be to use MySQL server
code. However, there are variations within MySQL server code. Furthermore, not all SAS users have
access to MySQL. Finally, our goal is to illustrate how knowledge of SQL concepts improve programming
readability and efficiency and for this purpose, SAS SQL is sufficient.
EXAMPLE – PROJECTION
We start with a simple example. Suppose your database table has ten columns which for convenience
we will call C1, C2, C3, …., C10. Suppose in your particular view you don’t need columns 4 and 7. We
present the SAS SQL and BASE SAS code to achieve this in table II.
All three approaches in Table II implement the same SQL concept, projection. However, BASE SAS,
APPROACH #1 is more readable, using only two column names instead of eight to achieve its goals.
SAS SQL APPROACH
PROC SQL;
CREATE TABLE EightColumn as
SELECT C1,C2,C3,C5,C6,C8,C9,C10
FROM D.Table;
QUIT;
RUN;
BASE SAS APPROACH #1
DATA D.EightColumn;
SET D.Table;
DROP C4 C7;
RUN;
BASE SAS APPROACH #2
DATA D.EightColumn;
SET D.Table;
KEEP C1 C2 C3 C5 C6 C8 C9 C10;
RUN;
Table II: Three approaches implementing projection, the selection of particular columns.
We assume
that D is the library name for the directory storing the tables. We further assume that our initial table
has filename Table and the result of the view inquiry is stored in a table, EightColumn. Notice how BASE
SAS APPROACH #1 is more readable then BASE SAS APPROACH #2 and SAS SQL APPROACH.
EXAMPLE – UNION
A simple example of union is presented in table IV which performs a union on the two tables presented
in table III. For convenience, the desired goal, the union view, is also presented in Table III. The SAS
code is a trifle more compact; time wise both codes are about equal.
So far so good. However if we modify table-A and table-B by adding just one column not present in the
other table the situation becomes complicated. The modified tables are presented in Table V. The code
to produce a union of the two tables is presented in table VI. Notice that SAS SQL code is not present in
Table VI. In fact, there is no easy way in SAS SQL to perform a simple union of the two tables! Oracle
remedies this SQL deficiency by introducing the to_char operator. BASE SAS remedies this SQL
deficiency with the proc append procedure using the force option. It is important to emphasize, in the
context of this paper, how proc append is perceived. It is not perceived as a SAS procedure per se; rather
it is perceived as a SAS implementation of a SQL concept .
TABLE-A
L
N
A
1
A
2
B
1
B
2
TABLE-B
L
N
B
2
B
3
C
2
C
3
TABLE-A UNION TABLE-B
L
N
A
1
A
2
B
1
B
2
B
3
C
2
C
3
Table III: Table-A, Table-B and their union. The code for creating the union is presented in table IV. “L”
and “N” stand for Letter and Number and could in a more realistic example represent a category (L) and
its components(N).
SAS SQL CODE FOR UNION
PROC SQL;
CREATE TABLE union_A_B AS
SELECT * FROM D.TableA
UNION
SELECT * FROM D.TableB;
QUIT;
RUN;
SAS CODE FOR UNION WITHOUT SQL
DATA D.Union_A_B;
SET D.TableA D.TableB;
RUN;
Table IV: Code for creating the union of Table-A and Table-B. We assume all tables lie in a directory
with SAS libname, D. TableA and TableB are the SAS filenames for Table-A and Table-B. Union_A_B is the
SAS filename for the union table. As can be seen, the SAS code is a little more compact. Time wise the
two codes are equal.
TABLE-A
L
N
A
1
A
2
B
1
B
2
TABLE-B
L
N
B
2
B
3
C
2
C
3
X
x
x
x
x
TABLE-A UNION TABLE-B
L
N
X
A
1
x
A
2
x
B
1
x
B
2
x
B
3
C
2
C
3
Y
y
y
y
y
Y
y
y
y
y
Table V:
Table-A, Table-B and their union. The simple tables illustrated in Table III, have each been
augmented with one column (X,Y) not present in the other table. The code for creating the union table is
presented in table VI. Notice that no SAS SQL code can implement such a union.
BASE SAS CODE FOR UNION
DATA D.Dummy;
INPUT L $ N $ X $ Y $;
CARDS; ; RUN;
PROC APPEND BASE=D.Dummy DATA=D.TableA FORCE;RUN;
PROC APPEND BASE=D.Dummy DATA=D.TableB FORCE;RUN;
ORACLE SQL CODE FOR UNION
CREATE TABLE Union_A_B AS
SELECT L,N,X, to_char(NULL) as Y FROM D.TableA
UNION
SELECT L,N,to_char(NULL) as X, Y FROM D.TableB;
Table VI: Code for creating the union of Table-A and Table-B. We assume all tables lie in a directory
with libname, D. TableA and TableB are the filenames for Table-A and Table-B. Union_A_B is the
filename for the union table. As can be seen there is no easy SAS SQL code that will create a union. To
create a union in BASE SAS requires using the proc append procedure with the force option. We also
must create an initial dummy table (with 0 rows) that has all four columns. Oracle SQL allows the
creation of the union using the special oracle to_char(NULL) function.
EXAMPLE – AGGREGATE FUNCTIONS
I have never seen the division operation used in SQL. Most programmers don’t know the precise
definition. In practice however, one uses the division concept when using the aggregate functions.
It is important to understand the difference between a function and an aggregate function. Functions
like sum, concatenate and substring operate on data elements in a particular record. By contrast,
aggregate functions operate – that is, use as arguments – all records (typically all values in a particular
column). In using these functions, we group by certain common elements (and it is in this sense that an
aggregate function is a particular implementation of a division).
Originally, SQL had five aggregate functions – average, sum, max, min, count. However, both SAS SQL
and BASE SAS have adapted to broader needs. A complete list of aggregate functions available in SAS
SQL may be found in table 2.6 of the SAS 9.1 SQL procedure guide (While almost all SAS functions can be
used in SQL, they cannot necessarily be used as aggregate functions).
BASE SAS also implements the aggregate functions with the proc means procedure. The aggregate
functions available in proc means may be found in chapter 33 of the BASE SAS 9.2 procedure guide.
AVERAGES: BASE SAS – PROC MEANS
PROC SORT DATA=D.Table; BY gender agegroup; RUN;
PROC MEANS
DATA = D.Table;
VAR salary;
BY gender agegroup;
OUTPUT OUT =D.Average
MEANS=;
AVERAGES: SQL SAS
PROC SQL;
CREATE TABLE D.Average AS
SELECT gender,agegroup, AVG(salary)
FROM D.Table
GROUP BY gender, agegroup;
QUIT;
RUN;
Table VII: SQL SAS and BASE SAS approaches to taking an average. We assume data lies in a table
named table lying in a directory with SAS library name, D. The query seeks average salary by gender and
agegroup. The correspondences between the two approaches are transparent. For example, BASE SAS
uses the BY statement to indicate the categories averaged over while SQL SAS uses the GROUP BY
statement.
A typical proc means and the corresponding SQL code using aggregate functions is presented in table
VII. Note some important differences
•
BASE SAS requires an initial sorting
•
In this example both BASE and SQL SAS have the aggregate function, average, available. Certain
functions however are not simultaneously available in both BASE and SQL SAS.
•
We have not focused in this paper on the extra capacities of BASE SAS over SQL SAS. For
example, the aggregate summaries can be neatly displayed using a variety of table ,not available
in SQL SAS. BASE SAS also provides many other statistics indicated by the _TYPE_ variable.
EXAMPLE JOIN:
The join is the heart of database theory. Only SQL SAS can implement a join; BASE SAS cannot.
Furthermore, it is the join operation that facilitates breaking up a giant database into small manageable
tables which can be joined to produce a myriad of views. Illustrations of the four types of SQL joins are
presented in Table VIII, using Table-A and Table-B of Table V. The table caption contains definitions.
Table-A RIGHT JOIN Table-B
L
N
X
Y
B
2
x
y
B
3
y
C
2
y
C
3
y
Table-A INNER JOIN Table-B
L
N
X
Y
B
2
x
y
Table A LEFT JOIN Table-B
L
N
X Y
A
1
x
A
2
x
B
1
x
B
2
x
y
Table-A FULL JOIN Table-B
L
N
X
Y
A
1
x
A
2
x
B
1
x
B
2
x
y
B
3
y
C
2
y
C
3
y
Table VIII: Illustration of the four possible SQL joins. Table-A and Table-B are presented in Table V
above. The FULL JOIN is sometimes called the OUTER JOIN. The word JOIN by itself, in SAS SQL, is
interpreted as INNER JOIN. The definitions of these terms should be clear from the examples: a) Inner
join takes all records with shared L and N values in the two tables (and adds extra columns); b) Right Join
takes all rows from Table-B and adds Table-A columns where appropriate; c) Left join takes all rows from
Table-A and adds Table-B columns where appropriate; d) Full join takes all records from both tables.
Even though Table-A and Table-B are relatively “simple” the code for these four types of joins are not all
straightforward. The code is presented in Table IX. Notice that the SQL coalesce function is needed for
the full join. Failing to use it produces the full join table of Table VIII without the <L,N> values <B,3>,
<C,2>, <C,3> (but the “y” are retained with invisible row headers).
SQL SAS CODE FOR INNER JOIN
PROC SQL;
CREATE TABLE D.Inner as
SELECT t1.*,t2.Y
FROM d.TableA as t1 INNER JOIN d.TableB as t2
ON t1.L=t2.L and t1.N=t2.N;
QUIT;
RUN;
SQL SAS CODE FOR LEFT JOIN
PROC SQL;
CREATE TABLE D.Left as
SELECT t1.*,t2.Y
FROM d.TableA as t1 LEFT JOIN d.TableB as t2
ON t1.L=t2.L and t1.N=t2.N;
QUIT;
RUN;
SQL SAS CODE FOR FULL JOIN
PROC SQL;
CREATE TABLE D.Left as
SELECT coalesce(t1.L,t2.L) as L,
coalesce(t1.N,t2.N) as N,t1.X, t2.Y
FROM d.TableA as t1 FULL JOIN d.TableB as t2
ON t1.L=t2.L and t1.N=t2.N;
QUIT;
RUN;
Table IX: SAS SQL code for the inner, left and full join. The code for the right join is symmetrical to the
code for the left join and is omitted for reasons of space. The SAS SQL code for the inner join also works
if the word inner is deleted. Notice that the code for the full join requires use of the SQL coalesce
function. Without use of this function one obtains the full join table of Table VIII without the values
<B,3>,<C,2>,<C,3> in columns L,N (though the y values are retained). The example shows the necessity
of using caution in writing code for joins even if the table structure is relatively simple.
EXAMPLE – EXCEL
At this point it might be worthwhile to study the applicability of the ideas of this paper in a Microsoft
excel setting. Although Excel does allow connection with a local SQL server, typically, the Excel user –
whether they use spreadsheet functions or visual basic excel does not have access to SQL. Nevertheless,
the SQL concepts are frequently implementable in SQL. Let us briefly review how the major five
concepts are implemented.
PROJECTION: If you don’t want some columns in excel you click, drag and delete. True this is manual, but
it is still an implementation of an SQL concept. If your database is small (under 10000 records) this may
be an ideal implementation. If you periodically process the database the same way you can record the
manual deletions with a Visual Basic macro.
SELECTION: Selections can be implemented in Excel using a visual basic macro that selects rows
according to some Boolean criteria. It is interesting that the WHERE clause, or its equivalent – Boolean
criteria – is almost the same in BASE SAS, SQL SAS and EXCEL.
UNION: Consider the example presented in Tables 5 and 6. To implement this in Excel, one would first
add columns – X and Y - to both tables (by click and inserts). Then one would click, drag and paste to
union the two tables. Again with some simple modifications this can be recorded with Visual Basic.
AGGREGATE FUNCTIONS: Excel has certain built-in aggregate functions. For example the SUMIF and
COUNTIF functions can be used to obtain averages. Unlike BASE SAS and SQL SAS you can make these
averages “part” of the table (that is, an additional column). Alternatively, you can summarize a table
nearby the original table for a quick comparison. Excel also has an outline feature as part of its interface.
You can (after sorting first) use the subtotal and group features to perform aggregate functions and then
click on outlines to drill down and see individual items. Furthermore, because Excel has a RANGE object,
any function that excel uses can be made into an aggregate function. Consequently, for small databases
(say under 10,000 records), Excel may be superior for database manipulation.
JOIN: I recently was performing a project where I had to implement a full join on four quarters of data in
order to obtain all users from any quarter and the data available on them. Although joins can be
implemented in SQL SAS – and joins on multiple tables can be implemented with nested joins – as
indicated in the previous section, join code can be tricky and improper code can lead to unexpected
errors. I wanted to be able to debug quickly any missing data elements.
I used the following algorithm presented in table X. Notice that the algorithm possesses the desired
characteristics of readability, time efficiency and debugging (The names sheet record names from each
quarter sheet allowing easy debugging if something was missing).
The interesting feature of this example is that although join is a characteristic SQL function, and Excel is
not a database language, it was quite easy to implement a full join on multiple tables. Corresponding
code in SQL SAS (or BASE SAS) would be less readable. This example shows the power of utilizing SQL as
a set of constructs.
IMPLEMENTATION OF AN OUTER JOIN ON MULTIPLE TABLES IN EXCEL
For each quarterly sheet,
Cut and paste all names of that quarter to a sheet, names
After all four quarters are processed, then
Sort the names on sheet names
Remove duplicate names
Data from the 4 quarter sheets are joined to the name sheet
Data is joined using the Excel VLOOKUP function
NOTE: Since VLOOKUP works on single columns, first
and last names are combined. For example, “Russell
Hendel “ becomes “Russell!Hendel”
Table X:
Implementation of a full join on multiple tables in Microsoft excel. The above procedure
when done the first time manually, can be recorded as a visual basic macro and with minor
modifications used in future quarters. For further discussion see the main paper.
OPTIMIZATION
Optimization refers to the speed of the program run. An optimized program runs faster than a nonoptimized program. Every implementation of SQL has internal optimization techniques. Additionally, a
programmer should be aware of rules of thumb which will optimize programs. These rules of thumb
should be used whether the programmer implements the code in SQL or some other language.
Optimization is a large topic. We briefly mention two very useful rules. The first rule is that projections
and selections should be done prior to joins and aggregates.
The second rule is that aggregates should typically be done prior to joins. For example, suppose you
were a University and had one table storing personal information on students such as addresses, emails,
phone numbers, billing status etc. Suppose you have tables with grades and wish to compute some
averages and notify students who are failing.
You could join all tables and then compute the average. But this would be clumsy as the aggregate
functions are working on a bigger table. It is simplest to compute aggregate averages first for each
students and then join the student-aggregate average with the personal information.
The importance of optimization should be emphasized. If your database is small, say under 10,000
records, almost any code you write will work efficiently. But if your database is moderate (say
1,000,000+ records) optimization techniques may make a real difference in time efficiency.
CONCLUDING REMARKS
In this paper, we have reviewed several instances of implementations of SQL concepts in a variety of
languages. Although SAS SQL suffices for all database operations, we have seen several instances in
which BASE SAS, ORACLE and even EXCEL code are more readable or more efficient. We have also seen
several rare instances where SAS SQL could not implement a concept but BASE SAS could. We believe
the approach of this paper useful to database programmers who wish to optimize readability and
efficiency.
REFERENCES
Christopher J. Date, An Introduction To Database Systems, 8th Edition, Addison Wesley, 2004
SAS 9.1 SQL Procedure User’s Guide,
http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_sqlproc_6992.pdf
BASE SAS 9.2 Procedure Guide
http://support.sas.com/documentation/cdl/en/proc/61895/PDF/default/proc.pdf
CONTACT
Russell Jay Hendel
7500 Security Boulevard
Baltimore, MD 21244
Phone: 410 786 0329
Email: [email protected]
Russell Jay Hendel
Dept of Mathematics
Room 316
7800 York Road
Towson, MD 21252
Phone: 410 704 3091
Email: [email protected]
[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.