Download SQL Concepts vs. SQL Code: Improving Programs using the 8 SQL

SQL Concepts vs. SQL Code: Improving Programs using the 8 SQL Concepts ABSTRACT SQL is a well-known standard for writing programs manipulating databases. Consequently, SQL is a part of several programming languages – e.g. Oracle, SAS – useful for database manipulation. Surprisingly however, the SQL language uses only eight programming concepts, and only five of these are frequently used. Therefore, an understanding of SQL as programming concepts vs. programming code can significantly improve database programming. In fact, this approach is also useful to improve code in languages without SQL proper such as Microsoft Excel. In this paper, we explore several illustrative examples of improving code through knowledge of the five major SQL concepts. Upon reading this paper, a person will feel comfortable in modifying or designing database programs with improvements in readability and efficiency. INTRODUCTION This paper is most useful to programmers who are modestly familiar with database programming. However, for purposes of completeness, we lightly review database theory and the eight SQL concepts. Upon reading this paper, a person will feel comfortable in designing or modifying existing database programs with improvements in readability and efficiency. In the first half of the 20th century, many private and government companies in a variety of nations were each developing programming methods to deal with their specific databases. A critical turning point came when mathematicians proved that the basic programming constructs in any particular language were implementable in any other language. In other words, the mathematicians proved that these diverse database languages shared commonality. This led to the definition of universal concepts – database system, database tables and database views – as well as SQL, structured query language, which captured the commonality of programming procedures. Let us briefly review these ideas. A database may be thought of as a collection of database tables. Each database table may be thought of as rectangularly arranged data. The rows of each table heuristically correspond to entities – people, Medicare claims, airline trips, corporate transactions etc. The columns of each table heuristically correspond to attributes of these entities. The database tables may share common columns, common rows or have relations between columns and rows of several tables. Hence the name relational database. The programmer has one goal, which manifests itself in a diversity of ways. The programmer is interested in views – that is, the programmer wishes to construct new tables using the rows and columns of the existing tables. Some simple examples might be: Finding which items with more than average purchase frequency are in stock; finding which customers have outstanding payments for more than thirty days; finding which airlines connect between two cities. Here, we are using the term view broadly to include either a view (with no query or table creation) or a created query or table containing the desired result. A remarkable discovery emerged from the review of the diverse database attempts of the first half of the 20th century: Only eight concepts, five of which were frequently used, are needed to construct any view. To appreciate the remarkableness of this discovery simply visit the online help of familiar programming languages – Oracle, SAS, Excel, and Access. Each such language has many procedures, diverse keywords, and many functions. Nevertheless, every SQL program to extract a view may be programmed using only the five keywords and concepts enumerated in Table I. SELECTION - Select only certain rows meeting certain criteria PROJECTION – Select only certain columns UNION, INTERSECTION, (DIFFERENCE )– Take all rows belonging to any of (all of) several tables JOIN, (PRODUCT) - Join columns of tables by linking on rows describing common entities AGGREGATION(DIVISION)– Numerically summarize – e.g. average– rows with common elements Table I: The eight SQL concepts grouped into five categories. Note that projection and products are implemented in SQL using the comma operator and without using keywords. There are five main aggregate functions, implemented using specific keywords – sum, average, count, max, min. Although product is a distinct SQL concept, in practice, one only sees products when implementing joins which are products followed by a selection (of rows with common entities). We have grouped the three settheoretic operations – union, intersection and difference. For example, although union and intersection are distinct they are both set theoretic operations on rows whose implementation differs in one keyword. In the remainder of the paper, we explore illustrative examples, improving code through application of concepts. SQL code is presented in SAS SQL, version 9.1. An alternative would be to use MySQL server code. However, there are variations within MySQL server code. Furthermore, not all SAS users have access to MySQL. Finally, our goal is to illustrate how knowledge of SQL concepts improve programming readability and efficiency and for this purpose, SAS SQL is sufficient. EXAMPLE – PROJECTION We start with a simple example. Suppose your database table has ten columns which for convenience we will call C1, C2, C3, …., C10. Suppose in your particular view you don’t need columns 4 and 7. We present the SAS SQL and BASE SAS code to achieve this in table II. All three approaches in Table II implement the same SQL concept, projection. However, BASE SAS, APPROACH #1 is more readable, using only two column names instead of eight to achieve its goals. SAS SQL APPROACH PROC SQL; CREATE TABLE EightColumn as SELECT C1,C2,C3,C5,C6,C8,C9,C10 FROM D.Table; QUIT; RUN; BASE SAS APPROACH #1 DATA D.EightColumn; SET D.Table; DROP C4 C7; RUN; BASE SAS APPROACH #2 DATA D.EightColumn; SET D.Table; KEEP C1 C2 C3 C5 C6 C8 C9 C10; RUN; Table II: Three approaches implementing projection, the selection of particular columns. We assume that D is the library name for the directory storing the tables. We further assume that our initial table has filename Table and the result of the view inquiry is stored in a table, EightColumn. Notice how BASE SAS APPROACH #1 is more readable then BASE SAS APPROACH #2 and SAS SQL APPROACH. EXAMPLE – UNION A simple example of union is presented in table IV which performs a union on the two tables presented in table III. For convenience, the desired goal, the union view, is also presented in Table III. The SAS code is a trifle more compact; time wise both codes are about equal. So far so good. However if we modify table-A and table-B by adding just one column not present in the other table the situation becomes complicated. The modified tables are presented in Table V. The code to produce a union of the two tables is presented in table VI. Notice that SAS SQL code is not present in Table VI. In fact, there is no easy way in SAS SQL to perform a simple union of the two tables! Oracle remedies this SQL deficiency by introducing the to_char operator. BASE SAS remedies this SQL deficiency with the proc append procedure using the force option. It is important to emphasize, in the context of this paper, how proc append is perceived. It is not perceived as a SAS procedure per se; rather it is perceived as a SAS implementation of a SQL concept . TABLE-A L N A 1 A 2 B 1 B 2 TABLE-B L N B 2 B 3 C 2 C 3 TABLE-A UNION TABLE-B L N A 1 A 2 B 1 B 2 B 3 C 2 C 3 Table III: Table-A, Table-B and their union. The code for creating the union is presented in table IV. “L” and “N” stand for Letter and Number and could in a more realistic example represent a category (L) and its components(N). SAS SQL CODE FOR UNION PROC SQL; CREATE TABLE union_A_B AS SELECT * FROM D.TableA UNION SELECT * FROM D.TableB; QUIT; RUN; SAS CODE FOR UNION WITHOUT SQL DATA D.Union_A_B; SET D.TableA D.TableB; RUN; Table IV: Code for creating the union of Table-A and Table-B. We assume all tables lie in a directory with SAS libname, D. TableA and TableB are the SAS filenames for Table-A and Table-B. Union_A_B is the SAS filename for the union table. As can be seen, the SAS code is a little more compact. Time wise the two codes are equal. TABLE-A L N A 1 A 2 B 1 B 2 TABLE-B L N B 2 B 3 C 2 C 3 X x x x x TABLE-A UNION TABLE-B L N X A 1 x A 2 x B 1 x B 2 x B 3 C 2 C 3 Y y y y y Y y y y y Table V: Table-A, Table-B and their union. The simple tables illustrated in Table III, have each been augmented with one column (X,Y) not present in the other table. The code for creating the union table is presented in table VI. Notice that no SAS SQL code can implement such a union. BASE SAS CODE FOR UNION DATA D.Dummy; INPUT L $ N $ X $ Y $; CARDS; ; RUN; PROC APPEND BASE=D.Dummy DATA=D.TableA FORCE;RUN; PROC APPEND BASE=D.Dummy DATA=D.TableB FORCE;RUN; ORACLE SQL CODE FOR UNION CREATE TABLE Union_A_B AS SELECT L,N,X, to_char(NULL) as Y FROM D.TableA UNION SELECT L,N,to_char(NULL) as X, Y FROM D.TableB; Table VI: Code for creating the union of Table-A and Table-B. We assume all tables lie in a directory with libname, D. TableA and TableB are the filenames for Table-A and Table-B. Union_A_B is the filename for the union table. As can be seen there is no easy SAS SQL code that will create a union. To create a union in BASE SAS requires using the proc append procedure with the force option. We also must create an initial dummy table (with 0 rows) that has all four columns. Oracle SQL allows the creation of the union using the special oracle to_char(NULL) function. EXAMPLE – AGGREGATE FUNCTIONS I have never seen the division operation used in SQL. Most programmers don’t know the precise definition. In practice however, one uses the division concept when using the aggregate functions. It is important to understand the difference between a function and an aggregate function. Functions like sum, concatenate and substring operate on data elements in a particular record. By contrast, aggregate functions operate – that is, use as arguments – all records (typically all values in a particular column). In using these functions, we group by certain common elements (and it is in this sense that an aggregate function is a particular implementation of a division). Originally, SQL had five aggregate functions – average, sum, max, min, count. However, both SAS SQL and BASE SAS have adapted to broader needs. A complete list of aggregate functions available in SAS SQL may be found in table 2.6 of the SAS 9.1 SQL procedure guide (While almost all SAS functions can be used in SQL, they cannot necessarily be used as aggregate functions). BASE SAS also implements the aggregate functions with the proc means procedure. The aggregate functions available in proc means may be found in chapter 33 of the BASE SAS 9.2 procedure guide. AVERAGES: BASE SAS – PROC MEANS PROC SORT DATA=D.Table; BY gender agegroup; RUN; PROC MEANS DATA = D.Table; VAR salary; BY gender agegroup; OUTPUT OUT =D.Average MEANS=; AVERAGES: SQL SAS PROC SQL; CREATE TABLE D.Average AS SELECT gender,agegroup, AVG(salary) FROM D.Table GROUP BY gender, agegroup; QUIT; RUN; Table VII: SQL SAS and BASE SAS approaches to taking an average. We assume data lies in a table named table lying in a directory with SAS library name, D. The query seeks average salary by gender and agegroup. The correspondences between the two approaches are transparent. For example, BASE SAS uses the BY statement to indicate the categories averaged over while SQL SAS uses the GROUP BY statement. A typical proc means and the corresponding SQL code using aggregate functions is presented in table VII. Note some important differences • BASE SAS requires an initial sorting • In this example both BASE and SQL SAS have the aggregate function, average, available. Certain functions however are not simultaneously available in both BASE and SQL SAS. • We have not focused in this paper on the extra capacities of BASE SAS over SQL SAS. For example, the aggregate summaries can be neatly displayed using a variety of table ,not available in SQL SAS. BASE SAS also provides many other statistics indicated by the _TYPE_ variable. EXAMPLE JOIN: The join is the heart of database theory. Only SQL SAS can implement a join; BASE SAS cannot. Furthermore, it is the join operation that facilitates breaking up a giant database into small manageable tables which can be joined to produce a myriad of views. Illustrations of the four types of SQL joins are presented in Table VIII, using Table-A and Table-B of Table V. The table caption contains definitions. Table-A RIGHT JOIN Table-B L N X Y B 2 x y B 3 y C 2 y C 3 y Table-A INNER JOIN Table-B L N X Y B 2 x y Table A LEFT JOIN Table-B L N X Y A 1 x A 2 x B 1 x B 2 x y Table-A FULL JOIN Table-B L N X Y A 1 x A 2 x B 1 x B 2 x y B 3 y C 2 y C 3 y Table VIII: Illustration of the four possible SQL joins. Table-A and Table-B are presented in Table V above. The FULL JOIN is sometimes called the OUTER JOIN. The word JOIN by itself, in SAS SQL, is interpreted as INNER JOIN. The definitions of these terms should be clear from the examples: a) Inner join takes all records with shared L and N values in the two tables (and adds extra columns); b) Right Join takes all rows from Table-B and adds Table-A columns where appropriate; c) Left join takes all rows from Table-A and adds Table-B columns where appropriate; d) Full join takes all records from both tables. Even though Table-A and Table-B are relatively “simple” the code for these four types of joins are not all straightforward. The code is presented in Table IX. Notice that the SQL coalesce function is needed for the full join. Failing to use it produces the full join table of Table VIII without the <L,N> values <B,3>, <C,2>, <C,3> (but the “y” are retained with invisible row headers). SQL SAS CODE FOR INNER JOIN PROC SQL; CREATE TABLE D.Inner as SELECT t1.*,t2.Y FROM d.TableA as t1 INNER JOIN d.TableB as t2 ON t1.L=t2.L and t1.N=t2.N; QUIT; RUN; SQL SAS CODE FOR LEFT JOIN PROC SQL; CREATE TABLE D.Left as SELECT t1.*,t2.Y FROM d.TableA as t1 LEFT JOIN d.TableB as t2 ON t1.L=t2.L and t1.N=t2.N; QUIT; RUN; SQL SAS CODE FOR FULL JOIN PROC SQL; CREATE TABLE D.Left as SELECT coalesce(t1.L,t2.L) as L, coalesce(t1.N,t2.N) as N,t1.X, t2.Y FROM d.TableA as t1 FULL JOIN d.TableB as t2 ON t1.L=t2.L and t1.N=t2.N; QUIT; RUN; Table IX: SAS SQL code for the inner, left and full join. The code for the right join is symmetrical to the code for the left join and is omitted for reasons of space. The SAS SQL code for the inner join also works if the word inner is deleted. Notice that the code for the full join requires use of the SQL coalesce function. Without use of this function one obtains the full join table of Table VIII without the values <B,3>,<C,2>,<C,3> in columns L,N (though the y values are retained). The example shows the necessity of using caution in writing code for joins even if the table structure is relatively simple. EXAMPLE – EXCEL At this point it might be worthwhile to study the applicability of the ideas of this paper in a Microsoft excel setting. Although Excel does allow connection with a local SQL server, typically, the Excel user – whether they use spreadsheet functions or visual basic excel does not have access to SQL. Nevertheless, the SQL concepts are frequently implementable in SQL. Let us briefly review how the major five concepts are implemented. PROJECTION: If you don’t want some columns in excel you click, drag and delete. True this is manual, but it is still an implementation of an SQL concept. If your database is small (under 10000 records) this may be an ideal implementation. If you periodically process the database the same way you can record the manual deletions with a Visual Basic macro. SELECTION: Selections can be implemented in Excel using a visual basic macro that selects rows according to some Boolean criteria. It is interesting that the WHERE clause, or its equivalent – Boolean criteria – is almost the same in BASE SAS, SQL SAS and EXCEL. UNION: Consider the example presented in Tables 5 and 6. To implement this in Excel, one would first add columns – X and Y - to both tables (by click and inserts). Then one would click, drag and paste to union the two tables. Again with some simple modifications this can be recorded with Visual Basic. AGGREGATE FUNCTIONS: Excel has certain built-in aggregate functions. For example the SUMIF and COUNTIF functions can be used to obtain averages. Unlike BASE SAS and SQL SAS you can make these averages “part” of the table (that is, an additional column). Alternatively, you can summarize a table nearby the original table for a quick comparison. Excel also has an outline feature as part of its interface. You can (after sorting first) use the subtotal and group features to perform aggregate functions and then click on outlines to drill down and see individual items. Furthermore, because Excel has a RANGE object, any function that excel uses can be made into an aggregate function. Consequently, for small databases (say under 10,000 records), Excel may be superior for database manipulation. JOIN: I recently was performing a project where I had to implement a full join on four quarters of data in order to obtain all users from any quarter and the data available on them. Although joins can be implemented in SQL SAS – and joins on multiple tables can be implemented with nested joins – as indicated in the previous section, join code can be tricky and improper code can lead to unexpected errors. I wanted to be able to debug quickly any missing data elements. I used the following algorithm presented in table X. Notice that the algorithm possesses the desired characteristics of readability, time efficiency and debugging (The names sheet record names from each quarter sheet allowing easy debugging if something was missing). The interesting feature of this example is that although join is a characteristic SQL function, and Excel is not a database language, it was quite easy to implement a full join on multiple tables. Corresponding code in SQL SAS (or BASE SAS) would be less readable. This example shows the power of utilizing SQL as a set of constructs. IMPLEMENTATION OF AN OUTER JOIN ON MULTIPLE TABLES IN EXCEL For each quarterly sheet, Cut and paste all names of that quarter to a sheet, names After all four quarters are processed, then Sort the names on sheet names Remove duplicate names Data from the 4 quarter sheets are joined to the name sheet Data is joined using the Excel VLOOKUP function NOTE: Since VLOOKUP works on single columns, first and last names are combined. For example, “Russell Hendel “ becomes “Russell!Hendel” Table X: Implementation of a full join on multiple tables in Microsoft excel. The above procedure when done the first time manually, can be recorded as a visual basic macro and with minor modifications used in future quarters. For further discussion see the main paper. OPTIMIZATION Optimization refers to the speed of the program run. An optimized program runs faster than a nonoptimized program. Every implementation of SQL has internal optimization techniques. Additionally, a programmer should be aware of rules of thumb which will optimize programs. These rules of thumb should be used whether the programmer implements the code in SQL or some other language. Optimization is a large topic. We briefly mention two very useful rules. The first rule is that projections and selections should be done prior to joins and aggregates. The second rule is that aggregates should typically be done prior to joins. For example, suppose you were a University and had one table storing personal information on students such as addresses, emails, phone numbers, billing status etc. Suppose you have tables with grades and wish to compute some averages and notify students who are failing. You could join all tables and then compute the average. But this would be clumsy as the aggregate functions are working on a bigger table. It is simplest to compute aggregate averages first for each students and then join the student-aggregate average with the personal information. The importance of optimization should be emphasized. If your database is small, say under 10,000 records, almost any code you write will work efficiently. But if your database is moderate (say 1,000,000+ records) optimization techniques may make a real difference in time efficiency. CONCLUDING REMARKS In this paper, we have reviewed several instances of implementations of SQL concepts in a variety of languages. Although SAS SQL suffices for all database operations, we have seen several instances in which BASE SAS, ORACLE and even EXCEL code are more readable or more efficient. We have also seen several rare instances where SAS SQL could not implement a concept but BASE SAS could. We believe the approach of this paper useful to database programmers who wish to optimize readability and efficiency. REFERENCES Christopher J. Date, An Introduction To Database Systems, 8th Edition, Addison Wesley, 2004 SAS 9.1 SQL Procedure User’s Guide, http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_sqlproc_6992.pdf BASE SAS 9.2 Procedure Guide http://support.sas.com/documentation/cdl/en/proc/61895/PDF/default/proc.pdf CONTACT Russell Jay Hendel 7500 Security Boulevard Baltimore, MD 21244 Phone: 410 786 0329 Email: [email protected] Russell Jay Hendel Dept of Mathematics Room 316 7800 York Road Towson, MD 21252 Phone: 410 704 3091 Email: [email protected] [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SQL Concepts vs. SQL Code: Improving Programs using the 8 SQL