Download What is a Database? - osastatistician.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Serializability wikipedia , lookup

DBase wikipedia , lookup

Microsoft Access wikipedia , lookup

Btrieve wikipedia , lookup

IMDb wikipedia , lookup

Oracle Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

SQL wikipedia , lookup

Functional Database Model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Concurrency control wikipedia , lookup

Ingres (database) wikipedia , lookup

PL/SQL wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Slide 1
© 2003 By Default!
Data Management Using R –
Interfacing with the Structured Query
Language
STAT 7550 – Statistical Computing
Utah State University
November 21, 2008
Bill Welbourn
A Free sample background from www.powerpointbackgrounds.com
Slide 2
© 2003 By Default!
Objectives of the Project
•
Introduce the notion of the database.
•
Database applications.
•
General overview of the SQLite Relational Database
Management System (RDBMS).
•
Explain how R1 (and other programming languages) interfaces
with the SQLite RDBMS. Highlights of the R commands for
the interface to the SQLite RDBMS, the RSQLite library.
•
A working example, demonstrating the procedure for storing
and retrieving an R dataframe within a SQLite database.
•
Further motivation for the use of the R-SQLite interface,
working with “massive” databases.
A Free sample background from www.powerpointbackgrounds.com
1R
Development Core Team (2008), Version 2.8.0.
Slide 3
© 2003 By Default!
What is a Database?
•
Essentially a series of structured files on a computer that are
organized in a highly efficient manner.
•
The organization is comprised in a hierarchical manner, from
the “top, down,” as shown in Figure 1 below.
Figure 1: The anatomy of a database
Database
Row
Row
Table
Column
Field
Field
Column
Field
Field
A Free sample background from www.powerpointbackgrounds.com
Row
Row
Table
Column
Field
Field
Column
Field
Field
Row
Row
Table
Column
Field
Field
Column
Field
Field
Slide 4
© 2003 By Default!
Components of a Table
•
As Figure 1 suggests, at the highest level, a database is
comprised of a series of tables.
•
Each table is made up of a series of columns. Think of the
columns as characteristics (variables) collected for a study.
•
Data is stored in rows of the table, where each row of the
table is called a record. Records of a table are essentially
synonymous with observations for a study.
•
The location where each row intersects a column is known as
a field.
•
Each table contains specific, common data. A table of a
database is analogous to a worksheet within an Excel
workbook.
A Free sample background from www.powerpointbackgrounds.com
Slide 5
© 2003 By Default!
What is a Relational Database?
•
It is a database comprised of tables which relate to one
another. The table relationships are based on “Key fields.”
•
To illustrate, consider two relational tables within a database.
A column of each table is affixed with the same naming
convention, say “ID.” Each field within these columns is
designated a unique (key) label, so that a one-to-one
(relationship) mapping between the tables is obtained.
Figure 2: Example of two relational tables
Table i
Table j
ID
Column 2
…
Column k
Column 1
ID
…
Column m
Row 1
Key 1
Fieldi (1,2)
…
Fieldi (1,k)
Row 1
Fieldj (1,1)
Key 1
…
Fieldj (1,m)
Row 2
Key 2
Fieldi (2,2)
…
Fieldi (2,k)
Row 2
Fieldj (2,1)
Key 2
…
Fieldj (2,m)
…
…
…
…
…
…
…
…
…
…
Row n
Key n
Fieldi (n,2)
…
Fieldi (n,k)
Row n
Fieldj (n,1)
Key n
…
Fieldj (n,m)
A Free sample background from www.powerpointbackgrounds.com
Slide 6
© 2003 By Default!
Three Types of Relationships
One-to-One
•
•
A record in table one must have a record in table two, and vice-versa.
Example: More variables (columns) were collected in a study than allowed to
be stored in a single database table. Study participants (observations/records)
are labeled with unique ID’s which allow for the table-to-table relationship to
be established.
One-to-Many
•
•
A record in table one has many corresponding records in table two, while table
two has many records which correspond to a single record in table one.
Example: Study participant identifiers (unique ID’s) are stored in table one,
while repeated measurements are stored in table two.
Many-to-Many
•
•
Like the one-to-many relationship, table one has many corresponding records
in table two. However, unlike the one-to-many relationship, table two has
many corresponding records in table one. Example: Customer product orders.
Each order can contain multiple products, and one product can be in many
orders.
A Free sample background from www.powerpointbackgrounds.com
Slide 7
© 2003 By Default!
Database Applications
•
World Wide Web.
•
Medical Data.
•
Data analysis situations which warrant the consideration in
utilizing a database:
•
You possess a flat (text) file(s) with an inordinate number of observations.
•
You have collected an insurmountable quantity of characteristics for your
observations (e.g., genetic data).
•
You are ready to execute a large simulation analysis.
•
You need to prepare a portable file, so that another statistician has easy access
to your data.
•
Anytime there is data in your possession. It is fairly straightforward to
maintain a SQL database in R.
A Free sample background from www.powerpointbackgrounds.com
Slide 8
© 2003 By Default!
The SQLite RDBMS
•
Created by D. Richard Hipp. Version 1.0 released
August 17, 2000. Most recent version, 3.6.4, released
October 15, 2008.
•
An ACID (Atomicity, Consistency, Isolation, Durability)
compliant RDBMS. In computer science, ACID is a set of
properties which guarantee that database transactions (logical
operations) are processed reliably.
•
Contained in a relatively small (~500kB) C programming
library.
•
It is not a database, rather a system which manages databases.
Microsoft Access, in contrast, is simply a program used to
create a database.
A Free sample background from www.powerpointbackgrounds.com
Slide 9
© 2003 By Default!
The SQLite RDBMS (cont.)
•
The columns of a table for a SQL database, typically are
assigned a “type” (e.g., string, integer, float, double). This is
analogous to defining variables in the C programming
language. However, SQLite (automatically) assigns types to
individual values.
•
Allows for multithread reading of a database. The writing of
a database can only occur if no other access to the database is
present.
•
Interfacing with programming languages (e.g., BASIC, C,
C++, Perl, Ruby, and R).
•
Most widely deployed SQL RDBMS.
A Free sample background from www.powerpointbackgrounds.com
Slide 10
© 2003 By Default!
Interfacing the SQLite RDBMS
•
Consists of three components: The application (such as R)
which requires access to the database; an interface; and the
RDBMS.
•
An interface acts as an interpreter, translating commands
from the application, so that the database is accessible to the
user. In R, the interface lies within the DBI library.
•
The interface communicates with the database via the
applicable database driver. The database driver knows how to
“talk” to the database. In R, the SQLite database driver and
the source (C library) for the SQLite engine are included
within the RSQLite library.
A Free sample background from www.powerpointbackgrounds.com
Slide 11
© 2003 By Default!
The R-SQLite Interface
Figure 3: The process flow between the application and the database
Application
(e.g., R)
Interface (e.g.,
DBI library in R)
Driver (e.g., SQLite
in R)
Database
A Free sample background from www.powerpointbackgrounds.com
Slide 12
© 2003 By Default!
Accessing a SQLite DB
A five step cycle:
•
1)
Connect to the database. In R, to establish the connection to a database, issue
the commands: dbDriver(); and dbConnect().
setwd("c:/SQL"); library(DBI); library(RSQLite)
dbfile<-"DATA.dbsql"; drv<-dbDriver("SQLite")
con<-dbConnect(drv,dbname = dbfile)
2)
Issue a query or command to the database. To issue a query in R, the
command, dbSendQuery(), is (typically) used. Queries consist of SQL
commands.
rs <- dbSendQuery(con, "select v1,v2,v3 from Table1 where v1==1")
rs<-dbSendQuery(con,"select * from Table1")
Brief Summary of SQL Commands
SQL Command
Tabular Parameters Required of SQL Command
Select
Column Label(s)
From
Table Name(s)
Where
Specific Values for Column Label(s)
Order by
Column Label(s)
A Free sample background from www.powerpointbackgrounds.com
Slide 13
© 2003 By Default!
Accessing a SQLite DB (cont.)
3)
If a query was issued, we need to retrieve the applicable recordset. To do this
in R, we use the command, fetch(). The recordset will exist as a dataframe in
R.
d1<-fetch(rs, n = -1)
4)
Clear the query result, manipulate the recordset, and update the database. To
clear a query in R, use the command, dbClearResult(). To update a database,
issue the R command, dbWriteTable().
dbClearResult(rs)
dbWriteTable(con, “Table”, data frame, append, row.names,
overwrite)
5)
Close the connection to the database. To do this in R, use the command,
dbDisconnect().
dbDisconnect(con)
A Free sample background from www.powerpointbackgrounds.com
Slide 14
© 2003 By Default!
Example 1
•
You have recorded n (unique) observations (records) for a
study, and collected m-1 characteristics (excluding the unique
observation identifier) for each observation. Having the
option to store your data as a flat file or as a (single table)
SQL database, which should you choose?
•
To address this issue, you decide to conduct a (small)
simulation analysis, investigating data retrieval times for the
two types of data repositories. Figure 4, shown on the
subsequent slide, displays the results from a simulation, where
m=50 and each (of the 49) characteristic is of type “double.”
A total of 100 distinct values of n, n1,…,n100, were chosen, in
accordance to the rule
A Free sample background from www.powerpointbackgrounds.com
Slide 15
© 2003 By Default!
Example 1 (cont.)
Figure 4: Flat file – DB comparison
A Free sample background from www.powerpointbackgrounds.com
Slide 16
© 2003 By Default!
Example 2
•
You have recorded data for n (unique) participants of a study,
and collected m-1 characteristics (excluding the unique
observation identifier) for each participant. Further, for the
ith participant, you have recorded a total of i record(s).
Having the option to store your data as a flat file or as a
(single table) SQL database, which should you choose? Given
the unique identifier for a participant, suppose it is desirable
to have quick access to the records for each participant.
•
Figures 5 and 6, shown on the subsequent slides, display the
results of data retrieval times, where n=200 and n=500,
respectively, m=50, where each variable type is “double”. The
displayed value for the vertical axis, is the required time to
read in the i records for the ith participant.
A Free sample background from www.powerpointbackgrounds.com
Slide 17
© 2003 By Default!
Example 2 (cont.)
Figure 5: Flat file – DB comparison, marginal read I
A Free sample background from www.powerpointbackgrounds.com
Slide 18
© 2003 By Default!
Example 2 (cont.)
Figure 6: Flat file – DB comparison, marginal read II
A Free sample background from www.powerpointbackgrounds.com
Slide 19
© 2003 By Default!
Example 3
•
You have recorded data for n (unique) participants of a study,
and collected allele types at more than two million (2x106)
single nucleotide polymophism (SNP) sites in the human
nuclear genome. How could we utilize a relational database to
represent the repository for these data?
• It turns out that a SQLite database table is limited to 999 columns. So, we simply
create a sufficient number of tables (each with say m columns), and populate the
tabular columns with the SNP data, making sure to create a “Key” column for
each table.
•
Figures 7 and 8, display the required time (by database table)
to retrieve the first two columns – the “Key” along with a
column of data – for SQL databases of size n=2,500 and
n=25,000, respectively. For each database, a total of 2,106
tables were created, where m=951 columns for each database
table. That is, these two databases, comprise slightly greater
than five billion and 50 billion fields, respectively.
A Free sample background from www.powerpointbackgrounds.com
Slide 20
© 2003 By Default!
Example 3 (cont.)
Figure 7: Retrieval Time for a Massive SQL Database I
A Free sample background from www.powerpointbackgrounds.com
Slide 21
© 2003 By Default!
Example 3 (cont.)
Figure 8: Retrieval Time for a Massive SQL Database II
A Free sample background from www.powerpointbackgrounds.com
Slide 22
© 2003 By Default!
Example 3 (cont.)
Figure 9: Retrieval Time for a Massive SQL Database III
A Free sample background from www.powerpointbackgrounds.com
Slide 23
© 2003 By Default!
Conclusion
Database advantages
•
•
Ability to store an extraordinary quantity of data. On a Windows NT
platform, a SQL table can be as large as 2TB; on 64-bit operating systems,
there is virtually no limit to the size of a SQL table.
•
A single file could be utilized as a central data warehouse.
•
Ability to create tabular relationships.
•
Database indexing makes for very fast data retrieval.
•
Portability.
•
Interfacing with programming languages.
•
Multithreading.
Database disadvantages
•
•
Can be a bit of a challenge to recall, “What data, lives in which table?”
•
There is no “safety net” when it comes to overwriting data in a database.
•
Essentially having to learn a new programming (querying) language.
•
Database administration is industry requires continuous careful maintenance.
A Free sample background from www.powerpointbackgrounds.com
Slide 24
© 2003 By Default!
References
Hogan, R (2002). A practical guide to database design. Prentice Hall,
Englewood Cliffs, NJ.
• This is a good resource to obtain a working knowledge of what a database is all about. It covers the issues (e.g., who
will use the database, and what should the database contain), say an employer, would consider prior to implementing a
database in practice. It does not, however, discuss how to create and maintain the database, from a programming point
of view (I.e., the SQL commands).
Maslakowski M, Butcher T (2000). SAMS teach yourself MySQL in 21 days.
Macmillan USA, Indianapolis, IN
•Although the Windows version of R does not have the RMySQL binary package, this book is an excellent resource to
“getting your feet wet” with the SQL programming language. The book lacks the theme of what the Hogan (2002)
book comprises. Namely, the source does not talk about strategies in database design.
R Special Interest Group on Databases (R-SIG-DB). A common database
interface (DBI). Software version 0.2-4 retrieved from http://cran.r-project.org/,
September 27, 2008.
•The DBI package is necessary (but not sufficient) to interface any database with R (see slide 11). The DBI.pdf
(available at the web link provided above) is a good document to review, prior to creating your first database in R. It
provides an overview of the DBI package, and like most programming in R, to learn (effective) database management
in R will require practice through working with data frames.
A Free sample background from www.powerpointbackgrounds.com