Download The SQL Plus Buffer - Text Summarization

Document related concepts

Microsoft SQL Server wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

SQL wikipedia , lookup

Database model wikipedia , lookup

Transcript
SI 654
Database Application Design
Winter 2003
Dragomir R. Radev
1
© 2002 by Prentice Hall
Database Processing
Eighth Edition
Managing
Databases
with Oracle
2
Chapter 12
David M. Kroenke
© 2002 by Prentice Hall
What is Oracle?
• Oracle is the world’s most popular
DBMS that…
– Is extremely powerful and robust
– Runs on many different operating
systems
– Can be configured and tailored
– Operates with most, if not all, addon products
3
© 2002 by Prentice Hall
Oracle Complexity
• The power and flexibility of Oracle
makes it very complex:
– Installations are difficult
– The configuration options are
numerous
– System requirements are high
– System maintenance is complex
4
© 2002 by Prentice Hall
The Language of Oracle…
SQL Plus
• SQL Plus is used in Oracle to:
– Define the structure of a database
and the definition of the data
– Insert, delete, and modify data
– Define the behavior of the system
through stored procedures and
triggers
– Retrieve data and generate reports
5
© 2002 by Prentice Hall
Gaining Access to SQL Plus
• To gain access to SQL Plus, you will need a
username and password and possibly a host
string (depending on your system configuration)
• When Oracle is first installed, it
establishes several default accounts,
namely...
– internal/oracle (a privileged account)
– sys/change_on_install (a privileged
account)
– system/manager (a privileged account)
– scott/tiger (a non-privileged account)
6
© 2002 by Prentice Hall
Creating the Database
• Ways to create an Oracle database:
– Using SQL Plus
• Start button –> Programs –> Oracle –
OraHome81 –> Applications
Development –> SQL Plus
– Using Oracle’s Database Configuration
Assistant
• Start button –> Programs –> Oracle –
OraHome81 –> Database
Administration –> Database
Configuration Assistant
7
© 2002 by Prentice Hall
Entering SQL Plus
Commands
• The SQL Plus Buffer
– As a user types commands, the
commands are saved into the SQL
Plus buffer.
• The SQL Plus Editor
– Users may edit or alter SQL Plus
commands using a text editor.
8
© 2002 by Prentice Hall
SQL Plus Buffer Commands
• SQL Plus is not case sensitive (except
within quotation marks).
• List – displays the content of the SQL Plus
buffer
• List n – display line number n and changes
the current line number to n
• Change – performs a search/replace
operation for the current line number
• Semi-colon (;) or slash (/) executes
9
© 2002 by Prentice Hall
SQL Plus Editor
• The SQL Plus Edit command will
launch the SQL Plus text editor
• After the SQL statement is complete
and correct, exit the editor
• To execute the statement, type the
slash key (/) at the SQL prompt
• To retrieve an existing SQL file:
– SQL> Edit file1.sql
10
© 2002 by Prentice Hall
SQL Plus Commands
•
•
•
•
•
•
•
•
Desc – lists the fields in the specified table
Select – retrieve data
Create – create objects
Drop – delete objects
Alter – change objects
Insert – input data
Delete – delete data
Update – change data
11
© 2002 by Prentice Hall
Select Syntax
Select field1, field2
From table_a, table_b;
12
© 2002 by Prentice Hall
Create Syntax
Create Table tablename (
field1
data_type(size) NOT NULL,
field2 data_type (size) NULL);
Create Sequence tableID Increment
by 1 start with 1000;
(this command creates a counter that
automatically increments for each new record –
does not ensure uniqueness)
13
© 2002 by Prentice Hall
Alter Table Syntax
Alter Table tablename1
Add Constraint FieldPK Primary Key (Field1,
Field2);
Alter Table tablename2
Add Constraint FieldFK Foreign Key (Field1,
Field2) references tablename1 On Delete Cascade;
14
© 2002 by Prentice Hall
Insert Syntax
Insert into tablename
(fieldID, field2)
Values
(fieldID.NextVal, ‘data content’);
15
© 2002 by Prentice Hall
Drop Syntax
• Drop Table tablename;
• Drop Sequence fieldID;
16
© 2002 by Prentice Hall
Indexes
• Indexes are used to enforce
uniqueness and to enable fast
retrieval of data.
• Create Unique Index fieldIndex on
Table(field1, field2);
17
© 2002 by Prentice Hall
Changing the Table
Structures
• Alter Table tablename add field4 datatype
(size);
• Alter Table tablename Drop Column field2;
–you will permanently lose the data in field2
• Alter Table tablename Modify field3 not
null;
18
© 2002 by Prentice Hall
Changing the Data…
Update Syntax
Update tablename
Set field1 = ‘value_a’
Where field3 = value;
19
© 2002 by Prentice Hall
Check Constraint
Provide a list of valid values or a valid
range…
Create Table tablename (
Field1 datatype (size) Not Null,
Field2 datatype (size) Null
Check (field2 in (‘value_a’,
‘value_b’)));
20
© 2002 by Prentice Hall
Check Constraint
• Alter Table tablename Add Constraint
DateChk Check (DateField1 <= DateField2);
• Alter Table tablename Add Constraint
NumRange Check (Field1 Between 180 and
400);
• Alter Table tablename Drop Constraint
constraintname;
21
© 2002 by Prentice Hall
Views
• Displaying the data from the
database just the way a user wants
it…
Create View View1 As
Select * From Tablename
With Read Only;
22
© 2002 by Prentice Hall
PL/SQL
• Allowing SQL to act more like a
programming language.
• Row-at-a-time versus set-at-a-time.
• PL/SQL permits Cursors
• A stored procedure is a PL/SQL (or
other program) stored in the
database. A stored procedure may
have parameters.
23
© 2002 by Prentice Hall
PL/SQL Parameter Types
• IN – specifies the input parameters
• OUT – specifies the output
parameters
• IN OUT – a parameter that may be
an input or an output
24
© 2002 by Prentice Hall
PL/SQL Code
• Variables are declared following the
AS keyword
• The assignment operator is := as
follows
variable1 := ‘value’
• Comments in PL/SQL are enclosed
between /* and */ as follows…
/* This is a comment */
25
© 2002 by Prentice Hall
PL/SQL Control Structures
FOR variable IN list_of_values
LOOP
Instructions
END LOOP;
IF condition THEN
BEGIN
Instructions
END;
26
© 2002 by Prentice Hall
Saving, Compiling, and
Executing PL/SQL Code
• The last line in the PL/SQL procedure
should be a slash (/).
• The procedure must be saved to a file
• To compile the procedure, type the
keyword Start, followed by the procedure
filename
– START MyProg.SQL
• To see any reported errors, type SHOW
ERRORS;
• To execute the procedure type EXEC
MyProg (‘parameter1’, ‘parameter2’);
27
© 2002 by Prentice Hall
Triggers
• A trigger is a stored procedure that is
automatic invoked by Oracle when a
specified activity occurs
• A trigger is defined relative to the activity
which invoked the trigger
– BEFORE – execute the stored procedure
prior to the activity
– AFTER – execute the stored procedure
after the activity
– INSTEAD OF – execute the stored
procedure in lue of the activity
28
© 2002 by Prentice Hall
Trigger Example
Create or Replace Trigger triggername
Before Insert or Update of
fieldname on tablename
For Each Row
Begin
/* instructions */
End;
29
© 2002 by Prentice Hall
A Trigger Knows the Old and
New Values for Fields
• The variable :new.fieldname1 stores
the new information for fieldname1 as
entered by the user.
• The variable :old.fieldname1 stores
the information in fieldname1 prior to
the user’s request.
30
© 2002 by Prentice Hall
Activating a Trigger
• The trigger must be saved to a file
• To compile the trigger, type the keyword
Start, followed by the trigger filename
– START MyTrigger.SQL
• To see any reported errors, type SHOW
ERRORS;
• If no errors were encountered, the trigger
is automatically activated
31
© 2002 by Prentice Hall
Data Dictionary
• The data dictionary contains information
that Oracle knows about itself… the
metadata.
• It includes information regarding just
about everything in the database including
the structure and definition of tables,
sequences, triggers, indexes, views, stored
procedures, etc.
• The data dictionary table names are stored
in the DICT table.
32
© 2002 by Prentice Hall
Concurrency Control
• Since Oracle only reads committed
changes, dirty reads and lost updates
are avoided
• Transaction isolation levels:
– Read Committed
– Serializable
– Read-only
– Explicit Locks
33
© 2002 by Prentice Hall
Read Committed Transaction
Isolation
• Reads may not be repeatable (2 reads may
result in 2 data values, based on timing of
updates and reads)
• Phantoms are possible (data from a read
may be deleted after the read occurred)
• Uses exclusive locks
• Deadlocks are possible and are resolved by
rolling-back one of the transactions
34
© 2002 by Prentice Hall
Serializable Transaction
Isolation
• Reads are always repeatable
• Phantoms are avoided
• Must issue the following command:
Set Transaction Isolation Level Serializable; or
Alter Sessions Set Isolation Level Serializable;
• Coordinates activities in submission order.
When this coordination detects
difficulties, the application program(s)
must intervene.
35
© 2002 by Prentice Hall
Read-only Transaction
Isolation
• An Oracle-only isolation level
• No inserting, updating, or deleting is
permitted
36
© 2002 by Prentice Hall
Explicit Locking
• Not recommended
• Oracle does not promote locks. As a
result, a table may have many, many
locks within it. Oracle manages these
locks transparently. Issuing explicit
locks may interfere with these
transparent locks.
37
© 2002 by Prentice Hall
Oracle Security
• Username and Password is used to manage
DBMS access
• Users may be assigned to one or more
profiles
• Oracle provides extensive resource
limitations and access rights. These
restrictions may be applied to users or
profiles.
• The SQL Grant operator provides
additional access rights
• The SQL Revoke operator remove access
rights
38
© 2002 by Prentice Hall
Backup/Recovery
• Committed changes are saved to
destination Tablespaces.
• Uncommitted changes are saved in the
Rollback Tablespace.
• Redo Log files save all changes made in the
Tablespaces.
• To start and/or recover from a system
failure, the Control Files are read.
39
© 2002 by Prentice Hall
Archivelog
• If Oracle is running in ARCHIVELOG
mode, backup copies are made of the
redo log files.
• Otherwise, the redo log files are
periodically overwritten with new
information.
40
© 2002 by Prentice Hall
Types of Failures
• Application Failure
– When a program bug is encountered or
when a program does not correctly
respond to current system conditions.
• Instance Failure
– When Oracle is unable to do what it
needs to do.
• Media Failure
– When a disk becomes inaccessible to
Oracle.
41
© 2002 by Prentice Hall
Recovery of an Application
Failure
• Oracle rolls back uncommitted
changes.
42
© 2002 by Prentice Hall
Recovery of an Instance
Failure
• Oracle would be restarted using the
following sequence…
– Read the Control File
– Restore system to last known valid
state
– Roll forward changes not in system
(replay the Redo Log Files)
43
© 2002 by Prentice Hall
Recovery from a Media
Failure
• Restore system from Backup
• Read the Control File
• Roll forward changes using the
Archive Log Files (from the
ARCHIVELOG)
• Roll forward changes from the on-line
Log Files (the most recent versions of
the Logs)
44
© 2002 by Prentice Hall
Types of Recoveries
• Consistent Backup
– After the restoration, delete all
uncommitted activities
– This ensures consistency, may lose
recent changes
• Inconsistent Backup
– After the restoration, all
uncommitted activities remain
45
© 2002 by Prentice Hall
Database Processing
Eighth Edition
Networks,
Multi-Tier
Architectures,
and XML
46
Chapter 14
David M. Kroenke
© 2002 by Prentice Hall
Networks
• A network is a collection of
computers that communicate with one
another using standard sets of rules,
called protocols
• Common Network Environments:
– Internet
– Intranet
– Wireless Network Access
47
© 2002 by Prentice Hall
The Internet
• Internet - a publicly accessible
network of networks spanning the
globe
• Uses communications protocol called
Transmission Control
Program/Internet Protocol (TCP/IP)
48
© 2002 by Prentice Hall
Key Dates for the Internet
• The Internet was born in the 1960’s by the
US armed services and was called
ARPANET
• HTTP: HyperText Transfer Protocol (used
to create Web Pages) was created in 1989
by CERN
– Key HTTP characteristics:
• Request-based (waits for user action)
• Stateless (does not sequence or remember
activities)
49
© 2002 by Prentice Hall
HTTP: Stateless Property
• In applications development, you may often
wish to save the application state.
• Several Internet tools exist to help
accomplish this:
– Microsoft Internet Information Server (IIS)
– Microsoft Active Server Pages (ASP)
– Java Servlets with Java ServerPages (JSP)
50
© 2002 by Prentice Hall
The Intranet
• Some organizations use Internet
technologies to create their own
privately accessible network called an
intranet.
• If a connection to the Internet does
exists, it does so through a firewall
• An intranet is almost always faster
than the Internet
51
© 2002 by Prentice Hall
Firewall
• Firewall - a security gateway that
protects an organization from
unauthorized access via the Internet
• Consists of software and sometimes
hardware components
52
© 2002 by Prentice Hall
Wireless Network Access
• Due to less reliability, inferior screen
displays, and slower transfer rates, the
traditional wired protocols are not
appropriate for wireless environments.
• A few protocols have been developed which
allow wireless devices to communicate via
the Internet:
– Wireless Application Protocol (WAP)
– Wireless Markup Language (WML)
53
© 2002 by Prentice Hall
Multi-tier Architectures
• Multi-tier Architectures
– Tiers are the number of computers
(serving a like function) that a user
must use to satisfy his/her
request.
– Common tiers include Web server
and database server.
54
© 2002 by Prentice Hall
A Three-Tier Architecture
55
© 2002 by Prentice Hall
Functions of Tiers
56
© 2002 by Prentice Hall
Processing at the
Different Tiers
• Since each tier serves a different
function, each tier may have a
different operating system and
different application software
offerings.
57
© 2002 by Prentice Hall
Processing
• Client Processing
– Using the browser (e.g., Netscape
Navigator)
• Server Processing
– Using Server Software (e.g., ASP)
58
© 2002 by Prentice Hall
Windows 2000 Web Server
Languages
•
•
•
•
•
59
JavaScript
VBScript
Perl
ActiveX Control
Java
© 2002 by Prentice Hall
Standards & Languages
Common With MS Web Server
60
© 2002 by Prentice Hall
Unix/Linux Web Server
Environment
•
•
•
•
•
•
•
61
JavaScript
Java Applets
Java Servlets
Java Server Pages
Perl
Java
CGI
© 2002 by Prentice Hall
N-Tier Processing
• The 3-Tier architecture may be
extended to include additional tiers.
• This produces a distributed
processing model using various
servers on the Internet
62
© 2002 by Prentice Hall
Markup Languages
• Markup Languages are used to
specify the appearance and behavior
of Web Pages
• Markup Language flavors:
– HTML – a subset of the SGML
– DHTML
– RDS/ADO
– XML
63
© 2002 by Prentice Hall
HTML
• HyperText Markup Language
– PROS
• Simple
• Standardized
– CONS
• Static content
• Limited connectivity
• Mixed structure/content
64
© 2002 by Prentice Hall
DHTML
• Dynamic HyperText Markup Language
• Encapsulates the entire HTML
command set
• Provides access to objects on the
page using the Document Object
Model (DOM)
• Allows for Cascading Style Sheets
(CSS)
65
© 2002 by Prentice Hall
Data Services
• Data services allow Web pages to exchange
data with databases
• RDS is a set of ObjectX controls
– The data exchanges must be relatively
simple
• ADO is a set of ActiveX Data Controls
– These data exchanges may be more
complex
66
© 2002 by Prentice Hall
Extensible Markup Language
–XML
• XML clearly separates content from
structure and allows developers to easily
define their own elements.
• Rather than hard-coding Web pages, you
create rules that govern how the document
should look. Then merge the structure and
the content files. So, the very nature of
XML is dynamic.
67
© 2002 by Prentice Hall
Document Type Declaration –
DTD
• A DTD defines the data content and
may provide the data values
• While a DTD is desirable, it is not
mandatory
– XML documents using DTDs are
termed type-valid documents
– XML documents not using DTDs are
termed not-type-valid documents
68
© 2002 by Prentice Hall
XML & CSS
• Similar to DHTML, Cascading Style
Sheets (CSS) may be used with XML
documents to present a consistent,
standardized Web site.
69
© 2002 by Prentice Hall
Extensible Style Language
Transformation –XMLT
• XMLT is used to transform one
document into another document
70
© 2002 by Prentice Hall
XML Schema
• XML Schema is the next generation
of DTD
• The schema itself is an XML
document
• A W3 standard is currently being
developed
• A document that conforms to an XML
Schema is termed schema-valid.
71
© 2002 by Prentice Hall
XML Schema Concepts
• Simple Elements
– Consist of a single content value
• Complex Elements
– Consist of multiple content values
72
© 2002 by Prentice Hall
XML Namespaces
• Namespaces define where to look for files
• An XML document may have:
– Up to one default namespace
– Many labeled namespaces
• Naming conventions:
– Must be unique within all schemas
– Typically resembles a URL, but is not a
URL
73
© 2002 by Prentice Hall
Wireless Application
Protocol (WAP)
• WAP has been developed to facilitate
Web development for wireless
devices such as Personal Data
Assistants (PDA) or cellular phones
74
© 2002 by Prentice Hall
WAP Server
• A WAP Server transforms XML
documents into Wireless Markup
Language (WML) – WML is a subset
of XML
• A WML Scripting Language also
exists
75
© 2002 by Prentice Hall
XML and Database
Applications
• Any document that can process a
DTD or XML Schema document can
correctly interpret any arbitrary
database view
• XML can easily process multiple
multi-valued paths (several SQL
statements would be required)
76
© 2002 by Prentice Hall
XML and Database
Applications
• The separation of structure and content
allows for:
– The same data to be displayed in many
different ways
– The same structure (report) may be
regenerated many times with
different/updated data.
– Permits document validation checking
77
© 2002 by Prentice Hall
OASIS
– Document structures may be published
and made publicly available
– Organization for the Advancement of
Structured Information Standards
(OASIS):
• A clearinghouse for XML publications
and schema standards
78
© 2002 by Prentice Hall
DBMS Integration of XML
– Oracle
• XML DOM parser
• Xpath
• XSQL
– SQL Server
• XML DOM parser
• Xpath
• ADO
• ASP
79
© 2002 by Prentice Hall
Web-based databases
80
© 2002 by Prentice Hall
Types of databases
• Textual databases
• Semi-structured databases
81
© 2002 by Prentice Hall
Indexing textual data
• Inverted files
• Boolean queries
• Signature files
• Signature S1 matches signature S2 if
S2&S1=S2
82
© 2002 by Prentice Hall
XML-QL
83
© 2002 by Prentice Hall
XML-QL
Two slides from Johannes Gehrke, Cornell University
<IMG SRC=“xysq.gif” ALT=“(x+y)^2”>
<apply> <power/>
<apply> <plus/> <ci>x</ci> <ci>y</ci> </apply>
<cn>2</cn>
</apply>
WHERE
<BOOK>
<NAME><LAST>$1</LAST></NAME>
</BOOK> in “www.booklist.com/books.xml
CONSTRUCT <RESULT> $1 </RESULT>
84
© 2002 by Prentice Hall
XML-QL (continued)
WHERE <BOOK> $b <BOOK> IN “www.booklist.com/books.xml”,
<AUTHOR> $n </AUTHOR>
<PUBLISHED> $p </PUBLISHED> in $e
CONSTRUCT
<RESULT>
<PUBLISHED> $p </PUBLISHED>
WHERE <LAST> $l </LAST> IN $n
CONSTRUCT <LAST> $l </LAST>
</RESULT>
85
© 2002 by Prentice Hall
XML-QL (continued)
<!ELEMENT book (author+, title, publisher)>
<!ATTLIST book year CDATA>
<!ELEMENT article (author+, title, year?,
(shortversion|longversion))>
<!ATTLIST article type CDATA>
<!ELEMENT publisher (name, address)>
<!ELEMENT author (firstname?, lastname)>
86
© 2002 by Prentice Hall
XML-QL (continued)
WHERE <book>
<publisher><name>AddisonWesley</name></publisher>
<title> $t</title>
<author> $a</author>
</book> IN "www.a.b.c/bib.xml"
CONSTRUCT $a
87
© 2002 by Prentice Hall
XML-QL (continued)
WHERE <book>
<publisher><name>AddisonWesley</></>
<title> $t</>
<author> $a</>
</> IN "www.a.b.c/bib.xml"
CONSTRUCT $a
88
© 2002 by Prentice Hall
XML-QL (continued)
WHERE <book>
<publisher><name>Addison-Wesley</></>
<title> $t</>
<author> $a</>
</> IN "www.a.b.c/bib.xml"
CONSTRUCT <result>
<author> $a</>
<title> $t</>
</>
89
© 2002 by Prentice Hall
XML-QL (continued)
<bib>
<book year="1995">
<!-- A good introductory text -->
<title> An Introduction to Database Systems </title>
<author> <lastname> Date </lastname> </author>
<publisher> <name> Addison-Wesley </name > </publisher>
</book>
<book year="1998">
<title> Foundation for Object/Relational Databases: The Third
Manifesto </title>
<author> <lastname> Date </lastname> </author>
<author> <lastname> Darwen </lastname> </author>
<publisher> <name> Addison-Wesley </name > </publisher>
</book>
</bib>
90
© 2002 by Prentice Hall
XML-QL (continued)
<result>
<author> <lastname> Date </lastname> </author>
<title> An Introduction to Database Systems </title>
</result>
<result>
<author> <lastname> Date </lastname> </author>
<title> Foundation for Object/Relational Databases: The Third Manifesto
</title>
</result>
<result>
<author> <lastname> Darwen </lastname> </author>
<title> Foundation for Object/Relational Databases: The Third Manifesto
</title>
</result>
91
© 2002 by Prentice Hall
XML-QL (continued)
WHERE <book > $p</> IN "www.a.b.c/bib.xml",
<title > $t</>,
<publisher><name>Addison-Wesley</>> IN $p
CONSTRUCT <result>
<title> $t </>
WHERE <author> $a </> IN $p
CONSTRUCT <author> $a</>
</>
92
© 2002 by Prentice Hall
XML-QL (continued)
<result>
<title> An Introduction to Database Systems </title>
<author> <lastname> Date </lastname> </author>
</result>
<result>
<title> Foundation for Object/Relational Databases: The
Third Manifesto </title>
<author> <lastname> Date </lastname> </author>
<author> <lastname> Darwen </lastname> </author>
</result>
93
© 2002 by Prentice Hall
XML-QL (continued)
WHERE <article>
<author>
<firstname> $f </> // firstname $f
<lastname> $l </> // lastname $l
</>
</> CONTENT_AS $a IN
"www.a.b.c/bib.xml"
<book year=$y>
<author>
<firstname> $f </> // join on same
firstname $f
<lastname> $l </> // join on same
lastname $l
</>
</> IN "www.a.b.c/bib.xml",
y > 1995
CONSTRUCT <article> $a </>
94
© 2002 by Prentice Hall
XML-QL (continued)
95
© 2002 by Prentice Hall
XML-QL (continued)
<!ATTLIST person ID ID #REQUIRED>
<!ATTLIST article author IDREFS
#IMPLIED>
96
© 2002 by Prentice Hall
XML-QL (continued)
<person ID="o123">
<firstname>John</firstname>
<lastname>Smith<lastname>
</person>
<person ID="o234">
...
</person>
<article author="o123 o234">
<title> ... </title>
<year> 1995 </year>
</article>
97
© 2002 by Prentice Hall
XML-QL (continued)
98
© 2002 by Prentice Hall
XML-QL (continued)
WHERE <article><author><lastname> $n</></></> IN "abc.xml”
WHERE <article author=$i>
<title> </> ELEMENT_AS $t
</>,
<person ID=$i>
<lastname> </> ELEMENT_AS $l
</>
CONSTRUCT <result> $t $l</>
99
© 2002 by Prentice Hall
Scalar values
NOT!
<title>A Trip to <titlepart> the Moon </titlepart></title>
<title><CDATA> A Trip to </CDATA><titlepart><CDATA> the
Moon</CDATA></titlepart></title>
100
YES
© 2002 by Prentice Hall
Tag variables
WHERE <$p>
<title> $t </title>
<year>1995</>
<$e> Smith </>
</> IN
"www.a.b.c/bib.xml",
$e IN {author, editor}
CONSTRUCT <$p>
<title> $t </title>
<$e> Smith </>
</>
101
© 2002 by Prentice Hall
Transforming data
<!ELEMENT book (author+, title, publisher)>
<!ATTLIST book year CDATA>
<!ELEMENT article (author+, title, year?,
(shortversion|longversion))>
<!ATTLIST article type CDATA>
<!ELEMENT publisher (name, address)>
<!ELEMENT author (firstname?, lastname)>
<!ELEMENT person (lastname, firstname, address?, phone?, publicationtitle*)>
102
© 2002 by Prentice Hall
Transforming data (cont’d)
WHERE <$> <author> <firstname> $fn </>
<lastname> $ln </>
</>
<title> $t </>
</> IN "www.a.b.c/bib.xml",
CONSTRUCT <person ID=PersonID($fn,
$ln)>
<firstname> $fn </>
<lastname> $ln </>
<publicationtitle> $t </>
</>
103
© 2002 by Prentice Hall
Integrating data from
different sources
WHERE <person>
<name></> ELEMENT_AS $n
<ssn> $ssn</>
</> IN "www.a.b.c/data.xml",
<taxpayer>
<ssn> $ssn</>
<income></> ELEMENT_AS $i
</> IN
"www.irs.gov/taxpayers.xml"
CONSTRUCT <result> $n $i </>
104
© 2002 by Prentice Hall
Query blocks
WHERE <$e> <title> $t </>
<year> 1995 </> </> CONTENT_A $p
IN "www.a.b.c/bib.xml"
CONSTRUCT <result ID=ResultID($p)> <title> $t </> </>
{ WHERE $e = "journal-paper",
<month> $m </> IN $p
CONSTRUCT <result ID=ResultID($p)> <month> $m </>
</>
}
{ WHERE $e = "book",
<publisher>$q </> IN $p
CONSTRUCT <result ID=ResultID($p)> <publisher>$q
</> </>
}
105
© 2002 by Prentice Hall
WSQ
106
© 2002 by Prentice Hall
Web-supported queries
SIGMOD2000 (Goldman and Widom)
WebPages (SearchExp,T1,T2,…,Tn,URL,Rank,
Date)
SELECT NAME, COUNT
FROM STATES, WEBCOUNT
WHERE NAME = T1
ORDER BY COUNT DESC
107
© 2002 by Prentice Hall
KDD: Data Mining
108
© 2002 by Prentice Hall
The big problem
• Billions of records
• A small number of interesting
patterns
• “Data rich but information poor”
109
© 2002 by Prentice Hall
Data mining
• Knowledge discovery
• Knowledge extraction
• Data/pattern analysis
110
© 2002 by Prentice Hall
Types of source data
•
•
•
•
111
Relational databases
Transactional databases
Web logs
Textual databases
© 2002 by Prentice Hall
Association rules
• 65% of all customers who buy beer
and tomato sauce also buy pasta and
chicken wings
• Association rules: X Y
112
© 2002 by Prentice Hall
Association analysis
• IF
20 < age < 30
AND
20K < INCOME < 30K
• THEN
– Buys (“CD player”)
• SUPPORT = 2%, CONFIDENCE = 60%
113
© 2002 by Prentice Hall
Basic concepts
• Minimum support threshold
• Minimum confidence threshold
• Itemsets
• Occurrence frequency of an itemset
114
© 2002 by Prentice Hall
Association rule mining
• Find all frequent itemsets
• Generate strong association rules
from the frequent itemsets
115
© 2002 by Prentice Hall
Support and confidence
• Support (X)
• Confidence (X  Y) = Support(X+Y) /
Support (X)
116
© 2002 by Prentice Hall
Example
117
TID
T100
T200
T300
T400
T500
T600
T700
T800
T900
List of items IDs
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3
© 2002 by Prentice Hall
Example (cont’d)
•
•
•
•
•
•
•
118
Frequent itemset l = {I1, I2, I5}
I1 AND I2  I5
C = 2/4 = 50%
I1 AND I5  I2
I2 AND I5  I1
I1  I2 AND I5
I2  I1 AND I5
I3  I1 AND I2
© 2002 by Prentice Hall
Example 2
119
TID
date
items
T100
10/15/99
{K, A, D, B}
T200
10/15/99
{D, A, C, E, B}
T300
10/19/99
{C, A, B, E}
T400
10/22/99 {B, A, D}
min_sup = 60%, min_conf = 80%
© 2002 by Prentice Hall
Correlations
• Corr (A,B) = P (A OR B) / P(A) P (B)
• If Corr < 1: A discourages B (negative
correlation)
• (lift of the association rule A  B)
120
© 2002 by Prentice Hall
Contingency table
121
Game
^Game
Sum
Video
4,000
3,500
7,500
^Video
2,000
500
2,500
Sum
6,000
4,000
10,000
© 2002 by Prentice Hall
Example
•
•
•
•
122
P({game}) = 0.60
P({video}) = 0.75
P({game,video}) = 0.40
P({game,video})/(P({game})x(P({video})
) = 0.40/(0.60 x 0.75) = 0.89
© 2002 by Prentice Hall
Example 2
hotdog ^hotdo Sum
s
gs
123
hamburgers 2000
500
2500
^hamburger 1000
s
1500
2500
Sum
2000
5000
3000
© 2002 by Prentice Hall
Classification using decision
trees
• Expected information need
S
• I (s1, s2, …, sm) = -
pi log (pi)
• s = data samples
• m = number of classes
124
© 2002 by Prentice Hall
RID
Age
Income
student
credit
buys?
1
<= 30
High
No
Fair
2
<= 30
High
No
Excellent No
3
31 .. 40
High
No
Fair
Yes
4
> 40
Medium
No
Fair
Yes
5
> 40
Low
Yes
Fair
Yes
6
> 40
Low
Yes
Excellent No
7
31 .. 40
Low
Yes
Excellent Yes
8
<= 30
Medium
No
Fair
No
9
<= 30
Low
Yes
Fair
Yes
10
> 40
Medium
Yes
Fair
Yes
11
<= 30
Medium
Yes
Excellent Yes
12
31 .. 40
Medium
No
Excellent Yes
13
31 .. 40
High
Yes
Fair
14
> 40
Medium
no
excellent no
125
No
Yes
© 2002 by Prentice Hall
Decision tree induction
• I(s1,s2)
= I(9,5) =
= - 9/14 log 9/14 – 5/14 log 5/14 =
= 0.940
126
© 2002 by Prentice Hall
Entropy and information gain
•E(A) =
S
S1j + … + smj
s
I (s1j,…,smj)
Entropy = expected information based on the partitioning into
subsets by A
Gain (A) = I (s1,s2,…,sm) – E(A)
127
© 2002 by Prentice Hall
Entropy
• Age <= 30
s11 = 2, s21 = 3, I(s11, s21) = 0.971
• Age in 31 .. 40
s12 = 4, s22 = 0, I (s12,s22) = 0
• Age > 40
s13 = 3, s23 = 2, I (s13,s23) = 0.971
128
© 2002 by Prentice Hall
Entropy (cont’d)
• E (age) =
5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I
(S13,s23) = 0.694
• Gain (age) = I (s1,s2) – E(age) = 0.246
• Gain (income) = 0.029, Gain (student) =
0.151, Gain (credit) = 0.048
129
© 2002 by Prentice Hall
Final decision tree
age
> 40
31 .. 40
student
credit
yes
no
yes
no
130
yes
excellent
no
fair
yes
© 2002 by Prentice Hall
Other techniques
• Bayesian classifiers
• X: age <=30, income = medium,
student = yes, credit = fair
• P(yes) = 9/14 = 0.643
• P(no) = 5/14 = 0.357
131
© 2002 by Prentice Hall
Example
• P
P
P
P
P
P
P
P
132
(age < 30 | yes) = 2/9 = 0.222
(age < 30 | no) = 3/5 = 0.600
(income = medium | yes) = 4/9 = 0.444
(income = medium | no) = 2/5 = 0.400
(student = yes | yes) = 6/9 = 0.667
(student = yes | no) = 1/5 = 0.200
(credit = fair | yes) = 6/9 = 0.667
(credit = fair | no) = 2/5 = 0.400
© 2002 by Prentice Hall
Example (cont’d)
• P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
• P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019
• P (X | yes) P (yes) = 0.044 x 0.643 = 0.028
• P (X | no) P (no) = 0.019 x 0.357 = 0.007
• Answer: yes/no?
133
© 2002 by Prentice Hall
More types of data mining
•
•
•
•
134
Classification and prediction
Cluster analysis
Outlier analysis
Evolution analysis
© 2002 by Prentice Hall