Download TUKE-08-SRBD-02-Fyzicke-organizacie

Document related concepts
Transcript
Systém riadenia bázy dát
(Database Management System)
Ján GENČI
PDT
2009
Obsah
•
•
•
•
•
•
•
RAID
2-phase multiway sort-merge
Fyzická organizácia dát
Indexovanie
Systémový katalóg
Operácie relačnej algebry (krátko)
Implementácia operácií relačnej algebry
2
Obsah (nestihneme)
• Transakčné spracovanie
• Paralelné spracovanie
• Zotavenie po chybách
3
Literatúra [1]
• Hector Garcia-Molina, Jeffrey D. Ullman,
Jennifer D. Widom: Database System
Implementation, Prentice Hall, 1999. ISBN-10:
0130402648,
pp.653
• Database Systems: The Complete Book, 2001
4
Literatúra [2]
• Elmasri R., Navathe S. B. : Fundamentals
of database systems. 4th ed., Pearson
Education, 2001. 5th ed. – 2006, pp. 1030
(ch. 13-15 -19; 120 resp. 220 str.)
5
Literatúra [3]
• Ramakrishnan R., Gehrke J.: Database
Management Systems. McGraw-Hill
Science/Engineering/Math; 3rd ed., 2002,
pp. 906 (ch. 7-14; 220 str.)
6
Literatúra [4]
• Abraham Silberschatz, Henry Korth, S.
Sudarshan: Database System Concepts.
McGraw-Hill Science/Engineering/Math;
5th ed., 2005. pp.~920 (ch. 11-14-17; 170
resp. 290 str.
7
RAID
Obrázky (väčšina) z [2]
RAID
• Originally - Redundant Arrays of Inexpensive
Disks.
• Currently - Redundant Array of Independent
Disks
• Chen, Lee, Gibson, Katz, and Patterson (1994),
ACM Computing Survey, Vol. 26, No.2 (June
1994).
• http://sk.wikipedia.org/wiki/RAID (pekne názorne
spracované)
9
RAID 0
10
RAID 1, 2
11
RAID 3, 4, 5, 6
12
RAID – ďalšie kombinácie
• 10, 01 - Kombinácie základných RAIDov
• Performance:
– Block-interleaved distributed-parity disk arrays
(RAID 5) have the best small read, large
read, and large write performance of any
redundant disk array.
– Small write requests are somewhat inefficient
compared with redundancy schemes such as
mirroring.
13
Two phase, multiway
sort-merge
Partially based on presentation of
Simonas Šaltenis - Advanced
Algorithm Design and Analysis
Purpose of Algorithm
• Sorting of very large collection of data
(Data>Memory)
• Classic algorithm – With’s sort-merge
algorithm (Wirth C.: Algoritmy a dátové
štruktúry.)
15
Princíp – 1. fáza
1. Vytvoriť maximálne možné veľké „behy“
(utriedené postupnosti elementov) –
najlepšie načítaním do dostupnej
pamäte a zotriedením napr. quick-sortom
2. Spájanie behov (mergovanie)
16
Princíp – 2. fáza
Bf1
p1
Bf2
Read, when
pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfk[pk])
Bfo
po
Bfk
pk
Current
page
Current
page
Write, when
Bfo full
Current
page
File Y:
Run 1
Run 2
Run k=n/m
EOF
File X:
17
Zhodnotenie
• Phase 1: O(n), Phase 2: O(n)
• Total: O(n) I/Os!
• Files only of “limited” size can be sorted
– Phase 2 can merge a maximum of m-1 runs
(m – number of buffers).
– Which means: N/M (number of runs) < m-1
18
Triedenie veľmi veľkých súborov
(m-1)3M = N
Phase 2
(m-1)M
Phase 1
...
(m-1)2M
...
...
(m-1)M
(m-1)M
M
M
…
M
M
M
…
M
...
M
M
…
M
M
M
…
M
M
M
…
M
...
M
M
…
M
19
Otázky
20
SRBD – štruktúry a algoritmy
22
Primárne (fyzické) organizácie
O čom budeme hovoriť
• Podporované dátové typy
• Formovanie záznamov
• Organizácia (radenie) záznamov
– fyzická
– logická
• „Umiestnenie“ DBMS v rámci OS
24
Podporované dátové typy
• Tzv. built-in dátové typy
• Pre účely ukladania dát, je pre nás
zaujímavá veľkosť dátového typu
(sizeof(typ))
• „Sémantika“ typu je podporená
implementáciou (HW alebo SW)
relevantných operácií (out of scope)
25
Storage Record Formats
• A fixed-length record
• A record with variable-length fields
• A variable-field record with separator
characters.
26
Storage Record Formats [2]
27
Fixed length record
• Size of items is recorded in the system
catalog
28
Variable length records
• Result of item(s) of variable length
F1
$
F2
$
F3
$
F4
Fi = položka i
$
jednotlivé polia sú oddelené oddeľovačmi
F1
F2
F3
F4
pole ukazovateľov na položky záznamu
29
NULL value representation
• Prakticky väčšina zdrojov o spôsobe
implementácie „mlčí“
• Pri záznamoch premenlivej dĺžky sa dá
využiť null pointer na prvok záznamu
• ORACLE v dokumentácii pre ORA7
prezentoval ukladanie NULL hodnoty cez
bitmapový prefix záznamu
30
Fyzická organizácia záznamov
"packed" organizácia
"unpacked" (bitmapová) organizácia
slot 1
slot 1
slot 2
slot 2
slot 3
voľné
miesto
slot N
slot M
N
počet záznamov
hlavička
stránky
1
1 0 1 M
M
3 2 1
počet slotov
31
Fyzická organizácia záznamov 2
dátová
oblasť
stránka i
rid = (i,N)
rid = (i,2)
rid = (i,1)
dĺžka 24
voľné miesto
adresár slotov obsahuje
okrem dĺžky každého záznamu
aj ukazovateľ na začiatok
každého záznamu
20
16 24
N
2
adresár slotov
(slot directory)
N
1
počet položiek
v adresári slotov
32
Umiestňovanie záznamov do
fyzických blokov
• Spanned
• Unspanned
33
Logické organizácie záznamov
• Sekvenčná
• Hašovaná
• Heap (hromada)
• Zhodnotenie z pohľadu operácií insert, find
a delete
34
Sekvenčná
organizácia
35
Zhodnotenie – sekvenčná org.
• Insert – drahá operácia (potreba posunúť
priemerne N/2 záznamov) – oblasti
pretečenia (overflow areas)
• Find – možnosť binárneho vyhľadávania
podľa usporiadavajúceho atribútu O(log2N), ináč O(N) = N/2 alebo N
• Delete – drahá operácia (potreba posunúť
priemerne N/2 záznamov) – možnosť
označovať záznamy ako zmazané pack
36
Interné
Hashovanie
37
Zhodnotenie – hashovanie
• Insert – O(1) ak neuvažujeme konflikty; ak
uvažujeme = najhorší prípad O(N)
• Find – O(1) – hashovací atribút, O(N)
ostatné atribúty
• Delete – O(1)
• Štruktúra musí byť dimenzovaná na
maximálny počet záznamov
38
Externé hashovanie
39
Zhodnotenie - externé hashovanie
• Ako interné hashovanie
• Konflikty sa riešia blokmi pretečenia (viď
ďalší slajd )
40
Ext. Hashovanie – overflow bloky
41
Extendible
hashing
42
Zhodnotenie – ext. hashing
• Ako externé hashovanie
• Plusom je možnosť dynamického
rozširovania „veľkosti hashovacieho poľa“
43
Heap (hromada)
• Záznamy sú neusporiadané – nie je
usporiadavací atrubút
• Strácame možnosť - binárne
vyhľadávanie; primárny index (ale iba pre
usporiad. atr.)
• Veľmi efektívna operácia INSERT
44
Miesto DBMS v rámci OS
Cooked files
Raw devices
DBMS
DBMS
Služby OS
Služby OS
Filesystem
-
Driver
Driver
• NTFS
45
Otázky
46
Indexovanie
Z podstatnej časti podľa [2]
Všetky obrázky z [2]
Index
• Alternatívny spôsob prístupu k dátam
• Lokalizácia záznamu podľa obsahu
48
Kategorizácia indexov
• Podľa počtu úrovní:
– Jedno-úrovňové
– Viac-úrovňové
• Podľa indexovaného atribútu:
– Primárne
– Klastrovacie (clustering)
– Sekundárne
• Podľa počtu indexovaných záznamov:
– Hustý (dense) – všetky záznamy v indexe
– Riedky (sparse) – len časť záznamov v indexe
49
Primárny index
•
•
•
Indexuje
„usporiadavajúci“
(ordering) atribút
Riedky (sparse) index
„Kotviaci“ záznam
•
INSERT problém
50
Clustering index
• Aj nad „neusporiadavajúcim“ atribútom
• Primárna organizácia sa usporiada
podľa daného atribútu – pri budovaní
indexu
51
Clustering index
• Pri bežnej práci sa primárna
organizácia nemodifikuje,
ale používajú sa overflow
bloky
52
Sekundárny
index
• Index nad neusporiadavajúcim atribútom (ale
kľúčovým)
• Hustý (dense) index
53
Sekundárny
index
• Nad nekľúčovým
atribútom (opakujúce
sa hodnoty)
54
Priebežné zhodnotenie
• Zatiaľ iba jednoúrovňové indexy
• Prínos (N – počet záznamov, r – záznamov v bloku)
– Vyhľadávanie nad „ordered“ kľúčom – log2N
– Vyhľadávanie nad „non-ordered“ kľúčom – N/2
– Vyhľadávanie nad nekľúčovým atribútom – N
– Primárny index log2(N/r)
– Sekundárny index log2N (počet čítaných blokov – podstatne
menší, kvôli vyššiemu blokovaciemu faktoru)
55
Príklad – sekvenčný súbor
(ordering attribute)
•
•
•
•
•
Ordered file with r = 30,000 records
Block size B = 1024 bytes.
Records are of fixed size and are unspanned
Record length R = 100 bytes.
The blocking factor
bfr = floor(B/R) = floor(1024/100) = 10 records per
block.
• The number of blocks
b = (r/bfr) = r (30,000/1O)l = 3000 blocks.
• A binary search would need approximately
– floor(log2 b) = floor(log2 3000) = 12 block accesses.
56
Primárny
index
• Na osvieženie pamäti
57
Príklad – primárny index
• Key field of the file is V = 9 bytes long, a block pointer is
P = 6 bytes
• size of index entry R = (9 + 6) = 15 bytes,  blocking
factor
bfri = floor(B/Ri ) = floor(1024/15) = 68 entries per block.
• The total number of index entries ri is equal to the
number of blocks in the data file - 3000.
• The number of index blocks is hence
bi = ceiling(r/bfri) = ceiling(3000/68) = 45 blocks.
• To perform a binary search on the index file would need
ceiling (log2 bi)l = ceiling (log245) = 6
(block accesses).
• To search for a record using the index, we need one
additional block access to the data file - total of 6 + 1 = 7
block accesses
58
Príklad – sekundárny index
• As example 1: r = 30,000 ,R = 100 bytes, B = 1024 bytes.
• To do a linear search, we would require
b/2 = 3000/2 = 1500 block accesses
(on the average, 3000 in the worst case)
• Supppose V = 9 and P = 6  bfri = 68
– secondary index is dense  the total number of index entries ri
is equal to the number of records = 30,000.
– The number of blocks needed for the index is
bi = ceiling(r/bfr) = 1(30,000/68) l = 442 blocks.
– A binary search on this secondary index needs
ceiling(log2bi ) = ceiling (log2442) = 9 block accesses.
59
Porovnanie (single-level) indexov
60
Multi-Level Indexes
• Because a single-level index is an ordered file, we can
create a primary index to the index itself ; in this case,
the original index file is called the first-level index and
the index to the index is called the second-level index.
• We can repeat the process, creating a third, fourth, ...,
top level until all entries of the top level fit in one disk
block
• A multi-level index can be created for any type of firstlevel index (primary, secondary, clustering) as long as
the first-level index consists of more than one disk block
61
Multilevel
indexy
• Prvá úroveň dense alebo sparse
• Ďalšie úrovne už
iba sparse
• Top level – iba
jeden blok
• Vyhľadávanie
vyžaduje pribl.
(logbfribi) „block
accesses“
• INSERT problém !!!
62
Dynamic Multilevel Indexes Using B-Trees
and B+-Trees
• Because of the insertion and deletion problem, most
multi-level indexes use B-tree or B+-tree data structures,
which leave space in each tree node (disk block) to allow
for new index entries
• These data structures are variations of search trees that
allow efficient insertion and deletion of new search
values.
• In B-Tree and B+-Tree data structures, each node
corresponds to a disk block
• Each node is kept between half-full and completely full
63
Dynamic Multilevel Indexes Using B-Trees
and B+-Trees (contd.)
• An insertion into a node that is not full is quite efficient; if
a node is full the insertion causes a split into two nodes
• Splitting may propagate to other tree levels
• A deletion is quite efficient if a node does not become
less than half full
• If a deletion causes a node to become less than half full,
it must be merged with neighboring nodes
64
Difference between B-tree and B+-tree
• In a B-tree, pointers to data records exist
at all levels of the tree
• In a B+-tree, all pointers to data records
exists at the leaf-level nodes
• A B+-tree can have less levels (or higher
capacity of search values) than the
corresponding B-tree
65
B-tree structure
66
B+-tree structure
67
B+-tree
example
68
B-tree example - numbers
69
B+-tree example - numbers
70
B-tree – duplicate keys
71
Otázky
72
Systémový katalóg
Na základe prezentácie
Ľubomíra Miškoviča
Čo je systémový katalóg
• Systémový katalóg uchováva dáta ktoré
popisujú každú databázu (metadata)
• Obsahuje popis:
– Položiek, viet, súborov a vzťahov medzi nimi
– Konceptuálnej schémy, externých schém a
internú schému. Je tu popísané aj mapovanie
medzi schémami na rôznych úrovniach
74
Zjednodušený model prostredia
databázového systému
75
Obsah systémového katalógu
• Katalógy pre relačné SRBD obsahujú
– Názvy relácií
– Názvy atribútov
– Domény atribútov
– Primárne kľúče
– Sekundárne kľúčové atribúty
– Cudzie kľúče
– Podmienky
76
Obsah systémového katalógu
• Ďalej obsahujú popisy
– Externých pohľadov
– Uloženie štruktúr a indexov pre internú úroveň
– Informácie o bezpečnosti a autorizácií, ktoré
definujú prístup používateľa k databázovým
pohľadom
– Prihlasovacie mená tvorcov alebo vlastníkov
každej relácie
77
Obsah systémového katalógu
• Uchovávajú informácie ako
– Veľkosť záznamu
– Aktuálny počet záznamov
– Počet indexov
– Meno tvorcu každej relácie
78
Spôsoby implementácie
systémového katalógu
• Systémový katalóg môže byť vytváraný
pre každú databázu v systéme, alebo
môže byť spoločný pre všetky databázy
• Systémový katalóg môže byť tvorený
tabuľkami, ktorých štruktúra je totožná s
tabuľkou databázy alebo špeciálnou
štruktúrou
79
Príklad systémových katalógov pre
Informix
• Systables – opisuje každú tabuľku v databáze.
Obsahuje jeden riadok pre každú tabuľku v
databáze, pohľad alebo synonymum definované
v databáze. Zahŕňa všetky tabuľky v databáze aj
tabuľku systémového katalógu
• Syscolumns – definuje každý stĺpec v
databáze. Pre každý stĺpec definovaný v tabuľke
alebo pohľade existuje jeden riadok
• Sysindex – popisuje indexy v databáze.
Obsahuje jeden riadok pre každý index
definovaný v databáze
80
Systables
81
syscolumns
82
Vzťah medzi tabuľkami
83
Oracle
84
Postgres
85
Otázky
86
Relačná algebra (RA) a
implementácia operácií RA
Podľa [2]
Relačná algebra
• Relácia - podmnožina karteziánskeho
súčinu
R D1  ...  Dn
• Relačná algebra:
– Formálny jazyk pre relačný model
– Základný súbor operácií pre vyhľadávacie
dotazy
88
Operácie relačnej algebry
•
•
•
•
•
Selekcia 
Projekcia 
Kartézsky súčin 
Spojenie (join)
(theta-, equi-, natural- )
Množinové (union kompatibilné):
– Prienik (intersection) 
– Zjednotenie (union)
– Rozdiel (difference) \
89
Elementary conditionEC and
condition C
• Definition:
Elementary (simple) condition EC is clause of the form:
<Attribute> <Operator> <Value>
where operator is from the set of relational operators
{=,<,>,<=,>=,≠}.
• Definition:
Condition C is clause of the form :
[NOT] EC1 [{OR | AND } [ [NOT] EC2] …]
90
Examples
• (O1): SSN='123456789'(EMPLOYEE)
• (O2): DNUMBER>5(DEPARTMENT)
• (O3): DNO=5(EMPLOYEE)
• (O4):
DNO=5 AND SALARY>30000 AND SEX=' F' (EMPLOYEE)
• (O5):
ESSN='123456789' AND PNO=10 (WORKS_ON)
91
SELECT operation
• Definition:
c = { tiR | c(ti)}
Implementation:
(3-value logic)
–
–
–
–
–
–
Linear search
Binary search
Using a primary index (or hash key)
Using a primary index to retrieve multiple records
Using a clustering index to retrieve multiple records
Using a secondary (B+-tree) index on an equality
comparison
– ...
92
S1:Linear search (brute force)
Retrieve every record in the file, and test
whether its attribute values satisfy the
selection condition.
for every ti
if (c(ti) == TRUE)
output(ti)
93
S2:Binary search
If the selection condition involves an equality
comparison on a key attribute on which the
file is ordered.
•SSN='123456789'(EMPLOYEE)
94
S3: Using a primary index (or hash
key)
If the selection condition involves an equality
comparison on a key attribute with a primary
index (or hash key), use the primary index
(or hash key) to retrieve the record. Note
that this condition retrieves a single record
(at most).
•SSN='123456789'(EMPLOYEE)
95
S4: Using a primary index to
retrieve multiple records
If the comparison condition is >, >=, <', or <=
on a key field with a primary index, use the
index to find the record satisfying the
corresponding condition
DNUMBER>5(DEPARTMENT) (selectivity, distribution)
DNO=5 AND SALARY>30000 AND SEX=' F' (EMPLOYEE)
96
S5: Using a clustering index to
retrieve multiple records
If the selection condition involves an equality
comparison on a non (key attribute with a
clustering index for example, DNO = 5 in S3)
use the index to retrieve all the records
satisfying the condition.
DNO=5(EMPLOYEE)
(if clusterred on DNO)
97
S6: Using a secondary (B+-tree)
index on an equality comparison
This search method can be used to retrieve
a single record if the indexing field is a key
(has unique values) or to retrieve multiple
records if the indexing field is not a key. This
can also be used for comparisons involving
>, >=, <, or <=.
98
S7: Conjunctive selection using an
individual index
If an attribute involved in any single simple
condition in the conjunctive condition has an
access path that permits the use of one of
the Methods S2 (binary search) to S6 (Btree), use that condition to retrieve the
records and then check whether each
retrieved record satisfies the remaining
simple conditions in the conjunctive
condition.
99
S8:Conjunctive selection using a
composite index
If two or more attributes are involved in
equality conditions in the conjunctive
condition and a composite index (or hash
structure) exists on the combined fields-for
example, if an index has been created on
the composite key (ESSN, PNO) of the
WORKS_ON file for O5-we can use the
index directly.
100
JOIN operation
• R ⋈c S = {tiR,tjS| c(ti,tj) == TRUE }
• Implementácia
– Nested-loop join (brute force)
– Single-loop join (using an access structure to
retrieve the matching records)
– Sort-merge join
– Hash-join
101
J1. Nested-loop join (brute force)
For each record t in R (outer loop), retrieve
every record s from S (inner loop) and test
whether the two records satisfy the join
condition c (incl. theta-join).
for each ti
for each sj
if( c(ti,sj) == TRUE )
output(ti.sj)
Improvement - nested-block join
102
J2. Single-loop join (using an access
structure to retrieve the matching
records)
If an index (or hash key) exists for one of the
two join attributes-say, B of S,
retrieve each record t in R, one at a time
(single loop), and then use the access
structure to retrieve directly all matching
records s from S that satisfy
t[B] =t[A] (equi-join).
103
J3. Sort-merge join
If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.
Both files are scanned concurrently in order of the
join attributes, matching the records that have the
same values for A and B. If the files are not sorted,
they may be sorted first by using external sorting.
104
J4. Hash-join
• The records of files R and S are both hashed to the
same hash file, using the same hashing function on the
join attributes A of R and B of S as hash keys.
• First, a single pass through the file with fewer records
(say, R) hashes its records to the hash file buckets
(partitioning phase - records of R are partitioned into the
hash buckets).
• In the second phase (probing phase), a single pass
through the other file (S) then hashes each of its records
to probe the appropriate bucket, and that record is
combined with all matching records from R in that
bucket.
105
PROJECT operation
• <attribute list>(R)
• Implementation:
– straightforward to implement if <attribute list> includes
a key of relation R – the same number of records.
– If <attribute list> does not include a key of R,
duplicate tuples must be eliminated (sorting, hashing).
– Index can be used in some cases.
106
SET operation
• CARTESIAN PRODUCT operation R  S is quite
expensive, because its result includes a record
for each combination of records from R and S.
• Can be improved by processing at the block
level
• UNION, INTERSECTION, and SET
DIFFERENCE apply only to union-compatible
relations (that have the same number of
attributes and the same attribute domains).
• Implementation - sort-merge technique and
hashing
107
Sort-merge technique
(for the SET operation)
• The two relations are sorted on the same
attributes.
• After sorting, a single scan through each relation
is sufficient to produce the result.
• For example, we can implement the UNION
operation, R  S, by scanning and merging both
sorted files concurrently, and whenever the
same tuple exists in both relations, only one is
kept in the merged result.
• For the INTERSECTION operation, R  S, we
keep in the merged result only those tuples that
appear in both relations.
108
Hashing
(for the SET operation)
• One table is partitioned and the other is used to probe the
appropriate partition.
• For example, to implement R  S, first hash (partition) the records of
R; then, hash (probe) the records of S, but do not insert duplicate
records in the buckets.
• To implement R  S, first partition the records of R to the hash file.
Then, while hashing each record of S, probe to check if an identical
record from R is found in the bucket, and if so add the record to the
result file.
• To implement R - S, first hash the records of R to the hash file
buckets. While hashing (probing) each record of S, if an identical
record is found in the bucket, remove that record from the bucket.
109
Implementing Aggregate
Operations
• The aggregate operators (MIN, MAX, COUNT,
AVERAGE, SUM), when applied to an entire table, can
be computed by a table scan or by using an appropriate
index, if available.
• For example, consider the following SQL query:
SELECT MAX(SALARY)
FROM EMPLOYEE;
• If an (ascending) index on SALARY exists for the
EMPLOYEE relation, then the optimizer can decide on
using the index to search for the largest value by
following the rightmost pointer in each index node from
the root to the rightmost leaf.
110
Implementing Aggregate
Operations
• The dense index can be used for the
COUNT, AVERAGE, and SUM
aggregates.
• The associated computation would be
applied to the values in the index.
111
GROUP BY clause
• When a GROUP BY clause is used in a query, the
aggregate operator must be applied separately to each
group of tuples.
• In this case, the computation is more complex - the table
must first be partitioned into subsets of tuples, where
each partition (group) has the same value for the
grouping attributes.
• Sorting or hashing are used to partition the file into the
appropriate groups
• If a clustering index exists on the grouping attributes,
then the records are already partitioned (grouped) into
the appropriate subsets.
112
• Otázky
113