Download data4 lecture - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
2. Horizontal Data
2. Horizontal Data
Let's take a look at some issues and problems regarding data in general
CENTRALITY OF DATA
Data are central to every computer program.
If a program has no data, there is no input, no output, no constants, no variables...
It is hard to imagine a program in which there is no data? Therefore, virtually all
programs are data management programs and therefore, virtually all computing
involves data management.
However, not the all data in computer programs is RESIDUALIZED.
RESIDUALIZED data is data stored and managed after the termination of the
program that generated it (for reuse later).
Database Management Systems (DBMSs) store and manage residualized data.
ISSUES AND PROBLEMS WITH DATA CONTINUED
HUGE VOLUME
(EVERYONE HAS LOTS OF DATA AVAILABLE TO THEM TODAY!)
Data are collected much faster than data are process or managed.
NASA's Earth Observation System (EOS), alone, has collected over 15 petabytes of
data already (15,000,000,000,000,000 bytes).
Most of it will never be use!
Most of it will never be seen! Why not?
There's so much volume, so usefulness of much of it will never be discovered.
SOLUTION: Reduce the volume and raise the information density through structuring,
querying, filtering, mining, summarizing, aggregating...
That's the main task of Data and Database workers today!
Claude Shannon's information theory principle comes into play here:
More volume means less information.
Shannon's Law of Information
The more volume you have, the less information you have.
(AKA: Shannon’s Canon)
A simple illustration: Which phone book has more useful information?
(both have the same 4 data granules; Smith, Jones, 234-9814, 231-7237)
BOOK-1
Name
Number
Number
Smith
234-9816
Jones
231-7237
BOOK-2
Name
Smith
234-9816
Smith
231-7237
Jones
234-9816
Jones
231-7237
The Red Book has no useful phone number information!
Data analysis, querying and mining reduce volume and raises info level
STRUCTURING and RESIDUALIZING DATA
Proper Structuring of data may be the second most important task in data and
database system work today!
At the highest level, is the decision as to whether a data set should be structured as
horizontal or vertical data (or some combination).
Another important task to be addressed in data systems work today is
RESIDUALIZATION OF DATA
MUCH WELL-STRUCTURED DATA IS DISCARDED PREMATURELY
Databases are about storing data persistently, for later use.
RESIDUALIZING DATA may be the third most important task in data and database
system work today!
WHAT IS A DATABASE?
There are many definitions in the literature. Here is the one we will use:
An integrated shared repository of operational data of interest to an enterprise.
INTEGRATED: it must be the unification of several distinct files
SHARED: same data can be used by more than 1 user (concurrently)
REPOSITORY: implies "persistence".
OPERATIONAL DATA: data on accounts, parts, patients, students, employees,
genes, stock, pixels,...
By contrast, nonoperational incl. I/O data, transient data in buffers, queues...
ENTERPRISE: bank, warehouse, hospital, school, corp, gov agency, person..
WHAT IS A DATABASE MANAGEMENT SYSTEM
(DBMS)
A program which organizes and manages access to residual data
Databases also contains METADATA also (data on the data).
Metadata is non-user data which contains the descriptive information about the
data and database organization (i.e., Catalog data).
WHY USE A DATABASE?
COMPACTNESS (saves space - no paper files necessary).
EASE OF USE (less drudgery, more of the organizational and search work done by the
system; user specifies what, not how).
CENTRALIZED CONTROL (by DB Administrator (DBA) and by the CEO).
REDUCES REDUNDANCY (1 copy is enough, but concurrent use must be controlled
NO INCONSISTENCIES (again, since there is only 1 copy necessary).
ENFORCE STANDARDS (corporate, dept, industry, national, international).
INTEGRITY CONSTRAINTS (automatically maintained) (e.g., GENDER=male =>
MAIDEN_NAME=null).
BALANCE REQUIREMENTS (even conflicting requirements? DataBase
Administrator (DBA) can optimize for the whole company).
DATA INDEPENDENCE (occurs because applications are immune to storage
structure and access strategy changes. Can change the storage structure without
changing the access programs and vice versa).
HORIZONTAL DATA
Almost all commerical databases today are HORIZONTAL. That is, the contain
horizontally structure data. Horizontal data is data is formed into files of
horizontal records of a common type.
HORIZONTAL DATA TERMINOLOGY
stored (physical, on disk)
FIELDS, RECORDS, FILES
logical (as viewed by user)
type (e.g., datatype)
FIELDS, RECORDS, FILES
occurrences (instances)
TYPE: defines structure and expected contents (time-independent - changes only upon
DB reorganization)
OCCURRENCE: actual data instances at a given time (time-dependent - changes
with every insert/delete/update)
Stored? (versus logical)
STORED FIELD is the smallest unit of stored data in a database.
e.g., Jones is a Lname stored field occurrence.
Char 25 might be the metadata type of that occurrence.
STORED RECORD is a named horizontal concatenation of related stored fields.
e.g., | Jones | John | 412 Elm St | Fargo | ND | 58102 |
an instance
City
St
Zip
Lname
Fname
Address
field names
Lname(char25), Fname(char15), Address(char20), City(char15), St(char2), Zip(char5)
field types
STORED FILE is a named collection of all occurrences of 1 type of stored record.
Employee | Lname | Fname | Address | City | St | Zip |
| Jones
| Smith
| Thom
| Trath
| John
| James
| Bob
| Phil
| 412 Elm |Fargo| ND|58102|
| 415 Oak | Mhd |MN|56560|
| 12 Main | Mhd |MN|56560|
| 234 12St |Fargo|ND |58105|
. . . EoF
record and field names
record instance
record instance
record instance
record instance
End of File marker
Stored continued
The employee file type IS the common employee record type (+ possibly, some other
type characteristics, e.g., max-#-records)
In todays storage device world, there is only linear storage space, so the 2-D picture of
a stored file, strictly speaking, not possible in physical storage media today.
Some day there may be truly 2-D storage (e.g., holographic storage) and even 3-D.
A more accurately depiction of the store Employee file (as stored on linear storage):
| Jones | John | 412 Elm |Fargo| ND|58102|| Smith | James |
| 415 Oak | Mhd |MN|56560|| Thom | Bob
| Phil
| 12 Main | Mhd |MN|56560|| Trath |
| 234 12St |Fargo|ND |58105| EoF |
How these entities are stored and how they are viewed or known to users may differ. They may
be known to the users in various logical variations. A logical record based on the 1st
occuring employee record above might be: | Jones | John |Fargo| ND|
So we also have
LOGICAL FIELD = smallest unit of logical data
LOGICAL RECORD= named collection of related logical fields.
LOGICAL FILE
= named collection of occurrences of 1 type
logical record which may or may not correspond to the physical entities.
of
Terminology
Unfortunately there is a lot of variation in terminology. It will suffice to "equate" terms
as follows in this course:
COMMON USAGE
File
Record
Field
RELATIONAL MODEL
Relation
Tuple
Attribute
When we need to be more careful we will use:
relation is a "set" of tuples whereas a
table is a "sequence" of rows or records (has order)
tuple is a "set" of fields whereas a
row or record is a "sequence" of fields (has order)
TABULAR USAGE
Table
Row
Column
DATA MODELS
For conceptualizing (logically) and storing (physically) data in a database we have
horizontal and vertical models.
Here are some of the HORIZONTAL MODELS for files of horizontal records:
(in which processing is typically done through vertical scans, e.g.,
Get and process 1st record.
Get and process next record ...)
RELATIONAL
.
TABULAR
(simple flat unordered files or relations of records of tuples
of unordered field values)
(ordered files of ordered fields)
INVERTED LIST
(Tabular with an access paths (index?) on every field)
HIERARCHICAL
(files with hierarchical links)
NETWORK
(files with record chains)
OBJECT-RELATIONAL (Relational with "Large OBject" (LOBs) fields) (attributes .
which point to or contain complex objects).
(Each is described in detail later in these notes).
DATA MODELS cont.
Here are some of the VERTICAL MODELS (for vertical vectors or trees of
attribute values, processing is typically through logical horizontal AND/OR
programs).
BINARY STORAGE MODEL (Copeland ~1986) (This model used vertical value .
and bit vectors. It has virtually dissappeared!)
BIT TRANSPOSE FILES
.
(Wang ~1988) (This model used vertical bit files.
It has also virtually dissappeared!)
VIPER STRUCTURES
(~1998) (Used vertical bit vectors for data mining.)
PREDICATE-Trees or Ptrees (This model and technology is patented by NDSU
.
and uses vertical bit trees) (~1997).
(The last one only is described in detail later in these notes). The first two have not
developed and are non-existent today. Those interested in Viper can Google it.)
REVIEW OF HORIZONTAL DATA MODELS
RELATIONAL DATA MODEL The only construct allowed is a [simple, flat] relation
for both entity description and relationship definition, e.g.,
STUDENT
S# SNAME LCODE
|25|CLAY |NJ5101|
|32|THAISZ|NJ5102|
|38|GOOD |FL6321|
|17|BAID |NY2091|
|57|BROWN |NY2092|
COURSE
C# CNAME SITE
|8 |DSDE |ND |
|7 |CUS |ND |
|6 |3UA |NJ |
|5 |3UA |ND |
ENROLL
S# C# GRADE
|32|8 | 89 |
|32|7 | 91 |
|25|7 | 68 |
|25|6 | 76 |
|32|6 | 62 |
|38|6 | 98 |
|17|5 | 96 |
LOCATION
The LOCATION relation represents a
LCODE STATUS
relationship between the LCODE
and STATUS attributes
|NJ5101| 1
|
(1-to-many).
|NJ5102| 1
|
|FL6321| 4
|
|NY2091| 3
|
|NY2092| 3
|
The STUDENT and COURSE
relations represent entities
The ENROLL relations represents a relationship between
Student and Course entities (a many-many relationship)
REVIEW OF HORIZONTAL DATA MODELS
HIERARCHICAL DATA MODEL entities=records relationships=links of records forming trees
EX:
root type is STUDENT
dependent type is COURSE
(with attributes S#, NAME, LOCATION),
(with attributes C#, CNAME),
2nd-level dependent type ENROLLMENT (with attributes, GRADE, LOC)
25|CLAY|OTBK
7|CUS
6|3UA
ND|68
NJ|76
32|THAISZ|KNB
8|DSDE
7|CUS
6|3UA
ND|62
ND|89
ND|91
38|GOOD|GTR
6|3UA
NJ|98
STUDENTS
COURSES
ENROLLMENTS
If the typical workload involves producing class lists for students, this organization is very good. Why?
If the typical workload is producing course enrollment lists for professors, this is very poor. Why?
The problem with the Hierarchical Data Model is that it almost always favors a
particular workload category (at the expense of the others).
REVIEW OF HORIZONTAL DATA MODELS
NETWORK DATA MODEL
entities = records
relationships = owner-member chains (sets)
many-to-many relationships easily accomodated
EX: 3 entities (STUDENT ENROLLMENT COURSE)
2 owner-member chains:
25| CLAY | MJ511
68
8|DSDE|ND
76
STUDENT-ENROLLMENT
32 | THAISZ | NJ512
89
7|CUS |ND
91
62
6|3UA |NJ
COURSE-ENROLLMENT
STUDENT records
ENROLLMENT records
COURSE records
Easy to insert (create new record and reset pointers), delete (reset pointers),
update (always just 1 copy to worry about, ZERO REDUNDANCY!)
network approach: fast processing, complicated structure (usually requires data
processing shop)
Again, it favors one workload type over others.
REVIEW OF HORIZONTAL DATA MODELS
INVERTED LIST MODEL (TABULAR):
Flat Ordered Files (like relational except there's intrinsic order visible to user
programs on both tuples and attributes). Order is usually "arrival order",
meaning each record is given a unique "Relative Record Number" or RRN
when it is inserted. - RRNs never change (unless there is a reorganization).
Programs can access records by RRN. Physical placement of records on pages is
in RRN order ("clustered on RRN" so that application programs can
efficiently retrieve in RRN order.
Indexes, etc can be provided for other access paths (and orderings).
page1
RRN
| 0 |
| 1 |
| 2 |
| 3 |
S#
25
32
38
47
ST
|NJ|
|NJ|
|FL|
|NY|
page2
| 0 | 57 |NY|
|
|
| |
|
|
| |
|
|
| |
STATE-INDEX
RID STATE
|1,2| FL |
|1,0| NJ |
|1,1| NJ |
|1,3| NY |
|2,0| NY |
REVIEW OF HORIZONTAL DATA MODELS
OBJECT RELATIONAL MODEL
Object Relational Model (OR model) is like relational model except repeating groups
are allowed (many levels of repeating groups - even nested repeating groups) and
Pointers to very complex structures are allowed.
(LOBs for Large OBjects, BLOBs for Binary Large OBjects, etc. for storing, e.g.,
pictures, movies, and other binary large objects.
3. Vertical Data
Thank
you.