Download PPTX - SFU computer science

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript

Topic for Thursday?
1
Miscellaneous Topics
in Databases
PARALLEL DBMS
WHY PARALLEL ACCESS TO DATA?
At 10 MB/s
1.2 days to scan
1 Terabyte
10 MB/s
1,000 x parallel
1.5 minute to scan.
1 Terabyte
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
4
PARALLEL DBMS: INTRO

Parallelism is natural to DBMS processing
Pipeline parallelism: many machines each doing one
step in a multi-step process.
 Partition parallelism: many machines doing the same
thing to different pieces of data.
 Both are natural in DBMS!

Pipeline
Partition
Any
Sequential
Program
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
outputs split N ways, inputs merge
5

Speed-Up

More resources means
proportionally less time
for given amount of
data.
Xact/sec.
(throughput)
SOME || TERMINOLOGY
Ideal
Realistic
degree of ||-ism
Scale-Up


If resources increased
in proportion to
increase in data size,
time is constant.
Why Realistic <> Ideal?
sec./Xact
(response time)

Realistic
Ideal
degree of ||-ism
6
INTRODUCTION

Parallel machines are becoming quite common
and affordable
Prices of microprocessors, memory and disks have
dropped sharply
 Recent desktop computers feature multiple processors
and this trend is projected to accelerate


Databases are growing increasingly large
large volumes of transaction data are collected and
stored for later analysis.
 multimedia objects like images are increasingly stored
in databases


Large-scale parallel database systems
increasingly used for:
storing large volumes of data
 processing time-consuming decision-support queries
 providing high throughput for transaction processing

7

Google data centers around the world, as of 2008
8
PARALLELISM IN DATABASES
Data can be partitioned across multiple disks for
parallel I/O.
 Individual relational operations (e.g., sort, join,
aggregation) can be executed in parallel

data can be partitioned and each processor can work
independently on its own partition
 Results merged when done


Different queries can be run in parallel with each
other.


Concurrency control takes care of conflicts.
Thus, databases naturally lend themselves to
parallelism.
9
PARTITIONING

Horizontal partitioning (shard)
involves putting different rows into different tables
 Ex: customers with ZIP codes less than 50000 are
stored in CustomersEast, while customers with ZIP
codes greater than or equal to 50000 are stored in
CustomersWest


Vertical partitioning
involves creating tables with fewer columns and using
additional tables to store the remaining columns
 partitions columns even when already normalized
 called "row splitting" (the row is split by its columns)
 Ex: split (slow to find) dynamic data from (fast to find)
static data in a table where the dynamic data is not
used as often as the static

10
COMPARISON OF PARTITIONING TECHNIQUES

Evaluate how well partitioning techniques
support the following types of data access:
1.Scanning the entire relation.
2.Locating a tuple associatively – point queries.

E.g., r.A = 25.
3.Locating all tuples such that the value of a
given attribute lies within a specified range –
range queries.

E.g., 10  r.A < 25.
11
HANDLING SKEW USING HISTOGRAMS

Balanced partitioning vector can be constructed from
histogram in a relatively straightforward fashion


Assume uniform distribution within each range of the histogram
Histogram can be constructed by scanning relation, or
sampling (blocks containing) tuples of the relation
12
INTERQUERY PARALLELISM

Queries/transactions execute in parallel with one
another

concurrent processing
Increases transaction throughput; used primarily
to scale up a transaction processing system to
support a larger number of transactions per
second.
 Easiest form of parallelism to support

13
INTRAQUERY PARALLELISM
Execution of a single query in parallel on multiple
processors/disks; important for speeding up longrunning queries
 Two complementary forms of intraquery
parallelism :

Intraoperation Parallelism – parallelize the
execution of each individual operation in the query
(each CPU runs on a subset of tuples)
 Interoperation Parallelism – execute the different
operations in a query expression in parallel.
(each CPU runs a subset of operations on the data)

14
PARALLEL JOIN
The join operation requires pairs of tuples to be
tested to see if they satisfy the join condition, and
if they do, the pair is added to the join output.
 Parallel join algorithms attempt to split the pairs
to be tested over several processors. Each
processor then computes part of the join locally.
 In a final step, the results from each processor can
be collected together to produce the final result.

15
QUERY OPTIMIZATION




Query optimization in parallel databases is more complex
than in sequential databases
Cost models are more complicated, since we must take into
account partitioning costs and issues such as skew and
resource contention
When scheduling execution tree in parallel system, must
decide:
 How to parallelize each operation
 how many processors to use for it
 What operations to pipeline
 what operations to execute independently in parallel
 what operations to execute sequentially
Determining the amount of resources to allocate for each
operation is a problem
 E.g., allocating more processors than optimal can result
in high communication overhead
16
DEDUCTIVE DATABASES
OVERVIEW OF DEDUCTIVE DATABASES

Declarative Language


Language to specify rules
Inference Engine (Deduction Machine)
Can deduce new facts by interpreting the rules
 Related to logic programming

Prolog language (Prolog => Programming in logic)
 Uses backward chaining to evaluate
 Top-down application of the rules


Consists of:

Facts


Similar to relation specification without the necessity of
including attribute names
Rules

Similar to relational views (virtual relations that are not stored)
18
PROLOG/DATALOG NOTATION
Facts are provided as predicates
 Predicate has

a name
 a fixed number of arguments

Convention:
 Constants are numeric or character strings
 Variables start with upper case letters


E.g., SUPERVISE(Supervisor, Supervisee)

States that Supervisor SUPERVISE(s) Supervisee
19
PROLOG/DATALOG NOTATION

Rule

Is of the form head :- body

where :- is read as if and only iff
E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)
 E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)

20
PROLOG/DATALOG NOTATION

Query

Involves a predicate symbol followed by some variable
arguments to answer the question



where :- is read as if and only iff
E.g., SUPERIOR(james,Y)?
E.g., SUBORDINATE(james,X)?
21
Prolog notation
Supervisory tree
22
PROVING A NEW FACT
23
24
DATA MINING
DEFINITION
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable
patterns in data.
Example pattern (Census Bureau Data):
If (relationship = husband), then (gender = male). 99.6%
26
DEFINITION (CONT.)
Data mining is the exploration and analysis of large quantities
of data in order to discover valid, novel, potentially useful,
and ultimately understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern
beforehand.
Useful: We can devise actions from the
patterns.
Understandable: We can interpret and
comprehend the patterns.
27
WHY USE DATA MINING TODAY?
Human analysis skills are inadequate:


Volume and dimensionality of the data
High data growth rate
Availability of:





Data
Storage
Computational power
Off-the-shelf software
Expertise
28
THE KNOWLEDGE DISCOVERY PROCESS
Steps:

Identify business problem

Data mining

Action

Evaluation and measurement

Deployment and integration into businesses
processes
29
PREPROCESSING AND MINING
Knowledge
Patterns
Target
Data
Preprocessed
Data
Interpretation
Model
Construction
Original Data
Preprocessing
Data
Integration
and Selection
30
DATA MINING TECHNIQUES

Supervised learning


Unsupervised learning


Classification and regression
Clustering
Dependency modeling

Associations, summarization, causality
Outlier and deviation detection
 Trend analysis and change detection

31
EXAMPLE APPLICATION: SKY SURVEY
 Input
data: 3 TB of image data with 2
billion sky objects, took more than six
years to complete
 Goal: Generate a catalog with all objects
and their type
 Method: Use decision trees as data mining
model
 Results:



94% accuracy in predicting sky object classes
Increased number of faint objects classified by
300%
Helped team of astronomers to discover 16 new
high red-shift quasars in one order of
magnitude less observation time
32
CLASSIFICATION EXAMPLE
 Example
training
database




Two predictor attributes:
Age and Car-type (Sport,
Minivan and Truck)
Age is ordered, Car-type is
categorical attribute
Class label indicates
whether person bought
product
Dependent attribute is
categorical
Age Car
20 M
30 M
25
T
30
S
40
S
20
T
30 M
25 M
40 M
20
S
Class
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
No
33
GOALS AND REQUIREMENTS

Goals:
To produce an accurate classifier/regression function
 To understand the structure of the problem


Requirements on the model:
High accuracy
 Understandable by humans, interpretable
 Fast construction for very large training databases

34
WHAT ARE DECISION TREES?
Age
<30
>=30
YES
Car Type
Minivan
YES
Sports, Truck
Minivan
YES
Sports,
Truck
NO
YES
NO
0
30
60 Age
35
DENSITY-BASED CLUSTERING
A cluster is defined as a connected dense component.
 Density is defined in terms of number of neighbors of
a point.
 We can find clusters of arbitrary shape

36
MARKET BASKET ANALYSIS
Consider shopping cart filled with several items
 Market basket analysis tries to answer the
following questions:

Who makes purchases?
 What do customers buy together?
 In what order do customers purchase items?

37
MARKET BASKET ANALYSIS (CONTD.)

Coocurrences


Association rules


80% of all customers purchase items X, Y and Z together.
60% of all customers who purchase X and Y also buy Z.
Sequential patterns

60% of customers who first buy X also purchase Y within
three weeks.
38
SPATIAL DATA
40
WHAT IS A SPATIAL DATABASE?

Database that:
Stores spatial objects
 Manipulates spatial objects just like other objects in
the database

41
WHAT IS SPATIAL DATA?

Data which describes either location or shape
e.g.House or Fire Hydrant location
Roads, Rivers, Pipelines, Power lines
Forests, Parks, Municipalities, Lakes

In the abstract, reductionist view of the computer,
these entities are represented as Points, Lines,
and Polygons.
42
Roads are represented as Lines
Mail Boxes are represented as Points
43
TOPIC THREE
Land Use Classifications are
represented as Polygons
44
TOPIC THREE
Combination of all the previous data
45
SPATIAL RELATIONSHIPS
Not just interested in location, also interested in
“Relationships” between objects that are very
hard to model outside the spatial domain.
 The most common relationships are

Proximity : distance
 Adjacency : “touching” and “connectivity”
 Containment : inside/overlapping

46
SPATIAL RELATIONSHIPS
Distance between a toxic waste dump and a piece of
property you were considering buying.
47
SPATIAL RELATIONSHIPS
Distance to various pubs
48
SPATIAL RELATIONSHIPS
Adjacency: All the lots which share an edge
49
Connectivity: Tributary relationships in river
networks
50
MOST ORGANIZATIONS HAVE SPATIAL
DATA
Geocodable addresses
 Customer location
 Store locations
 Transportation
tracking
 Statistical/Demograph
ic
 Cartography
 Epidemiology
 Crime patterns

Weather Information
 Land holdings
 Natural resources
 City Planning
 Environmental
planning
 Information
Visualization
 Hazard detection

51
ADVANTAGES OF SPATIAL DATABASES

Able to treat your spatial data like anything else
in the DB








transactions
backups
integrity checks
less data redundancy
fundamental organization and operations handled by
the DB
multi-user support
security/access control
locking
52
ADVANTAGES OF SPATIAL DATABASES

Offset complicated tasks to the DB server
organization and indexing done for you
 do not have to re-implement operators
 do not have to re-implement functions


Significantly lowers the development time of
client applications
53
ADVANTAGES OF SPATIAL DATABASES

Spatial querying using SQL

use simple SQL expressions to determine spatial
relationships
distance
 adjacency
 containment


use simple SQL expressions to perform spatial
operations
area
 length
 intersection
 union
 buffer

54
Original Polygons
Union
Intersection
55
Buffered rivers
Original river network
56
ADVANTAGES OF SPATIAL DATABASES
… WHERE distance(<me>,pub_loc) < 1000
SELECT distance(<me>,pub_loc)*$0.01 + beer_cost
…
... WHERE touches(pub_loc, street)
… WHERE inside(pub_loc,city_area) and city_name
= ...
57
ADVANTAGES OF SPATIAL DATABASES
Simple value of the proposed lot
Area(<my lot>) * <price per acre>
+ area(intersect(<my log>,<forested area>) ) * <wood value per
acre>
- distance(<my lot>, <power lines>) * <cost of power line laying>
58
New Electoral Districts
• Changes in areas between 1996 and
2001 election.
• Want to predict voting in 2001 by
looking at voting in 1996.
• Intersect the 2001 district polygon with
the voting areas polygons.
• Outside will have zero area
• Inside will have 100% area
• On the border will have partial area
• Multiply the % area by 1996 actual
voting and sum
• Result is a simple prediction of 2001
voting
More advanced: also use demographic
data.
59
DISADVANTAGES OF SPATIAL DATABASES
Cost to implement can be high
 Some inflexibility
 Incompatibilities with some GIS software
 Slower than local, specialized data structures
 User/managerial inexperience and caution

60
PICTOGRAMS - SHAPES

Types: Basic Shapes, Multi-Shapes, Derived
Shapes, Alternate Shapes, Any possible Shape,
User-Defined Shapes
Basic Shapes
Alternate Shapes
Multi-Shapes
Any Possible Shape
N
Derived Shapes
0, N
*
User Defined Shape
!
61
SPATIAL DATA ENTITY CREATION

Form an entity to hold county names, states,
populations, and geographies
CREATE TABLE County(
Name
varchar(30),
State
varchar(30),
Pop
Integer,
Shape
Polygon);

Form an entity to hold river names, sources,
lengths, and geographies
CREATE TABLE River(
Name
varchar(30),
Source varchar(30),
Distance Integer,
Shape
LineString);
62
EXTENDING THE ER DIAGRAM
Standard ER
Diagram
Spatial ER
Diagram
63
64