Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
 Topic for Thursday? 1 Miscellaneous Topics in Databases PARALLEL DBMS WHY PARALLEL ACCESS TO DATA? At 10 MB/s 1.2 days to scan 1 Terabyte 10 MB/s 1,000 x parallel 1.5 minute to scan. 1 Terabyte Parallelism: divide a big problem into many smaller ones to be solved in parallel. 4 PARALLEL DBMS: INTRO  Parallelism is natural to DBMS processing Pipeline parallelism: many machines each doing one step in a multi-step process.  Partition parallelism: many machines doing the same thing to different pieces of data.  Both are natural in DBMS!  Pipeline Partition Any Sequential Program Sequential Any Sequential Sequential Program Any Sequential Program Any Sequential Program outputs split N ways, inputs merge 5  Speed-Up  More resources means proportionally less time for given amount of data. Xact/sec. (throughput) SOME || TERMINOLOGY Ideal Realistic degree of ||-ism Scale-Up   If resources increased in proportion to increase in data size, time is constant. Why Realistic <> Ideal? sec./Xact (response time)  Realistic Ideal degree of ||-ism 6 INTRODUCTION  Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply  Recent desktop computers feature multiple processors and this trend is projected to accelerate   Databases are growing increasingly large large volumes of transaction data are collected and stored for later analysis.  multimedia objects like images are increasingly stored in databases   Large-scale parallel database systems increasingly used for: storing large volumes of data  processing time-consuming decision-support queries  providing high throughput for transaction processing  7  Google data centers around the world, as of 2008 8 PARALLELISM IN DATABASES Data can be partitioned across multiple disks for parallel I/O.  Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel  data can be partitioned and each processor can work independently on its own partition  Results merged when done   Different queries can be run in parallel with each other.   Concurrency control takes care of conflicts. Thus, databases naturally lend themselves to parallelism. 9 PARTITIONING  Horizontal partitioning (shard) involves putting different rows into different tables  Ex: customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest   Vertical partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns  partitions columns even when already normalized  called "row splitting" (the row is split by its columns)  Ex: split (slow to find) dynamic data from (fast to find) static data in a table where the dynamic data is not used as often as the static  10 COMPARISON OF PARTITIONING TECHNIQUES  Evaluate how well partitioning techniques support the following types of data access: 1.Scanning the entire relation. 2.Locating a tuple associatively – point queries.  E.g., r.A = 25. 3.Locating all tuples such that the value of a given attribute lies within a specified range – range queries.  E.g., 10  r.A < 25. 11 HANDLING SKEW USING HISTOGRAMS  Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion   Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation 12 INTERQUERY PARALLELISM  Queries/transactions execute in parallel with one another  concurrent processing Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second.  Easiest form of parallelism to support  13 INTRAQUERY PARALLELISM Execution of a single query in parallel on multiple processors/disks; important for speeding up longrunning queries  Two complementary forms of intraquery parallelism :  Intraoperation Parallelism – parallelize the execution of each individual operation in the query (each CPU runs on a subset of tuples)  Interoperation Parallelism – execute the different operations in a query expression in parallel. (each CPU runs a subset of operations on the data)  14 PARALLEL JOIN The join operation requires pairs of tuples to be tested to see if they satisfy the join condition, and if they do, the pair is added to the join output.  Parallel join algorithms attempt to split the pairs to be tested over several processors. Each processor then computes part of the join locally.  In a final step, the results from each processor can be collected together to produce the final result.  15 QUERY OPTIMIZATION     Query optimization in parallel databases is more complex than in sequential databases Cost models are more complicated, since we must take into account partitioning costs and issues such as skew and resource contention When scheduling execution tree in parallel system, must decide:  How to parallelize each operation  how many processors to use for it  What operations to pipeline  what operations to execute independently in parallel  what operations to execute sequentially Determining the amount of resources to allocate for each operation is a problem  E.g., allocating more processors than optimal can result in high communication overhead 16 DEDUCTIVE DATABASES OVERVIEW OF DEDUCTIVE DATABASES  Declarative Language   Language to specify rules Inference Engine (Deduction Machine) Can deduce new facts by interpreting the rules  Related to logic programming  Prolog language (Prolog => Programming in logic)  Uses backward chaining to evaluate  Top-down application of the rules   Consists of:  Facts   Similar to relation specification without the necessity of including attribute names Rules  Similar to relational views (virtual relations that are not stored) 18 PROLOG/DATALOG NOTATION Facts are provided as predicates  Predicate has  a name  a fixed number of arguments  Convention:  Constants are numeric or character strings  Variables start with upper case letters   E.g., SUPERVISE(Supervisor, Supervisee)  States that Supervisor SUPERVISE(s) Supervisee 19 PROLOG/DATALOG NOTATION  Rule  Is of the form head :- body  where :- is read as if and only iff E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)  E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)  20 PROLOG/DATALOG NOTATION  Query  Involves a predicate symbol followed by some variable arguments to answer the question    where :- is read as if and only iff E.g., SUPERIOR(james,Y)? E.g., SUBORDINATE(james,X)? 21 Prolog notation Supervisory tree 22 PROVING A NEW FACT 23 24 DATA MINING DEFINITION Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6% 26 DEFINITION (CONT.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. 27 WHY USE DATA MINING TODAY? Human analysis skills are inadequate:   Volume and dimensionality of the data High data growth rate Availability of:      Data Storage Computational power Off-the-shelf software Expertise 28 THE KNOWLEDGE DISCOVERY PROCESS Steps:  Identify business problem  Data mining  Action  Evaluation and measurement  Deployment and integration into businesses processes 29 PREPROCESSING AND MINING Knowledge Patterns Target Data Preprocessed Data Interpretation Model Construction Original Data Preprocessing Data Integration and Selection 30 DATA MINING TECHNIQUES  Supervised learning   Unsupervised learning   Classification and regression Clustering Dependency modeling  Associations, summarization, causality Outlier and deviation detection  Trend analysis and change detection  31 EXAMPLE APPLICATION: SKY SURVEY  Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete  Goal: Generate a catalog with all objects and their type  Method: Use decision trees as data mining model  Results:    94% accuracy in predicting sky object classes Increased number of faint objects classified by 300% Helped team of astronomers to discover 16 new high red-shift quasars in one order of magnitude less observation time 32 CLASSIFICATION EXAMPLE  Example training database     Two predictor attributes: Age and Car-type (Sport, Minivan and Truck) Age is ordered, Car-type is categorical attribute Class label indicates whether person bought product Dependent attribute is categorical Age Car 20 M 30 M 25 T 30 S 40 S 20 T 30 M 25 M 40 M 20 S Class Yes Yes No Yes Yes No Yes Yes Yes No 33 GOALS AND REQUIREMENTS  Goals: To produce an accurate classifier/regression function  To understand the structure of the problem   Requirements on the model: High accuracy  Understandable by humans, interpretable  Fast construction for very large training databases  34 WHAT ARE DECISION TREES? Age <30 >=30 YES Car Type Minivan YES Sports, Truck Minivan YES Sports, Truck NO YES NO 0 30 60 Age 35 DENSITY-BASED CLUSTERING A cluster is defined as a connected dense component.  Density is defined in terms of number of neighbors of a point.  We can find clusters of arbitrary shape  36 MARKET BASKET ANALYSIS Consider shopping cart filled with several items  Market basket analysis tries to answer the following questions:  Who makes purchases?  What do customers buy together?  In what order do customers purchase items?  37 MARKET BASKET ANALYSIS (CONTD.)  Coocurrences   Association rules   80% of all customers purchase items X, Y and Z together. 60% of all customers who purchase X and Y also buy Z. Sequential patterns  60% of customers who first buy X also purchase Y within three weeks. 38 SPATIAL DATA 40 WHAT IS A SPATIAL DATABASE?  Database that: Stores spatial objects  Manipulates spatial objects just like other objects in the database  41 WHAT IS SPATIAL DATA?  Data which describes either location or shape e.g.House or Fire Hydrant location Roads, Rivers, Pipelines, Power lines Forests, Parks, Municipalities, Lakes  In the abstract, reductionist view of the computer, these entities are represented as Points, Lines, and Polygons. 42 Roads are represented as Lines Mail Boxes are represented as Points 43 TOPIC THREE Land Use Classifications are represented as Polygons 44 TOPIC THREE Combination of all the previous data 45 SPATIAL RELATIONSHIPS Not just interested in location, also interested in “Relationships” between objects that are very hard to model outside the spatial domain.  The most common relationships are  Proximity : distance  Adjacency : “touching” and “connectivity”  Containment : inside/overlapping  46 SPATIAL RELATIONSHIPS Distance between a toxic waste dump and a piece of property you were considering buying. 47 SPATIAL RELATIONSHIPS Distance to various pubs 48 SPATIAL RELATIONSHIPS Adjacency: All the lots which share an edge 49 Connectivity: Tributary relationships in river networks 50 MOST ORGANIZATIONS HAVE SPATIAL DATA Geocodable addresses  Customer location  Store locations  Transportation tracking  Statistical/Demograph ic  Cartography  Epidemiology  Crime patterns  Weather Information  Land holdings  Natural resources  City Planning  Environmental planning  Information Visualization  Hazard detection  51 ADVANTAGES OF SPATIAL DATABASES  Able to treat your spatial data like anything else in the DB         transactions backups integrity checks less data redundancy fundamental organization and operations handled by the DB multi-user support security/access control locking 52 ADVANTAGES OF SPATIAL DATABASES  Offset complicated tasks to the DB server organization and indexing done for you  do not have to re-implement operators  do not have to re-implement functions   Significantly lowers the development time of client applications 53 ADVANTAGES OF SPATIAL DATABASES  Spatial querying using SQL  use simple SQL expressions to determine spatial relationships distance  adjacency  containment   use simple SQL expressions to perform spatial operations area  length  intersection  union  buffer  54 Original Polygons Union Intersection 55 Buffered rivers Original river network 56 ADVANTAGES OF SPATIAL DATABASES … WHERE distance(<me>,pub_loc) < 1000 SELECT distance(<me>,pub_loc)*$0.01 + beer_cost … ... WHERE touches(pub_loc, street) … WHERE inside(pub_loc,city_area) and city_name = ... 57 ADVANTAGES OF SPATIAL DATABASES Simple value of the proposed lot Area(<my lot>) * <price per acre> + area(intersect(<my log>,<forested area>) ) * <wood value per acre> - distance(<my lot>, <power lines>) * <cost of power line laying> 58 New Electoral Districts • Changes in areas between 1996 and 2001 election. • Want to predict voting in 2001 by looking at voting in 1996. • Intersect the 2001 district polygon with the voting areas polygons. • Outside will have zero area • Inside will have 100% area • On the border will have partial area • Multiply the % area by 1996 actual voting and sum • Result is a simple prediction of 2001 voting More advanced: also use demographic data. 59 DISADVANTAGES OF SPATIAL DATABASES Cost to implement can be high  Some inflexibility  Incompatibilities with some GIS software  Slower than local, specialized data structures  User/managerial inexperience and caution  60 PICTOGRAMS - SHAPES  Types: Basic Shapes, Multi-Shapes, Derived Shapes, Alternate Shapes, Any possible Shape, User-Defined Shapes Basic Shapes Alternate Shapes Multi-Shapes Any Possible Shape N Derived Shapes 0, N * User Defined Shape ! 61 SPATIAL DATA ENTITY CREATION  Form an entity to hold county names, states, populations, and geographies CREATE TABLE County( Name varchar(30), State varchar(30), Pop Integer, Shape Polygon);  Form an entity to hold river names, sources, lengths, and geographies CREATE TABLE River( Name varchar(30), Source varchar(30), Distance Integer, Shape LineString); 62 EXTENDING THE ER DIAGRAM Standard ER Diagram Spatial ER Diagram 63 64