Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic for Thursday? 1 Miscellaneous Topics in Databases PARALLEL DBMS WHY PARALLEL ACCESS TO DATA? At 10 MB/s 1.2 days to scan 1 Terabyte 10 MB/s 1,000 x parallel 1.5 minute to scan. 1 Terabyte Parallelism: divide a big problem into many smaller ones to be solved in parallel. 4 PARALLEL DBMS: INTRO Parallelism is natural to DBMS processing Pipeline parallelism: many machines each doing one step in a multi-step process. Partition parallelism: many machines doing the same thing to different pieces of data. Both are natural in DBMS! Pipeline Partition Any Sequential Program Sequential Any Sequential Sequential Program Any Sequential Program Any Sequential Program outputs split N ways, inputs merge 5 Speed-Up More resources means proportionally less time for given amount of data. Xact/sec. (throughput) SOME || TERMINOLOGY Ideal Realistic degree of ||-ism Scale-Up If resources increased in proportion to increase in data size, time is constant. Why Realistic <> Ideal? sec./Xact (response time) Realistic Ideal degree of ||-ism 6 INTRODUCTION Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature multiple processors and this trend is projected to accelerate Databases are growing increasingly large large volumes of transaction data are collected and stored for later analysis. multimedia objects like images are increasingly stored in databases Large-scale parallel database systems increasingly used for: storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing 7 Google data centers around the world, as of 2008 8 PARALLELISM IN DATABASES Data can be partitioned across multiple disks for parallel I/O. Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel data can be partitioned and each processor can work independently on its own partition Results merged when done Different queries can be run in parallel with each other. Concurrency control takes care of conflicts. Thus, databases naturally lend themselves to parallelism. 9 PARTITIONING Horizontal partitioning (shard) involves putting different rows into different tables Ex: customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest Vertical partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns partitions columns even when already normalized called "row splitting" (the row is split by its columns) Ex: split (slow to find) dynamic data from (fast to find) static data in a table where the dynamic data is not used as often as the static 10 COMPARISON OF PARTITIONING TECHNIQUES Evaluate how well partitioning techniques support the following types of data access: 1.Scanning the entire relation. 2.Locating a tuple associatively – point queries. E.g., r.A = 25. 3.Locating all tuples such that the value of a given attribute lies within a specified range – range queries. E.g., 10 r.A < 25. 11 HANDLING SKEW USING HISTOGRAMS Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation 12 INTERQUERY PARALLELISM Queries/transactions execute in parallel with one another concurrent processing Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second. Easiest form of parallelism to support 13 INTRAQUERY PARALLELISM Execution of a single query in parallel on multiple processors/disks; important for speeding up longrunning queries Two complementary forms of intraquery parallelism : Intraoperation Parallelism – parallelize the execution of each individual operation in the query (each CPU runs on a subset of tuples) Interoperation Parallelism – execute the different operations in a query expression in parallel. (each CPU runs a subset of operations on the data) 14 PARALLEL JOIN The join operation requires pairs of tuples to be tested to see if they satisfy the join condition, and if they do, the pair is added to the join output. Parallel join algorithms attempt to split the pairs to be tested over several processors. Each processor then computes part of the join locally. In a final step, the results from each processor can be collected together to produce the final result. 15 QUERY OPTIMIZATION Query optimization in parallel databases is more complex than in sequential databases Cost models are more complicated, since we must take into account partitioning costs and issues such as skew and resource contention When scheduling execution tree in parallel system, must decide: How to parallelize each operation how many processors to use for it What operations to pipeline what operations to execute independently in parallel what operations to execute sequentially Determining the amount of resources to allocate for each operation is a problem E.g., allocating more processors than optimal can result in high communication overhead 16 DEDUCTIVE DATABASES OVERVIEW OF DEDUCTIVE DATABASES Declarative Language Language to specify rules Inference Engine (Deduction Machine) Can deduce new facts by interpreting the rules Related to logic programming Prolog language (Prolog => Programming in logic) Uses backward chaining to evaluate Top-down application of the rules Consists of: Facts Similar to relation specification without the necessity of including attribute names Rules Similar to relational views (virtual relations that are not stored) 18 PROLOG/DATALOG NOTATION Facts are provided as predicates Predicate has a name a fixed number of arguments Convention: Constants are numeric or character strings Variables start with upper case letters E.g., SUPERVISE(Supervisor, Supervisee) States that Supervisor SUPERVISE(s) Supervisee 19 PROLOG/DATALOG NOTATION Rule Is of the form head :- body where :- is read as if and only iff E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y) E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y) 20 PROLOG/DATALOG NOTATION Query Involves a predicate symbol followed by some variable arguments to answer the question where :- is read as if and only iff E.g., SUPERIOR(james,Y)? E.g., SUBORDINATE(james,X)? 21 Prolog notation Supervisory tree 22 PROVING A NEW FACT 23 24 DATA MINING DEFINITION Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6% 26 DEFINITION (CONT.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. 27 WHY USE DATA MINING TODAY? Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rate Availability of: Data Storage Computational power Off-the-shelf software Expertise 28 THE KNOWLEDGE DISCOVERY PROCESS Steps: Identify business problem Data mining Action Evaluation and measurement Deployment and integration into businesses processes 29 PREPROCESSING AND MINING Knowledge Patterns Target Data Preprocessed Data Interpretation Model Construction Original Data Preprocessing Data Integration and Selection 30 DATA MINING TECHNIQUES Supervised learning Unsupervised learning Classification and regression Clustering Dependency modeling Associations, summarization, causality Outlier and deviation detection Trend analysis and change detection 31 EXAMPLE APPLICATION: SKY SURVEY Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete Goal: Generate a catalog with all objects and their type Method: Use decision trees as data mining model Results: 94% accuracy in predicting sky object classes Increased number of faint objects classified by 300% Helped team of astronomers to discover 16 new high red-shift quasars in one order of magnitude less observation time 32 CLASSIFICATION EXAMPLE Example training database Two predictor attributes: Age and Car-type (Sport, Minivan and Truck) Age is ordered, Car-type is categorical attribute Class label indicates whether person bought product Dependent attribute is categorical Age Car 20 M 30 M 25 T 30 S 40 S 20 T 30 M 25 M 40 M 20 S Class Yes Yes No Yes Yes No Yes Yes Yes No 33 GOALS AND REQUIREMENTS Goals: To produce an accurate classifier/regression function To understand the structure of the problem Requirements on the model: High accuracy Understandable by humans, interpretable Fast construction for very large training databases 34 WHAT ARE DECISION TREES? Age <30 >=30 YES Car Type Minivan YES Sports, Truck Minivan YES Sports, Truck NO YES NO 0 30 60 Age 35 DENSITY-BASED CLUSTERING A cluster is defined as a connected dense component. Density is defined in terms of number of neighbors of a point. We can find clusters of arbitrary shape 36 MARKET BASKET ANALYSIS Consider shopping cart filled with several items Market basket analysis tries to answer the following questions: Who makes purchases? What do customers buy together? In what order do customers purchase items? 37 MARKET BASKET ANALYSIS (CONTD.) Coocurrences Association rules 80% of all customers purchase items X, Y and Z together. 60% of all customers who purchase X and Y also buy Z. Sequential patterns 60% of customers who first buy X also purchase Y within three weeks. 38 SPATIAL DATA 40 WHAT IS A SPATIAL DATABASE? Database that: Stores spatial objects Manipulates spatial objects just like other objects in the database 41 WHAT IS SPATIAL DATA? Data which describes either location or shape e.g.House or Fire Hydrant location Roads, Rivers, Pipelines, Power lines Forests, Parks, Municipalities, Lakes In the abstract, reductionist view of the computer, these entities are represented as Points, Lines, and Polygons. 42 Roads are represented as Lines Mail Boxes are represented as Points 43 TOPIC THREE Land Use Classifications are represented as Polygons 44 TOPIC THREE Combination of all the previous data 45 SPATIAL RELATIONSHIPS Not just interested in location, also interested in “Relationships” between objects that are very hard to model outside the spatial domain. The most common relationships are Proximity : distance Adjacency : “touching” and “connectivity” Containment : inside/overlapping 46 SPATIAL RELATIONSHIPS Distance between a toxic waste dump and a piece of property you were considering buying. 47 SPATIAL RELATIONSHIPS Distance to various pubs 48 SPATIAL RELATIONSHIPS Adjacency: All the lots which share an edge 49 Connectivity: Tributary relationships in river networks 50 MOST ORGANIZATIONS HAVE SPATIAL DATA Geocodable addresses Customer location Store locations Transportation tracking Statistical/Demograph ic Cartography Epidemiology Crime patterns Weather Information Land holdings Natural resources City Planning Environmental planning Information Visualization Hazard detection 51 ADVANTAGES OF SPATIAL DATABASES Able to treat your spatial data like anything else in the DB transactions backups integrity checks less data redundancy fundamental organization and operations handled by the DB multi-user support security/access control locking 52 ADVANTAGES OF SPATIAL DATABASES Offset complicated tasks to the DB server organization and indexing done for you do not have to re-implement operators do not have to re-implement functions Significantly lowers the development time of client applications 53 ADVANTAGES OF SPATIAL DATABASES Spatial querying using SQL use simple SQL expressions to determine spatial relationships distance adjacency containment use simple SQL expressions to perform spatial operations area length intersection union buffer 54 Original Polygons Union Intersection 55 Buffered rivers Original river network 56 ADVANTAGES OF SPATIAL DATABASES … WHERE distance(<me>,pub_loc) < 1000 SELECT distance(<me>,pub_loc)*$0.01 + beer_cost … ... WHERE touches(pub_loc, street) … WHERE inside(pub_loc,city_area) and city_name = ... 57 ADVANTAGES OF SPATIAL DATABASES Simple value of the proposed lot Area(<my lot>) * <price per acre> + area(intersect(<my log>,<forested area>) ) * <wood value per acre> - distance(<my lot>, <power lines>) * <cost of power line laying> 58 New Electoral Districts • Changes in areas between 1996 and 2001 election. • Want to predict voting in 2001 by looking at voting in 1996. • Intersect the 2001 district polygon with the voting areas polygons. • Outside will have zero area • Inside will have 100% area • On the border will have partial area • Multiply the % area by 1996 actual voting and sum • Result is a simple prediction of 2001 voting More advanced: also use demographic data. 59 DISADVANTAGES OF SPATIAL DATABASES Cost to implement can be high Some inflexibility Incompatibilities with some GIS software Slower than local, specialized data structures User/managerial inexperience and caution 60 PICTOGRAMS - SHAPES Types: Basic Shapes, Multi-Shapes, Derived Shapes, Alternate Shapes, Any possible Shape, User-Defined Shapes Basic Shapes Alternate Shapes Multi-Shapes Any Possible Shape N Derived Shapes 0, N * User Defined Shape ! 61 SPATIAL DATA ENTITY CREATION Form an entity to hold county names, states, populations, and geographies CREATE TABLE County( Name varchar(30), State varchar(30), Pop Integer, Shape Polygon); Form an entity to hold river names, sources, lengths, and geographies CREATE TABLE River( Name varchar(30), Source varchar(30), Distance Integer, Shape LineString); 62 EXTENDING THE ER DIAGRAM Standard ER Diagram Spatial ER Diagram 63 64