Download CS186: Introduction to Database Systems

EECS 647: Introduction to Database Systems Instructor: Luke Huan Spring 2007 Review  Classification     Training data set Testing data set Classification Models Model evaluation 5/22/2017 Luke Huan Univ. of Kansas 2 Metrics for Performance Evaluation… PREDICTED CLASS Class=Yes ACTUAL CLASS  Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Most widely-used metric: ad TP  TN Accuracy   a  b  c  d TP  TN  FP  FN 5/22/2017 Luke Huan Univ. of Kansas 3 Cost-Sensitive Measures PREDICTED CLASS Class=Y Class=N es o ACTUA Class=Y L es CLASS Class=N o 5/22/2017 a (TP) b (FN) c (FP) d (TN) a Precision (p)  ac a Recall (r)  ab 2rp 2a F - measure (F)   r  p 2a  b  c Luke Huan Univ. of Kansas 4 Today’s Topic   Clustering XML 5/22/2017 Luke Huan Univ. of Kansas 5 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized 5/22/2017 Luke Huan Univ. of Kansas 6 Applications of Cluster Analysis  Understanding  Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations  Summarization  Reduce the size of large data sets 5/22/2017 Luke Huan Univ. of Kansas 7 Multidisciplinary Efforts of Clustering  Pattern Recognition  Spatial Data Analysis  Create thematic maps in GIS by clustering feature spaces  Detect spatial clusters or for other spatial mining tasks  Image Processing  Economic Science (especially market research)  WWW   Document classification  Cluster Weblog data to discover groups of similar access patterns Bioinfo:  Phylogenetic tree  Microarray analysis 5/22/2017 Luke Huan Univ. of Kansas 8 What is not Cluster Analysis?  Supervised classification   Simple segmentation   Dividing students into different registration groups alphabetically, by last name Results of a query   Have class label information Groupings are a result of an external specification Graph partitioning  5/22/2017 Some mutual relevance and synergy, but areas are not identical Luke Huan Univ. of Kansas 9 Terms in Cluster Analysis   Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes  So what?  Clustering can be used as a stand-alone tool to get insight into data distribution  Clustering can be used as a preprocessing step for other algorithms such as discretization 5/22/2017 Luke Huan Univ. of Kansas 10 Types of Clusters    Well-separated clusters Center-based clusters Contiguous clusters 5/22/2017 Luke Huan Univ. of Kansas 11 Types of Clusters: Well-Separated  Well-Separated Clusters:  A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters 5/22/2017 Luke Huan Univ. of Kansas 12 Types of Clusters: Center-Based  Center-based   A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters 5/22/2017 Luke Huan Univ. of Kansas 13 Major Clustering Approaches (I)  Partitioning approach:  Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors   Typical methods: k-means, k-medoids, CLARANS Hierarchical approach:  Create a hierarchical decomposition of the set of data (or objects) using some criterion  5/22/2017 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Luke Huan Univ. of Kansas 14 Partitional Clustering Original Points 5/22/2017 A Partitional Clustering Luke Huan Univ. of Kansas 15 Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering 5/22/2017 p3 p4 Non-traditional Dendrogram Luke Huan Univ. of Kansas 16 Partitioning Algorithms: Basic Concept  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance km1tmiKm (Cm  tmi )2  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-means (MacQueen’67): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 5/22/2017 Luke Huan Univ. of Kansas 17 The K-Means Clustering Method  Given k, the k-means algorithm is implemented in four steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the mean of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment 5/22/2017 Luke Huan Univ. of Kansas 18 The K-Means Clustering Method 10 9 8 7 6 5 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 4 3 2 1 0 0 10 10 9 9 8 8 7 7 6 6 5 4 3 2 1 0 0 5/22/2017 Update the cluster means 1 2 3 4 5 6 7 8 9 10 Luke Huan Univ. of Kansas Update the cluster means 1 2 3 4 5 6 7 8 9 10 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 19 Comments on the K-Means Method  Strength   Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum.  5/22/2017 The global optimum may be found using techniques such as genetic algorithms Luke Huan Univ. of Kansas 20 Comments on the K-Means Method  Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers  Not suitable to discover clusters with non-convex shapes 5/22/2017 Luke Huan Univ. of Kansas 21 A Problem K-means: Differing Density K-means (3 Clusters) Original Points 5/22/2017 Luke Huan Univ. of Kansas 22 A Problem of K-means: Non-globular Shapes Original Points 5/22/2017 K-means (2 Clusters) Luke Huan Univ. of Kansas 23 From HTML to XML (eXtensible Markup Language)  HTML describes presentation of content <h1>Bibliography</h1> <p><i>Foundations of Databases</i> Abiteboul, Hull, and Vianu <br>Addison Wesley, 1995 <p>…  XML describes only the content <bibliography> <book> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Hull</author> <author>Vianu</author> <publisher>Addison Wesley</publisher> <year>1995</year> </book> <book>…</book> </bibliography>  Separation of content from presentation simplifies content extraction and allows the same content to be presented easily in different looks 5/22/2017 Luke Huan Univ. of Kansas 24 Other nice features of XML  Portability: Just like HTML, you can ship XML data across platforms   Flexibility: You can represent any information (structured, semi-structured, documents, …)   Relational data requires heavy-weight protocols, e.g., JDBC Relational data is best suited for structured data Extensibility: Since data describes itself, you can change the schema easily  Relational schema is rigid and difficult to change 5/22/2017 Luke Huan Univ. of Kansas 25 XML terminology     <bibliography> <book ISBN=”ISBN-10” price=”80.00”> <title>Foundations of Databases</t <is_textbook/> <author>Abiteboul</author> <author>Hull</author> <author>Vianu</author> <publisher>Addison Wesley</publish <year>1995</year> </book>… </bibliography> Tag names: book, title, … Start tags: <book>, <title>, … End tags: </book>, </title>, … An element is enclosed by a pair of start and end tags: <book>…</book>   Elements can be nested: <book>…<title>…</title>…</book> Empty elements: <is_textbook></is_textbook>   Can be abbreviated: <is_textbook/> Elements can also have attributes: <book ISBN=”…” price=”80.00”> 5/22/2017 Luke Huan Univ. of Kansas 26 Well-formed XML documents A well-formed XML document  Follows XML lexical conventions   Wrong: <section>We show that x < 0…</section> Right: <section>We show that x < 0…</section>    Other special entities: > becomes > and & becomes & Contains a single root element Has tags that are properly matched and elements that are properly nested  Right: <section>…<subsection>…</subsection>…</sectio n>  Wrong: <section>…<subsection>…</section>…</subsectio n> 5/22/2017 Luke Huan Univ. of Kansas 27 More XML features  Comments:  CDATA: <![CDATA[Tags: <book>,…]]>  ID’s and references  <person id=”o12”><name>Homer</name>…</person> <person id=”o34”><name>Marge</name>…</person> <person id=”o56” father=”o12” mother=”o34”><name>Bart</name>…</person>…  Namespaces allow external schemas and qualified names <book xmlns:myCitationStyle=”http://…/mySchema”> <myCitationStyle:title>…</myCitationStyle:title> <myCitationStyle:author>…</myCitationStyle:author>… </book>  Processing instructions for apps: <? …java applet… ?>  And more… 5/22/2017 Luke Huan Univ. of Kansas 28 Valid XML documents  A valid XML document conforms to a Document Type Definition (DTD)   A DTD specifies    A DTD is optional A grammar for the document Constraints on structures and values of elements, attributes, etc. Example <!DOCTYPE bibliography [ <!ELEMENT bibliography (book+)> <!ELEMENT book (title, author*, publisher?, year?, section*)> <!ATTLIST book ISBN CDATA #REQUIRED> <!ATTLIST book price CDATA #IMPLIED> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT publisher (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT section (title, (#PCDATA)?, section*)> ]> 5/22/2017 Luke Huan Univ. of Kansas 29 DTD explained <!DOCTYPE bibliography [ bibliography is the root element of the document <!ELEMENT bibliography (book+)> One or more bibliography consists of a sequence of one or more book elements <!ELEMENT book (title, author*, publisher?, year?, Zero or one section*)> Zero or more book consists of a title, zero or more authors, an optional publisher, and zero or more sections, in sequence <!ATTLIST book ISBNISBN ID #REQUIRED> book has a required attribute which is a unique identifier <bibliography> <book ISBN=”ISBN-10” price=”80.00”> <title>Foundations of Databases</ <author>Abiteboul</author> <author>Hull</author> <author>Vianu</author> <publisher>Addison Wesley</publis <year>1995</year> </book>… </bibliography> <!ATTLIST book price(#IMPLIED) CDATA #IMPLIED> book has an optional price attribute which contains character data 5/22/2017 Luke Huan Univ. of Kansas 30 DTD explained (cont’d) <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT title (#PCDATA)> PCDATA is text that will be parsed author (#PCDATA)> publisher (#PCDATA)> (<…> will be treated as a markup tag and < etc. will be treated as entities year (#PCDATA)> CDATA is unparsed character data title, author, publisher, and year all contain parsed character data (#PCDATA) <!ELEMENT section (title, (#PCDATA)?, section*)> ]> 5/22/2017 Each section starts with a title, followed by some optional text and then zero or more subsections <section><title>Introduction</title> Luke Huan Univ. of Kansas In this section we introduce XML and <section><title>XML</title> XML stands for… </section> <section><title>DTD</title> <section><title>Definition</title> DTD stands for… </section> <section><title>Usage</title> You can use DTD to… </section> </section> </section> 31 “Deterministic” content declaration  Catch: the following declaration does not work:    <!ELEMENT pub-venue ( (name, address, month, year) | (name, volume, number, year) )> Because when looking at name, the XML processor would not know which way to go without looking further ahead Requirement: content declaration must be “deterministic” (i.e., no look-ahead required) 5/22/2017 Luke Huan Univ. of Kansas 32 XML versus relational data Relational data  Schema is always fixed in advance and difficult to change  Simple, flat table structures  Ordering of rows and columns is unimportant  Data exchange is problematic “Native” support in all serious commercial DBMS  5/22/2017 XML data  Well-formed XML does not require predefined, fixed schema  Nested structure; ID/IDREF(S) permit arbitrary graphs  Ordering forced by document format; may or may not be important  Designed for easy exchange  Often implemented as an “add-on” on top of relations Luke Huan Univ. of Kansas 33 Query languages for XML  XPath    XQuery   Path expressions with conditions Building block of other standards (XQuery, XSLT, XLink, XPointer, etc.) XPath + full-fledged SQL-like query language XSLT  XPath + transformation templates 5/22/2017 Luke Huan Univ. of Kansas 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CS186: Introduction to Database Systems