Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 647: Introduction to Database Systems Instructor: Luke Huan Spring 2007 Review Classification Training data set Testing data set Classification Models Model evaluation 5/22/2017 Luke Huan Univ. of Kansas 2 Metrics for Performance Evaluation… PREDICTED CLASS Class=Yes ACTUAL CLASS Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Most widely-used metric: ad TP TN Accuracy a b c d TP TN FP FN 5/22/2017 Luke Huan Univ. of Kansas 3 Cost-Sensitive Measures PREDICTED CLASS Class=Y Class=N es o ACTUA Class=Y L es CLASS Class=N o 5/22/2017 a (TP) b (FN) c (FP) d (TN) a Precision (p) ac a Recall (r) ab 2rp 2a F - measure (F) r p 2a b c Luke Huan Univ. of Kansas 4 Today’s Topic Clustering XML 5/22/2017 Luke Huan Univ. of Kansas 5 What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized 5/22/2017 Luke Huan Univ. of Kansas 6 Applications of Cluster Analysis Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets 5/22/2017 Luke Huan Univ. of Kansas 7 Multidisciplinary Efforts of Clustering Pattern Recognition Spatial Data Analysis Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns Bioinfo: Phylogenetic tree Microarray analysis 5/22/2017 Luke Huan Univ. of Kansas 8 What is not Cluster Analysis? Supervised classification Simple segmentation Dividing students into different registration groups alphabetically, by last name Results of a query Have class label information Groupings are a result of an external specification Graph partitioning 5/22/2017 Some mutual relevance and synergy, but areas are not identical Luke Huan Univ. of Kansas 9 Terms in Cluster Analysis Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes So what? Clustering can be used as a stand-alone tool to get insight into data distribution Clustering can be used as a preprocessing step for other algorithms such as discretization 5/22/2017 Luke Huan Univ. of Kansas 10 Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters 5/22/2017 Luke Huan Univ. of Kansas 11 Types of Clusters: Well-Separated Well-Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters 5/22/2017 Luke Huan Univ. of Kansas 12 Types of Clusters: Center-Based Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters 5/22/2017 Luke Huan Univ. of Kansas 13 Major Clustering Approaches (I) Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion 5/22/2017 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Luke Huan Univ. of Kansas 14 Partitional Clustering Original Points 5/22/2017 A Partitional Clustering Luke Huan Univ. of Kansas 15 Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering 5/22/2017 p3 p4 Non-traditional Dendrogram Luke Huan Univ. of Kansas 16 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance km1tmiKm (Cm tmi )2 Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 5/22/2017 Luke Huan Univ. of Kansas 17 The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the mean of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment 5/22/2017 Luke Huan Univ. of Kansas 18 The K-Means Clustering Method 10 9 8 7 6 5 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 4 3 2 1 0 0 10 10 9 9 8 8 7 7 6 6 5 4 3 2 1 0 0 5/22/2017 Update the cluster means 1 2 3 4 5 6 7 8 9 10 Luke Huan Univ. of Kansas Update the cluster means 1 2 3 4 5 6 7 8 9 10 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 19 Comments on the K-Means Method Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum. 5/22/2017 The global optimum may be found using techniques such as genetic algorithms Luke Huan Univ. of Kansas 20 Comments on the K-Means Method Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes 5/22/2017 Luke Huan Univ. of Kansas 21 A Problem K-means: Differing Density K-means (3 Clusters) Original Points 5/22/2017 Luke Huan Univ. of Kansas 22 A Problem of K-means: Non-globular Shapes Original Points 5/22/2017 K-means (2 Clusters) Luke Huan Univ. of Kansas 23 From HTML to XML (eXtensible Markup Language) HTML describes presentation of content <h1>Bibliography</h1> <p><i>Foundations of Databases</i> Abiteboul, Hull, and Vianu <br>Addison Wesley, 1995 <p>… XML describes only the content <bibliography> <book> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Hull</author> <author>Vianu</author> <publisher>Addison Wesley</publisher> <year>1995</year> </book> <book>…</book> </bibliography> Separation of content from presentation simplifies content extraction and allows the same content to be presented easily in different looks 5/22/2017 Luke Huan Univ. of Kansas 24 Other nice features of XML Portability: Just like HTML, you can ship XML data across platforms Flexibility: You can represent any information (structured, semi-structured, documents, …) Relational data requires heavy-weight protocols, e.g., JDBC Relational data is best suited for structured data Extensibility: Since data describes itself, you can change the schema easily Relational schema is rigid and difficult to change 5/22/2017 Luke Huan Univ. of Kansas 25 XML terminology <bibliography> <book ISBN=”ISBN-10” price=”80.00”> <title>Foundations of Databases</t <is_textbook/> <author>Abiteboul</author> <author>Hull</author> <author>Vianu</author> <publisher>Addison Wesley</publish <year>1995</year> </book>… </bibliography> Tag names: book, title, … Start tags: <book>, <title>, … End tags: </book>, </title>, … An element is enclosed by a pair of start and end tags: <book>…</book> Elements can be nested: <book>…<title>…</title>…</book> Empty elements: <is_textbook></is_textbook> Can be abbreviated: <is_textbook/> Elements can also have attributes: <book ISBN=”…” price=”80.00”> 5/22/2017 Luke Huan Univ. of Kansas 26 Well-formed XML documents A well-formed XML document Follows XML lexical conventions Wrong: <section>We show that x < 0…</section> Right: <section>We show that x < 0…</section> Other special entities: > becomes > and & becomes & Contains a single root element Has tags that are properly matched and elements that are properly nested Right: <section>…<subsection>…</subsection>…</sectio n> Wrong: <section>…<subsection>…</section>…</subsectio n> 5/22/2017 Luke Huan Univ. of Kansas 27 More XML features Comments: <!-- Comments here --> CDATA: <![CDATA[Tags: <book>,…]]> ID’s and references <person id=”o12”><name>Homer</name>…</person> <person id=”o34”><name>Marge</name>…</person> <person id=”o56” father=”o12” mother=”o34”><name>Bart</name>…</person>… Namespaces allow external schemas and qualified names <book xmlns:myCitationStyle=”http://…/mySchema”> <myCitationStyle:title>…</myCitationStyle:title> <myCitationStyle:author>…</myCitationStyle:author>… </book> Processing instructions for apps: <? …java applet… ?> And more… 5/22/2017 Luke Huan Univ. of Kansas 28 Valid XML documents A valid XML document conforms to a Document Type Definition (DTD) A DTD specifies A DTD is optional A grammar for the document Constraints on structures and values of elements, attributes, etc. Example <!DOCTYPE bibliography [ <!ELEMENT bibliography (book+)> <!ELEMENT book (title, author*, publisher?, year?, section*)> <!ATTLIST book ISBN CDATA #REQUIRED> <!ATTLIST book price CDATA #IMPLIED> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT publisher (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT section (title, (#PCDATA)?, section*)> ]> 5/22/2017 Luke Huan Univ. of Kansas 29 DTD explained <!DOCTYPE bibliography [ bibliography is the root element of the document <!ELEMENT bibliography (book+)> One or more bibliography consists of a sequence of one or more book elements <!ELEMENT book (title, author*, publisher?, year?, Zero or one section*)> Zero or more book consists of a title, zero or more authors, an optional publisher, and zero or more sections, in sequence <!ATTLIST book ISBNISBN ID #REQUIRED> book has a required attribute which is a unique identifier <bibliography> <book ISBN=”ISBN-10” price=”80.00”> <title>Foundations of Databases</ <author>Abiteboul</author> <author>Hull</author> <author>Vianu</author> <publisher>Addison Wesley</publis <year>1995</year> </book>… </bibliography> <!ATTLIST book price(#IMPLIED) CDATA #IMPLIED> book has an optional price attribute which contains character data 5/22/2017 Luke Huan Univ. of Kansas 30 DTD explained (cont’d) <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT title (#PCDATA)> PCDATA is text that will be parsed author (#PCDATA)> publisher (#PCDATA)> (<…> will be treated as a markup tag and < etc. will be treated as entities year (#PCDATA)> CDATA is unparsed character data title, author, publisher, and year all contain parsed character data (#PCDATA) <!ELEMENT section (title, (#PCDATA)?, section*)> ]> 5/22/2017 Each section starts with a title, followed by some optional text and then zero or more subsections <section><title>Introduction</title> Luke Huan Univ. of Kansas In this section we introduce XML and <section><title>XML</title> XML stands for… </section> <section><title>DTD</title> <section><title>Definition</title> DTD stands for… </section> <section><title>Usage</title> You can use DTD to… </section> </section> </section> 31 “Deterministic” content declaration Catch: the following declaration does not work: <!ELEMENT pub-venue ( (name, address, month, year) | (name, volume, number, year) )> Because when looking at name, the XML processor would not know which way to go without looking further ahead Requirement: content declaration must be “deterministic” (i.e., no look-ahead required) 5/22/2017 Luke Huan Univ. of Kansas 32 XML versus relational data Relational data Schema is always fixed in advance and difficult to change Simple, flat table structures Ordering of rows and columns is unimportant Data exchange is problematic “Native” support in all serious commercial DBMS 5/22/2017 XML data Well-formed XML does not require predefined, fixed schema Nested structure; ID/IDREF(S) permit arbitrary graphs Ordering forced by document format; may or may not be important Designed for easy exchange Often implemented as an “add-on” on top of relations Luke Huan Univ. of Kansas 33 Query languages for XML XPath XQuery Path expressions with conditions Building block of other standards (XQuery, XSLT, XLink, XPointer, etc.) XPath + full-fledged SQL-like query language XSLT XPath + transformation templates 5/22/2017 Luke Huan Univ. of Kansas 34