Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Relational model wikipedia , lookup
Database model wikipedia , lookup
Oracle Database wikipedia , lookup
Chapter 10 Managing Databases with Oracle 9i SII 654 Fall 2005 Fundamentals, Design, and Implementation, 9/e Introduction Oracle is the world’s most popular DBMS It is a powerful and robust DBMS that runs on many different operating systems Oracle DBMS engine: Personal Oracle and Enterprise Oracle Example of Oracle products – SQL*Plus: a utility for processing SQL and creating components like stored procedures and triggers • PL/SQL is a programming language that adds programming constructs to the SQL language – Oracle Developer (Forms & Reports Builder) – Oracle Designer Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/2 Creating an Oracle Database Installing Oracle – Install Oracle 9i Client to use an already created database – Install Oracle 9i Personal Edition to create your own databases Three ways to create an Oracle database – Via the Oracle Database Configuration Assistant – Via the Oracle-supplied database creation procedures – Via the SQL CREATE DATABASE command Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/3 SQL*Plus Oracle SQL*Plus or the Oracle Enterprise Manager Console may be used to manage an Oracle database SQL*Plus is a text editor available in all Oracle Except inside quotation marks of strings, Oracle commands are case-insensitive The semicolon (;) terminates a SQL statement The right-leaning slash (/) executes SQL statement stored in Oracle buffer SQL*Plus can be used to – Enter SQL statements – Submit SQL files created by text editors, e.g., notepad, to Oracle Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/4 Example: SQL*Plus Prompt Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/5 SQL*Plus Buffer SQL*Plus keeps the current statements in a multi-line buffer without executing it LIST is used to see the contents of the buffer – LIST [line_number] is used to change the current line CHANGE/astring/bstring/ is used to change the contents of the current line – astring = the string you want to change – bstring = what you want to change it to Example: change/Table_Name/*/ – ‘Table_Name’ is replaced with ‘*’ Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/6 Example: SQL*Plus Buffer Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/7 Creating Tables Some of the SQL-92 CREATE TABLE statements need to be modified for Oracle – Oracle does not support a CASCADE UPDATE constraint – Int data type is interpreted by Oracle as Number(38) – Varchar data type is interpreted as VarChar2 – Money or currency is defined in Oracle using the Numeric data type Oracle sequences must be used for surrogate keys DESCRIBE or DESC command is used to view table status Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/8 Oracle Data Types Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/9 Oracle Sequences A sequence is an object that generates a sequential series of unique numbers It is the best way to work with surrogate keys in Oracle Two sequence methods – NextVal provides the next value in a sequence – CurrVal provides the current value in a sequence Using sequences does not guarantee valid surrogate key values because it is possible to have missing, duplicate, or wrong sequence value in the table Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/10 Example: Sequences Creating sequence CREATE SEQUENCE CustID INCREMENT BY 1 START WITH 1000; Entering data using sequence INSERT INTO CUSTOMER (CustomerID, Name, AreaCode, PhoneNumber) VALUES (CustID.NextVal, ‘Mary Jones’, ‘350’, ‘555–1234); Retrieving the row just created SELECT * FROM CUSTOMER WHERE CustomerID = CustID.CurrVal Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/11 DROP and ALTER Statements Drop statements may be used to remove structures from the database – DROP TABLE MYTABLE; • Any data in the MYTABLE table will be lost – DROP SEQUENCE MySequence; ALTER statement may be used to drop (add) a column – ALTER TABLE MYTABLE DROP COLUMN MyColumn; – ALTER TABLE MYTABLE ADD C1 NUMBER(4); Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/12 TO_DATE Function Oracle requires dates in a particular format TO_DATE function may be used to identify the format – TO_DATE(‘11/12/2002’,’MM/DD/YYYY’) • 11/12/2002 is the date value • MM/DD/YYYY is the pattern to be used when interpreting the date TO_DATE function can be used with the INSERT and UPDATE statement to enter data – INSERT INTO T1 VALUES (100, TO_DATE (‘01/05/02’, ‘DD/MM/YY’); Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/13 Creating Indexes Indexes are created to – Enforce uniqueness on columns – Facilitate sorting – Enable fast retrieval by column values Good candidates for indexes are columns that are frequently used with equal conditions in WHERE clause or in a join Example: – CREATE INDEX CustNameIdx ON CUSTOMER(Name); – CREATE UNIQUE INDEX WorkUniqueIndex ON WORK(Title, Copy, ArtistID); Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/14 Restrictions On Column Modifications A column may be dropped at any time and all data will be lost A column may be added at any time as long as it is a NULL column To add a NOT NULL column – Add a NULL column – Fill the new column in every row with data – Change its structure to NOT NULL • ALTER TABLE T1 MODIFY C1 NOT NULL; Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/15 Creating Views SQL-92 CREATE VIEW command can be used to create views in SQL*Plus Oracle allows the ORDER BY clause in view definitions Only Oracle 9i supports the JOIN…ON syntax Example: CREATE VIEW CustomerInterests AS SELECT C.Name as Customer, A.Name as Artist FROM CUSTOMER C JOIN CUSTOMER_ARTIST_INT I ON C.CustomerID = I.CustomerID JOIN ARTIST A ON I.ArtistID = A.ArtistID; Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/16 Enterprise Manager Console The Oracle Enterprise Manager Console provides graphical facilities for managing an Oracle database The utility can be used to manage – Database structures such as tables and views – User accounts, passwords, roles, and privileges The Manager Console includes a SQL scratchpad for executing SQL statements Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/17 Application Logic Oracle database application can be processed using – Programming language to invoke Oracle DBMS commands – Stored procedures – Start command to invoke database commands stored in .sql files – Triggers Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/18 Stored Procedures A stored procedure is a PL/SQL or Java program stored within the database Stored procedures are programs that can – – – – Have parameters Invoke other procedures and functions Return values Raise exceptions A stored procedure must be compiled and stored in the database Execute or Exec command is used to invoke a stored procedure – Exec Customer_Insert (‘Michael Bench’, ‘203’, ‘5552014’, ‘US’); Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/19 Example: Stored Procedure Insert Figure 10-20 IN signifies input parameters OUT signifies an output parameter IN OUT signifies a parameter used for both input and output Variables are declared after the keyword AS Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/20 Triggers Oracle triggers are PL/SQL or Java procedures that are invoked when specified database activity occurs Triggers can be used to – – – – – Enforce a business rule Set complex default values Update a view Perform a referential integrity action Handle exceptions Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/21 Triggers (cont.) Trigger types – A command trigger will be fired once per SQL command – A row trigger will be fired once for every row involved in the processing of a SQL command • Three types of row triggers: BEFORE, AFTER, and INSTEAD OF • BEFORE and AFTER triggers are placed on tables while INSTEAD OF triggers are placed on views • Each trigger can be fired on insert, update, or delete commands Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/22 Data Dictionary Oracle maintains a data dictionary of metadata The metadata of the dictionary itself are stored in the table DICT SELECT Table_Name, Comments FROM DICT WHERE Table_Name LIKE (‘%TABLES%’); USER_TABLES contains information about user or system tables DESC USER_TABLES; Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/23 Example Oracle Metadata Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/24 Concurrency Control Oracle processes database changes by maintaining a System Change Number (SCN) – SCN is a database-wide value that is incremented by Oracle when database changes are made With SCN, SQL statements always read a consistent set of values; those that were committed at or before the time the statement was started Oracle only reads committed changes; it will never reads dirty data Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/25 Oracle Transaction Isolation Oracle supports the following transaction isolation levels – Read Committed: Oracle’s default transaction isolation level since it never reads uncommitted data changes – Serializable: Dirty reads are not possible, repeated reads yield the same results, and phantoms are not possible – Read Only: All statements read consistent data. No inserts, updates, or deletions are possible – Explicit locks: Not recommended Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/26 Oracle Security Oracle security components: – An ACCOUNT is a user account – A PROFILE is a set of system resource maximums that are assigned to an account – A PRIVILEGE is the right to perform a task – A ROLE consists of groups of PRIVILEGEs and other ROLEs Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/27 Account System Privileges Each ACCOUNT can be allocated many SYSTEM PRIVILEGEs and many ROLEs An ACCOUNT has all the PRIVILEGEs – That have been assigned directly – Of all of its ROLEs – Of all of its ROLEs that are inherited through ROLE connections A ROLE can have many SYSTEM PRIVILEGEs and it may also have a relationship to other ROLEs ROLEs simplify the administration of the database – A set of privileges can be assigned to or removed from a ROLE just once Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/28 Account Authentication Accounts can be authenticated by – Password – The host operating system Password management can be specified via PROFILEs Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/29 Oracle Recovery Facilities Three file types for Oracle recovery: – Datafiles contain user and system data – ReDo log files contain logs of database changes • OnLine ReDo files are maintained on disk and contain the rollback segments from recent database changes • Offline or Archive ReDo files are backups of the OnLine ReDo files – Control files describe the name, contents, and locations of various files used by Oracle Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/30 Oracle Recovery Facilities (cont.) Oracle can operate in either ARCHIVELOG or NOARCHIVELOG mode – If running in ARCHIVELOG mode, Oracle logs all changes to the database – When the OnLine ReDo files fill up, they are copied to the Archive ReDo files The Oracle Recovery Manager (RMAN) is a utility program used to create backups and to perform recovery Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/31 Types of Failure Oracle recovery techniques depend on the type of failure – An application failure due to application logic errors – An instance failure occurs when Oracle itself fails due to an operating system or computer hardware failure • Oracle can recover from application and instance failure without using the archived log file – A media failure occurs when Oracle is unable to write to a physical file because of a disk failure or corrupted files • The database is restored from a backup Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/32 Oracle Backup Facilities Two kinds of backups A consistent backup: Database activity must be stopped and all uncommitted changes have been removed from the datafiles – Cannot be done if the database supports 24/7 operations An inconsistent backup: Backup is made while Oracle is processing the database – An inconsistent backup can be made consistent by processing an archive log file Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/33 Chapter 10 Managing Databases with Oracle 9i Instructor: Dragomir R. Radev Winter 2005 Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005 Fundamentals, Design, and Implementation, 9/e The big problem Billions of records A small number of interesting patterns “Data rich but information poor” Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/36 Data mining Knowledge discovery Knowledge extraction Data/pattern analysis Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/37 Types of source data Relational databases Transactional databases Web logs Textual databases Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/38 Association rules 65% of all customers who buy beer and tomato sauce also buy pasta and chicken wings Association rules: X Y Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/39 Association analysis IF 20 < age < 30 AND 20K < INCOME < 30K THEN – Buys (“CD player”) SUPPORT = 2%, CONFIDENCE = 60% Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/40 Basic concepts Minimum support threshold Minimum confidence threshold Itemsets Occurrence frequency of an itemset Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/41 Association rule mining Find all frequent itemsets Generate strong association rules from the frequent itemsets Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/42 Support and confidence Support (X) Confidence (X Y) = Support(X+Y) / Support (X) Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/43 Example TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of item IDs I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/44 Example (cont’d) Frequent itemset l = {I1, I2, I5} I1 AND I2 I5 C = 2/4 = 50% I1 AND I5 I2 I2 AND I5 I1 I1 I2 AND I5 I2 I1 AND I5 I3 I1 AND I2 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/45 Example 2 TID date items T100 10/15/99 {K, A, D, B} T200 10/15/99 {D, A, C, E, B} T300 10/19/99 {C, A, B, E} T400 10/22/99 {B, A, D} min_sup = 60%, min_conf = 80% Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/46 Correlations Corr (A,B) = P (A OR B) / P(A) P (B) If Corr < 1: A discourages B (negative correlation) (lift of the association rule A B) Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/47 Contingency table Game ^Game Sum Video 4,000 3,500 7,500 ^Video 2,000 500 2,500 Sum 6,000 4,000 10,000 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/48 Example P({game}) = 0.60 P({video}) = 0.75 P({game,video}) = 0.40 P({game,video})/(P({game})x(P({video })) = 0.40/(0.60 x 0.75) = 0.89 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/49 Example 2 hotdogs ^hotdogs Sum hamburgers 2000 500 2500 ^hamburgers 1000 1500 2500 Sum 3000 2000 5000 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/50 Classification using decision trees Expected information need I (s1, s2, …, sm) = - S pi log (pi) s = data samples m = number of classes Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/51 RID Age Income student credit buys? 1 <= 30 High No Fair No 2 <= 30 High No Excellent No 3 31 .. 40 High No Fair Yes 4 > 40 Medium No Fair Yes 5 > 40 Low Yes Fair Yes 6 > 40 Low Yes Excellent No 7 31 .. 40 Low Yes Excellent Yes 8 <= 30 Medium No Fair No 9 <= 30 Low Yes Fair Yes 10 > 40 Medium Yes Fair Yes 11 <= 30 Medium Yes Excellent Yes 12 31 .. 40 Medium No Excellent Yes 13 31 .. 40 High Yes Fair Yes 14 > 40 Medium no excellent no Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/52 Decision tree induction I(s1,s2) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/53 Entropy and information gain •E(A) = S S1j + … + smj s I (s1j,…,smj) Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s1,s2,…,sm) – E(A) Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/54 Entropy Age <= 30 s11 = 2, s21 = 3, I(s11, s21) = 0.971 Age in 31 .. 40 s12 = 4, s22 = 0, I (s12,s22) = 0 Age > 40 s13 = 3, s23 = 2, I (s13,s23) = 0.971 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/55 Entropy (cont’d) E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 Gain (age) = I (s1,s2) – E(age) = 0.246 Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/56 Final decision tree age > 40 31 .. 40 student credit yes no yes no yes excellent no Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke fair yes Chapter 10/57 Other techniques Bayesian classifiers X: age <=30, income = medium, student = yes, credit = fair P(yes) = 9/14 = 0.643 P(no) = 5/14 = 0.357 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/58 Example P (age < 30 | yes) = 2/9 = 0.222 P (age < 30 | no) = 3/5 = 0.600 P (income = medium | yes) = 4/9 = 0.444 P (income = medium | no) = 2/5 = 0.400 P (student = yes | yes) = 6/9 = 0.667 P (student = yes | no) = 1/5 = 0.200 P (credit = fair | yes) = 6/9 = 0.667 P (credit = fair | no) = 2/5 = 0.400 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/59 Example (cont’d) P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 P (X | no) P (no) = 0.019 x 0.357 = 0.007 Answer: yes/no? Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/60 Predictive models Inputs (e.g., medical history, age) Output (e.g., will patient experience any side effects) Some models are better than others Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/61 Principles of data mining Training/test sets Error analysis and overfitting error test training Cross-validation input size Supervised vs. unsupervised methods Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/62 Representing data Vector space credit pay off default salary Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/63 Decision surfaces credit pay off default salary Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/64 Decision trees credit pay off default salary Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/65 Linear boundary credit pay off default salary Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/66 kNN models Assign each element to the closest cluster Demos: – http://www2.cs.cmu.edu/~zhuxj/courseproject/knnd emo/KNN.html Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/67 Other methods Decision trees Neural networks Support vector machines Demos – http://www.cs.technion.ac.il/~rani/LocBo ost/ Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/68 arff files @data sunny,85,85,FALSE,no @relation weather sunny,80,90,TRUE,no overcast,83,86,FALSE,yes @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/69 Weka http://www.cs.waikato.ac.nz/ml/weka Methods: rules.ZeroR bayes.NaiveBayes trees.j48.J48 lazy.IBk trees.DecisionStump Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/70 kMeans clustering http://www.cc.gatech.edu/~dellaert/html/sof tware.html java weka.clusterers.SimpleKMeans -t data/weather.arff Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/71 More useful pointers http://www.kdnuggets.com/ http://www.twocrows.com/booklet.htm Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/72 More types of data mining Classification and prediction Cluster analysis (clustering) Outlier analysis Evolution analysis Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/73 Clustering Exclusive/overlapping clusters Hierarchical/flat clusters Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/74 Methods Single-linkage – One common pair is sufficient – disadvantages: long chains Complete-linkage – All pairs have to match – Disadvantages: too conservative Average-linkage Centroid-based (online) – Look at distances to centroids Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/75 k-means Needed: small number k of desired clusters hard vs. soft decisions Example: Weka Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/76 k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/77 Example Cluster the following vectors into two groups: – A = <1,6> – B = <2,2> – C = <4,0> – D = <3,3> – E = <2,5> – F = <2,1> Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/78 Demos http://vivisimo.com/ http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_h tml/AppletKM.html http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_kmeans http://www.cs.washington.edu/research/imagedatabase/dem o/kmcluster http://www.cc.gatech.edu/~dellaert/html/software.html http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf http://www.ece.neu.edu/groups/rpl/projects/kmeans/ % cd /data2/tools/weka-3-3-4 % export CLASSPATH=/data2/tools/weka-3-3-4/weka.jar % java weka.clusterers.SimpleKMeans -t data/weather.arff Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke Chapter 10/79