Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Knowledge Discovery By Matt Goliber and Jim Hougas What is Data Mining? • Not like gold or diamond mining • Mining of knowledge from data • Important to many different fields • A Part of Knowledge Discovery in Databases (KDD) The Process of Knowledge Discovery Data cleaning and integration Raw data Data Warehouse Data Data transformation, transformation, selection, selection, and and mining mining Pattern evaluation and knowledge presentation Patterns KNOWLEDGE! Why is Data Mining useful? • We are data rich but information poor -Internet -Intelligence • Humans often lack the ability to comprehend and manage the immense amount of available and sometime seemingly unrelated data How long has this idea been around? • Late 60’s and Early 70’s • Stanford’s Meta-DENDRAL (1970-76) -Extension of DENDRAL • Doug Lenat with AM (1976) Meta-DENDRAL • Extension of the DENDRAL (1965) program -One of the first expert systems -Interpreted mass spectra • Meta-DENDRAL took the mass spectra of compound of known 3D structure and formulated rules about the interpretation of the spectra • Came up with known rules and some new ones! Sample Mass Spec ethyl 3-oxy-3-phenylpropanoate (ethyl benzoylacetate) AM • Doug Lenat, 1976 • Name means nothing, stand alone • AM was given sets, bags, ordered sets, and lists • AM was also given operations to perform on these data sets -Union, Intersection, ect… • Came up with ideas about counting, addition, multiplication, prime numbers, and Goldbach’s conjecture • AM thought that these were all uninteresting • Liked maximally divisible numbers though… What next? • Not a whole lot… • Databases were not prevalent enough, no great demand • Did benefit from machine learning research • Beginning of the 1990’s, “The next area…” -Ranked as one of the most promising research areas (NSF) -Information explosion • Early commercial systems -Farm Journal -GM Next Generation Techniques • Decision Trees – Each branch is a classification question – Allows businesses to segment customers, products, and sales regions – Questions organize the data • Rule Induction – All patterns are pulled from the data – Accuracy and Significance are then added to them – Help the user know how strong pattern is and likelihood of it occurring again – Ex: If bagels are purchased then cream cheese is purchased 90% of the time and this pattern occurs in 3% of all shopping baskets Decision Trees vs. Rule Induction • Decision Trees – Many rules to cover same instance or – no rule to cover an instance • Rule Induction – Always and only one rule • Example – Decision Trees use height and shoe size to determine size of person – Rule Induction uses one or the other Examples of Significant Developments • Stock Market Advances (1991) – Astrophysicists Doyne Farmer and Norman Packard – Prediction company could predict stock market trends • Bell Atlantic (1996) – Consumer phone buying trends – Rule Induction • Advanced Scout (1997) – Inderpal Bhandari assists NBA coaches – Rule Induction • Persuade 400,000 undecided voters (2004) – MoveOn attemps to influence the election – Decision Tree Challenges • Large Data Sets with High Complexity - One or the other is currently possible, but not both • Expensive - Costs of Bell Atlantic (Experts are needed) - Cost for a two-day course in Las Vegas ($1,300) - Software ($100,000) Research • DARPA – Defense Advance Research Projects Agency – ACLU claims this is an invasion of privacy – Decision Tree • Uncovering Terrorists in public chat rooms – Tracks the times that messages are sent • Advanced Scout – Bhandari is working on Advanced Scout for the NHL – Rule Induction Current State • Out of the Lab – Into Fortune 500 companies • Automate Model Scoring – Fingers are currently crossed in hopes that scoring by IT personnel is done correctly Future States • Utilizing Company Warehouses – Data miners must take advantage of a million dollar warehouse that a company builds • Effort Knob – Low for quick model, high for quality model • Computed Target Columns – User could create a new target variable – Ex: finance information that a business has Sources http://web.media.mit.edu/~haase/thesis/node54.html#SECTION00711000000000000000 http://smi-web.stanford.edu/projects/history.html#METADENDRAL http://www.cs.cf.ac.uk/Dave/AI2/node151.html http://64.233.161.104/search?q=cache:Q6eMD9tEKwIJ:www.cosc.brocku.ca/Offerings/4P79/Week12.ppt+meta-dendral&hl=en http://laurel.actlab.utexas.edu/~cynbe/muq/muf3_21.html http://64.233.161.104/search?q=cache:yft0cQ5tZJQJ:www.cs.uwaterloo.ca/~shallit/Talks/cct.ps+%22fundamental+theorem+of+a rithmetic%22+computer+data+mining+prove&hl=en http://mathworld.wolfram.com/GoldbachConjecture.html http://www.quantlet.com/mdstat/scripts/csa/html/node202.html http://www.thearling.com http://www.wired.com http://www.dmreview.com http://www.ebscohost.com http://www.thearling.com/text/dmtechniques/dmtechniques.htm http://www.aaai.org/Library/Magazine/Vol13/13-03/vol13-03.html Data Mining: Concepts and Techniques. Han J. and Kamber M.