Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining A Brief Overview Copyright © Curt Hill 2003-2016 The Problem • Huge volumes of data overwhelm traditional methods of data analysis such as: • Spreadsheets • Ad hoc queries • Multidimensional analysis tools • Statistical analysis packages Copyright © Curt Hill 2003-2016 What is Data Mining? • Exploratory data analysis based on a data warehouse – Knowledge Discovery in Databases (KDD) • Data Mining extracts previously unknown and potentially useful information – Rules, constraints, correlations, patterns, signatures and irregularities • The goal is to automate the methods for finding these in the data Copyright © Curt Hill 2003-2016 Data Warehouse • A database usually separated from the operational database(s) • Used as a base for decision support systems – Upper and middle management – Not used for day to day management but for spotting trends and making path decisions • Typically very large and composed of recent copies from the operational database(s) • Data Mining is one of the applications that could use Copyright © Curt Hill 2003-2016 Goals of Data Mining • Prediction of future behaviors – Seasonal or non-seasonal trends – How will consumers respond to discounts? – Allows the enterprise to be ready • Identification of item, event or activity – Intruders may be identified by the files they access or programs they use Copyright © Curt Hill 2003-2016 Goals Again • Classification of categories of users or products – Shoppers may be categorized as: • • • • Discount seeking Rush Regular Attached to certain brand names – The store may be made more friendly to such • Optimize the use of time, space, materials and money Copyright © Curt Hill 2003-2016 Knowledge Discovery • There are several types of discoverable knowledge – – – – – Association Rules Classification hierarchies Sequential patterns Time series patterns Clustering • Each of these needs more information Copyright © Curt Hill 2003-2016 Association Rules • What we are looking for is knowledge of associations that are not obvious • This has gained traction in market basket research – Very profitable information • If a MRI has characteristic a and b then if often has c – This is an association rule Copyright © Curt Hill 2003-2016 Market Basket Model • Premise: the items in a checkout transaction are not random • Thus we analyze customer transactions for patterns or association rules • These patterns may guide decisions on – Sale items – Shelf arrangement or product placement Copyright © Curt Hill 2003-2016 Retail Example • A young father goes to the store to buy disposable diapers • On his way through the store he sees a Sports Illustrated and buys it • In general, people do not impulse buy disposable diapers, but while buying these, they may buy something else on impulse • Can we examine retail transaction records and perceive the connection? Copyright © Curt Hill 2003-2016 Association Rule • Is of the form: X => Y – Where both X and Y could be sets of items • The support of this rule is the percent of total transactions that have both • The confidence of this rule is the number of transactions which have the first one divided by the number of transactions that have both • High support and high confidence indicates rules that business decisions may be based upon this rule – Put magazine rack on the route to the diapers Copyright © Curt Hill 2003-2016 Agriculture Example • LandSat are in polar orbits • They record data on all land every 18 days • A pixel is approximately 31 yards on a side • Seven bands from near infrared to ultraviolet are recorded for each pixel • Each produce a 1 byte value • Can you get this data in a spreadsheet? Copyright © Curt Hill 2003-2016 Agriculural rule • In middle summer a near infrared value in the range 48 to 255 and red in red in range 0 to 31 suggests that the yield will be 128 to 255 bushels acre • If the support and confidence are high this suggests that the farmer should apply nitrogen to the areas where near infrared was less than 47 and red was greater than 32 Copyright © Curt Hill 2003-2016 Computational Difficulties • Consider how many tickets a supermarket or department store might generate? • In general, most of these tickets have more than two or three items • The store carries thousands of items • Discovering these association rules become computationally taxing • One good reason to keep this off of the operational database Copyright © Curt Hill 2003-2016 Algorithm Properties • There are a number of algorithms for finding these rules • These typically exploit two properties: • Downward closure • The subset of a large itemset should also have large support • Removing a few items does not hurt • Antimonotocity • The superset of a small itemset should have small support Copyright © Curt Hill 2003-2016 Classification • Classifying data into predetermined groups • Then we can deal with the groups in different ways • AKA supervised learning – Developed by Artificial Intelligence • The process of clustering is attempting to classify data in groups that are not predetermined Copyright © Curt Hill 2003-2016 Models • The two typical models are decision trees and a set of rules • We look at the data to build the model and then use the model for new data • Consider in the next slide a decision tree for granting a credit card to an applicant Copyright © Curt Hill 2003-2016 Example: Decision Tree Married Yes No Salary <25K Poor Balance >75K Fair Good >5K <5K Poor Age <25 Fair Copyright © Curt Hill 2003-2016 >25 Good Clustering • AKA unsupervised learning • Classify the data into groups that you are not aware of to begin with • A distance function must be supplied that describes the distance between two points – The points are often not purely numeric – They are often not in 2 dimensions or even 3 which makes things interesting Copyright © Curt Hill 2003-2016 Applications • Marketing – Determine advertising, store placement, segmentation of customers • Finance – Analysis of performance of securities • Manufacturing – Optimizing resources, designing the manufacturing process • Health Care – Discovery of items in X-Ray and MRI images Copyright © Curt Hill 2003-2016 Example • Certain diseases switch on genes characteristic to that disease • Drugs often switch off a gene • In 2011 database of genes and what affected them was mined • The result was that mice infected with small cell lung cancer were treated with an antidepressant, imipramine – The tumors were reduced Copyright © Curt Hill 2003-2016 Telco Example • A local telephone company mines its connection data for possible marketing opportunities • A phone very busy in the 3PM to 6PM range suggests a teenager – Pitch a teen phone • Busy in the 9AM to 5PM suggests a home business – Pitch a business line Copyright © Curt Hill 2003-2016 Social Media • Publicly viewable social media presents a very large quantity of data • However it is: – Noisy – Unstructured – Dynamic • It is of great interest in political campaigns, marketing, health care – This is where people express things first Copyright © Curt Hill 2003-2016 Data Scientists • Has a nicer ring than knowledge workers but is a similar position • A 2016 survey considered how they spend their time: – – – – – – Cleaning and organizing data 60% Collecting data sets 19% Mining data for patterns 9% Refining algorithms 4% Building training sets 3% Other 5% • Data janitors Copyright © Curt Hill 2003-2016 • Skills According to the same survey, the skills in the most demand are: • SQL – Structured Query Language • Hadoop – algorithm and database for big data • Python – programming language • Java – programming language • R – programming language • Hive – A NoSQL database • MapReduce – algorithm to exploit multiple processors • NoSQL – class of non-relational databases • Pig – system to analyze big data • SAS – Statisical Analysis System Copyright © Curt Hill 2003-2016 Finally • Much of the analysis done in data mining has been done for centuries – What is different now is the amount and types of captured data • There are a number of commercial tools for mining • Many large companies have substantial investment and return on their mining activities Copyright © Curt Hill 2003-2016