Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Knowledge Discovery in Databases Data 3 1 Data Mining • Data mining is a capability to support the recognition of previously unknown but potentially useful relationships within large databases/ data warehouses. • Aim: find useful patterns in the data. • Uses statistical, mathematical, artificial intelligence, and machine-learning techniques Data 3 2 Data Mining Tools • Data mining tools use statistical or rules-based methods to identify patterns and create predictive models. • Tools look for patterns using a variety of models – – – – – – Statistical methods e.g. correlation Decision trees Case based reasoning Neural computing Intelligent agents Genetic algorithms Data 3 3 Text Mining • Text Mining – Analyse text documents. – Find Hidden content – Group by themes – Determine relationships between documents Data 3 4 Process of Data Mining/ Knowledge Discovery Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases Data 3 5 What does it let you do? • Data mining automates the process of sifting through historical data in order to discover new information. • Data Mining techniques enable users to identify patterns and correlations within a set of data • These can then be used as predictive models that anticipate behaviour or events based on trends in the data. Data 3 6 Correlation versus Causation • Correlation – A statistical relation between two or more variables such that changes in the value of one variable are accompanied by changes in the value of the other • Causation – Changes in one variable cause changes in another. Data 3 7 What do you need for Data Mining? • Massive data collection • Powerful computers • Data mining algorithms Data 3 8 Five Basic Operations • Clustering – Identifies groups of items that share a particular characteristic • Classification – infers the defining characteristics of a certain group • Association – identifies relationships between events that occur at the one time • Sequencing: – relationships over time • Forecasting – estimates future values based on patterns within large sets of data Data 3 9 Clustering • The process of identifying relationships between similar records without any preconceived notion of what that that similarity might involve. • Examples: – Disease clusters, – Similarities in customers telephone usage • Often used as an exploratory exercise before further data mining using a classification technique. Data 3 10 Classification • DM system learns from examples of the data how to partition or classify the data i.e. it formulates classification rules which can be used for prediction. – Example : Bank classifies customers and may offer them differing levels of service, different offers, different charges. Can build loan approval models. Data 3 11 Association • Looks for links between records in a data set – e.g. items purchased at the one time. • Patterns can be identified to indicate probabilities e.g. • • • • 500,000 transactions 20,000 nappies 30,000 beer 10,000 nappies + beer – Beer and nappies occur together in 2% of transactions. – “when people buy beer they buy nappies 1/3 of the time” – “when people buy nappies they buy beer 50% of the time” Data 3 12 Sequential Analysis • A form of association used to track relationships over time. – E.g. health insurance claims. – E.g. 10% of customers who bought a tent bought a backpack within one month. – Weather patterns e.g. tidal wave in Hawaii follows hurricane in N. Atlantic x% of the time. Data 3 13 Forecasting • Concerns the prediction of continuous variables e.g. sales, share values, stock market levels, oil prices etc. • Often done with regression functions statistical methods for examining the relationship between variables in order to predict a future value. • 2 types – Forecasting single continuous value based on unordered examples. e.g. predict income based on personal details. – Predict one or more values based on a sequential pattern – time series forecasting. Data 3 14 Data Mining Tools in more detail • Case-based Reasoning – Use historical cases to identify patterns. • Neural Computing : – Examine historical data for pattern recognition e.g. identify potential customers for a new product. • Intelligent agents – Retrieve information from large databases. • Other tools e.g. decision trees, rule induction, data visualisation. Data 3 15 Some Key Applications Areas • Data mining is used in many different areas • Two big areas are: – Market analysis and management • Initial Data Gathered From Credit card transactions, loyalty cards, discount coupons, customer complaint calls, lifestyle studies, focus groups – Fraud detection and management Data 3 16 Examples Market analysis and management • Target marketing – Find clusters of “model” customers who share the same characteristics: e.g. interests, income • Determine customer purchasing patterns over time • Cross-market analysis uses associations/co-relations between product sales and predicts based on the association information • Customer profiling: – What types of customers buy what products • Identifying customer requirements– Identifying the best products for different customers, use prediction to find what factors will attract new customers Data 3 17 Fraud detection and management • Used in health care, retail, credit card services, telecommunications (phone card fraud), etc. • Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances • Examples – auto insurance: detect a group of people who stage accidents to collect on insurance – money laundering: detect suspicious money transactions – medical insurance: detect professional patients and ring of doctors and ring of references Data 3 18 Text Mining - Application of data mining to unstructured or less structured files. - Text mining operates with less structured information and helps organisations to:– Find hidden content of documents including useful relationships. – Relate documents across unnoticed divisions e.g. customers in 2 product division have the same characteristics. – Group documents by themes e.g. all customers who have similar complaints. Data 3 19 Some more example applications by area • Marketing:- Predicting customers to respond to internet banners or buy a product. Segmenting customer demographics. • Banking : forecasting bad loans and fraudulent credit card usage, credit card spending by new customers and which customers will respond bet to new loan offers. • Retailing and Sales: Predicting sales, correct stock levels, distribution schedules • Manufacturing and Production: predicting when to expect machinery failures , finding key factors that control the optimisation of manufacturing capacity. Data 3 20 • Brokerage and Securities Trading:- Predicting when bond prices will change, forecasting range of stock fluctuation for particular issues, determining when to trade stock. • Insurance: forecasting claim amounts, medical coverage costs, classifying the most important elements that affect medical coverage, predicting which customers will buy new policies. • Computer Hardware and Software: Predicting drive failure, forecasting creation time for new chips, predicting potential security violations. • Government and Defence: Forecasting cost of moving military equipment, testing strategies for potential military engagements, predicting resource consumption. Data 3 21 • Airlines: Capturing data on what customers are flying and destination of those who change carriers midflight. • Healthcare : correlating demographics of patients with critical illnesses. • Broadcasting – programs best shown in prime time and how to maximize returns by inserting advertisements. • Police: tracking crime patterns, locations, criminal behaviour and attributes to help crack criminal cases. Data 3 22 Problems with data mining • Need clear business objectives and access to the appropriate data. • Need the right data. – Bad data quality can lead to spurious results • Models are not fail-safe. • Privacy, property and other legal and ethical issues. • Companies must change mode of operation and maintain the effort (e.g. loyalty programs such as air miles). Data 3 23 Conclusion • Data Mining is an attractive sounding technology which is still evolving. • The key is that the algorithms discover useful relationships. – Unlike standard research where researchers hypothesise correlations and then search for them. • There are ethical issues: – E.g. Criminal profiling. Data 3 24