Download 74 - Understanding Data Mining

Management:Overview Understanding Data Mining Data mining has become one of the latest trends in using data. Rod Newing explains that it is a complex process which has been around for a long time. O rganisations world-wide are accumulating vast quantities of electronic data as databases become ever more pervasive. The recent trend to implement a data warehouse architecture is increasing the quality and accessibility of data. This is all being done at great cost, but the information is only valuable if used effectively. Users have been using query tools, OLAP servers, Business Intelligence tools, Enterprise Information Systems and a wide range of other packaged software to examine their data. However, these tools either work with summarised data or answer users specific questions. The more numerate analysts have recognised that there are hidden patterns, relationships and rules in their data which cannot be found by using these traditional methods. The answer is to use specialist software which harnesses advanced mathematics to examine large volumes of detailed data. This specialist group of software has become known as "data mining" or "knowledge discovery". Data mining is defined as the process of extracting valid, previously unknown and ultimately comprehensible information from large databases and using it to make critical business decisions. Time 1960s Evolutionary step Data collection 1980s Data access 1990s Data warehousing and decision support Now Data mining The name is derived from the process of sifting large amounts of ore to discover nuggets of gold, just as the software is able to sift large volumes of data to find nuggets of information which yield gold in the form of competitive advantage. The extracted information can be used to do one or more of the following: ● ● ● ● ● Provide an understanding of data relationships to end users. Form a prediction or classification model. Allow prediction of future trends based on past experience. Identify relationships between database records. Provide a summary of the database being mined. With a query, the user knows what is in the database and know what information to ask for, so they must know what patterns exist. With data mining, the software establishes the patterns and relationships. It is possible to carry out data mining operations using a query tool, but the process is extremely complex and would be prohibitively manually intensive. Data mining software uses algorithms which have automated most of the work involved. Data mining differs from statistical analysis in that the latter is used to verify existing knowledge in order to prove a known relationship. Most data mining involves carrying out several different operations using more than one technology, so it should be thought of as an operation, rather than a product. Data mining can be carried out on any data file, from a spreadsheet to a data warehouse. Transaction processing systems can be mined, and the exercise can be used to generate benefits which can help to justify the considerable investment required to implement a data warehouse architecture. Figure 1 outlines the major milestones in the evolution of Data Mining. Objectives Data mining can achieve a number of different objectives, using one or more different technologies. Prediction And Classification This approach uses the historical data in the database to predict future behaviour. It creates a generalised description which characterises the contents of the database by generating Business question "What was my total revenue in each of the last five years?" "What were unit sales in New England in March?" Enabling Technologies Computers, tapes, disks. "What were unit sales in New England in March?" Drill down to Boston. "What is likely to happen to Boston unit sales next month? Why?" On-Line Analytical Processing, data warehouses. Relational databases, SQL, ODBC. Advanced algorithms, multi-processor computers, massive databases. Characteristics Retrospective static data delivery. Retrospective dynamic data delivery at record level. Retrospective dynamic data delivery at multiple levels. Prospective proactive information delivery. Figure 1 - Milestones in the evolution of Data Mining. Issue 74 Page 13 PC Network Advisor File: M0481.1 Management:Overview an understandable model. It enables the model to be applied to new data sets in order to predict the behaviour hidden in that data. For example, a predictive model of existing customers can be applied to potential customers in order to identify those most likely to purchase a particular product or service. It has traditionally used statistical techniques, but lots of automatic model development techniques are being developed, often based on supervised induction. Analysing Links Data mining can be used to establish relationships between the records in the database which would otherwise be impossible to find because they cannot be predicted and so cannot be found other than by accident. It is a relatively recent technique, which has become well known through shopping basket analysis, which indicates popular combinations purchased by retail customers. Segmenting Databases This is a form of sophisticated query to identify common groups of records within a database. It may be a technique in its own right or may be used to prepare data for further processing. Data Transformation Once it has been selected, the data may need to be transformed. For instance, neural networks require nominal values to be converted to numeric ones. Alternatively, derived attributes may need to be created by applying mathematical or logical operators, such as a ratio or logarithmic value. One or more data mining techniques are carried out to try to extract the required information or meet the required objective. Some of the algorithms used are described in Figure 2. Supervised induction automatically creates a classification model from a set of records, known as a "training set", which may be the whole database or a sample of data from it. The induced model consists of generalised patterns which can be used to classify new records. It can use neural networks or decision trees, but the latter do not work well with noisy data. It produces high quality models, even when data in the training set is poor or incomplete. The result is more accurate than that obtained using statistical methods, because it checks for local patterns, whereas the latter work across the entire database. The models are easy for the user to understand. An example would be a credit card analysis to discover the attributes of a good Results Interpretation The result of applying data mining algorithms will be tables of values or relationships. The user will have to look for interesting groupings of data and establish if there is any business value in them. They need to be analysed using a data visualisation (see Figure 3) or decision support tool. Visualisation helps the user to understand the data and identify patterns. If the objective is to produce a model, it must be validated and tested. This identifies unusual values which do not conform to the expected pattern. It is often a source of new knowledge since the results defy known logic. It is also used in fraud detection, where unusual values may represent an unauthorised transaction. Decision Trees Data Selection The objective determines the type of information and the way it is organised. Only part of the data available from the source data file will be needed, so the relevant data must be identified. Noise and missing values may need to be addressed. It may also be preferable to sample the data required and mine the sample. File: M0481.2 There are a number of techniques for carrying out the data mining exercise. Supervised Induction Neural Networks There are four basic steps which need to be carried out in order to complete a data mining exercise. Techniques Applying Algorithms Detecting Deviations The Process It may be necessary to refine the data, repeating the sequence again. This process is often referred to as "data refining". Software which learns from training to identify patterns and construct a model. This model is then applied to larger data sets to predict its structures. It can also identify changes, which then become a notifiable event. Decision trees are tree-shaped structures which represent sets of decisions. They generate rules for classifying the data set, using algorithms such as ID3, Classification and Regression Trees ("CART") and Chi Square Automatic Interaction Detection ("CHAID"). Clustering Methods In this method, artificial intelligence search techniques are used to identify subsets in a cluster. It uses software such as AQ11, UNIMEM and COBWEB. Rule Induction Rule induction involves the extraction of "if ... then ...." rules from data based on statistical significance. Examples are IBM’s RMINI, and FOIL, which are in the public domain. Genetic Algorithms This is an optimisation technique which uses processes such as genetic combination, mutation and natural selection in a design based on the concepts of evolution. Figure 2 - Data Mining Technologies. PC Network Advisor Issue 74 Page 14 Management:Overview Data Mining credit risk in order to predict credit worthiness of applicants. example of association discovery is market basket analysis. related transactions. It is used for targeting direct mail. Association Discovery Sequence Discovery Clustering This is a technique which identifies the affinities which exist among records. The output might find that 67% of records containing A, B and C, also contain Y and Z. The percentage is known as the "confidence factor". An This is similar to association discovery, but works over time. It is frequently directed towards individual customers as a means of identifying their preferences. It detects buying patterns which occur in a sequence of This technique is used to segment a database into subsets of mutually exclusive groups. The members of each group should be as close to each other as possible and as far apart from other groups as possible. The members of each cluster should possess properties which are interesting to the user. Data visualisation techniques are then used to examine each cluster to establish which are useful or interesting. It is less precise than other techniques because of redundant or irrelevant data. The solution is for the user to direct the software to ignore subsets of attributes, assign weightings to them or apply filters to the information. The importance of the attributes themselves can be established using statistical methods. Clustering can also be used to provide data for other techniques, such as supervised induction. Clusters can be created using statistics, neural networks or unsupervised induction. However, using statistical methods makes it difficult to assign new records to existing clusters, because of the difficulty of measuring and handling its deviation from those clusters. Data visualisation provides the user with visual summaries of the results of the data mining algorithms. This helps them to understand the results of the data mining algorithms by communicating relationships in a way that rows and columns cannot. It is interactive, allowing the user to filter or change the information displayed. The user can also change the presentation method used, such as from a histogram to a scatter chart. Visualisation allows users to browse the data looking for unusual features. It is good at identifying small meaningful sub-sets of data which defy conventional wisdom. These "outliers" are anomalies which may be errors, or genuine and valuable exceptions to established wisdom. A wide range of advanced chart types can be used: ● ● ● ● ● ● Geographical maps, combined with histograms, colour coding, pie charts etc. Tree maps showing the hierarchy of a classified database. Rule visualisation. Trends. Scatter graphs. Heat maps. These chart types are very advanced when compared with traditional graphing tools and need powerful workstations. For instance, a five dimensional chart can be created by representing clusters on a three dimensional scatter chart as a sphere. The size and colour of the sphere represent the fourth and fifth dimensions. The time dimension can be incorporated by "playing" the chart like a video. The user can watch the movements in a multi-dimensional chart as it changes with the elapsed time. Figure 3 - Data visualisation. Supplier Angoss Attar Brann Software DataMind Corporation EDS IBM Integral Solutions Right Information Systems The SAS Institute Silicon Graphics SPSS Product Knowledge Seeker XpertRule Viper Mine Your Own Business Dbintellect Intelligent Miner Intelligent Decision Server Clementine 4Thought Neural Network Application, Insight, Spectraview, GIS MineSet SPSS CHAID, Neural Connection, Professional Statistics etc Applications The importance of data mining has been recognised by information intensive industries which have large databases of customer transactions, such as banking, health care, insur- Contact Details http://www.angoss.com http://www.attar.com http://www.brannsoftware.co.uk http://www.datamindcorp.com http://www.dbintellect.com http://www.software.ibm.com http://www.isl.co.uk http://www.4thought.com http://www.sas.com http://www.sgi.com http://www.spss.com Figure 4 - The Main Data Mining Products. Issue 74 Page 15 PC Network Advisor File: M0481.3 Management:Overview Supplier Cognos Comshare NCR Holistic Systems Oracle Pilot Software Planning Sciences Red Brick Systems Product PowerPlay Commander Decision Knowledge Discovery Workbench Holos Express Pilot Discovery Server Gentia Red Brick Data Mine Tool 4Thought, Knowledge Seeker Own Clementine Contact Details http://www.cognos.com http://www.comshare.com http://www.ncr.com Own Partners’ Own, based on the Thinking Machine Own, plus Intelligent Miner Mine Your Own Business http://www.holossys.com http://www.oracle.com http://www.pilotsw.com http://www.gentium.com http://www.redbrick.com Figure 5 - Products incorporating data mining. ance, marketing, retail and telecommunications. One of the most well-known data mining applications is market/shopping basket analysis. This involves running an association discovery operation over Electronic Point Of Sale (EPOS) data. It analyses the combinations of products purchased by individual buyers to find dependencies. Until the recent arrival of loyalty cards, it has been the only way the supermarkets and high street stores has to understand who their customers are and how they behave. Other common applications are for promotion effectiveness, customer vulnerability analysis, cross-selling, portfolio creation and fraud detection. It is also used in healthcare, where it can find relationships between patient histories, illnesses and surgical operations. It is also used in manufacturing processes to monitor quality and spot machine wear. In marketing, if an organisation wants to cross-sell one product to another, it cannot target all customers, because the volume may be too large. Therefore it is necessary to mine the database of existing customers to identify patterns which describe the characteristics of purchasers of the product. These patterns can then be applied to the database of customers who have not purchased the product to segment and predict those who are more likely to purchase the product. These are then targeted in a very specific marketing campaign. Data mining is often used to predict and identify people most likely to respond to direct mail. This reduces the cost of mailing without affecting the response rate. Organisations have File: M0481.4 found up to a twenty-fold decrease on costs over conventional approaches. The data mining operation can also be taken a step further by identifying clusters of the most profitable likely customers, which may be different to those most likely to respond. Identifying exceptions can be just as important as finding hidden patterns. In fraud detection, credit card transactions are often analysed by a neural network to identify unusual transactions which may indicate that the card is not being used by its holder, even before the loss is reported. It is important to understand that a particular data mining exercise may use more than one stage and use several algorithms by passing the results from one analysis to another. For instance, the user might produce associations using a decision tree and then pass the result to a neural network to identify changes over time. Mining elements can be combined in an infinite variety of ways. present the data in an easy to understand manner so that users can assess its significance to the business. It may incorporate its own visualisation tools or work with third-party packages. The software must incorporate filters to remove "noise", which is incorrect information or spurious relationships. For instance, the software shouldn’t waste the user’s time by reporting that 99.9% of married people have a spouse of the opposite gender! Software for data mining is available either direct from the authors or through decision support vendors who have embedded it into their own applications. IBM and the other vendors have open Application Programming Interfaces so that application builders can add value to their decision support software by driving a data mining engine from their own tools. Software For most organisations, the software needs to be scalable from a stand-alone PC to a parallel-processing server. This allows data mining operations to be carried out on desktop databases, relational or multi-dimensional data marts, transaction processing systems or enterprise data warehouses. Because of the different techniques and technologies, the software needs to integrate various different algorithms into one product. Most vendors use several different ones and are writing further modules to expand the scope of their products. The software must PC Network Advisor PCNA The Author Rod Newing MBA FCA FInstD is a specialist writer on Executive Computing. He can be contacted via email as [email protected]. Issue 74 Page 16 New Reviews from Tech Support Alert Anti-Trojan Software Reviews A detailed review of six of the best anti trojan software programs. Two products were impressive with a clear gap between these and other contenders in their ability to detect and remove dangerous modern trojans. Inkjet Printer Cartridge Suppliers Everyone gets inundated by hundreds of ads for inkjet printer cartridges, all claiming to be the cheapest or best. But which vendor do you believe? Our editors decided to put them to the test by anonymously buying printer cartridges and testing them in our office inkjet printers. Many suppliers disappointed but we came up with several web sites that offer good quality cheap inkjet cartridges with impressive customer service. Windows Backup Software In this review we looked at 18 different backup software products for home or SOHO use. In the end we could only recommend six though only two were good enough to get our “Editor’s Choice” award The 46 Best Freeware Programs There are many free utilities that perform as well or better than expensive commercial products. Our Editor Ian Richards picks out his selection of the very best freeware programs and he comes up with some real gems. Tech Support Alert http://www.techsupportalert.com

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 74 - Understanding Data Mining