Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ministerul Educaţiei al Republicii Moldova Universitatea de Stat “Alecu Russo” din Bălţi Facultatea de Ştiinţe Reale, Economice şi ale Mediului Catedra de Matematică şi Informatică Teză de masterat la tema: "METODE AVANSATE DE ANALIZĂ A DATELOR DATA MINING ŞI OLAP PENTRU PIAŢA LOCURILOR DE MUNCĂ" A efectuat: Tatiana Mihailov, Studentă în grupa MITT11M Specialitatea „Management Innovaţional şi Transfer Technologic” Conducător ştiinţific: dr., sup. lect., Corina Negara Bălţi – 2016 1 Republic of Moldova Ministry of Education Balti State University “Alecu Russo” Faculty of Real, Economical Sciencies and Sciencies of Environment Department of Mathematics and Informatics Master Thesys on the topic: DATA MINING AND OLAP ADVANCED DATA ANALYSIS METHODS FOR VACANCIES' MARKET Realized by: Tatiana Mihailov, Student of MITT21M group Speciality „Innovational Management and Technological Transfer” Scientific coordinator: dr., sup. lect., Corina Negara Bălţi – 2016 2 Content I. ADVANCED DATA ANALISYS AND PROCESSING ..........................................................................6 1.1. Data Mining main notions .......................................................................................................................6 1.1.1. Data mining stages .........................................................................................................................8 1.1.2. Data mining process .............................................................................................................................9 1.1.2.1. Data Preprocessing ..........................................................................................................................11 1.2. Data Mining Tools.................................................................................................................................16 1.3. Data Mining Techniques and Their Application ...................................................................................17 1.3.1. Classiffication trees ............................................................................................................................18 1.3.2. Text Mining ........................................................................................................................................18 1.3.3. Other data mining techniques .............................................................................................................20 1.4. OLAP main notions ...............................................................................................................................22 1.5. OLAP and Data Mining Comparison ....................................................................................................23 1.6. Integration of OLAP and Data Mining..................................................................................................24 II. VACANCIES’ MARKET ANALYSIS WITH DATA MINING AND OLAP ......................................25 2.1. Data Mining Query Language ...............................................................................................................25 2.2. Datebase structure .................................................................................................................................29 2.3. The interests of companies that search workers and of individuals who search work ..........................30 2.4. Realising data mining in WEKA ...........................................................................................................30 2.4.1. Classification trees. Specifying the Criteria for Predictive Accuracy ................................................34 2.4.2. Classification trees. Selecting Splits...................................................................................................36 2.4.3. Classification trees. Determining When to Stop Splitting..................................................................37 2.4.4. Classification trees. Selecting the "Right-Sized" Tree ......................................................................38 2.5. Realising OLAP functions in FastCube ................................................................................................43 III. PRACTICAL APPLICATIONS DESCRIPTION .................................................................................45 3.1. Data preprocessing ................................................................................................................................45 3.2. Data classification .................................................................................................................................46 3.3. Multidimensional Data Analysis OLAP ................................................................................................48 CONCLUSIONS ..........................................................................................................................................53 BIBLIOGRAPHY ........................................................................................................................................55 3 Introduction The research’s actuality. Information is now being more and more, due to internet. But is all that data trully information? Not always, because only understood data or data that brings some new value for the receiver is called information. Much data is collected, and the problem that arises is extracting again information from the huge amount of data. Search engines and spiders help to find anything on internet. But databases collected by companies, are in continuous growth. agencies and institutions need to process databases from more companies in order to find the answers to some questions Enormous amount of data needs to be data mined in order to analyze the content and make decisions that will influence the actions, strategies of entrepreneurs. The cornerstone of all business activities (and any other intentional activities for that matter) is information processing. This includes data collection, storage, transportation, manipulation, and retrieval (with or without the aid of computers). Good information about world events helps financial traders make better trading decisions, directly resulting in better profits for the trading firm. This is very valuable. Major trading firms invest heavily in information technologies. Good traders are handsomely rewarded[1]. If in an employment agency for labor comes hundreeds of unemployers and many of them leave it sadly because for a week or a month they didn’t find anything that fits their profile, then is time for the agency to start thinking of the way it delivers it’s services and how to improve the system. In such situation Data Mining or OLAP techniques might be very useful. A better service leads to a higher rate of employment. A higher rate of employment shows an agency, the more institutions, enterprises and individuals approach the agency when workers are needed. In a higher level cooperation between agencies and people who can not subordinate to someone else, potential entrepreneurs, managers and analysts may ask higher-level analytical questions such as the products or services have been most popular for the town this year or it is needed but is not yet delivered. If is it the same group of products that were most profitable last year. The answers to these types of questions represent information that is both analysis based and decision oriented. Decision-oriented software activities are more complex but the good news are that data mining and OLAP are the solution[1]. The problem of this research is that humanity knows about data mining for almost 10 years, but in Moldova, a small country, but with good potential in programming this area is few developed. This can be concluded by the number of bibliographical sources of authors on this topic from Moldova. If some students study it, still is unknown whether it is applied anywhere. 4 The goal of this thesys is to analyze the data mining and OLAP technologies and to suggest solutions data mining, OLAP and data mining with OLAP. The objectives of the thesys are the following: Critical analysis of literature on the topic; Analysis of the way Data Mining may be used; Analysis of the way OLAP may be used; Making a set of recomendations of when which is the best. The short description of the thesys on chapters: There is formulated the actualitaty of the topic, the problem of this research, the goal and objectives of the research and what presents each chapter in introduction. In chapter I, entitled „Advanced Data Analisys And Processing” is presented the objective study of the literature on the topic, and namely the interested information about Data Mining and OLAP terms, the tools that use each of these methods. In chapter II, entitled "Vacancies’ Market Analisys With Data Mining And OLAP" is written how can the concepts studies be applied in order to compare the two tools. Chapter III presents the work that was done in order to see how can be applied both techniques to make the work of labour agencies more effective. The thesys is written on 56 pages, contains 22 bibliographical sources, 60 figures and 2 tables. 5 I. ADVANCED DATA ANALISYS AND PROCESSING 1.1. Data Mining main notions Data mining is the process to discover interesting knowledge from large amounts of data. It is an interdisciplinary field with contributions from many areas, such as statistics, artificial intelligence, big databases, machine learning, information retrieval, pattern recognition and bioinformatics. Artificial intelligence is based on euristics and tries to use similar methods with human thinking in order to solve statistical problems [2]. Data mining is widely used in many domains, such as retail, finance, telecommunication and social media. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications [3]. The term Data Mining has got it’s name from two notions: (1) search of valuable information in big data bases (data) and (2) digging in mines (mining). Both of the processes need either selecting a huge amount of raw material, or rational search and research of the required worths. The term Data Mining often means extracting information, excavation, intellectual analysis of data, means of finding patterns, knowledge extraction, analysis of patterns, "extracting the seeds of knowledge from mountains of data", knowledge excavation from data bases, informational sinking of data, data "abstersion". The expression Knowledge Discovery in Databases (KDD) can be considered a synonym of Data Mining. Definition of Data Mining appeared in 1978 and has got a high popularity in modern interpretation since approximately the first half of 1990-es. Till then the data processing and analysis was implemented in the framework of applied statistics, herewith were dealt the tasks of processing small-scale databases. Connected to the progress of data base system technology (fig.1.1.), the term data mining has developed. Data mining as a process means the action of processing based on performant patterns of data selection and aggregation from data warehouses [4]. In its simplest form, data mining automates the detection of relevant patterns in a database, using defined approaches and algorithms to look into current and historical data that can then be analyzed to predict future trends. Because data mining tools predict future trends and behaviors by reading through databases for hidden patterns, they allow organizations to make proactive, knowledge-driven decisions and answer questions that were previously too time-consuming to resolve [5]. 6 Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. [6] Figure 1.1. The evolution of database system technology [7]. 7 In the table 1.1. is a short description of some disciplines at the joint of which has appeared Data Mining technology. Each of the course that had formed Data Mining, has it’s own characteristics. Let’s compare some of them. Table 1.1. Comparison of statistics, machine learning and Data Mining Statistics Machine Learning Data Mining More than Data Mining, is Is more euristical. Integrates based on theory. euristics. the thoeries and Is more concentrated on Focused on work of learning Focused on an unified process of cheching hypothesis agents’ improvement. data analysis, includes data cleaning, learning, integration and results visualisation. In real world applications, a data mining process can be broken into six major phases: business understanding, data understanding, data preparation, modeling, evaluation and deployment, as defined by the CRISP-DM (Cross Industry Standard Process for Data Mining). In order to make data mining is neccesarily a date warehouse or a big data base. The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide information management systems, using designated technology suitable for corporate data base management (e.g., Oracle, Sybase, MS SQL Server. Also, a flexible, high-performance (see the IDP technology), open architecture approach to data warehousing - that flexibly integrates with the existing corporate systems and allows the users to organize and efficiently reference for analytic purposes enterprise repositories of data of practically any complexity - is offered in StatSoft enterprise systems such as STATISTICA Enterprise and STATISTICA Enterprise/QC, which can also work in conjunction with STATISTICA Data Miner and STATISTICA Enterprise Server[8]. 1.1.1. Data mining stages The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions). Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large 8 numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage. Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and MetaLearning. Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome [9]. 1.1.2. Data mining process Knowledge discovery as a process is depicted in Figure 1.2 and consists of an iterative sequence of the following steps: 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection(wheredatarelevanttotheanalysistaskareretrievedfromthedatabase) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures;) 9 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for evaluation. We agree that data mining is a step in the knowledge discovery process. However, in industry, in media, and in the data base research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data. [7] Figure 1.2. Data mining as a step in the knowledge discovery.[7] 10 1.1.2.1. Data Preprocessing Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. “How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process?” There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining. descriptive data summarization, which serves as a foundation for data preprocessing. Descriptive data summarization helps us study the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning and data integration. The methods for data preprocessing are organized into the following categories: data cleaning, data integration and transformation, and data reduction. Concept hierarchies can be used in an alternative form of data reduction where we replace low-level data (such as raw values for age) with higher-level concepts (such as youth, middle-aged, or senior). The automatic generation of concept hierarchies from categorical data is also described. Incomplete, noisy, and inconsistent data are commonplace properties of large real-world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transaction data. Other data may not be included simply because it was not considered important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, the recording of the history or modifications to the data may have been overlooked. Missing data, particularly for tuples with missing values for some attributes, may need to be 11 inferred. There are many possible reasons for noisy data (having in correct attribute values). The data collection instruments used may be faulty. There may have been human or computer errors occurring at data entry. Errors in data transmission can also occur. There may be technology limitations, such as limited buffer size for coordinating synchronized data transfer and consumption. In correct data may also result from inconsistencies innaming conventions or data codes used, or inconsistent formats for input fields, such as date. Duplicate tuples also require data cleaning. [7]. Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied to it. Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding overfitting the data to the function being modeled. Therefore, a useful preprocessing step is to run your data through some data cleaning routines. You would like to include data from multiple sources in your analysis. This would involve integrating multiple databases, data cubes, or files, that is, data integration. Yet some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies. For example, the attribute for customer identification may be referred to as customer id in one data store and cust id in another. Naming inconsistencies may also occur for attribute values. For example, the same first name could be registered as “Bill” in one database, but “William” in another, and “B.” In the third. Further more, you suspect that some attributes may be inferred from others (e.g., annual revenue). Having a large amount of redundant data may slow down or confuse the knowledge discovery process. Clearly, in addition to data cleaning, steps must be taken to help avoid redundancies during data integration. Typically, data cleaning and data integration are performed as a preprocessing step when preparing the data for a data warehouse. Additional data cleaning can be performed to detect and remove redundancies that may have resulted from data integration. Getting back to your data, you have decided, say, that you would like to use a distancebased mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or clustering. 1 Such methods provide better results if the data to be analyzed have been normalized, that is, scaled to a specific range such as [0.0, 1.0]. Your customer data, for example, contain the attributes age and annual salary. The annual salary attribute usually takes much larger values than age. Therefore, if the attributes are left unnormalized, the distance measurements taken on annual salary will generally outweigh distance measurements taken on age. Furthermore, it would be useful for your analysis to obtain aggregate information as to the sales per 12 customer region—something that is not part of any precomputed data cube in your data warehouse. You soon realize that data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process. Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. There are a number of strategies for data reduction. These include data aggregation (e.g., building a datacube), attribute subset selection (e.g., removing irrelevant attributes through correlation analysis), dimensionality reduction (e.g., using encoding schemes such as minimum length encoding or wavelets), and numerosity reduction (e.g., “replacing” the data by alternative, smaller representations such as clusters or parametric models). Data can also be “reduced” by generalization with the use of concept hierarchies, where low-level concepts, such as city for customer location, are replaced with higher-level concepts, such as region or province or state. A concept hierarchy organizes the concepts into varying levels of abstraction. Data discretization is a form of data reduction that is very useful for the automatic generation of concept hierarchies from numerical data. This is described in Section 2.6, along with the automatic generation of concept hierarchies for categorical data. Note that the above categorization is not mutually exclusive. For example, the removal of redundant data may be seen as a form of data cleaning, as well as data reduction. In summary, realworld data tend to be dirty, incomplete, and inconsistent. Data preprocessing techniques can improve the quality of the data, thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data preprocessing is an important step in the knowledge discovery process, because quality decisions must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision making. Descriptive Data Summarization. For data preprocessing to be successful, it is essential to have an overall picture of your data. Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which data values should be treated as noise or outliers. Thus, we first introduce the basic concepts of descriptive data summarization be foregetting into the concrete workings of data preprocessing techniques. For many data preprocessing tasks, users would like to learn about data characteristics regarding both central tendency and dispersion of the data. Measures of central tendency include mean, median, mode, and midrange, while measures of data dispersion include quartiles, interquartile range (IQR), and variance. These descriptive statistics are of great help in understanding the distribution of the data. Such measures have been studied extensively in the statistical literature. From the data mining point 13 of view, we need to examine how they can be computed efficiently in large databases. In particular, it is necessary to introduce the notions of distributive measure, algebraic measure, and holistic measure. Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it. Measuring the Central Tendency. Various ways to measure the central tendency of data. The most common and most effective numerical measure of the “center” of a set of data is the (arithmetic) mean. Let x 1 ,x 2 ,...,x N be a set of N values or observations, such as for some attribute, like salary. The mean of this set of values is This corresponds to the built-in aggregate function, average (avg() in SQL), provided in relational database systems. A distributive measure is a measure (i.e., function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set. Both sum() and count() are distributive measures because they can be computed in this manner. Other examples include max() and min(). An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures. Hence, average (or mean()) is an algebraic measure because it can be computed by sum()/count(). When computing data cubes 2 , sum() and count() are typically saved in precomputation. Thus, the derivation of average for data cubes is straightforward. Sometimes, each value x i in a set may be associated with a weight w i , for i = 1,...,N. The weights reflect the significance, importance, or occurrence frequency attached to their respective values. In this case, we can compute the next formula. This is called the weighted arithmetic mean or the weighted average. Note that the weighted average is another example of an algebraic measure. Although the mean is the single most useful quantity for describing a dataset, it is not always the best way of measuring the center of the data. A major problem with the mean is its sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values can corrupt the mean. For example, the mean salary at a company may be substantially pushed up by that of a few highly paid managers. Similarly, the average score of a 14 class in an exam could be pulled down quite a bit by a few very low scores. To offset the effect caused by a small number of extreme values, we can instead use the trimmed mean, which is the mean obtained after chopping off values at the high and low extremes. For example, we can sort the values observed for salary and remove the top and bottom 2% before computing the mean. We should avoid trimming too large a portion (such as 20%) at both ends as this can result in the loss of valuable information. For skewed (asymmetric) data, a better measure of the center of data is the median. Suppose that a given data set of N distinct values is sorted in numerical order. If N is odd, then the median is the middle value of the ordered set; otherwise (i.e., if N is even), the median is the average of the middle two values. A holistic measure is a measure that must be computed on the entire data set as a whole. It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset. The median is an example of a holistic measure. Holistic measures are much more expensive to compute than distributive measures such as those listed above. We can, however, easily approximate the median value of a dataset. Assume that data are grouped in intervals according to their x i data values and that the frequency (i.e. number of data values) of each interval is known. For example, people may be grouped according to their annual salary in intervals such as 10–20K, 20–30K, and soon. Let the interval that contains the median frequency be the median interval. We can approximate the median of the entire data set (e.g., the median salary) by interpolation using the formula: Figure 1.3. Mean, median and mode of symmetric versus positively and negatively skewed data [7]. Where L 1 is the lower boundary of the median interval, N is the number of values in the entire dataset, (∑ freq) l is the sum of the frequencies of all of the intervals that are lower than the median interval, freq median is the frequency of the median interval, and width is the width of the median interval. 15 Another measure of central tendency is the mode. The mode for a set of data is the value that occurs most frequently in the set. It is possible for the greatest frequency to correspond to several different values, which results in more than one mode. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal. In general, a data set with two or more modes is multimodal. At the other extreme, if each data value occurs only once, then there is no mode. For unimodal frequency curves that are moderately skewed (asymmetrical), we have the following empirical relation: mean−mode = 3× (mean−median). This implies that the mode for unimodal frequency curves that are moderately skewed can easily be computed if the mean and median values are known. In a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and mode are all at the same center value, as shown in Figure 1.3(a). However, data in most real applications are not symmetric. They may instead be either positively skewed, where the mode occurs at a value that is smaller than the median (Figure 1.3(b)), or negatively skewed, where the mode occurs at a value greater than the median (Figure 1.3(c)). The midrange can also be used to assess the central tendency of a data set. It is the average of the largest and smallest values in the set. This algebraic measure is easy to compute using the SQL aggregate functions, max() and min() [7]. 1.2. Data Mining Tools There are different tools that allow to make data mining. Here are some of them: DMQL (data mining query language); ERP (enterprise resource planning), according to APICS dictionary (American Production and Inventory Control Society)[10]. There are different as ERP 2.5. or HELIUM V, SAP R/3, Oracle e-Business Suite, People Soft, JD Edwards, Lawson Financials etc.; Weka application; Statistical applications. One of them is StatSoft; Follows description for each tool. HELIUM V is designed for use by both purely commercial enterprises as well as by individual, series and made-to-order producers. The frequently encountered hybrid corporate organisations are also supported. HELIUM V is currently being deployed in the following industries and sectors: Metal processing; Mechanical engineering; 16 Electronics; Electrical engineering; Plastics technology; Food and cosmetics; Retail; Service providers; Agencies; Local authorities / town councils. ERP systems support the entrepreneurial task of utilising the resources available in a company (capital assets, operating resources and work force) for its workflows as efficiently as possible and thus to optimise the controlling of its business processes. HELIUM V is designed to be used for well over 100 users. Mutually linked or intermeshing modules ensure that the data is only recorded once in the system and is then made available for subsequent processing. The intermeshed modules mean that your knowledge of your customers, suppliers and above all your products (manufacturing processes as well as commodities) constantly grows and is transparently visualised. From this you can identify positive and negative deviations and intervene to control and correct the processes in your company [11]. 1.3. Data Mining Techniques and Their Application In addition to particular data mining tools, there is a variety of data mining techniques. The main techniques for data mining include [13]: artificial neural networks; decision trees; the nearest-neighbor method; classification; prediction; clustering; induction; statistical methods; outlier detection; tendency detection; association rules; 17 sequence analysis; dependency analysis; time series analysis; text mining; data visualization; new techniques such as social network analysis and sentiment analysis. 1.3.1. Classiffication trees Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The goal of classification trees is to predict or explain responses on a categorical dependent variable. Classification trees are widely used in applied fields as diverse as medicine (diagnosis), computer science (data structures), botany (classification), and psychology (decision theory). Classification trees can be and sometimes are quite complex. However, graphical procedures can be developed to help simplify interpretation even for complex trees. Amenability to graphical display and ease of interpretation are perhaps partly responsible for the popularity of classification trees in applied fields, but two features that characterize classification trees more generally are their hierarchical nature and their flexibility. 1.3.2. Text Mining Text databases consist of huge collection of documents. They collect these information from several sources such as news articles, books, digital libraries, e-mail messages, web pages, etc. Due to increase in the amount of information, the text databases are growing rapidly. In many of the text databases, the data is semi-structured. For example, a document may contain a few structured fields, such as title, author, publishing_date, etc. But along with the structure data, the document also contains unstructured text components, such as abstract and contents. Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data. Users require tools to compare the documents and rank their importance and relevance. Therefore, text mining has become popular and an essential theme in data mining. 18 Information Retrieval. Information retrieval deals with the retrieval of information from a large number of text-based documents. Some of the database systems are not usually present in information retrieval systems because both handle different kinds of data. Examples of information retrieval system includes: Online Library catalogue system; Online Document Management Systems; Web Search Systems etc. Note − The main problem in an information retrieval system is to locate relevant documents in a document collection based on a user's query. This kind of user's query consists of some keywords describing an information need. In such search problems, the user takes an initiative to pull relevant information out from a collection. This is appropriate when the user has ad-hoc information need, i.e., a short-term need. But if the user has a long-term information need, then the retrieval system can also take an initiative to push any newly arrived information item to the user. This kind of access to information is called Information Filtering. And the corresponding systems are known as Filtering Systems or Recommender Systems. Basic Measures for Text Retrieval. We need to check the accuracy of a system when it retrieves a number of documents on the basis of user's input. Let the set of documents relevant to a query be denoted as {Relevant} and the set of retrieved document as {Retrieved}. The set of documents that are relevant and retrieved can be denoted as {Relevant} ∩ {Retrieved}. This can be shown in the form of a Venn diagram from figure 1.24. Figure 1.24. Venn diagram for set of documents [7]. There are three fundamental measures for assessing the quality of text retrieval − Precision; Recall; F-score. Precision. Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision can be defined as Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}| 19 Recall. Recall is the percentage of documents that are relevant to the query and were in fact retrieved. Recall is defined as − Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}| F-score. F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for precision or vice versa. F-score is defined as harmonic mean of recall or precision as follows F-score = recall x precision / (recall + precision) / 2 [7]. 1.3.3. Other data mining techniques Association this method is used to separe repetitive structures in time and is particularly used for discovering rules, according to which a data set accessibility correlates to other set elements. This method is frequent used to find a specific regularity through many transactions. Sequencies- based analysis allows highlighting the regularities in transactions. For example we can answer the question buying which things precede buying certain type production. This method is used in marketing, price flexibility management etc. Dependency analysis – are algorithms that extract dependencies between elements or objects from databanks, which can not be recognized in advance. This way the value of an data object can be predicted based on other. Clusterisation – combines sets of records that have similar features. This method can be used in market and providers segmentation being combined with statistical models or neural networks. Clustering is often considered the first step in data analysis. Classification – this algorithm groups data in clases, describing record’s characteristics that are of the same clasa. This method can be alied, for example in crediting risks evaluation. Decisional trees – use a set of commands for data classification. The method is fast and is better understood by neural networks, but becomes complicated if there is a long list of commands. Decision trees are tree-shaped structures that represent decision sets. These decisions generate rules, which then are used to classify data. Decision trees are the favored technique for building understandable models. Auditors can use them to assess, for example, whether the organization is using an appropriate cost-effective marketing strategy that is based on the assigned value of the customer, such as profit. Induction – process of searching data sets and of generating standard rules. Statistical methods – may be applied for curve description that is the closest to a set of data points. 20 Tendency discovery – these methods extract data tendencies or data abnormalities using different statistical methods, for example rows sorting. Text mining- while Data Mining is typically concerned with the detection of patterns in numeric data, very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.). Data visualization – building grafts using colours and others. This helps to general data analysis to find out some abnormalities or structures or tendencies [4]. Artificial neural networks are non-linear, predictive models that learn through training. Although they are powerful predictive modeling techniques, some of the power comes at the expense of ease of use and deployment. One area where auditors can easily use them is when reviewing records to identify fraud and fraud-like actions. Because of their complexity, they are better employed in situations where they can be used and reused, such as reviewing credit card transactions every month to check for anomalies. The nearest-neighbor method classifies dataset records based on similar data in a historical dataset. Auditors can use this approach to define a document that is interesting to them and ask the system to search for similar items. Each of these approaches brings different advantages and disadvantages that need to be considered prior to their use. Neural networks, which are difficult to implement, require all input and resultant output to be expressed numerically, thus needing some sort of interpretation depending on the nature of the data-mining exercise. The decision tree technique is the most commonly used methodology, because it is simple and straightforward to implement. Finally, the nearest-neighbor method relies more on linking similar items and, therefore, works better for extrapolation rather than predictive enquiries. A good way to apply advanced data mining techniques is to have a flexible and interactive data mining tool that is fully integrated with a database or data warehouse. Using a tool that operates outside of the database or data warehouse is not as efficient. Using such a tool will involve extra steps to extract, import, and analyze the data. When a data mining tool is integrated with the data warehouse, it simplifies the application and implementation of mining results. Furthermore, as the warehouse grows with new decisions and results, the organization can mine best practices continually and apply them to future decisions. 21 Regardless of the technique used, the real value behind data mining is modeling — the process of building a model based on user-specified criteria from already captured data. Once a model is built, it can be used in similar situations where an answer is not known. For example, an organization looking to acquire new customers can create a model of its ideal customer that is based on existing data captured from people who previously purchased the product. The model then is used to query data on prospective customers to see if they match the profile. Modeling also can be used in audit departments to predict the number of auditors required to undertake an audit plan based on previous attempts and similar work. [5]. 1.4. OLAP main notions OLAP is an acronym for Online Analytical Processing or in other sources is also called multi-dimensional information systems. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. It is the foundation for may kinds of business applications for Business Performance Management, Planning, Budgeting, Forecasting, Financial Reporting, Analysis, Simulation Models, Knowledge Discovery, and Data Warehouse Reporting. OLAP enables end-users to perform ad hoc analysis of data in multiple dimensions, thereby providing the insight and understanding they need for better decision making [16]. OLAP concepts include the concept of multiple hierarchical dimensions and can be used by anyone to think more clearly about the world, whether it be the material world from the atomic scale to the galactic scale, the economics world from micro agents to macro economies, or the social world from interpersonal to international relationships. In other words, even without any kind of formal language, just being able to think in terms of a multi-dimensional, multi-level world is useful regardless of your position in life. Good information needs to be existent, accurate, timely, and understandable [7]. Software products devoted to the operations of a business, built principally on top of large-scale database systems, have come to be known as On-Line Transaction Processing systems or OLTP. The development path for OLTP software has followed a pretty straight line for the past 35 years. The goal has been to make systems handle larger amounts of data, process more transactions per unit time, and support larger numbers of concurrent users with ever-greater robustness. But this handle of data is not only simply operational activities. It includes also more complex analysis. The difference between these two way of analysis is shown in table 1.1. 22 Table 1.2. Difference between operational activities and analysis-based decision-oriented activities.[1] Operational activities Analysis-based decision-oriented activities More frequent Less frequent More predictable Less predictable Smaller amounts of data accessed per query Larger amounts of data accessed per query Query mostly raw data Query mostly derived data Require mostly current data Require past, present and projected data Few, if any complex derivations Many complex derivations OLAP is a powerful analysis tool in [17]: Forecasting Statistical computations, aggregations, etc. 1.5. OLAP and Data Mining Comparison OLAP and data mining are used to solve different kinds of analytic problems such as: OLAP provides summary data and generates rich calculations. For example, OLAP answers questions like "How do sales of mutual funds in North America for this quarter compare with sales a year ago? What can we predict for sales next quarter? What is the trend as measured by percent change?" Data mining discovers hidden patterns in data. Data mining operates at a detail level instead of a summary level. Data mining answers questions like "Who is likely to buy a mutual fund in the next six months, and what are the characteristics of these likely buyers?" [18]. Note that despite its name, analyses referred to as OLAP do not need to be performed truly "on-line" (or in real-time); the term applies to analyses of multidimensional databases (that may, obviously, contain dynamically updated information) through efficient "multidimensional" queries that reference various types of data. OLAP facilities can be integrated into corporate (enterprisewide) database systems and they allow analysts and managers to monitor the performance of the 23 business (e.g., such as various aspects of the manufacturing process or numbers and types of completed transactions at different locations) or the market. The final result of OLAP techniques can be very simple (e.g., frequency tables, descriptive statistics, simple cross-tabulations) or more complex (e.g., they may involve seasonal adjustments, removal of outliers, and other forms of cleaning the data). Although Data Mining techniques can operate on any kind of unprocessed or even unstructured information, they can also be applied to the data views and summaries generated by OLAP to provide more in-depth and often more multidimensional knowledge. In this sense, Data Mining techniques could be considered to represent either a different analytic approach (serving different purposes than OLAP) or as an analytic extension of OLAP [19]. The functions or algorithms typically found in OLAP tools (such as aggregation [in its many forms], allocations, ratios, products, etc.) are descriptive modeling functions whereas the functions found in any so-called data-mining package (such as regressions, neural nets, decision trees, and clustering) are pattern discovery or explanatory modeling functions. In addition to the fact that OLAP provides descriptive modeling functions while data mining provides explanatory modeling functions, OLAP also provides a sophisticated structuring consisting of dimensions with hierarchies and cross-dimensional referencing that is nowhere provided in a data-mining environment. A typical data-mining or statistics tool looks at the world in terms of variables and cases. The fact that many data miners do their work without using OLAP tools doesn’t mean they aren’t using OLAP functions. On the contrary, all data miners do some OLAP work as part of their data exploration and preparation prior to running particular pattern detection algorithms. Simply, many data miners rely on basic calculation capabilities provided for either in the data-mining tool or the backend database [1]. 1.6. Integration of OLAP and Data Mining OLAP and data mining can complement each other. For example, OLAP might pinpoint problems with sales of mutual funds in a certain region. Data mining could then be used to gain insight about the behavior of individual customers in the region. Finally, after data mining predicts something like a 5% increase in sales, OLAP can be used to track the net income. Or, Data Mining might be used to identify the most important attributes concerning sales of mutual funds, and those attributes could be used to design the data model in OLAP [18]. 24 II. VACANCIES’ MARKET ANALYSIS WITH DATA MINING AND OLAP 2.1. Data Mining Query Language The Data Mining Query Language (DMQL) is actually based on the Structured Query Language (SQL). Data Mining Query Languages can be designed to support ad hoc and interactive data mining. This DMQL provides commands for specifying primitives. The DMQL can work with databases and data warehouses as well. DMQL can be used to define data mining tasks. Particularly we examine how to define data warehouses and data marts in DMQL. Syntax for Task-Relevant Data Specification In figure 2.1. is the syntax of DMQL for specifying task-relevant data. use database database_name or use data warehouse data_warehouse_name in relevance to att_or_dim_list from relation(s)/cube(s) [where condition] order by order_list group by grouping_list Figure 2.1. Syntax for Specifying the Kind of Knowledge Follows the syntax for Characterization, Discrimination, Association, Classification, and Prediction. Characterization. The syntax for characterization is resented in figure 2.2. mine characteristics [as pattern_name] analyze {measure(s) } Figure 2.2. Syntax for characterization 25 The analyze clause, specifies aggregate measures, such as count, sum, or count%. An example is shown in figure 2.3. Description describing customer purchasing habits. mine characteristics as customerPurchasing analyze count% Figure 2.3. Syntax for specifies aggregate measure count%. Discrimination. The syntax for Discrimination is resented in figure 2.4. mine comparison [as {pattern_name]} For {target_class} where {t arget_condition} {versus {contrast_class_i} where {contrast_condition_i}} analyze {measure(s)} Figure 2.4. Syntax for Discrimination For example, a user may define big spenders as customers who purchase items that cost $100 or more on an average; and budget spenders as customers who purchase items at less than $100 on an average. The mining of discriminant descriptions for customers from each of these categories can be specified in the DMQL as in figure 2.5. mine comparison as purchaseGroups for bigSpenders where avg(I.price) ≥$100 versus budgetSpenders where avg(I.price)< $100 analyze count Figure 2.5. Mining of discriminant descriptions for customers from each of these categories. Association. The syntax for Association is written in figure 2.6. mine associations [ as {pattern_name} ] {matching {metapattern} } 26 Figure 2.6. Syntax for Association For Example as written in figure 2.7. mine associations as buyingHabits matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z) Figure 2.7. Syntax for Association. Another version. X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object variables. Classification. The syntax for Classification is written in figure 2.8. mine classification [as pattern_name] analyze classifying_attribute_or_dimension Figure 2.8. Syntax for Classification. For example, to mine patterns, classifying customer credit rating where the classes are determined by the attribute credit_rating, and mine classification is determined as classify Customer Credit Rating (figure 2.9.). analyze credit_rating Figure 2.9. Function to analyze data. Prediction. The syntax for prediction is written in figure 2.10. mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}} Figure 2.10. Syntax for prediction. Syntax for Concept Hierarchy Specification. To specify concept hierarchies, use the syntax from figure 2.11. 27 use hierarchy <hierarchy> for <attribute_or_dimension> Figure 2.11. Syntax to specify concept hierarchies. We use different syntaxes to define different types of hierarchies such as in figure 2.12. -schema hierarchies define hierarchy time_hierarchy on date as [date,month,quarter,year] -set-grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level3: {40, ..., 59} < level1: middle_aged level4: {60, ..., 89} < level1: senior -operation-derived hierarchies define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) -rule-based hierarchies define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)< $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) ≤ $250)) level_1: high_profit_margin < level_0: all Figure 2.12. Syntaxes to define different types of hierarchies. Syntax for Interestingness Measures Specification. Interestingness measures and thresholds can be specified by the user with the statement presented in figure 2.13. 28 with <interest_measure_name> threshold = threshold_value Figure 2.13. Syntax to specify interestingness measures and thresholds. For Example, as in figure 2.14. with support threshold = 0.05 with confidence threshold = 0.7 Figure 2.14. Syntax to specify interestingness measures and thresholds. Another option. Syntax for Pattern Presentation and Visualization Specification. We have a syntax, which allows users to specify the display of discovered patterns in one or more forms (figure 2.14). display as <result_form> Figure 2.14. Syntax which allows users to specify the display of discovered patterns. For Example as in figure 2.15. display as table Figure 2.15. Syntax which allows users to specify the display of discovered patterns. Another option. [12] 2.2. Datebase structure The data warehuse can be either a relational database or a set of conncected databases[20]. To connect more and especcial different databases is a more complex procedure. The goal of the thesys is to compare two techniques, and we can use a simlier data warehouse – a big relational database. It’s structure is indicated in the figure 2.16. 29 Figure 2.16. The structure of database for market of work vacancies. 2.3. The interests of companies that search workers and of individuals who search work The companies may: Introduce dates about it (it’s title, contact adress, phone, email, locality, what free labour place it has, what are the requirements and responsabilities, work schedule) and also view registered people who has the diploma in the area they need and living in the locality they search a worker or not having any certification, for simple operational functions. The individuals may: introduce data about himself/herself(name, surname, the proffessional preparation he has, date of birth, locality, phone, email) get to know how many free jobs are in a certain locality in a certain field and at a certain date. Search a work according to his/her prefferences. Get to know what was the most popular proffession in a certain locality/district for the previous year. 2.4. Realising data mining in WEKA The chosen application to deliver data mining is WEKA because of it’s advantages: 30 Weka contains a collection of visualization tools and algorithms for data analysis and predictive modeling. It allows to make the most important functions for data mining: preprocessing, classification, clustering, association, sellecting attributes and visualizing (figure 2.17.); graphical user interfaces for easy access and use to these functions; rree availability under the GNU General Public License; portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform; a comprehensive collection of data preprocessing and modeling techniques; there are tutorials on how to data mine with Weka on Youtube. Figure 2.17. The interface of Weka for an explorer. User with different level of skills can access the application (figure 2.18.). The graraphical user interface allows even a non programmer to work with this application and to data mine. 1 The new function for data mining in Weka is attribute selection. This process is separated into two parts: Attribute Evaluator: Method by which attribute subsets are assessed; 31 Search Method: Method by which the space of possible subsets is searched. In the next section Jason Brownlee wants to share three clever ways of using attribute selection in Weka. Figure 2.18. Weka GUI Chooser. 1. Explore Attribute Selection. When we are just stating out with attribute selection he recommends playing with a few of the methods in the Weka Explorer. Loading our dataset and click the “Select attributes” tab (figure 2.19.). Trying out different Attribute Evaluators and Search Methods on your dataset and review the results in the output window. Figure 2.19. Feature Selection Methods in the Weka Explorer The idea is to get a feeling and build up an intuition for 1) how many and 2) which attributes are selected for your problem. We could use this information going forward into either or both of the next steps. 2. Prepare Data with Attribute Selection. The next step would be to use attribute selection as part of your data preparation step. 32 There is a filter (figure 2.20.) we can use when preprocessing your dataset that will run an attribute selection scheme then trim your dataset to only the selected attributes. The filter is called “AttributeSelection” under the Unsupervised Attribute filters. Figure 2.20. Creating Transforms of a Dataset using Feature Selection methods in Weka We can then save the dataset for use in experiments when spot checking algorithms. 3. Run Algorithms with Attribute Selection. Finally, there is one more clever way you can incorporate attribute selection and that is to incorporate it with the algorithm directly. There is a meta algorithm (figure 2.21.) we can run and include in experiments that selects attributes running the algorithm. The algorithm is called “AttributeSelectedClassifier” under the “meta” group of algorithms. We can configure this algorithm to use your algorithm of choice as well as the Attribute Evaluator and Search Method of your choosing. Figure 2.21. Coupling a Classifier and Attribute Selection in a Meta Algorithm in Weka 33 It can be included multiple versions of this meta algorithm configured with different variations and configurations of the attribute selection scheme and see how they compare to each other. 2.4.1. Classification trees. Specifying the Criteria for Predictive Accuracy An operational definition of accurate prediction is hard to come by. To solve the problem of defining predictive accuracy, the problem is "stood on its head," and the most accurate prediction is operationally defined as the prediction with the minimum costs. The term costs need not seem mystifying. In many typical applications, costs simply correspond to the proportion of misclassified cases. The notion of costs was developed as a way to generalize, to a broader range of prediction situations, the idea that the best prediction has the lowest misclassification rate. The need for minimizing costs, rather than just the proportion of misclassified cases, arises when some predictions that fail are more catastrophic than others, or when some predictions that fail occur more frequently than others. The costs to a gambler of losing a single bet (or prediction) on which the gambler's whole fortune is at stake are greater than the costs of losing many bets (or predictions) on which a tiny part of the gambler's fortune is at stake. Conversely, the costs of losing many small bets can be larger than the costs of losing just a few bigger bets. We should spend proportionately more effort in minimizing losses on bets where losing (making errors in prediction) costs us more. Minimizing costs, however, does correspond to minimizing the proportion of misclassified cases when Priors are taken to be proportional to the class sizes and when Misclassification costs are taken to be equal for every class. We will address Priors first. Priors, or, a priori probabilities, specify how likely it is, without using any prior knowledge of the values for the predictor variables in the model, that a case or object will fall into one of the classes. For example, in an educational study of high school drop-outs, it may happen that, overall, there are fewer drop-outs than students who stay in school (i.e., there are different base rates); thus, the a prioriprobability that a student drops out is lower than that a student remains in school. The a priori probabilities used in minimizing costs can greatly affect the classification of cases or objects. If differential base rates are not of interest for the study, or if we know that there are about an equal number of cases in each class, then we would use equal priors. If the differential base rates are reflected in the class sizes (as they would be if the sample is a probability sample) then we would use priors estimated by the class proportions of the sample. Finally, if there are specific knowledge about the base rates (for example, based on previous research), then it is real to 34 specify priors in accordance with that knowledge. For example, a prioriprobabilities for carriers of a recessive gene could be specified as twice as high as for individuals who display a disorder caused by the recessive gene. The general point is that the relative size of the priors assigned to each class can be used to "adjust" the importance of misclassifications for each class. Minimizing costs corresponds to minimizing the overall proportion of misclassified cases when Priors are taken to be proportional to the class sizes (and Misclassification costs are taken to be equal for every class), because prediction should be better in larger classes to produce an overall lower misclassification rate. Misclassification costs. Sometimes more accurate classification is desired for some classes than others for reasons unrelated to relative class sizes. Regardless of their relative frequency, carriers of a disease who are contagious to others might need to be more accurately predicted than carriers of the disease who are not contagious to others. If it is expected that little is lost in avoiding a non-contagious person but much is lost in not avoiding a contagious person, higher misclassification costs could be specified for misclassifying a contagious carrier as non-contagious than for misclassifying a non-contagious person as contagious. But to reiterate, minimizing costs corresponds to minimizing the proportion of misclassified cases when Priors are taken to be proportional to the class sizes and when Misclassification costs are taken to be equal for every class. Case weights. A little less conceptually, the use of case weights on a weighting variable as case multipliers foraggregated data sets is also related to the issue of minimizing costs. Interestingly, as an alternative to using case weights for aggregated data sets, it is to specify appropriate priors and/or misclassification costs and produce the same results while avoiding the additional processing required to analyze multiple cases with the same values for all variables. Suppose that in an aggregated data set with two classes having an equal number of cases, there are case weights of 2 for all the cases in the first class, and case weights of 3 for all the cases in the second class. If it is specified priors of .4 and .6, respectively, specify equal misclassification costs, and analyze the data without case weights, the same misclassification rates are to happen, as we would get if it is specified priorsestimated by the class sizes, specify equal misclassification costs, and analyze the aggregated data set using the case weights. Also it can be obtained the same misclassification rates if we specify priors to be equal, specify the costs of misclassifying class 1 cases as class 2 cases to be 2/3 of the costs of misclassifying class 2 cases as class 1 cases, and analyze the data without case weights. The relationships between priors, misclassification costs, and case weights become quite complex in all but the simplest situations (for discussions, see Breiman et al, 1984; Ripley, 1996). In analyses where minimizing costscorresponds to minimizing the misclassification rate, however, 35 these issues need not cause any concern. Priors, misclassification costs, and case weights are brought up here, however, to illustrate the wide variety of prediction situations that can be handled using the concept of minimizing costs, as compared to the rather limited (but probably typical) prediction situations that can be handled using the narrower (but simpler) idea of minimizing misclassification rates. Furthermore, minimizing costs is an underlying goal of classification tree analysis, and is explicitly addressed in the fourth and final basic step in classification tree analysis, where in trying to select the "right-sized" tree, is chosen the tree with the minimum estimated costs. Depending on the type of prediction problem users are trying to solve, understanding the idea of reduction of estimated costs may be important for understanding the results of the analysis. 2.4.2. Classification trees. Selecting Splits The second basic step in classification tree analysis is to select the splits on the predictor variables that are used to predict membership in the classes of the dependent variables for the cases or objects in the analysis. Not surprisingly, given the hierarchical nature of classification trees, these splits are selected one at time, starting with the split at the root node, and continuing with splits of resulting child nodes until splitting stops, and the child nodes that have not been split become terminal nodes. Three Split selection methods are discussed here. Discriminant-based univariate splits. The first step in split selection when the Discriminantbased univariate splits option is chosen is to determine the best terminal node to split in the current tree, and which predictor variable to use to perform the split. For each terminal node, p-values are computed for tests of the significance of the relationship of class membership with the levels of each predictor variable. For categorical predictors, thep-values are computed for Chi-square tests of independence of the classes and the levels of the categorical predictor that are present at the node. For ordered predictors, the p-values are computed for ANOVAs of the relationship of the classes to the values of the ordered predictor that are present at the node. If the smallest computed p-value is smaller than the default Bonferroni-adjusted p-value for multiple comparisons of .05 (a different threshold value can be used), the predictor variable producing that smallest p-value is chosen to split the corresponding node. If no p-value smaller than the threshold p-value is found, p-values are computed for statistical tests that are robust to distributional violations, such as Levene's F. Details concerning node and predictor variable selection when no p-value is smaller than the specified threshold are described in Loh and Shih (1997). The next step is to determine the split. For ordered predictors, the 2-means clustering algorithm of Hartigan and Wong (1979) is applied to create two "superclasses" for the node. The 36 two roots are found for a quadratic equation describing the difference in the means of the "superclasses" on the ordered predictor, and the values for a split corresponding to each root are computed. The split closest to a "superclass" mean is selected. For categorical predictors, dummycoded variables representing the levels of the categorical predictor are constructed, and then singular value decomposition methods are applied to transform the dummy-coded variables into a set of non-redundant ordered predictors. The procedures for ordered predictors are then applied and the obtained split is "mapped back" onto the original levels of the categorical variable and represented as a contrast between two sets of levels of the categorical variable. Again, further details about these procedures are described in Loh and Shih (1997). Although complicated, these procedures reduce a bias in split selection that occurs when using the C&RT-style exhaustive search method for selecting splits. This is the bias toward selecting variables with more levels for splits, a bias that can skew the interpretation of the relative importance of the predictors in explaining responses on the dependent variable (Breiman et. al., 1984). Discriminant-based linear combination splits. The second split selection method is the Discriminant-based linear combination split option for ordered predictor variables (however, the predictors are assumed to be measured on at least interval scales). Surprisingly, this method works by treating the continuous predictors from which linear combinations are formed in a manner that is similar to the way categorical predictors are treated in the previous method. Singular value decomposition methods are used to transform the continuous predictors into a new set of nonredundant predictors. The procedures for creating "superclasses" and finding the split closest to a "superclass" mean are then applied, and the results are "mapped back" onto the original continuous predictors and represented as a univariate split on a linear combination of predictor variables. C&RT-style exhaustive search for univariate splits. The third split-selection method is the C&RT-style exhaustive search for univariate splits method for categorical or ordered predictor variables. With this method, all possible splits for each predictor variable at each node are examined to find the split producing the largest improvement in goodness of fit (or equivalently, the largest reduction in lack of fit). What determines the domain of possible splits at a node? For categorical predictor variables with k levels present at a node, there are 2(k-1) - 1 possible contrasts between two sets of levels of the predictor. For ordered predictors with k distinct levels present at a node, there are k -1 midpoints between distinct levels. Thus it can be seen that the number of possible splits that must be examined can become very large when there are large numbers of predictors with many levels that must be examined at many nodes. 2.4.3. Classification trees. Determining When to Stop Splitting 37 The third step in classification tree analysis is to determine when to stop splitting. One characteristic of classification trees is that if no limit is placed on the number of splits that are performed, eventually "pure" classification will be achieved, with each terminal node containing only one class of cases or objects. However, "pure" classification is usually unrealistic. Even a simple classification tree such as a coin sorter can produce impure classifications for coins whose sizes are distorted or if wear changes the lengths of the slots cut in the track. This potentially could be remedied by further sorting of the coins that fall into each slot, but to be practical, at some point the sorting would have to stop and we would have to accept that the coins have been reasonably well sorted. Likewise, if the observed classifications on the dependent variable or the levels on the predicted variable in a classification tree analysis are measured with error or contain "noise," it is unrealistic to continue to sort until every terminal node is "pure." Two options for controlling when splitting stops will be discussed here. These two options are linked to the choice of the Stopping rule specified for the analysis. Minimum n. One option for controlling when splitting stops is to allow splitting to continue until all terminal nodes are pure or contain no more than a specified minimum number of cases or objects. The desired minimum number of cases can be specified as the Minimum n, and splitting will stop when all terminal nodes containing more than one class have no more than the specified number of cases or objects. Fraction of objects. Another option for controlling when splitting stops is to allow splitting to continue until all terminal nodes are pure or contain no more cases than a specified minimum fraction of the sizes of one or more classes. The desired minimum fraction can be specified as the Fraction of objects and, if the priors used in the analysis are equal and class sizes are equal, splitting will stop when all terminal nodes containing more than one class have no more cases than the specified fraction of the class sizes for one or more classes. If the priors used in the analysis are not equal, splitting will stop when all terminal nodes containing more than one class have no more cases than the specified fraction for one or more classes. 2.4.4. Classification trees. Selecting the "Right-Sized" Tree Example. After a night at the horse track, a studious gambler computes a huge classification tree with numerous splits that perfectly account for the win, place, show, and no show results for every horse in every race. Expecting to become rich, the gambler takes a copy of the tree graph to 38 the races the next night, sorts the horses racing that night using the classification tree, makes his or her predictions and places his or her bets, and leaves the race track later much less rich than had been expected. The poor gambler has foolishly assumed that a classification tree computed from a learning sample in which the outcomes are already known will perform equally well inpredicting outcomes in a second, independent test sample. The gambler's classification tree performed poorly during cross-validation. The gambler's payoff might have been larger using a smaller classification tree that did not classify perfectly in the learning sample, but which was expected to predict equally well in the test sample. Some generalizations can be offered about what constitutes the "right-sized" classification tree. It should be sufficiently complex to account for the known facts, but at the same time it should be as simple as possible. It should exploit information that increases predictive accuracy and ignore information that does not. It should, if possible, lead to greater understanding of the phenomena that it describes. Of course, these same characteristics apply to any scientific theory, so we must try to be more specific about what constitutes the "right-sized" classification tree. One strategy is to grow the tree to just the right size, where the right size is determined by the user from knowledge from previous research, diagnostic information from previous analyses, or even intuition. The other strategy is to use a set of well-documented, structured procedures developed by Breiman et al. (1984) for selecting the "right-sized" tree. These procedures are not foolproof, as Breiman et al. (1984) readily acknowledge, but at least they take subjective judgment out of the process of selecting the "right-sized" tree. FACT-style direct stopping. We will begin by describing the first strategy, in which the researcher specifies the size to grow the classification tree. This strategy is followed by using FACT-style direct stopping as the Stopping rule for the analysis and by specifying the Fraction of objects, which allows the tree to grow to the desired size. There are several options for obtaining diagnostic information to determine the reasonableness of the choice of size for the tree. Three options for performing cross-validation of the selected classification tree are discussed below. Test sample cross-validation. The first, and most preferred type of cross-validation is test sample cross-validation. In this type of cross-validation, the classification tree is computed from the learning sample, and its predictive accuracy is tested by applying it to predict class membership in the test sample. If the costs for the test sample exceed the costs for the learning sample (remember, costs equal the proportion of misclassified cases when priors are estimated and misclassification costs are equal), this indicates poor cross-validation and that a different sized tree might crossvalidate better. The test and learning samples can be formed by collecting two independent data 39 sets, or if a large learning sample is available, by reserving a randomly selected proportion of the cases, say a third or a half, for use as the test sample. V-fold cross-validation. This type of cross-validation is useful when no test sample is available and the learning sample is too small to have the test sample taken from it. A specified V value for V-fold cross-validationdetermines the number of random subsamples, as equal in size as possible, that are formed from the learning sample. The classification tree of the specified size is computed V times, each time leaving out one of the subsamples from the computations, and using that subsample as a test sample for cross-validation, so that each subsample is used V - 1 times in the learning sample and just once as the test sample. The CV costs computed for each of the V test samples are then averaged to give the V-fold estimate of the CV costs. Global cross-validation. In global cross-validation, the entire analysis is replicated a specified number of times holding out a fraction of the learning sample equal to 1 over the specified number of times, and using each hold-out sample in turn as a test sample to cross-validate the selected classification tree. This type of cross-validation is probably no more useful than V-fold cross-validation when FACT-style direct stopping is used, but can be quite useful as a method validation procedure when automatic tree selection techniques are used (for discussion, see Breiman et. al., 1984). This brings us to the second of the two strategies that can used to select the "rightsized" tree, an automatic tree selection method based on a technique developed by Breiman et al. (1984) called minimal cost-complexity cross-validation pruning. Minimal cost-complexity cross-validation pruning. Two methods of pruning can be used depending on the Stopping Rule we choose to use. Minimal cost-complexity cross-validation pruning is performed when we decide to Prune on misclassification error (as a Stopping rule), and minimal deviance-complexity cross-validation pruning is performed when we choose to Prune on deviance (as a Stopping rule). The only difference in the two options is the measure of prediction error that is used. Prune on misclassification error uses the costs that we have discussed repeatedly (which equal the misclassification rate when priors are estimated and misclassification costs are equal). Prune on deviance uses a measure, based on maximum-likelihood principles, called the deviance (see Ripley, 1996). We will focus on cost-complexity cross-validation pruning (as originated by Breiman et. al., 1984), since deviance-complexity pruning merely involves a different measure of prediction error. The costs needed to perform cost-complexity pruning are computed as the tree is being grown, starting with the split at the root node up to its maximum size, as determined by the specified Minimum n. The learning samplecosts are computed as each split is added to the tree, so that a sequence of generally decreasing costs (reflecting better classification) are obtained 40 corresponding to the number of splits in the tree. The learning sample costsare called resubstitution costs to distinguish them from CV costs, because V-fold cross-validation is also performed as each split is added to the tree. Use the estimated CV costs from V-fold cross-validation as thecosts for the root node. Note that tree size can be taken to be the number of terminal nodes, because for binary trees the tree size starts at one (the root node) and increases by one with each added split. Now, define a parameter called the complexity parameter whose initial value is zero, and for every tree (including the first, containing only the root node), compute the value for a function defined as the costs for the tree plus the complexity parameter times the tree size. Increase the complexity parameter continuously until the value of the function for the largest tree exceeds the value of the function for a smaller-sized tree. Take the smaller-sized tree to be the new largest tree, continue increasing the complexity parameter continuously until the value of the function for the largest tree exceeds the value of the function for a smaller-sized tree, and continue the process until the root node is the largest tree. (Those who are familiar with numerical analysis will recognize the use of apenalty function in this algorithm. The function is a linear combination of costs, which generally decrease with tree size, and tree size, which increases linearly. As the complexity parameter is increased, larger trees are penalized for their complexity more and more, until a discrete threshold is reached at which a smaller-sized tree's higher costs are outweighed by the largest tree's higher complexity). The sequence of largest trees obtained by this algorithm have a number of interesting properties. They are nested, because successively pruned trees contain all the nodes of the next smaller tree in the sequence. Initially, many nodes are often pruned going from one tree to the next smaller tree in the sequence, but fewer nodes tend to be pruned as the root node is approached. The sequence of largest trees is also optimally pruned, because for every size of tree in the sequence, there is no other tree of the same size with lower costs. Proofs and/or explanations of these properties can be found in Breiman et al. (1984). Tree selection after pruning. We now select the "right-sized" tree from the sequence of optimally pruned trees. A natural criterion is the CV costs. While there is nothing wrong with choosing the tree with the minimum CV costs as the "right-sized" tree, oftentimes there will be several trees with CV costs close to the minimum. Breiman et al. (1984) make the reasonable suggestion that we should choose as the "right-sized" tree the smallest-sized (least complex) tree whose CV costs do not differ appreciably from the minimum CV costs. They proposed a "1 SE rule" for making this selection, i.e., choose as the "right-sized" tree the smallest-sized tree whose CV costs do not exceed the minimum CV costs plus 1 times the Standard error of the CV costs for the minimum CV costs tree. 41 One distinct advantage of the "automatic" tree selection procedure is that it helps to avoid "overfitting" and "underfitting" of the data. The graph from figure 2.22 shows a typical plot of the Resubstitution costs and CV costs for the sequence of successively pruned trees. Figure 2.22. Cost Sequence for PRICE. As shown in this graph, the Resubstitution costs (e.g., the misclassification rate in the learning sample) rather consistently decrease as tree size increases. The CV costs, on the other hand, approach the minimum quickly as tree size initially increases, but actually start to rise as tree size becomes very large. Note that the selected "right-sized" tree is close to the inflection point in the curve, that is, close to the point where the initial sharp drop in CV costs with increased tree size starts to level out. The "automatic" tree selection procedure is designed to select the simplest (smallest) tree with close to minimum CV costs, and thereby avoid the loss in predictive accuracy produced by "underfitting" or "overfitting" the data (note the similarity to the logic underlying the use of a "scree plot" to determine the number of factors to retain in Factor Analysis; see also Reviewing the Results of a Principal Components Analysis). As has been seen, minimal cost-complexity cross-validation pruning and subsequent "rightsized" tree selection is a truly "automatic" process. The algorithms make all the decisions leading to selection of the "right-sized" tree, except for, perhaps, specification of a value for the SE rule. One issue that arises with the use of such "automatic" procedures is how well the results replicate, where replication might involve the selection of trees of quite different sizes across replications, given the "automatic" selection process that is used. This is whereglobal cross-validation can be very useful. As explained previously, in global cross-validation, the entire analysis is replicated a specified number of times (3 is the default) holding out a fraction of the cases to use as a test sample to crossvalidate the selected classification tree. If the average of the costs for the test samples, called the global CV costs, exceeds the CV costs for the selected tree, or if the standard error of the global CV costsexceeds the standard error of the CV costs for the selected tree, this indicates that the 42 "automatic" tree selection procedure is allowing too much variability in tree selection rather than consistently selecting a tree with minimum estimated costs. Classification trees and traditional methods. As can be seen in the methods used in computing classification trees, in a number of respects classification trees are decidedly different from traditional statistical methods for predicting class membership on a categorical dependent variable. They employ a hierarchy of predictions, with many predictions sometimes being applied to particular cases, to sort the cases into predicted classes. Traditional methods use simultaneous techniques to make one and only one class membership prediction for each and every case. In other respects, such as having as its goal accurate prediction, classification tree analysis is indistinguishable from traditional methods. Time will tell if classification tree analysis has enough to commend itself to become as accepted as the traditional methods. The distinction between the discriminant analysis and classification tree decision processes can perhaps be made most clear by considering how each analysis would be performed in Regression. Because risk in the example of Breiman et al. (1984) is a dichotomous dependent variable, the Discriminant Analysis predictions could be reproduced by a simultaneous multiple regression of risk on the three predictor variables for all patients. The classification tree predictions could only be reproduced by three separate simple regression analyses, where risk is first regressed on P for all patients, then risk is regressed on A for patients not classified as low risk in the first regression, and finally, risk is regressed on T for patients not classified as low risk in the second regression. This clearly illustrates the simultaneous nature of Discriminant Analysis decisions as compared to the recursive, hierarchical nature of classification trees decisions, a characteristic of classification trees that has far-reaching implications. Another distinctive characteristic of classification trees is their flexibility. The ability of classification trees to examine the effects of the predictor variables one at a time, rather than just all at once, has already been described, but there are a number of other ways in which classification trees are more flexible than traditional analyses. The ability of classification trees to perform univariate splits, examining the effects of predictors one at a time, has implications for the variety of types of predictors that can be analyzed [14]. 2.5. Realising OLAP functions in FastCube FastCube enables us to analyze data and to build summary tables (data slices) as well as create a variety of reports and graphs both easily and instantly. It's a handy tool for the efficient analysis of data array (figure 2.23.). Advantages of this application are the following: FastCube components can be built into the interface of host applications; 43 FastCube end users do not require high programming skills to build reports; FastCube is a set of OLAP Desktop components for Delphi/C++Builder/Lazarus; connection to data-bases can be not only through the standard ADO or BDE components but also through any component based on TDataSet; instant downloading and handling of data arrays; ready-made templates can be built for summary tables. It is posible to prohibit users from modifying the schema; all FastCube's settings may be accessed both programmatically and by the end user; it's data can be saved in a compact format for data exchange and data storage. Figure 2.23. Fast Cube 2 application. 44 III. PRACTICAL APPLICATIONS DESCRIPTION 3.1. Data preprocessing To do any procedure in Weka is neccesarily to convert the data file into *.arff format, as it is the data format „understood” by this application. In order to obtain this format, an option is to export *.xls file into "CSV(MSDOS)*.csv" and open it through Wordpad and simply changing the extension. The result must be as in the figure 3.1. Figure 3.1. Converting data file into arff format file. It was found that, at this level 20 people were hired. Others 50 vacances are still free. From these 20, 10 are from the domain of health, 2 from construction, 2 education, and other 2 from transport, by one from engineering and management (figure 3.2.). The same thing may be done by Excel, but the aplication Weka makes calculas automatically. Also it can be calculated other values which describe the data set, such as those in figure 3.3., in which are the maximum and minimum salary for this area of work. 45 Figure 3.2. How many vacancies were occupied by areas. Figure 3.3. Numerical description of the salaries of people hired. Once most of all doctors were hired, let’s see on this area, why and how were the people hired. One of the task for analysing data was to consider the profile of the people hired. So one argument is the salary, which ranges less, than in general in this locality, it is from 161 to 273 euro per month. Another factor is that there were people with the needed education in the locality where the these vacancies was. 3.2. Data classification By sex, there were 4 men and 5 women hired on job of health area (figure 3.4.). Women are most, but on the Earth they are also most. So it doesn’t says that women are more prefferable to be engaged as doctors in different area. Figure 3.4. People who were hired as doctors. 46 In order to find out the answer to the question "How long will it take to find a specific specialist?" classification is to be used too. The query is written on compiler (not everything is possible to do in visual application), so that is obtained a list of number of days in which a person might find a job, with the probability of prediction indicated in the same raw (see figure 3.5.). The steps to predict some values are to build a decisional tree and to choose the functions from figure 3.6. it can be noticed that it is not indicated the error, because it is prediction and the error can be calculated only when the event has happened. Figure 3.5. The results gained of how many days are needed to find a job. It can be seen that there are analysed more cases and it is shown as a result more options that may happen, with the probability coresponding to each option. It took much time to write the sequence of instructions by hand, in compiler and to find which instructions in which order are neccesarily to be able to predict these values. The knowledge and skills of a programmer are required for that who is working on this function of data mining. 47 Figure 3.6. Instructions to make to view the results of the tree. 3.3. Multidimensional Data Analysis OLAP In order to build and analyze a data cube is neccesarily to build it before using Fast Cube 2.0 application. It needs a ready cube to work with it. Building a data cube is possible through SQL Server Analysis Services. When installing it the author saw it is about Microsoft Visual Studio. Having SQL Server installed in advance is neccesarily to indicate to which source server is connected the application to take the data base source for the cube. The imported data base is showed in the figure 3.7. It was neccesarily to create a data source and a data source view, in the right menu of the application (figure 3.8.), with which it follows the work to create the data cube. 48 Figure 3.7. Data base imported in Microsoft Visual Studio to build a data cube. Figure 3.8. The menu in which to work. It is important not to confuse what we see in the middle of application with a view when 49 working with Microsoft Visual Studio and namely with creating a cube for the first time. A view is a collection of correlated tabels, that will make a multidimensional corelated data. So there can be more views in a database. Following all the steps from tutorials by Microsoft Developer Videos the first data cube is on the figure 3.9. 3.9.The first projected in this thesys data cube. The second datacube was made the same way, in Wizard mode (figure 3.10, 3.11.). Figure 3.10. Using wizard mode to build a cube. 50 Figure 3.11. Vacancies’ characteristics cube. When the cube is build follows to deploy the data cube, using instruction Deploy <title of multidimensional project> in BUILD menu. Afterthat the data can be viewed in diagrams such as figure 3.12, 3.13. Figure 3.12. The medicine job demand tendency. 51 Figure 3.13. the IT salary tendency. It can be seen that OLAP has the advantage of showing data by more criterias in the same time. In figure 3.13. is shown dimensions: time, salary and percentile range of salaries that companies offer. 52 CONCLUSIONS OLAP and data mining tools are used to solve more or less different analytic problems. Data mining discovers hidden patterns in data. Data mining operates at a detail level instead of a summary level. Data mining answers questions "How long will it take to find a specific specialist?". The used application for these functions of data preprocessing, data classification, was Weka 3.6 version that is for free. It was determined, on the field of vacanciies’ job market, by data preprocessing, the information about how many people of total number were hired, in which area were most of them hired, what was the range of their salary. By classification it was found on which genre were these people and in future how many days might take for a person to find a job. The best solution for enterprises that need to collect and analyze a huge amount of data is to buy a computer with at least 3 GB RAM and to hire one or two programmers with background in data mining, because a part of the functions must be written in compiler and skills and knowledge in the area are absolutely neccesarily. OLAP tool provides summary data and generates rich calculations. It was applyied in Microsoft Visual Studio application which uses SQL Server. For example, OLAP answers questions like "How many men and women were hired at a certain job (in a certain locality, for a certain period)?”. Multidimensional data analysis has the advantage of showing data by more criterias in the same time or by representing more dimensions and also allowing to change the criterias instantly as needed. OLAP and data mining can complement each other. For example, OLAP might pinpoint problems with job wacancies in a certain region. Data mining could then be used to gain insight about the behavior of individual potential employee in the region. Finally, after data mining predicts something like a 5% increase in job vacancies, OLAP can be used to track the hirings. Or, Data Mining might be used to identify the most important attributes concerning job vacancies, and those attributes could be used to design the data model in OLAP. While searching for applications to experiment, the author of the thesys has found many applications that include both techniques. The more ways are to analyze data, the better it is proccessed. Our country has a big potential to develop data mining, because though it is small, there are many individual enterprizes and people searching for work or workers. If more agencies use a centralized data system with all the potential functions for data mine and OLAP, more citizens will benefit. It is not the only problem why people go abroad for work, but it is one of the cause, that might be eliminated. Courses at universities and seminars for teachers are neccesarily to grow specialists in this area and to contribute to the development of this field in our country. 53 For the future pool of research in Moldova it is a task to find out how works data mining with different type data such as both structured and unstructured data base, hypertext and text mining, how to apply granularity term in this process of analyzing data from diffeent sources. 54 BIBLIOGRAPHY 1. THOMSEN E.. OLAP Solutions. Building Multidimensional Information Systems. 2nd ed. New York: ed. John Wiley & Sons, Inc., 2002. 661 p. 2. ILEANĂ, I., ROTAR, C., MUNTEAN, M. Inteligenţa artificială. Alba Iulia: ed. Aeternitas, 2009. 298 p 3. Data Mining [on-line] Disponibil pe Internet: <http://documents.software.dell.com/statistics/textbook/data-miningtechniques#mining> (visited: 9.12.2015) 4. MARINOVA, N. Instrumentele data mining – parte componenta a procesului de descoperire a cunostintelor. In: Economica. 2005, nr. 2(50). ISSN 1810-9136. 5. Data Mining 101: Tools and Techniques [on-line] Available on Internet: https://iaonline.theiia.org/data-mining-101-tools-and-techniques (visited: 8.12.2015) 6. FRAND, J.. Data Mining: What is Data Mining? [on-line] Available on Internet: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datami ning.htm (visited: 10.12.2015) 7. HAN, J., KAMBER, M. Data Mining: Concepts and Techniques. San Francisco: ed. Elsevier, 2006. 743 p. 8. Data warehousing [on-line] Available on Internet: http://documents.software.dell.com/statistics/textbook/data-miningtechniques#warehousing (visited: 9.12.2015) 9. What is Data Mining (Predictive Analytics, Big Data), [on-line] Available on Internet: http://www.statsoft.co.za/textbook/data-mining-techniques/ (visited: 10.12.2015) 10. КОЧ, К.. Что такое ERP, In CIO. 2001, 15 november, Перевод Даулета Тынбаева [on-line] Available on Internet: http://www.erp-online.ru/erp/ (visited: 10.12.2015) 11. Open Source ERP Software, [on-line] Available on Internet: http://www.heliumv.org/en/opensource-Industry_Solution-3.html (visited: 10.12.2015) 12. Data Mining - Query Language [on-line] Available on Internet: http://www.tutorialspoint.com/data_mining/dm_query_language.htm (visited: 9.12.2015) 55 13. ZHAO, Y.. R and Data Mining: Examples and Case Studies. Amsterdam: ed. Elsevier, 2013. 156 p. 14. Computational Methods, [on-line] Available on Internet: http://www.statsoft.com/Textbook/Classification-Trees#computation (visited: 10.12.15) 15. Neural Networks, [on-line] Available on Internet: http://documents.software.dell.com/statistics/textbook/data-miningtechniques#neural (visited: 10.12.2015) 16. How is OLAP Technology Used? [on-line] Available on Internet: http://olap.com/olap-definition/ (visited: 10.12.2015) 17. BELLAACHIA, A.. Data Warehousing and OLAP Technology [on-line] Available on Internet: http://www.seas.gwu.edu/~bell/csci243/lectures/data_warehousing.pdf (visited: 9.12.2015) 18. OLAP and Data Mining [on-line] Available on Internet: http://docs.oracle.com/cd/B28359_01/server.111/b28313/bi.htm (visited: 8.12.2015) 19. On-Line Analytic Processing (OLAP) [on-line] Available on Internet: http://documents.software.dell.com/statistics/textbook/data-mining-techniques#olap (visited: 10.12.2015) 20. INMON, W. H.. Building the Data Warehouse 3rd ed. 21. A Conceptual Model for Combining Enhanced OLAP and Data Mining Systems [online] Available on Internet: https://www.researchgate.net/publication/221522065 (visited: 10.12.2015) 22. Attribute-Relation File Format (ARFF) [on-line] Available on Internet: http://www.cs.waikato.ac.nz/ml/weka/arff.html (visited: 5.02.2016) 56