Download Ministerul Educaţiei al Republicii Moldova Universitatea

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Ministerul Educaţiei al Republicii Moldova
Universitatea de Stat “Alecu Russo” din Bălţi
Facultatea de Ştiinţe Reale, Economice şi ale Mediului
Catedra de Matematică şi Informatică
Teză de masterat la tema:
"METODE AVANSATE DE ANALIZĂ A DATELOR DATA MINING ŞI OLAP PENTRU
PIAŢA LOCURILOR DE MUNCĂ"
A efectuat: Tatiana Mihailov,
Studentă în grupa MITT11M
Specialitatea „Management Innovaţional şi
Transfer Technologic”
Conducător ştiinţific:
dr., sup. lect., Corina Negara
Bălţi – 2016
1
Republic of Moldova Ministry of Education
Balti State University “Alecu Russo”
Faculty of Real, Economical Sciencies and Sciencies of Environment
Department of Mathematics and Informatics
Master Thesys on the topic:
DATA MINING AND OLAP ADVANCED DATA ANALYSIS METHODS FOR VACANCIES'
MARKET
Realized by: Tatiana Mihailov,
Student of MITT21M group
Speciality „Innovational Management and
Technological Transfer”
Scientific coordinator:
dr., sup. lect., Corina Negara
Bălţi – 2016
2
Content
I. ADVANCED DATA ANALISYS AND PROCESSING ..........................................................................6
1.1. Data Mining main notions .......................................................................................................................6
1.1.1.
Data mining stages .........................................................................................................................8
1.1.2. Data mining process .............................................................................................................................9
1.1.2.1. Data Preprocessing ..........................................................................................................................11
1.2. Data Mining Tools.................................................................................................................................16
1.3. Data Mining Techniques and Their Application ...................................................................................17
1.3.1. Classiffication trees ............................................................................................................................18
1.3.2. Text Mining ........................................................................................................................................18
1.3.3. Other data mining techniques .............................................................................................................20
1.4. OLAP main notions ...............................................................................................................................22
1.5. OLAP and Data Mining Comparison ....................................................................................................23
1.6. Integration of OLAP and Data Mining..................................................................................................24
II. VACANCIES’ MARKET ANALYSIS WITH DATA MINING AND OLAP ......................................25
2.1. Data Mining Query Language ...............................................................................................................25
2.2. Datebase structure .................................................................................................................................29
2.3. The interests of companies that search workers and of individuals who search work ..........................30
2.4. Realising data mining in WEKA ...........................................................................................................30
2.4.1. Classification trees. Specifying the Criteria for Predictive Accuracy ................................................34
2.4.2. Classification trees. Selecting Splits...................................................................................................36
2.4.3. Classification trees. Determining When to Stop Splitting..................................................................37
2.4.4. Classification trees. Selecting the "Right-Sized" Tree ......................................................................38
2.5. Realising OLAP functions in FastCube ................................................................................................43
III. PRACTICAL APPLICATIONS DESCRIPTION .................................................................................45
3.1. Data preprocessing ................................................................................................................................45
3.2. Data classification .................................................................................................................................46
3.3. Multidimensional Data Analysis OLAP ................................................................................................48
CONCLUSIONS ..........................................................................................................................................53
BIBLIOGRAPHY ........................................................................................................................................55
3
Introduction
The research’s actuality. Information is now being more and more, due to internet. But is
all that data trully information? Not always, because only understood data or data that brings some
new value for the receiver is called information. Much data is collected, and the problem that arises
is extracting again information from the huge amount of data. Search engines and spiders help to
find anything on internet. But databases collected by companies, are in continuous growth. agencies
and institutions need to process databases from more companies in order to find the answers to
some questions Enormous amount of data needs to be data mined in order to analyze the content
and make decisions that will influence the actions, strategies of entrepreneurs.
The cornerstone of all business activities (and any other intentional activities for that matter)
is information processing. This includes data collection, storage, transportation, manipulation, and
retrieval (with or without the aid of computers). Good information about world events helps
financial traders make better trading decisions, directly resulting in better profits for the trading
firm. This is very valuable. Major trading firms invest heavily in information technologies. Good
traders are handsomely rewarded[1].
If in an employment agency for labor comes hundreeds of unemployers and many of them
leave it sadly because for a week or a month they didn’t find anything that fits their profile, then is
time for the agency to start thinking of the way it delivers it’s services and how to improve the
system. In such situation Data Mining or OLAP techniques might be very useful. A better service
leads to a higher rate of employment. A higher rate of employment shows an agency, the more
institutions, enterprises and individuals approach the agency when workers are needed.
In a higher level cooperation between agencies and people who can not subordinate to
someone else, potential entrepreneurs, managers and analysts may ask higher-level analytical
questions such as the products or services have been most popular for the town this year or it is
needed but is not yet delivered. If is it the same group of products that were most profitable last
year. The answers to these types of questions represent information that is both analysis based and
decision oriented. Decision-oriented software activities are more complex but the good news are
that data mining and OLAP are the solution[1].
The problem of this research is that humanity knows about data mining for almost 10
years, but in Moldova, a small country, but with good potential in programming this area is few
developed. This can be concluded by the number of bibliographical sources of authors on this topic
from Moldova. If some students study it, still is unknown whether it is applied anywhere.
4
The goal of this thesys is to analyze the data mining and OLAP technologies and to suggest
solutions data mining, OLAP and data mining with OLAP.
The objectives of the thesys are the following:
 Critical analysis of literature on the topic;
 Analysis of the way Data Mining may be used;
 Analysis of the way OLAP may be used;
 Making a set of recomendations of when which is the best.
The short description of the thesys on chapters:
There is formulated the actualitaty of the topic, the problem of this research, the goal and objectives
of the research and what presents each chapter in introduction.
In chapter I, entitled „Advanced Data Analisys And Processing” is presented the objective
study of the literature on the topic, and namely the interested information about Data Mining and
OLAP terms, the tools that use each of these methods.
In chapter II, entitled "Vacancies’ Market Analisys With Data Mining And OLAP" is
written how can the concepts studies be applied in order to compare the two tools.
Chapter III presents the work that was done in order to see how can be applied both
techniques to make the work of labour agencies more effective.
The thesys is written on 56 pages, contains 22 bibliographical sources, 60 figures and 2
tables.
5
I. ADVANCED DATA ANALISYS AND PROCESSING
1.1. Data Mining main notions
Data mining is the process to discover interesting knowledge from large amounts of data. It is
an interdisciplinary field with contributions from many areas, such as statistics, artificial
intelligence, big databases, machine learning, information retrieval, pattern recognition and
bioinformatics. Artificial intelligence is based on euristics and tries to use similar methods with
human thinking in order to solve statistical problems [2]. Data mining is widely used in many
domains, such as retail, finance, telecommunication and social media. The ultimate goal of data
mining is prediction - and predictive data mining is the most common type of data mining and one
that has the most direct business applications [3].
The term Data Mining has got it’s name from two notions: (1) search of valuable information in
big data bases (data) and (2) digging in mines (mining). Both of the processes need either selecting
a huge amount of raw material, or rational search and research of the required worths. The term
Data Mining often means extracting information, excavation, intellectual analysis of data, means of
finding patterns, knowledge extraction, analysis of patterns, "extracting the seeds of knowledge
from mountains of data", knowledge excavation from data bases, informational sinking of data, data
"abstersion". The expression Knowledge Discovery in Databases (KDD) can be considered a
synonym of Data Mining. Definition of Data Mining appeared in 1978 and has got a high popularity
in modern interpretation since approximately the first half of 1990-es. Till then the data processing
and analysis was implemented in the framework of applied statistics, herewith were dealt the tasks
of processing small-scale databases.
Connected to the progress of data base system technology (fig.1.1.), the term data mining has
developed.
Data mining as a process means the action of processing based on performant patterns of data
selection and aggregation from data warehouses [4].
In its simplest form, data mining automates the detection of relevant patterns in a database,
using defined approaches and algorithms to look into current and historical data that can then be
analyzed to predict future trends. Because data mining tools predict future trends and behaviors by
reading through databases for hidden patterns, they allow organizations to make proactive,
knowledge-driven decisions and answer questions that were previously too time-consuming to
resolve [5].
6
Generally, data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful information - information
that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of
analytical tools for analyzing data. It allows users to analyze data from many different dimensions
or angles, categorize it, and summarize the relationships identified. Technically, data mining is the
process of finding correlations or patterns among dozens of fields in large relational databases. [6]
Figure 1.1. The evolution of database system technology [7].
7
In the table 1.1. is a short description of some disciplines at the joint of which has appeared
Data Mining technology. Each of the course that had formed Data Mining, has it’s own
characteristics. Let’s compare some of them.
Table 1.1. Comparison of statistics, machine learning and Data Mining
Statistics
Machine Learning
Data Mining
More than Data Mining, is Is more euristical.
Integrates
based on theory.
euristics.
the
thoeries
and
Is more concentrated on Focused on work of learning Focused on an unified process of
cheching hypothesis
agents’ improvement.
data
analysis,
includes
data
cleaning, learning, integration and
results visualisation.
In real world applications, a data mining process can be broken into six major phases: business
understanding, data understanding, data preparation, modeling, evaluation and deployment, as
defined by the CRISP-DM (Cross Industry Standard Process for Data Mining).
In order to make data mining is neccesarily a date warehouse or a big data base. The most
efficient data warehousing architecture will be capable of incorporating or at least referencing all
data available in the relevant enterprise-wide information management systems, using designated
technology suitable for corporate data base management (e.g., Oracle, Sybase, MS SQL Server.
Also, a flexible, high-performance (see the IDP technology), open architecture approach to data
warehousing - that flexibly integrates with the existing corporate systems and allows the users to
organize and efficiently reference for analytic purposes enterprise repositories of data of practically
any complexity - is offered in StatSoft enterprise systems such as STATISTICA Enterprise and
STATISTICA Enterprise/QC, which can also work in conjunction with STATISTICA Data Miner
and STATISTICA Enterprise Server[8].
1.1.1. Data mining stages
The process of data mining consists of three stages: (1) the initial exploration, (2) model
building or pattern identification with validation/verification, and (3) deployment (i.e., the
application of the model to new data in order to generate predictions).
Stage 1: Exploration. This stage usually starts with data preparation which may involve
cleaning data, data transformations, selecting subsets of records and - in case of data sets with large
8
numbers of variables ("fields") - performing some preliminary feature selection operations to bring
the number of variables to a manageable range (depending on the statistical methods which are
being considered). Then, depending on the nature of the analytic problem, this first stage of the
process of data mining may involve anywhere between a simple choice of straightforward
predictors for a regression model, to elaborate exploratory analyses using a wide variety of
graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most
relevant variables and determine the complexity and/or the general nature of models that can be
taken into account in the next stage.
Stage 2: Model building and validation. This stage involves considering various models and
choosing the best one based on their predictive performance (i.e., explaining the variability in
question and producing stable results across samples). This may sound like a simple operation, but
in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed
to achieve that goal - many of which are based on so-called "competitive evaluation of models,"
that is, applying different models to the same data set and then comparing their performance to
choose the best. These techniques - which are often considered the core of predictive data mining include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and MetaLearning.
Stage 3: Deployment. That final stage involves using the model selected as best in the previous
stage and applying it to new data in order to generate predictions or estimates of the expected
outcome [9].
1.1.2. Data mining process
Knowledge discovery as a process is depicted in Figure 1.2 and consists of an iterative sequence
of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection(wheredatarelevanttotheanalysistaskareretrievedfromthedatabase)
4. Data transformation (where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
some interestingness measures;)
9
7. Knowledge presentation (where visualization and knowledge representation techniques are
used to present the mined knowledge to the user)
Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base. Note that
according to this view, data mining is only one step in the entire process, albeit an essential one
because it uncovers hidden patterns for evaluation.
We agree that data mining is a step in the knowledge discovery process. However, in industry,
in media, and in the data base research milieu, the term data mining is becoming more popular than
the longer term of knowledge discovery from data. [7]
Figure 1.2. Data mining as a step in the knowledge discovery.[7]
10
1.1.2.1. Data Preprocessing
Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data
due to their typically huge size (often several gigabytes or more) and their likely origin from
multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. “How can
the data be preprocessed in order to help improve the quality of the data and, consequently, of the
mining results? How can the data be preprocessed so as to improve the efficiency and ease of the
mining process?” There are a number of data preprocessing techniques. Data cleaning can be
applied to remove noise and correct inconsistencies in the data. Data integration merges data from
multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as
normalization, may be applied. For example, normalization may improve the accuracy and
efficiency of mining algorithms involving distance measurements. Data reduction can reduce the
data size by aggregating, eliminating redundant features, or clustering, for instance. These
techniques are not mutually exclusive; they may work together. For example, data cleaning can
involve transformations to correct wrong data, such as by transforming all entries for a date field to
a common format. Data processing techniques, when applied before mining, can substantially
improve the overall quality of the patterns mined and/or the time required for the actual mining.
descriptive data summarization, which serves as a foundation for data preprocessing. Descriptive
data summarization helps us study the general characteristics of the data and identify the presence
of noise or outliers, which is useful for successful data cleaning and data integration. The methods
for data preprocessing are organized into the following categories: data cleaning, data integration
and transformation, and data reduction. Concept hierarchies can be used in an alternative form
of data reduction where we replace low-level data (such as raw values for age) with higher-level
concepts (such as youth, middle-aged, or senior). The automatic generation of concept hierarchies
from categorical data is also described.
Incomplete, noisy, and inconsistent data are commonplace properties of large real-world
databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of
interest may not always be available, such as customer information for sales transaction data. Other
data may not be included simply because it was not considered important at the time of entry.
Relevant data may not be recorded due to a misunderstanding, or because of equipment
malfunctions. Data that were inconsistent with other recorded data may have been deleted.
Furthermore, the recording of the history or modifications to the data may have been overlooked.
Missing data, particularly for tuples with missing values for some attributes, may need to be
11
inferred. There are many possible reasons for noisy data (having in correct attribute values). The
data collection instruments used may be faulty. There may have been human or computer errors
occurring at data entry. Errors in data transmission can also occur. There may be technology
limitations, such as limited buffer size for coordinating synchronized data transfer and
consumption. In correct data may also result from inconsistencies innaming conventions or data
codes used, or inconsistent formats for input fields, such as date. Duplicate tuples also require data
cleaning. [7].
Data cleaning routines work to “clean” the data by filling in missing values, smoothing
noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data
are dirty, they are unlikely to trust the results of any data mining that has been applied to it.
Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable
output. Although most mining routines have some procedures for dealing with incomplete or noisy
data, they are not always robust. Instead, they may concentrate on avoiding overfitting the data to
the function being modeled. Therefore, a useful preprocessing step is to run your data through some
data cleaning routines. You would like to include data from multiple sources in your analysis. This
would involve integrating multiple databases, data cubes, or files, that is, data integration. Yet some
attributes representing a given concept may have different names in different databases, causing
inconsistencies and redundancies. For example, the attribute for customer identification may be
referred to as customer id in one data store and cust id in another. Naming inconsistencies may also
occur for attribute values. For example, the same first name could be registered as “Bill” in one
database, but “William” in another, and “B.” In the third. Further more, you suspect that some
attributes may be inferred from others (e.g., annual revenue). Having a large amount of redundant
data may slow down or confuse the knowledge discovery process. Clearly, in addition to data
cleaning, steps must be taken to help avoid redundancies during data integration. Typically, data
cleaning and data integration are performed as a preprocessing step when preparing the data for a
data warehouse. Additional data cleaning can be performed to detect and remove redundancies that
may have resulted from data integration. Getting back to your data, you have decided, say, that you
would like to use a distancebased mining algorithm for your analysis, such as neural networks,
nearest-neighbor classifiers, or clustering. 1 Such methods provide better results if the data to be
analyzed have been normalized, that is, scaled to a specific range such as [0.0, 1.0]. Your customer
data, for example, contain the attributes age and annual salary. The annual salary attribute usually
takes much larger values than age. Therefore, if the attributes are left unnormalized, the distance
measurements taken on annual salary will generally outweigh distance measurements taken on age.
Furthermore, it would be useful for your analysis to obtain aggregate information as to the sales per
12
customer region—something that is not part of any precomputed data cube in your data warehouse.
You soon realize that data transformation operations, such as normalization and aggregation, are
additional data preprocessing procedures that would contribute toward the success of the mining
process.
Data reduction obtains a reduced representation of the data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results. There are a number of
strategies for data reduction. These include data aggregation (e.g., building a datacube), attribute
subset selection (e.g., removing irrelevant attributes through correlation analysis), dimensionality
reduction (e.g., using encoding schemes such as minimum length encoding or wavelets), and
numerosity reduction (e.g., “replacing” the data by alternative, smaller representations such as
clusters or parametric models). Data can also be “reduced” by generalization with the use of
concept hierarchies, where low-level concepts, such as city for customer location, are replaced with
higher-level concepts, such as region or province or state. A concept hierarchy organizes the
concepts into varying levels of abstraction. Data discretization is a form of data reduction that is
very useful for the automatic generation of concept hierarchies from numerical data. This is
described in Section 2.6, along with the automatic generation of concept hierarchies for categorical
data. Note that the above categorization is not mutually exclusive. For example, the removal of
redundant data may be seen as a form of data cleaning, as well as data reduction. In summary, realworld data tend to be dirty, incomplete, and inconsistent. Data preprocessing techniques can
improve the quality of the data, thereby helping to improve the accuracy and efficiency of the
subsequent mining process. Data preprocessing is an important step in the knowledge discovery
process, because quality decisions must be based on quality data. Detecting data anomalies,
rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision
making.
Descriptive Data Summarization. For data preprocessing to be successful, it is essential to
have an overall picture of your data. Descriptive data summarization techniques can be used to
identify the typical properties of your data and highlight which data values should be treated as
noise or outliers. Thus, we first introduce the basic concepts of descriptive data summarization be
foregetting into the concrete workings of data preprocessing techniques. For many data
preprocessing tasks, users would like to learn about data characteristics regarding both central
tendency and dispersion of the data. Measures of central tendency include mean, median, mode, and
midrange, while measures of data dispersion include quartiles, interquartile range (IQR), and
variance. These descriptive statistics are of great help in understanding the distribution of the data.
Such measures have been studied extensively in the statistical literature. From the data mining point
13
of view, we need to examine how they can be computed efficiently in large databases. In particular,
it is necessary to introduce the notions of distributive measure, algebraic measure, and holistic
measure. Knowing what kind of measure we are dealing with can help us choose an efficient
implementation for it.
Measuring the Central Tendency. Various ways to measure the central tendency of data.
The most common and most effective numerical measure of the “center” of a set of data is the
(arithmetic) mean. Let x 1 ,x 2 ,...,x N be a set of N values or observations, such as for some
attribute, like salary. The mean of this set of values is
This corresponds to the built-in aggregate function, average (avg() in SQL), provided in
relational database systems. A distributive measure is a measure (i.e., function) that can be
computed for a given data set by partitioning the data into smaller subsets, computing the measure
for each subset, and then merging the results in order to arrive at the measure’s value for the
original (entire) data set. Both sum() and count() are distributive measures because they can be
computed in this manner. Other examples include max() and min(). An algebraic measure is a
measure that can be computed by applying an algebraic function to one or more distributive
measures. Hence, average (or mean()) is an algebraic measure because it can be computed by
sum()/count(). When computing data cubes 2 , sum() and count() are typically saved in
precomputation. Thus, the derivation of average for data cubes is straightforward. Sometimes, each
value x i in a set may be associated with a weight w i , for i = 1,...,N. The weights reflect the
significance, importance, or occurrence frequency attached to their respective values. In this case,
we can compute the next formula.
This is called the weighted arithmetic mean or the weighted average. Note that the weighted
average is another example of an algebraic measure. Although the mean is the single most useful
quantity for describing a dataset, it is not always the best way of measuring the center of the data. A
major problem with the mean is its sensitivity to extreme (e.g., outlier) values. Even a small number
of extreme values can corrupt the mean. For example, the mean salary at a company may be
substantially pushed up by that of a few highly paid managers. Similarly, the average score of a
14
class in an exam could be pulled down quite a bit by a few very low scores. To offset the effect
caused by a small number of extreme values, we can instead use the trimmed mean, which is the
mean obtained after chopping off values at the high and low extremes. For example, we can sort the
values observed for salary and remove the top and bottom 2% before computing the mean. We
should avoid trimming too large a portion (such as 20%) at both ends as this can result in the loss of
valuable information. For skewed (asymmetric) data, a better measure of the center of data is the
median. Suppose that a given data set of N distinct values is sorted in numerical order. If N is odd,
then the median is the middle value of the ordered set; otherwise (i.e., if N is even), the median is
the average of the middle two values.
A holistic measure is a measure that must be computed on the entire data set as a whole. It
cannot be computed by partitioning the given data into subsets and merging the values obtained for
the measure in each subset. The median is an example of a holistic measure. Holistic measures are
much more expensive to compute than distributive measures such as those listed above. We can,
however, easily approximate the median value of a dataset. Assume that data are grouped in
intervals according to their x i data values and that the frequency (i.e. number of data values) of
each interval is known. For example, people may be grouped according to their annual salary in
intervals such as 10–20K, 20–30K, and soon. Let the interval that contains the median frequency be
the median interval. We can approximate the median of the entire data set (e.g., the median salary)
by interpolation using the formula:
Figure 1.3. Mean, median and mode of symmetric versus positively and negatively skewed data [7].
Where L 1 is the lower boundary of the median interval, N is the number of values in the
entire dataset, (∑ freq) l is the sum of the frequencies of all of the intervals that are lower than the
median interval, freq median is the frequency of the median interval, and width is the width of the
median interval.
15
Another measure of central tendency is the mode. The mode for a set of data is the value that
occurs most frequently in the set. It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode. Data sets with one, two, or three modes are
respectively called unimodal, bimodal, and trimodal. In general, a data set with two or more modes
is multimodal. At the other extreme, if each data value occurs only once, then there is no mode. For
unimodal frequency curves that are moderately skewed (asymmetrical), we have the following
empirical relation:
mean−mode = 3× (mean−median).
This implies that the mode for unimodal frequency curves that are moderately skewed can
easily be computed if the mean and median values are known. In a unimodal frequency curve with
perfect symmetric data distribution, the mean, median, and mode are all at the same center value, as
shown in Figure 1.3(a). However, data in most real applications are not symmetric. They may
instead be either positively skewed, where the mode occurs at a value that is smaller than the
median (Figure 1.3(b)), or negatively skewed, where the mode occurs at a value greater than the
median (Figure 1.3(c)). The midrange can also be used to assess the central tendency of a data set. It
is the average of the largest and smallest values in the set. This algebraic measure is easy to
compute using the SQL aggregate functions, max() and min() [7].
1.2. Data Mining Tools
There are different tools that allow to make data mining. Here are some of them:
 DMQL (data mining query language);
 ERP (enterprise resource planning), according to APICS dictionary (American Production
and Inventory Control Society)[10]. There are different as ERP 2.5. or HELIUM V, SAP
R/3, Oracle e-Business Suite, People Soft, JD Edwards, Lawson Financials etc.;
 Weka application;
 Statistical applications. One of them is StatSoft;
Follows description for each tool.
HELIUM V is designed for use by both purely commercial enterprises as well as by individual,
series and made-to-order producers. The frequently encountered hybrid corporate organisations are
also supported. HELIUM V is currently being deployed in the following industries and sectors:
 Metal processing;
 Mechanical engineering;
16
 Electronics;
 Electrical engineering;
 Plastics technology;
 Food and cosmetics;
 Retail;
 Service providers;
 Agencies;
 Local authorities / town councils.
ERP systems support the entrepreneurial task of utilising the resources available in a company
(capital assets, operating resources and work force) for its workflows as efficiently as possible and
thus to optimise the controlling of its business processes.
HELIUM V is designed to be used for well over 100 users. Mutually linked or intermeshing
modules ensure that the data is only recorded once in the system and is then made available for
subsequent processing. The intermeshed modules mean that your knowledge of your customers,
suppliers and above all your products (manufacturing processes as well as commodities) constantly
grows and is transparently visualised. From this you can identify positive and negative deviations
and intervene to control and correct the processes in your company [11].
1.3. Data Mining Techniques and Their Application
In addition to particular data mining tools, there is a variety of data mining techniques. The
main techniques for data mining include [13]:

artificial neural networks;

decision trees;

the nearest-neighbor method;

classification;

prediction;

clustering;

induction;

statistical methods;

outlier detection;

tendency detection;

association rules;
17

sequence analysis;

dependency analysis;

time series analysis;

text mining;

data visualization;

new techniques such as social network analysis and sentiment analysis.
1.3.1. Classiffication trees
Classification trees are used to predict membership of cases or objects in the classes of a
categorical dependent variable from their measurements on one or more predictor variables. The
goal of classification trees is to predict or explain responses on a categorical dependent variable.
Classification trees are widely used in applied fields as diverse as medicine (diagnosis), computer
science (data structures), botany (classification), and psychology (decision theory). Classification
trees can be and sometimes are quite complex. However, graphical procedures can be developed to
help simplify interpretation even for complex trees. Amenability to graphical display and ease of
interpretation are perhaps partly responsible for the popularity of classification trees in applied
fields, but two features that characterize classification trees more generally are their hierarchical
nature and their flexibility.
1.3.2. Text Mining
Text databases consist of huge collection of documents. They collect these information from
several sources such as news articles, books, digital libraries, e-mail messages, web pages, etc. Due
to increase in the amount of information, the text databases are growing rapidly. In many of the text
databases, the data is semi-structured.
For example, a document may contain a few structured fields, such as title, author,
publishing_date, etc. But along with the structure data, the document also contains unstructured text
components, such as abstract and contents. Without knowing what could be in the documents, it is
difficult to formulate effective queries for analyzing and extracting useful information from the
data. Users require tools to compare the documents and rank their importance and relevance.
Therefore, text mining has become popular and an essential theme in data mining.
18
Information Retrieval. Information retrieval deals with the retrieval of information from a large
number of text-based documents. Some of the database systems are not usually present in
information retrieval systems because both handle different kinds of data. Examples of information
retrieval system includes:
 Online Library catalogue system;
 Online Document Management Systems;
 Web Search Systems etc.
Note − The main problem in an information retrieval system is to locate relevant documents
in a document collection based on a user's query. This kind of user's query consists of some
keywords describing an information need.
In such search problems, the user takes an initiative to pull relevant information out from a
collection. This is appropriate when the user has ad-hoc information need, i.e., a short-term need.
But if the user has a long-term information need, then the retrieval system can also take an initiative
to push any newly arrived information item to the user.
This kind of access to information is called Information Filtering. And the corresponding
systems are known as Filtering Systems or Recommender Systems.
Basic Measures for Text Retrieval. We need to check the accuracy of a system when it retrieves
a number of documents on the basis of user's input. Let the set of documents relevant to a query be
denoted as {Relevant} and the set of retrieved document as {Retrieved}. The set of documents that
are relevant and retrieved can be denoted as {Relevant} ∩ {Retrieved}. This can be shown in the
form of a Venn diagram from figure 1.24.
Figure 1.24. Venn diagram for set of documents [7].
There are three fundamental measures for assessing the quality of text retrieval −
 Precision;
 Recall;
 F-score.
Precision. Precision is the percentage of retrieved documents that are in fact relevant to the
query. Precision can be defined as Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
19
Recall. Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as − Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-score. F-score is the commonly used trade-off. The information retrieval system often
needs to trade-off for precision or vice versa. F-score is defined as harmonic mean of recall or
precision as follows F-score = recall x precision / (recall + precision) / 2 [7].
1.3.3. Other data mining techniques
Association this method is used to separe repetitive structures in time and is particularly
used for discovering rules, according to which a data set accessibility correlates to other set
elements. This method is frequent used to find a specific regularity through many transactions.
Sequencies- based analysis allows highlighting the regularities in transactions. For example
we can answer the question buying which things precede buying certain type production. This
method is used in marketing, price flexibility management etc.
Dependency analysis – are algorithms that extract dependencies between elements or
objects from databanks, which can not be recognized in advance. This way the value of an data
object can be predicted based on other.
Clusterisation – combines sets of records that have similar features. This method can be
used in market and providers segmentation being combined with statistical models or neural
networks. Clustering is often considered the first step in data analysis.
Classification – this algorithm groups data in clases, describing record’s characteristics that
are of the same clasa. This method can be alied, for example in crediting risks evaluation.
Decisional trees – use a set of commands for data classification. The method is fast and is
better understood by neural networks, but becomes complicated if there is a long list of commands.
Decision trees are tree-shaped structures that represent decision sets. These decisions generate rules,
which then are used to classify data. Decision trees are the favored technique for building
understandable models. Auditors can use them to assess, for example, whether the organization is
using an appropriate cost-effective marketing strategy that is based on the assigned value of the
customer, such as profit.
Induction – process of searching data sets and of generating standard rules.
Statistical methods – may be applied for curve description that is the closest to a set of data
points.
20
Tendency discovery – these methods extract data tendencies or data abnormalities using
different statistical methods, for example rows sorting.
Text mining- while Data Mining is typically concerned with the detection of patterns in
numeric data, very often important (e.g., critical to business) information is stored in the form of
text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally
consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and
the preparation of the text processed in that manner for further analyses with numeric data mining
techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product
names, etc.).
Data visualization – building grafts using colours and others. This helps to general data
analysis to find out some abnormalities or structures or tendencies [4].
Artificial neural networks are non-linear, predictive models that learn through training.
Although they are powerful predictive modeling techniques, some of the power comes at the
expense of ease of use and deployment. One area where auditors can easily use them is when
reviewing records to identify fraud and fraud-like actions. Because of their complexity, they are
better employed in situations where they can be used and reused, such as reviewing credit card
transactions every month to check for anomalies.
The nearest-neighbor method classifies dataset records based on similar data in a
historical dataset. Auditors can use this approach to define a document that is interesting to them
and ask the system to search for similar items.
Each of these approaches brings different advantages and disadvantages that need to be
considered prior to their use. Neural networks, which are difficult to implement, require all input
and resultant output to be expressed numerically, thus needing some sort of interpretation
depending on the nature of the data-mining exercise. The decision tree technique is the most
commonly used methodology, because it is simple and straightforward to implement. Finally, the
nearest-neighbor method relies more on linking similar items and, therefore, works better for
extrapolation rather than predictive enquiries.
A good way to apply advanced data mining techniques is to have a flexible and interactive
data mining tool that is fully integrated with a database or data warehouse. Using a tool that
operates outside of the database or data warehouse is not as efficient. Using such a tool will involve
extra steps to extract, import, and analyze the data. When a data mining tool is integrated with the
data warehouse, it simplifies the application and implementation of mining results. Furthermore, as
the warehouse grows with new decisions and results, the organization can mine best practices
continually and apply them to future decisions.
21
Regardless of the technique used, the real value behind data mining is modeling — the
process of building a model based on user-specified criteria from already captured data. Once a
model is built, it can be used in similar situations where an answer is not known. For example, an
organization looking to acquire new customers can create a model of its ideal customer that is based
on existing data captured from people who previously purchased the product. The model then is
used to query data on prospective customers to see if they match the profile. Modeling also can be
used in audit departments to predict the number of auditors required to undertake an audit plan
based on previous attempts and similar work. [5].
1.4. OLAP main notions
OLAP is an acronym for Online Analytical Processing or in other sources is also called
multi-dimensional information systems. OLAP performs multidimensional analysis of business data
and provides the capability for complex calculations, trend analysis, and sophisticated data
modeling. It is the foundation for may kinds of business applications for Business Performance
Management, Planning, Budgeting, Forecasting, Financial Reporting, Analysis, Simulation Models,
Knowledge Discovery, and Data Warehouse Reporting. OLAP enables end-users to perform ad hoc
analysis of data in multiple dimensions, thereby providing the insight and understanding they need
for better decision making [16].
OLAP concepts include the concept of multiple hierarchical dimensions and can be used by
anyone to think more clearly about the world, whether it be the material world from the atomic
scale to the galactic scale, the economics world from micro agents to macro economies, or the
social world from interpersonal to international relationships. In other words, even without any kind
of formal language, just being able to think in terms of a multi-dimensional, multi-level world is
useful regardless of your position in life.
Good information needs to be existent, accurate, timely, and understandable [7]. Software
products devoted to the operations of a business, built principally on top of large-scale database
systems, have come to be known as On-Line Transaction Processing systems or OLTP. The
development path for OLTP software has followed a pretty straight line for the past 35 years. The
goal has been to make systems handle larger amounts of data, process more transactions per unit
time, and support larger numbers of concurrent users with ever-greater robustness. But this handle
of data is not only simply operational activities. It includes also more complex analysis. The
difference between these two way of analysis is shown in table 1.1.
22
Table 1.2. Difference between operational activities and analysis-based decision-oriented
activities.[1]
Operational activities
Analysis-based
decision-oriented activities
More frequent
Less frequent
More predictable
Less predictable
Smaller amounts of data accessed per query
Larger amounts of data accessed per query
Query mostly raw data
Query mostly derived data
Require mostly current data
Require past, present and projected data
Few, if any complex derivations
Many complex derivations
OLAP is a powerful analysis tool in [17]:
 Forecasting
 Statistical computations,
 aggregations,
 etc.
1.5. OLAP and Data Mining Comparison
OLAP and data mining are used to solve different kinds of analytic problems such as:
OLAP provides summary data and generates rich calculations. For example, OLAP answers
questions like "How do sales of mutual funds in North America for this quarter compare with sales
a year ago? What can we predict for sales next quarter? What is the trend as measured by percent
change?"
Data mining discovers hidden patterns in data. Data mining operates at a detail level instead
of a summary level. Data mining answers questions like "Who is likely to buy a mutual fund in the
next six months, and what are the characteristics of these likely buyers?" [18].
Note that despite its name, analyses referred to as OLAP do not need to be performed truly
"on-line" (or in real-time); the term applies to analyses of multidimensional databases (that may,
obviously, contain dynamically updated information) through efficient "multidimensional" queries
that reference various types of data. OLAP facilities can be integrated into corporate (enterprisewide) database systems and they allow analysts and managers to monitor the performance of the
23
business (e.g., such as various aspects of the manufacturing process or numbers and types of
completed transactions at different locations) or the market. The final result of OLAP techniques
can be very simple (e.g., frequency tables, descriptive statistics, simple cross-tabulations) or more
complex (e.g., they may involve seasonal adjustments, removal of outliers, and other forms of
cleaning the data). Although Data Mining techniques can operate on any kind of unprocessed or
even unstructured information, they can also be applied to the data views and summaries generated
by OLAP to provide more in-depth and often more multidimensional knowledge. In this sense, Data
Mining techniques could be considered to represent either a different analytic approach (serving
different purposes than OLAP) or as an analytic extension of OLAP [19].
The functions or algorithms typically found in OLAP tools (such as aggregation [in its many
forms], allocations, ratios, products, etc.) are descriptive modeling functions whereas the functions
found in any so-called data-mining package (such as regressions, neural nets, decision trees, and
clustering) are pattern discovery or explanatory modeling functions. In addition to the fact that
OLAP provides descriptive modeling functions while data mining provides explanatory modeling
functions, OLAP also provides a sophisticated structuring consisting of dimensions with hierarchies
and cross-dimensional referencing that is nowhere provided in a data-mining environment. A
typical data-mining or statistics tool looks at the world in terms of variables and cases. The fact that
many data miners do their work without using OLAP tools doesn’t mean they aren’t using OLAP
functions. On the contrary, all data miners do some OLAP work as part of their data exploration and
preparation prior to running particular pattern detection algorithms. Simply, many data miners rely
on basic calculation capabilities provided for either in the data-mining tool or the backend database
[1].
1.6. Integration of OLAP and Data Mining
OLAP and data mining can complement each other. For example, OLAP might pinpoint
problems with sales of mutual funds in a certain region. Data mining could then be used to gain
insight about the behavior of individual customers in the region. Finally, after data mining predicts
something like a 5% increase in sales, OLAP can be used to track the net income. Or, Data Mining
might be used to identify the most important attributes concerning sales of mutual funds, and those
attributes could be used to design the data model in OLAP [18].
24
II. VACANCIES’ MARKET ANALYSIS WITH DATA MINING AND OLAP
2.1. Data Mining Query Language
The Data Mining Query Language (DMQL) is actually based on the Structured Query Language
(SQL). Data Mining Query Languages can be designed to support ad hoc and interactive data
mining. This DMQL provides commands for specifying primitives. The DMQL can work with
databases and data warehouses as well. DMQL can be used to define data mining tasks. Particularly
we examine how to define data warehouses and data marts in DMQL.
Syntax for Task-Relevant Data Specification
In figure 2.1. is the syntax of DMQL for specifying task-relevant data.
use database database_name
or
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Figure 2.1. Syntax for Specifying the Kind of Knowledge
Follows the syntax for Characterization, Discrimination, Association, Classification, and
Prediction.
Characterization. The syntax for characterization is resented in figure 2.2.
mine characteristics [as pattern_name]
analyze {measure(s) }
Figure 2.2. Syntax for characterization
25
The analyze clause, specifies aggregate measures, such as count, sum, or count%. An
example is shown in figure 2.3.
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Figure 2.3. Syntax for specifies aggregate measure count%.
Discrimination. The syntax for Discrimination is resented in figure 2.4.
mine comparison [as {pattern_name]}
For {target_class} where {t arget_condition}
{versus {contrast_class_i}
where {contrast_condition_i}}
analyze {measure(s)}
Figure 2.4. Syntax for Discrimination
For example, a user may define big spenders as customers who purchase items that cost
$100 or more on an average; and budget spenders as customers who purchase items at less than
$100 on an average. The mining of discriminant descriptions for customers from each of these
categories can be specified in the DMQL as in figure 2.5.
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Figure 2.5. Mining of discriminant descriptions for customers from each of these categories.
Association. The syntax for Association is written in figure 2.6.
mine associations [ as {pattern_name} ]
{matching {metapattern} }
26
Figure 2.6. Syntax for Association
For Example as written in figure 2.7.
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
Figure 2.7. Syntax for Association. Another version.
X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object
variables.
Classification. The syntax for Classification is written in figure 2.8.
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
Figure 2.8. Syntax for Classification.
For example, to mine patterns, classifying customer credit rating where the classes are
determined by the attribute credit_rating, and mine classification is determined as classify Customer
Credit Rating (figure 2.9.).
analyze credit_rating
Figure 2.9. Function to analyze data.
Prediction. The syntax for prediction is written in figure 2.10.
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
Figure 2.10. Syntax for prediction.
Syntax for Concept Hierarchy Specification. To specify concept hierarchies, use the syntax
from figure 2.11.
27
use hierarchy <hierarchy> for <attribute_or_dimension>
Figure 2.11. Syntax to specify concept hierarchies.
We use different syntaxes to define different types of hierarchies such as in figure 2.12.
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month,quarter,year]
-set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) ≤ $250))
level_1: high_profit_margin < level_0: all
Figure 2.12. Syntaxes to define different types of hierarchies.
Syntax for Interestingness Measures Specification. Interestingness measures and thresholds
can be specified by the user with the statement presented in figure 2.13.
28
with <interest_measure_name> threshold = threshold_value
Figure 2.13. Syntax to specify interestingness measures and thresholds.
For Example, as in figure 2.14.
with support threshold = 0.05
with confidence threshold = 0.7
Figure 2.14. Syntax to specify interestingness measures and thresholds. Another option.
Syntax for Pattern Presentation and Visualization Specification. We have a syntax, which
allows users to specify the display of discovered patterns in one or more forms (figure 2.14).
display as <result_form>
Figure 2.14. Syntax which allows users to specify the display of discovered patterns.
For Example as in figure 2.15.
display as table
Figure 2.15. Syntax which allows users to specify the display of discovered patterns. Another
option. [12]
2.2. Datebase structure
The data warehuse can be either a relational database or a set of conncected databases[20].
To connect more and especcial different databases is a more complex procedure. The goal of the
thesys is to compare two techniques, and we can use a simlier data warehouse – a big relational
database. It’s structure is indicated in the figure 2.16.
29
Figure 2.16. The structure of database for market of work vacancies.
2.3. The interests of companies that search workers and of individuals who search
work
The companies may:
 Introduce dates about it (it’s title, contact adress, phone, email, locality, what free labour
place it has, what are the requirements and responsabilities, work schedule) and also
 view registered people who has the diploma in the area they need and living in the locality
they search a worker or not having any certification, for simple operational functions.
The individuals may:
 introduce data about himself/herself(name, surname, the proffessional preparation he has,
date of birth, locality, phone, email)
 get to know how many free jobs are in a certain locality in a certain field and at a certain
date. Search a work according to his/her prefferences.
 Get to know what was the most popular proffession in a certain locality/district for the
previous year.
2.4. Realising data mining in WEKA
The chosen application to deliver data mining is WEKA because of it’s advantages:
30
 Weka contains a collection of visualization tools and algorithms for data analysis and
predictive modeling. It allows to make the most important functions for data mining:
preprocessing, classification, clustering, association, sellecting attributes and visualizing
(figure 2.17.);
 graphical user interfaces for easy access and use to these functions;
 rree availability under the GNU General Public License;
 portability, since it is fully implemented in the Java programming language and thus runs on
almost any modern computing platform;
 a comprehensive collection of data preprocessing and modeling techniques;
 there are tutorials on how to data mine with Weka on Youtube.
Figure 2.17. The interface of Weka for an explorer.
User with different level of skills can access the application (figure 2.18.).
The graraphical user interface allows even a non programmer to work with this application
and to data mine. 1
The new function for data mining in Weka is attribute selection. This process is separated
into two parts:
 Attribute Evaluator: Method by which attribute subsets are assessed;
31
 Search Method: Method by which the space of possible subsets is searched.
In the next section Jason Brownlee wants to share three clever ways of using attribute selection in
Weka.
Figure 2.18. Weka GUI Chooser.
1. Explore Attribute Selection. When we are just stating out with attribute selection he
recommends playing with a few of the methods in the Weka Explorer.
Loading our dataset and click the “Select attributes” tab (figure 2.19.). Trying out different
Attribute Evaluators and Search Methods on your dataset and review the results in the output
window.
Figure 2.19. Feature Selection Methods in the Weka Explorer
The idea is to get a feeling and build up an intuition for 1) how many and 2) which attributes
are selected for your problem. We could use this information going forward into either or both of
the next steps.
2. Prepare Data with Attribute Selection. The next step would be to use attribute selection as part
of your data preparation step.
32
There is a filter (figure 2.20.) we can use when preprocessing your dataset that will run an
attribute selection scheme then trim your dataset to only the selected attributes. The filter is called
“AttributeSelection” under the Unsupervised Attribute filters.
Figure 2.20. Creating Transforms of a Dataset using Feature Selection methods in Weka
We can then save the dataset for use in experiments when spot checking algorithms.
3. Run Algorithms with Attribute Selection. Finally, there is one more clever way you can
incorporate attribute selection and that is to incorporate it with the algorithm directly.
There is a meta algorithm (figure 2.21.) we can run and include in experiments that selects
attributes running the algorithm. The algorithm is called “AttributeSelectedClassifier” under the
“meta” group of algorithms. We can configure this algorithm to use your algorithm of choice as
well as the Attribute Evaluator and Search Method of your choosing.
Figure 2.21. Coupling a Classifier and Attribute Selection in a Meta Algorithm in Weka
33
It can be included multiple versions of this meta algorithm configured with different
variations and configurations of the attribute selection scheme and see how they compare to each
other.
2.4.1. Classification trees. Specifying the Criteria for Predictive Accuracy
An operational definition of accurate prediction is hard to come by. To solve the problem of
defining predictive accuracy, the problem is "stood on its head," and the most accurate prediction is
operationally defined as the prediction with the minimum costs. The term costs need not seem
mystifying. In many typical applications, costs simply correspond to the proportion of misclassified
cases. The notion of costs was developed as a way to generalize, to a broader range of prediction
situations, the idea that the best prediction has the lowest misclassification rate.
The need for minimizing costs, rather than just the proportion of misclassified cases, arises
when some predictions that fail are more catastrophic than others, or when some predictions that
fail occur more frequently than others. The costs to a gambler of losing a single bet (or prediction)
on which the gambler's whole fortune is at stake are greater than the costs of losing many bets (or
predictions) on which a tiny part of the gambler's fortune is at stake. Conversely, the costs of losing
many small bets can be larger than the costs of losing just a few bigger bets. We should spend
proportionately more effort in minimizing losses on bets where losing (making errors in prediction)
costs us more.
Minimizing costs, however, does correspond to minimizing the proportion of misclassified
cases when Priors are taken to be proportional to the class sizes and when Misclassification costs
are taken to be equal for every class. We will address Priors first. Priors, or, a priori probabilities,
specify how likely it is, without using any prior knowledge of the values for the predictor variables
in the model, that a case or object will fall into one of the classes. For example, in an educational
study of high school drop-outs, it may happen that, overall, there are fewer drop-outs than students
who stay in school (i.e., there are different base rates); thus, the a prioriprobability that a student
drops out is lower than that a student remains in school.
The a priori probabilities used in minimizing costs can greatly affect the classification of
cases or objects. If differential base rates are not of interest for the study, or if we know that there
are about an equal number of cases in each class, then we would use equal priors. If the differential
base rates are reflected in the class sizes (as they would be if the sample is a probability sample)
then we would use priors estimated by the class proportions of the sample. Finally, if there are
specific knowledge about the base rates (for example, based on previous research), then it is real to
34
specify priors in accordance with that knowledge. For example, a prioriprobabilities for carriers of a
recessive gene could be specified as twice as high as for individuals who display a disorder caused
by the recessive gene. The general point is that the relative size of the priors assigned to each class
can be used to "adjust" the importance of misclassifications for each class. Minimizing costs
corresponds to minimizing the overall proportion of misclassified cases when Priors are taken to be
proportional to the class sizes (and Misclassification costs are taken to be equal for every class),
because prediction should be better in larger classes to produce an overall lower misclassification
rate.
Misclassification costs. Sometimes more accurate classification is desired for some classes
than others for reasons unrelated to relative class sizes. Regardless of their relative frequency,
carriers of a disease who are contagious to others might need to be more accurately predicted than
carriers of the disease who are not contagious to others. If it is expected that little is lost in avoiding
a non-contagious person but much is lost in not avoiding a contagious person, higher
misclassification costs could be specified for misclassifying a contagious carrier as non-contagious
than for misclassifying a non-contagious person as contagious. But to reiterate, minimizing costs
corresponds to minimizing the proportion of misclassified cases when Priors are taken to be
proportional to the class sizes and when Misclassification costs are taken to be equal for every class.
Case weights. A little less conceptually, the use of case weights on a weighting variable as
case multipliers foraggregated data sets is also related to the issue of minimizing costs.
Interestingly, as an alternative to using case weights for aggregated data sets, it is to specify
appropriate priors and/or misclassification costs and produce the same results while avoiding the
additional processing required to analyze multiple cases with the same values for all variables.
Suppose that in an aggregated data set with two classes having an equal number of cases, there are
case weights of 2 for all the cases in the first class, and case weights of 3 for all the cases in the
second class. If it is specified priors of .4 and .6, respectively, specify equal misclassification costs,
and analyze the data without case weights, the same misclassification rates are to happen, as we
would get if it is specified priorsestimated by the class sizes, specify equal misclassification costs,
and analyze the aggregated data set using the case weights. Also it can be obtained the same
misclassification rates if we specify priors to be equal, specify the costs of misclassifying class 1
cases as class 2 cases to be 2/3 of the costs of misclassifying class 2 cases as class 1 cases, and
analyze the data without case weights.
The relationships between priors, misclassification costs, and case weights become quite
complex in all but the simplest situations (for discussions, see Breiman et al, 1984; Ripley, 1996).
In analyses where minimizing costscorresponds to minimizing the misclassification rate, however,
35
these issues need not cause any concern. Priors, misclassification costs, and case weights are
brought up here, however, to illustrate the wide variety of prediction situations that can be handled
using the concept of minimizing costs, as compared to the rather limited (but probably typical)
prediction situations that can be handled using the narrower (but simpler) idea of minimizing
misclassification rates. Furthermore, minimizing costs is an underlying goal of classification tree
analysis, and is explicitly addressed in the fourth and final basic step in classification tree analysis,
where in trying to select the "right-sized" tree, is chosen the tree with the minimum estimated costs.
Depending on the type of prediction problem users are trying to solve, understanding the idea of
reduction of estimated costs may be important for understanding the results of the analysis.
2.4.2. Classification trees. Selecting Splits
The second basic step in classification tree analysis is to select the splits on the predictor
variables that are used to predict membership in the classes of the dependent variables for the cases
or objects in the analysis. Not surprisingly, given the hierarchical nature of classification trees, these
splits are selected one at time, starting with the split at the root node, and continuing with splits of
resulting child nodes until splitting stops, and the child nodes that have not been split become
terminal nodes. Three Split selection methods are discussed here.
Discriminant-based univariate splits. The first step in split selection when the Discriminantbased univariate splits option is chosen is to determine the best terminal node to split in the current
tree, and which predictor variable to use to perform the split. For each terminal node, p-values are
computed for tests of the significance of the relationship of class membership with the levels of
each predictor variable. For categorical predictors, thep-values are computed for Chi-square tests of
independence of the classes and the levels of the categorical predictor that are present at the node.
For ordered predictors, the p-values are computed for ANOVAs of the relationship of the classes to
the values of the ordered predictor that are present at the node. If the smallest computed p-value is
smaller than the default Bonferroni-adjusted p-value for multiple comparisons of .05 (a different
threshold value can be used), the predictor variable producing that smallest p-value is chosen to
split the corresponding node. If no p-value smaller than the threshold p-value is found, p-values are
computed for statistical tests that are robust to distributional violations, such as Levene's F. Details
concerning node and predictor variable selection when no p-value is smaller than the specified
threshold are described in Loh and Shih (1997).
The next step is to determine the split. For ordered predictors, the 2-means clustering
algorithm of Hartigan and Wong (1979) is applied to create two "superclasses" for the node. The
36
two roots are found for a quadratic equation describing the difference in the means of the
"superclasses" on the ordered predictor, and the values for a split corresponding to each root are
computed. The split closest to a "superclass" mean is selected. For categorical predictors, dummycoded variables representing the levels of the categorical predictor are constructed, and then
singular value decomposition methods are applied to transform the dummy-coded variables into a
set of non-redundant ordered predictors. The procedures for ordered predictors are then applied and
the obtained split is "mapped back" onto the original levels of the categorical variable and
represented as a contrast between two sets of levels of the categorical variable. Again, further
details about these procedures are described in Loh and Shih (1997). Although complicated, these
procedures reduce a bias in split selection that occurs when using the C&RT-style exhaustive search
method for selecting splits. This is the bias toward selecting variables with more levels for splits, a
bias that can skew the interpretation of the relative importance of the predictors in explaining
responses on the dependent variable (Breiman et. al., 1984).
Discriminant-based linear combination splits. The second split selection method is the
Discriminant-based linear combination split option for ordered predictor variables (however, the
predictors are assumed to be measured on at least interval scales). Surprisingly, this method works
by treating the continuous predictors from which linear combinations are formed in a manner that is
similar to the way categorical predictors are treated in the previous method. Singular value
decomposition methods are used to transform the continuous predictors into a new set of nonredundant predictors. The procedures for creating "superclasses" and finding the split closest to a
"superclass" mean are then applied, and the results are "mapped back" onto the original continuous
predictors and represented as a univariate split on a linear combination of predictor variables.
C&RT-style exhaustive search for univariate splits. The third split-selection method is the
C&RT-style exhaustive search for univariate splits method for categorical or ordered predictor
variables. With this method, all possible splits for each predictor variable at each node are examined
to find the split producing the largest improvement in goodness of fit (or equivalently, the largest
reduction in lack of fit). What determines the domain of possible splits at a node? For categorical
predictor variables with k levels present at a node, there are 2(k-1) - 1 possible contrasts between
two sets of levels of the predictor. For ordered predictors with k distinct levels present at a node,
there are k -1 midpoints between distinct levels. Thus it can be seen that the number of possible
splits that must be examined can become very large when there are large numbers of predictors with
many levels that must be examined at many nodes.
2.4.3. Classification trees. Determining When to Stop Splitting
37
The third step in classification tree analysis is to determine when to stop splitting. One
characteristic of classification trees is that if no limit is placed on the number of splits that are
performed, eventually "pure" classification will be achieved, with each terminal node containing
only one class of cases or objects. However, "pure" classification is usually unrealistic. Even a
simple classification tree such as a coin sorter can produce impure classifications for coins whose
sizes are distorted or if wear changes the lengths of the slots cut in the track. This potentially could
be remedied by further sorting of the coins that fall into each slot, but to be practical, at some point
the sorting would have to stop and we would have to accept that the coins have been reasonably
well sorted.
Likewise, if the observed classifications on the dependent variable or the levels on the
predicted variable in a classification tree analysis are measured with error or contain "noise," it is
unrealistic to continue to sort until every terminal node is "pure." Two options for controlling when
splitting stops will be discussed here. These two options are linked to the choice of the Stopping
rule specified for the analysis.
Minimum n. One option for controlling when splitting stops is to allow splitting to continue
until all terminal nodes are pure or contain no more than a specified minimum number of cases or
objects. The desired minimum number of cases can be specified as the Minimum n, and splitting
will stop when all terminal nodes containing more than one class have no more than the specified
number of cases or objects.
Fraction of objects. Another option for controlling when splitting stops is to allow splitting
to continue until all terminal nodes are pure or contain no more cases than a specified minimum
fraction of the sizes of one or more classes. The desired minimum fraction can be specified as the
Fraction of objects and, if the priors used in the analysis are equal and class sizes are equal, splitting
will stop when all terminal nodes containing more than one class have no more cases than the
specified fraction of the class sizes for one or more classes. If the priors used in the analysis are not
equal, splitting will stop when all terminal nodes containing more than one class have no more
cases than the specified fraction for one or more classes.
2.4.4. Classification trees. Selecting the "Right-Sized" Tree
Example. After a night at the horse track, a studious gambler computes a huge classification
tree with numerous splits that perfectly account for the win, place, show, and no show results for
every horse in every race. Expecting to become rich, the gambler takes a copy of the tree graph to
38
the races the next night, sorts the horses racing that night using the classification tree, makes his or
her predictions and places his or her bets, and leaves the race track later much less rich than had
been expected. The poor gambler has foolishly assumed that a classification tree computed from a
learning sample in which the outcomes are already known will perform equally well inpredicting
outcomes in a second, independent test sample. The gambler's classification tree performed poorly
during cross-validation. The gambler's payoff might have been larger using a smaller classification
tree that did not classify perfectly in the learning sample, but which was expected to predict equally
well in the test sample.
Some generalizations can be offered about what constitutes the "right-sized" classification
tree. It should be sufficiently complex to account for the known facts, but at the same time it should
be as simple as possible. It should exploit information that increases predictive accuracy and ignore
information that does not. It should, if possible, lead to greater understanding of the phenomena that
it describes. Of course, these same characteristics apply to any scientific theory, so we must try to
be more specific about what constitutes the "right-sized" classification tree. One strategy is to grow
the tree to just the right size, where the right size is determined by the user from knowledge from
previous research, diagnostic information from previous analyses, or even intuition. The other
strategy is to use a set of well-documented, structured procedures developed by Breiman et al.
(1984) for selecting the "right-sized" tree. These procedures are not foolproof, as Breiman et al.
(1984) readily acknowledge, but at least they take subjective judgment out of the process of
selecting the "right-sized" tree.
FACT-style direct stopping. We will begin by describing the first strategy, in which the
researcher specifies the size to grow the classification tree. This strategy is followed by using
FACT-style direct stopping as the Stopping rule for the analysis and by specifying the Fraction of
objects, which allows the tree to grow to the desired size. There are several options for obtaining
diagnostic information to determine the reasonableness of the choice of size for the tree. Three
options for performing cross-validation of the selected classification tree are discussed below.
Test sample cross-validation. The first, and most preferred type of cross-validation is test
sample cross-validation. In this type of cross-validation, the classification tree is computed from the
learning sample, and its predictive accuracy is tested by applying it to predict class membership in
the test sample. If the costs for the test sample exceed the costs for the learning sample (remember,
costs equal the proportion of misclassified cases when priors are estimated and misclassification
costs are equal), this indicates poor cross-validation and that a different sized tree might crossvalidate better. The test and learning samples can be formed by collecting two independent data
39
sets, or if a large learning sample is available, by reserving a randomly selected proportion of the
cases, say a third or a half, for use as the test sample.
V-fold cross-validation. This type of cross-validation is useful when no test sample is
available and the learning sample is too small to have the test sample taken from it. A specified V
value for V-fold cross-validationdetermines the number of random subsamples, as equal in size as
possible, that are formed from the learning sample. The classification tree of the specified size is
computed V times, each time leaving out one of the subsamples from the computations, and using
that subsample as a test sample for cross-validation, so that each subsample is used V - 1 times in
the learning sample and just once as the test sample. The CV costs computed for each of the V test
samples are then averaged to give the V-fold estimate of the CV costs.
Global cross-validation. In global cross-validation, the entire analysis is replicated a
specified number of times holding out a fraction of the learning sample equal to 1 over the specified
number of times, and using each hold-out sample in turn as a test sample to cross-validate the
selected classification tree. This type of cross-validation is probably no more useful than V-fold
cross-validation when FACT-style direct stopping is used, but can be quite useful as a method
validation procedure when automatic tree selection techniques are used (for discussion, see Breiman
et. al., 1984). This brings us to the second of the two strategies that can used to select the "rightsized" tree, an automatic tree selection method based on a technique developed by Breiman et al.
(1984) called minimal cost-complexity cross-validation pruning.
Minimal cost-complexity cross-validation pruning. Two methods of pruning can be used
depending on the Stopping Rule we choose to use. Minimal cost-complexity cross-validation
pruning is performed when we decide to Prune on misclassification error (as a Stopping rule), and
minimal deviance-complexity cross-validation pruning is performed when we choose to Prune on
deviance (as a Stopping rule). The only difference in the two options is the measure of prediction
error that is used. Prune on misclassification error uses the costs that we have discussed repeatedly
(which equal the misclassification rate when priors are estimated and misclassification costs are
equal). Prune on deviance uses a measure, based on maximum-likelihood principles, called the
deviance (see Ripley, 1996). We will focus on cost-complexity cross-validation pruning (as
originated by Breiman et. al., 1984), since deviance-complexity pruning merely involves a different
measure of prediction error.
The costs needed to perform cost-complexity pruning are computed as the tree is being
grown, starting with the split at the root node up to its maximum size, as determined by the
specified Minimum n. The learning samplecosts are computed as each split is added to the tree, so
that a sequence of generally decreasing costs (reflecting better classification) are obtained
40
corresponding to the number of splits in the tree. The learning sample costsare called resubstitution
costs to distinguish them from CV costs, because V-fold cross-validation is also performed as each
split is added to the tree. Use the estimated CV costs from V-fold cross-validation as thecosts for
the root node. Note that tree size can be taken to be the number of terminal nodes, because for
binary trees the tree size starts at one (the root node) and increases by one with each added split.
Now, define a parameter called the complexity parameter whose initial value is zero, and for every
tree (including the first, containing only the root node), compute the value for a function defined as
the costs for the tree plus the complexity parameter times the tree size. Increase the complexity
parameter continuously until the value of the function for the largest tree exceeds the value of the
function for a smaller-sized tree. Take the smaller-sized tree to be the new largest tree, continue
increasing the complexity parameter continuously until the value of the function for the largest tree
exceeds the value of the function for a smaller-sized tree, and continue the process until the root
node is the largest tree. (Those who are familiar with numerical analysis will recognize the use of
apenalty function in this algorithm. The function is a linear combination of costs, which generally
decrease with tree size, and tree size, which increases linearly. As the complexity parameter is
increased, larger trees are penalized for their complexity more and more, until a discrete threshold is
reached at which a smaller-sized tree's higher costs are outweighed by the largest tree's higher
complexity).
The sequence of largest trees obtained by this algorithm have a number of interesting
properties. They are nested, because successively pruned trees contain all the nodes of the next
smaller tree in the sequence. Initially, many nodes are often pruned going from one tree to the next
smaller tree in the sequence, but fewer nodes tend to be pruned as the root node is approached. The
sequence of largest trees is also optimally pruned, because for every size of tree in the sequence,
there is no other tree of the same size with lower costs. Proofs and/or explanations of these
properties can be found in Breiman et al. (1984).
Tree selection after pruning. We now select the "right-sized" tree from the sequence of
optimally pruned trees. A natural criterion is the CV costs. While there is nothing wrong with
choosing the tree with the minimum CV costs as the "right-sized" tree, oftentimes there will be
several trees with CV costs close to the minimum. Breiman et al. (1984) make the reasonable
suggestion that we should choose as the "right-sized" tree the smallest-sized (least complex) tree
whose CV costs do not differ appreciably from the minimum CV costs. They proposed a "1 SE
rule" for making this selection, i.e., choose as the "right-sized" tree the smallest-sized tree whose
CV costs do not exceed the minimum CV costs plus 1 times the Standard error of the CV costs for
the minimum CV costs tree.
41
One distinct advantage of the "automatic" tree selection procedure is that it helps to avoid
"overfitting" and "underfitting" of the data. The graph from figure 2.22 shows a typical plot of the
Resubstitution costs and CV costs for the sequence of successively pruned trees.
Figure 2.22. Cost Sequence for PRICE.
As shown in this graph, the Resubstitution costs (e.g., the misclassification rate in the
learning sample) rather consistently decrease as tree size increases. The CV costs, on the other
hand, approach the minimum quickly as tree size initially increases, but actually start to rise as tree
size becomes very large. Note that the selected "right-sized" tree is close to the inflection point in
the curve, that is, close to the point where the initial sharp drop in CV costs with increased tree size
starts to level out. The "automatic" tree selection procedure is designed to select the simplest
(smallest) tree with close to minimum CV costs, and thereby avoid the loss in predictive accuracy
produced by "underfitting" or "overfitting" the data (note the similarity to the logic underlying the
use of a "scree plot" to determine the number of factors to retain in Factor Analysis; see also
Reviewing the Results of a Principal Components Analysis).
As has been seen, minimal cost-complexity cross-validation pruning and subsequent "rightsized" tree selection is a truly "automatic" process. The algorithms make all the decisions leading to
selection of the "right-sized" tree, except for, perhaps, specification of a value for the SE rule. One
issue that arises with the use of such "automatic" procedures is how well the results replicate, where
replication might involve the selection of trees of quite different sizes across replications, given the
"automatic" selection process that is used. This is whereglobal cross-validation can be very useful.
As explained previously, in global cross-validation, the entire analysis is replicated a specified
number of times (3 is the default) holding out a fraction of the cases to use as a test sample to crossvalidate the selected classification tree. If the average of the costs for the test samples, called the
global CV costs, exceeds the CV costs for the selected tree, or if the standard error of the global CV
costsexceeds the standard error of the CV costs for the selected tree, this indicates that the
42
"automatic" tree selection procedure is allowing too much variability in tree selection rather than
consistently selecting a tree with minimum estimated costs.
Classification trees and traditional methods. As can be seen in the methods used in
computing classification trees, in a number of respects classification trees are decidedly different
from traditional statistical methods for predicting class membership on a categorical dependent
variable. They employ a hierarchy of predictions, with many predictions sometimes being applied
to particular cases, to sort the cases into predicted classes. Traditional methods use simultaneous
techniques to make one and only one class membership prediction for each and every case. In other
respects, such as having as its goal accurate prediction, classification tree analysis is
indistinguishable from traditional methods. Time will tell if classification tree analysis has enough
to commend itself to become as accepted as the traditional methods.
The distinction between the discriminant analysis and classification tree decision
processes can perhaps be made most clear by considering how each analysis would be performed in
Regression. Because risk in the example of Breiman et al. (1984) is a dichotomous dependent
variable, the Discriminant Analysis predictions could be reproduced by a simultaneous multiple
regression of risk on the three predictor variables for all patients. The classification tree predictions
could only be reproduced by three separate simple regression analyses, where risk is first regressed
on P for all patients, then risk is regressed on A for patients not classified as low risk in the first
regression, and finally, risk is regressed on T for patients not classified as low risk in the second
regression. This clearly illustrates the simultaneous nature of Discriminant Analysis decisions as
compared to the recursive, hierarchical nature of classification trees decisions, a characteristic of
classification trees that has far-reaching implications. Another distinctive characteristic of
classification trees is their flexibility. The ability of classification trees to examine the effects of the
predictor variables one at a time, rather than just all at once, has already been described, but there
are a number of other ways in which classification trees are more flexible than traditional analyses.
The ability of classification trees to perform univariate splits, examining the effects of predictors
one at a time, has implications for the variety of types of predictors that can be analyzed [14].
2.5. Realising OLAP functions in FastCube
FastCube enables us to analyze data and to build summary tables (data slices) as well as
create a variety of reports and graphs both easily and instantly. It's a handy tool for the efficient
analysis of data array (figure 2.23.). Advantages of this application are the following:
 FastCube components can be built into the interface of host applications;
43
 FastCube end users do not require high programming skills to build reports;
 FastCube is a set of OLAP Desktop components for Delphi/C++Builder/Lazarus;
 connection to data-bases can be not only through the standard ADO or BDE components but
also through any component based on TDataSet;
 instant downloading and handling of data arrays;
 ready-made templates can be built for summary tables. It is posible to prohibit users from
modifying the schema;
 all FastCube's settings may be accessed both programmatically and by the end user;
 it's data can be saved in a compact format for data exchange and data storage.
Figure 2.23. Fast Cube 2 application.
44
III. PRACTICAL APPLICATIONS DESCRIPTION
3.1. Data preprocessing
To do any procedure in Weka is neccesarily to convert the data file into *.arff format, as it is
the data format „understood” by this application.
In order to obtain this format, an option is to export *.xls file into "CSV(MSDOS)*.csv" and open it through Wordpad and simply changing the extension. The result must be as
in the figure 3.1.
Figure 3.1. Converting data file into arff format file.
It was found that, at this level 20 people were hired. Others 50 vacances are still free. From
these 20, 10 are from the domain of health, 2 from construction, 2 education, and other 2 from
transport, by one from engineering and management (figure 3.2.). The same thing may be done by
Excel, but the aplication Weka makes calculas automatically.
Also it can be calculated other values which describe the data set, such as those in figure
3.3., in which are the maximum and minimum salary for this area of work.
45
Figure 3.2. How many vacancies were occupied by areas.
Figure 3.3. Numerical description of the salaries of people hired.
Once most of all doctors were hired, let’s see on this area, why and how were the people
hired. One of the task for analysing data was to consider the profile of the people hired.
So one argument is the salary, which ranges less, than in general in this locality, it is from
161 to 273 euro per month. Another factor is that there were people with the needed education in
the locality where the these vacancies was.
3.2. Data classification
By sex, there were 4 men and 5 women hired on job of health area (figure 3.4.). Women are
most, but on the Earth they are also most. So it doesn’t says that women are more prefferable to be
engaged as doctors in different area.
Figure 3.4. People who were hired as doctors.
46
In order to find out the answer to the question "How long will it take to find a specific
specialist?" classification is to be used too. The query is written on compiler (not everything is
possible to do in visual application), so that is obtained a list of number of days in which a
person might find a job, with the probability of prediction indicated in the same raw (see figure
3.5.). The steps to predict some values are to build a decisional tree and to choose the functions
from figure 3.6. it can be noticed that it is not indicated the error, because it is prediction and the
error can be calculated only when the event has happened.
Figure 3.5. The results gained of how many days are needed to find a job.
It can be seen that there are analysed more cases and it is shown as a result more options that
may happen, with the probability coresponding to each option. It took much time to write the
sequence of instructions by hand, in compiler and to find which instructions in which order are
neccesarily to be able to predict these values. The knowledge and skills of a programmer are
required for that who is working on this function of data mining.
47
Figure 3.6. Instructions to make to view the results of the tree.
3.3. Multidimensional Data Analysis OLAP
In order to build and analyze a data cube is neccesarily to build it before using Fast Cube 2.0
application. It needs a ready cube to work with it.
Building a data cube is possible through SQL Server Analysis Services. When installing it
the author saw it is about Microsoft Visual Studio. Having SQL Server installed in advance is
neccesarily to indicate to which source server is connected the application to take the data base
source for the cube. The imported data base is showed in the figure 3.7. It was neccesarily to create
a data source and a data source view, in the right menu of the application (figure 3.8.), with which it
follows the work to create the data cube.
48
Figure 3.7. Data base imported in Microsoft Visual Studio to build a data cube.
Figure 3.8. The menu in which to work.
It is important not to confuse what we see in the middle of application with a view when
49
working with Microsoft Visual Studio and namely with creating a cube for the first time. A view is
a collection of correlated tabels, that will make a multidimensional corelated data. So there can be
more views in a database. Following all the steps from tutorials by Microsoft Developer Videos the
first data cube is on the figure 3.9.
3.9.The first projected in this thesys data cube.
The second datacube was made the same way, in Wizard mode (figure 3.10, 3.11.).
Figure 3.10. Using wizard mode to build a cube.
50
Figure 3.11. Vacancies’ characteristics cube.
When the cube is build follows to deploy the data cube, using instruction Deploy <title of
multidimensional project> in BUILD menu.
Afterthat the data can be viewed in diagrams such as figure 3.12, 3.13.
Figure 3.12. The medicine job demand tendency.
51
Figure 3.13. the IT salary tendency.
It can be seen that OLAP has the advantage of showing data by more criterias in the same
time. In figure 3.13. is shown dimensions: time, salary and percentile range of salaries that
companies offer.
52
CONCLUSIONS
OLAP and data mining tools are used to solve more or less different analytic problems.
Data mining discovers hidden patterns in data. Data mining operates at a detail level instead
of a summary level. Data mining answers questions "How long will it take to find a specific
specialist?". The used application for these functions of data preprocessing, data classification, was
Weka 3.6 version that is for free. It was determined, on the field of vacanciies’ job market, by data
preprocessing, the information about how many people of total number were hired, in which area
were most of them hired, what was the range of their salary. By classification it was found on
which genre were these people and in future how many days might take for a person to find a job.
The best solution for enterprises that need to collect and analyze a huge amount of data is to
buy a computer with at least 3 GB RAM and to hire one or two programmers with background in
data mining, because a part of the functions must be written in compiler and skills and knowledge
in the area are absolutely neccesarily.
OLAP tool provides summary data and generates rich calculations. It was applyied in
Microsoft Visual Studio application which uses SQL Server. For example, OLAP answers
questions like "How many men and women were hired at a certain job (in a certain locality, for a
certain period)?”. Multidimensional data analysis has the advantage of showing data by more
criterias in the same time or by representing more dimensions and also allowing to change the
criterias instantly as needed.
OLAP and data mining can complement each other. For example, OLAP might pinpoint
problems with job wacancies in a certain region. Data mining could then be used to gain insight
about the behavior of individual potential employee in the region. Finally, after data mining predicts
something like a 5% increase in job vacancies, OLAP can be used to track the hirings. Or, Data
Mining might be used to identify the most important attributes concerning job vacancies, and those
attributes could be used to design the data model in OLAP. While searching for applications to
experiment, the author of the thesys has found many applications that include both techniques.
The more ways are to analyze data, the better it is proccessed.
Our country has a big potential to develop data mining, because though it is small, there are
many individual enterprizes and people searching for work or workers. If more agencies use a
centralized data system with all the potential functions for data mine and OLAP, more citizens will
benefit. It is not the only problem why people go abroad for work, but it is one of the cause, that
might be eliminated. Courses at universities and seminars for teachers are neccesarily to grow
specialists in this area and to contribute to the development of this field in our country.
53
For the future pool of research in Moldova it is a task to find out how works data mining
with different type data such as both structured and unstructured data base, hypertext and text
mining, how to apply granularity term in this process of analyzing data from diffeent sources.
54
BIBLIOGRAPHY
1. THOMSEN E.. OLAP Solutions. Building Multidimensional Information Systems.
2nd ed. New York: ed. John Wiley & Sons, Inc., 2002. 661 p.
2. ILEANĂ, I., ROTAR, C., MUNTEAN, M. Inteligenţa artificială. Alba Iulia: ed.
Aeternitas, 2009. 298 p
3. Data Mining [on-line] Disponibil pe Internet:
<http://documents.software.dell.com/statistics/textbook/data-miningtechniques#mining> (visited: 9.12.2015)
4. MARINOVA, N. Instrumentele data mining – parte componenta a procesului de
descoperire a cunostintelor. In: Economica. 2005, nr. 2(50). ISSN 1810-9136.
5. Data Mining 101: Tools and Techniques [on-line] Available on Internet:
https://iaonline.theiia.org/data-mining-101-tools-and-techniques (visited: 8.12.2015)
6. FRAND, J.. Data Mining: What is Data Mining? [on-line] Available on Internet:
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datami
ning.htm (visited: 10.12.2015)
7. HAN, J., KAMBER, M. Data Mining: Concepts and Techniques. San Francisco: ed.
Elsevier, 2006. 743 p.
8. Data warehousing [on-line] Available on Internet:
http://documents.software.dell.com/statistics/textbook/data-miningtechniques#warehousing (visited: 9.12.2015)
9. What is Data Mining (Predictive Analytics, Big Data), [on-line] Available on
Internet: http://www.statsoft.co.za/textbook/data-mining-techniques/ (visited:
10.12.2015)
10. КОЧ, К.. Что такое ERP, In CIO. 2001, 15 november, Перевод Даулета Тынбаева
[on-line] Available on Internet: http://www.erp-online.ru/erp/ (visited: 10.12.2015)
11. Open Source ERP Software, [on-line] Available on Internet:
http://www.heliumv.org/en/opensource-Industry_Solution-3.html (visited:
10.12.2015)
12. Data Mining - Query Language [on-line] Available on Internet:
http://www.tutorialspoint.com/data_mining/dm_query_language.htm (visited:
9.12.2015)
55
13. ZHAO, Y.. R and Data Mining: Examples and Case Studies. Amsterdam: ed.
Elsevier, 2013. 156 p.
14. Computational Methods, [on-line] Available on Internet:
http://www.statsoft.com/Textbook/Classification-Trees#computation (visited:
10.12.15)
15. Neural Networks, [on-line] Available on Internet:
http://documents.software.dell.com/statistics/textbook/data-miningtechniques#neural (visited: 10.12.2015)
16. How is OLAP Technology Used? [on-line] Available on Internet:
http://olap.com/olap-definition/ (visited: 10.12.2015)
17. BELLAACHIA, A.. Data Warehousing and OLAP Technology [on-line] Available
on Internet: http://www.seas.gwu.edu/~bell/csci243/lectures/data_warehousing.pdf
(visited: 9.12.2015)
18. OLAP and Data Mining [on-line] Available on Internet:
http://docs.oracle.com/cd/B28359_01/server.111/b28313/bi.htm (visited: 8.12.2015)
19. On-Line Analytic Processing (OLAP) [on-line] Available on Internet:
http://documents.software.dell.com/statistics/textbook/data-mining-techniques#olap
(visited: 10.12.2015)
20. INMON, W. H.. Building the Data Warehouse 3rd ed.
21. A Conceptual Model for Combining Enhanced OLAP and Data Mining Systems [online] Available on Internet: https://www.researchgate.net/publication/221522065
(visited: 10.12.2015)
22. Attribute-Relation File Format (ARFF) [on-line] Available on Internet:
http://www.cs.waikato.ac.nz/ml/weka/arff.html (visited: 5.02.2016)
56