Download Knowledge discovery in databases (KDD) is the process of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Clusterpoint wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
What does Knowledge Discovery in Databases
(KDD) mean?
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge
from a collection of data. This widely used data mining technique is a process that includes
data preparation and selection, data cleansing, incorporating prior knowledge on data sets
and interpreting accurate solutions from the observed results.
Major KDD application areas include marketing, fraud detection, telecommunication and
manufacturing.
Traditionally, data mining and knowledge discovery was performed manually. As time
passed, the amount of data in many systems grew to larger than terabyte size, and could
no longer be maintained manually. Moreover, for the successful existence of any business,
discovering underlying patterns in data is considered essential. As a result, several software
tools were developed to discover hidden data and make assumptions, which formed a part
of artificial intelligence.
The KDD process has reached its peak in the last 10 years. It now houses many different
approaches to discovery, which includes inductive learning, Bayesian statistics, semantic
query optimization, knowledge acquisition for expert systems and information theory. The
ultimate goal is to extract high-level knowledge from low-level data.
KDD includes multidisciplinary activities. This encompasses data storage and access, scaling
algorithms to massive data sets and interpreting results. The data cleansing and data
access process included in data warehousing facilitate the KDD process. Artificial
intelligence also supports KDD by discovering empirical laws from experimentation and
observations. The patterns recognized in the data must be valid on new data, and possess
some degree of certainty. These patterns are considered new knowledge. Steps involved in
the entire KDD process are:
1. Identify the goal of the KDD process from the customer’s perspective.
2. Understand application domains involved and the knowledge that's required
3. Select a target data set or subset of data samples on which discovery is be
performed.
4. Cleanse and preprocess data by deciding strategies to handle missing fields and
alter the data as per the requirements.
5. Simplify the data sets by removing unwanted variables. Then, analyze useful
features that can be used to represent the data, depending on the goal or task.
6. Match KDD goals with data mining methods to suggest hidden patterns.
7. Choose data mining algorithms to discover hidden patterns. This process includes
deciding which models and parameters might be appropriate for the overall KDD
process.
8. Search for patterns of interest in a particular representational form, which include
classification rules or trees, regression and clustering.
9. Interpret essential knowledge from the mined patterns.
10. Use the knowledge and incorporate it into another system for further action.
11. Document it and make reports for interested parties.
IMPORTANCE OF BI
Business Intelligence is a concept that typically involves the delivery and integration of
relevant and useful business information in an organization. As such, companies use
business intelligence to detect significant events and identify/monitor business trends in
order to adapt quickly to their changing environment or scenario. If you use effective
business intelligence training in your organization, you can improve the decision making
processes at all levels of management and improve your tactical and strategic management
processes. Here are some of the top reasons for investing in a proper business intelligence
system.
To Get Insights into Consumer Behavior
One of the main advantages of investing in business intelligence software and skilled
personnel is the fact that it will boost your ability to analyze the current consumer buying
trends. Once you understand what your consumers are buying, you can use this information
to develop products that match the current consumption trends and consequently improve
your profitability since you will be able to attract valuable customers.
To Improve Visibility
If you want to improve your control over various important processes in your organization,
you should consider investing in a good business intelligence system. Business intelligence
software will improve the visibility of these processes and make it possible to identify any
areas that need improvement. Moreover, if you currently have to skim through hundreds of
pages in your detailed periodic reports to assess the performance of your organization’s
processes, you can save time and improve productivity by having skilled intelligence
analysts using the software.
To Turn Data into Actionable Information
A business intelligence system is an analytical tool that can give you the insight you need
to make successful strategic plans for your organization. This is because such a system
would be able to identify key trends and patterns in your organizations data and
consequently make it easier for you to make important connections between different areas
of your business that may otherwise seem unrelated. As such, a business intelligence
system can help you understand the implications of various organizational processes better
and enhance your ability to identify suitable opportunities for your organization, thus enabling
you to plan for a successful future.
To Improve Efficiency
One of the most important reasons why you need to invest in an effective business
intelligence system is because such a system can improve efficiency within your
organization and, as a result, increase productivity. You can use business intelligence to
share information across different departments in your organization. This will enable you to
save time on reporting processes and analytics. This ease in information sharing is likely to
reduce duplication of roles/duties within the organization and improve the accuracy and
usefulness of the data generated by different departments. Furthermore, information sharing
also saves time and improves productivity.
Conclusion
In order to reap all the benefits of an effective business intelligence system, ensure you
invest in the skilled business intelligence personnel and software designed for analytical
efficiency and accessibility. You should also make sure that the system you choose can
analyze both the content and context of data.
Data Cleaning
Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting
(or removing) corrupt or inaccurate records from a record set, table, or database. Used
mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate,
irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or
coarse data.[1]
After cleansing, a data set will be consistent with other similar data sets in the system. The
inconsistencies detected or removed may have been originally caused by user entry errors,
by corruption in transmission or storage, or by different data dictionary definitions of similar
entities in different stores.
Data cleansing differs from data validation in that validation almost invariably means data is
rejected from the system at entry and is performed at entry time, rather than on batches of
data.
The actual process of data cleansing may involve removing typographical errors or validating
and correcting values against a known list of entities. The validation may be strict (such as
rejecting any address that does not have a valid postal code) or fuzzy (such as correcting
records that partially match existing, known records).
Some data cleansing solutions will clean data by cross checking with a validated data set.
Also data enhancement, where data is made more complete by adding related information,
is a common data cleansing practice. For example, appending addresses with phone
numbers related to that address.
Data cleansing may also involve activities like, harmonization of data, and standardization of
data. For example, harmonization of short codes (St, rd etc.) to actual words (street, road).
Standardization of data is a means of changing a reference data set to a new standard, ex,
use of standard codes.
The process of data cleansing

Data auditing: The data is audited with the use of statistical and database methods
to detect anomalies and contradictions: this eventually gives an indication of the
characteristics of the anomalies and their locations. Several commercial software
packages will let you specify constraints of various kinds (using a grammar that
conforms to that of a standard programming language, e.g., JavaScript or Visual
Basic) and then generate code that checks the data for violation of these
constraints. This process is referred to below in the bullets "workflow specification"
and "workflow execution." For users who lack access to high-end cleansing software,
Microcomputer database packages such as Microsoft Access or FileMaker Pro will
also let you perform such checks, on a constraint-by-constraint basis, interactively
with little or no programming required in many cases.

Workflow specification: The detection and removal of anomalies is performed by a
sequence of operations on the data known as the workflow. It is specified after the
process of auditing the data and is crucial in achieving the end product of highquality data. In order to achieve a proper workflow, the causes of the anomalies and
errors in the data have to be closely considered.

Workflow execution: In this stage, the workflow is executed after its specification is
complete and its correctness is verified. The implementation of the workflow should
be efficient, even on large sets of data, which inevitably poses a trade-off because
the execution of a data-cleansing operation can be computationally expensive.

Post-processing and controlling: After executing the cleansing workflow, the results
are inspected to verify correctness. Data that could not be corrected during
execution of the workflow is manually corrected, if possible. The result is a new
cycle in the data-cleansing process where the data is audited again to allow the
specification of an additional workflow to further cleanse the data by automatic
processing.
OLAP server
Online Analytical Processing, a category of software tools that provides analysis of data
stored in a database. OLAP tools enable users to analyze different dimensions of
multidimensional data. For example, it provides time series and trend analysis views. OLAP
often is used in data mining.
The chief component of OLAP is the OLAP server, which sits between a client and a
database management systems (DBMS). The OLAP server understands how data is
organized in the database and has special functions for analyzing the data.
Types are:
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in
between a relational back-end server and client front-end tools. They use a relational or
extended-relational DBMS to store and manage warehouse data, and OLAP middleware to
support missing pieces. ROLAP servers include optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools and services. ROLAP
technology tends to have greater scalability than MOLAP technology. The Microstrategy's
DSS server and Informix's Metacube, for example, adopt the ROLAP approach.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of
data through array-based multidimensional storage engines. They map multidimensional
views directly to data cube array structures. For example, Essbase from Hyperion is a
MOLAP server. The advantage of using a data cube is that it allows fast indexing to
precomputed summarized data. Notice that with multidimensional data stores, the storage
utilization may be low if the data set is sparse. In such cases, sparse matrix compression
techniques should be explored.Many MOLAP servers adopt a two-level storage
representation to handle sparse and dense data sets: the dense subcubes are identified and
stored as array structures, while the sparse subcubes employ compression technology for
efficient storage utilization.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
technology, benefiting from the greater scalability of ROLAP and the faster computation of
MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored
in a relational database, while aggregations are kept in a separate MOLAP store. The
Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational
databases, some relational and data warehousing firms (e.g., Red Brick from Informix)
implement specialized SQL servers that provide advanced query language and query
processing support for SQL queries over star and snowflake schemas in a read-only
environment.
ASSOCIATION RULE MINING
Association rules are if/then statements that help uncover relationships between seemingly
unrelated data in a transactional database, relational database or other information
repository.
Association rules are if/then statements that help uncover relationships between seemingly
unrelated data in a relational database or other information repository. An example of an
association rule would be "If a customer buys a dozen eggs, he is 80% likely to also purchase
milk."
An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent
is an item found in the data. A consequent is an item that is found in combination with the
antecedent.
Association rules are created by analyzing data for frequent if/then patterns and using the
criteria support and confidence to identify the most important relationships. Support is an
indication of how frequently the items appear in the database. Confidence indicates the
number of times the if/then statements have been found to be true.
In data mining, association rules are useful for analyzing and predicting customer behavior.
They play an important part in shopping basket data analysis, product clustering, and catalog
design and store layout.
Programmers use association rules to build programs capable of machine learning. Machine
learning is a type of artificial intelligence (AI) that seeks to build programs with the ability to
become more efficient without being explicitly programmed.
 Confidence (AB) = #tuples containing both A & B / #tuples containing A =
P(B|A) = P(A U B ) / P (A)
 Support (AB) = #tuples containing both A & B/ total number of tuples =
P(A U B)
 Frequent Itemsets: The sets of item which has minimum support (denoted by Li for
ith-Itemset).