Download Data mining - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DATA MINING
Data mining refers to extracting or “mining” knowledge from large amounts of data. Data Mining (DM) is
the science of finding new interesting patterns and relationship in huge amount of data. It is defined as
“the process of discovering meaningful new correlations, patterns, and trends by digging into large
amounts of data stored in warehouses”.
Applications of data mining to bioinformatics include gene finding, protein function domain detection,
function motif detection, protein function inference, disease diagnosis, disease prognosis, disease
treatment optimization, protein and gene interaction network reconstruction, data cleansing, and
protein sub-cellular location prediction.
KNOWLEDGE DISCOVERY IN DATABASE
Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. The goal is
the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of
data itself.
The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
1. Data cleaning to remove noise and inconsistent data.
2. Data integration, where multiple data sources may be combined.
3. Data selection, where data relevant to the analysis task are retrieved from the database.
4. Data transformation, where data are transformed and consolidated into forms appropriate for
mining by preforming summary or aggregation operations.
5. Data mining, which is an essential process where intelligent methods are applied to extract data
patterns.
6. Pattern evaluation to identify the truly interesting patterns representing knowledge based on
interesting measures.
7. Knowledge presentation, where visualization and knowledge representation techniques are
used to present mined knowledge to users.
Data mining
Data mining involves six common classes of tasks:

Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data
records, that might be interesting or data errors that require further investigation.

Association rule learning (Dependency modelling) – Searches for relationships between
variables

Clustering – is the task of discovering groups and structures in the data that are in some way or
another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For example, an
e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

Regression – attempts to find a function which models the data with the least error.

Summarization – providing a more compact representation of the data set, including
visualization and report generation.
Characteristics of a data mining system

Large quantities of data
The volume of data so great it has to be analyzed by automated techniques e.g. satellite
information, credit card transactions etc.

Noisy, incomplete data
Imprecise data is the characteristic of all data collection.

Complex data structure
conventional statistical analysis not possible

Heterogeneous data stored in legacy systems