Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING Data mining refers to extracting or “mining” knowledge from large amounts of data. Data Mining (DM) is the science of finding new interesting patterns and relationship in huge amount of data. It is defined as “the process of discovering meaningful new correlations, patterns, and trends by digging into large amounts of data stored in warehouses”. Applications of data mining to bioinformatics include gene finding, protein function domain detection, function motif detection, protein function inference, disease diagnosis, disease prognosis, disease treatment optimization, protein and gene interaction network reconstruction, data cleansing, and protein sub-cellular location prediction. KNOWLEDGE DISCOVERY IN DATABASE Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. The goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: 1. Data cleaning to remove noise and inconsistent data. 2. Data integration, where multiple data sources may be combined. 3. Data selection, where data relevant to the analysis task are retrieved from the database. 4. Data transformation, where data are transformed and consolidated into forms appropriate for mining by preforming summary or aggregation operations. 5. Data mining, which is an essential process where intelligent methods are applied to extract data patterns. 6. Pattern evaluation to identify the truly interesting patterns representing knowledge based on interesting measures. 7. Knowledge presentation, where visualization and knowledge representation techniques are used to present mined knowledge to users. Data mining Data mining involves six common classes of tasks: Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (Dependency modelling) – Searches for relationships between variables Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". Regression – attempts to find a function which models the data with the least error. Summarization – providing a more compact representation of the data set, including visualization and report generation. Characteristics of a data mining system Large quantities of data The volume of data so great it has to be analyzed by automated techniques e.g. satellite information, credit card transactions etc. Noisy, incomplete data Imprecise data is the characteristic of all data collection. Complex data structure conventional statistical analysis not possible Heterogeneous data stored in legacy systems