* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Knowledge discovery in databases (KDD) is the process of
Entity–attribute–value model wikipedia , lookup
Data Protection Act, 2012 wikipedia , lookup
Clusterpoint wikipedia , lookup
Data center wikipedia , lookup
Data analysis wikipedia , lookup
Forecasting wikipedia , lookup
Information privacy law wikipedia , lookup
Data vault modeling wikipedia , lookup
What does Knowledge Discovery in Databases (KDD) mean? Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results. Major KDD application areas include marketing, fraud detection, telecommunication and manufacturing. Traditionally, data mining and knowledge discovery was performed manually. As time passed, the amount of data in many systems grew to larger than terabyte size, and could no longer be maintained manually. Moreover, for the successful existence of any business, discovering underlying patterns in data is considered essential. As a result, several software tools were developed to discover hidden data and make assumptions, which formed a part of artificial intelligence. The KDD process has reached its peak in the last 10 years. It now houses many different approaches to discovery, which includes inductive learning, Bayesian statistics, semantic query optimization, knowledge acquisition for expert systems and information theory. The ultimate goal is to extract high-level knowledge from low-level data. KDD includes multidisciplinary activities. This encompasses data storage and access, scaling algorithms to massive data sets and interpreting results. The data cleansing and data access process included in data warehousing facilitate the KDD process. Artificial intelligence also supports KDD by discovering empirical laws from experimentation and observations. The patterns recognized in the data must be valid on new data, and possess some degree of certainty. These patterns are considered new knowledge. Steps involved in the entire KDD process are: 1. Identify the goal of the KDD process from the customer’s perspective. 2. Understand application domains involved and the knowledge that's required 3. Select a target data set or subset of data samples on which discovery is be performed. 4. Cleanse and preprocess data by deciding strategies to handle missing fields and alter the data as per the requirements. 5. Simplify the data sets by removing unwanted variables. Then, analyze useful features that can be used to represent the data, depending on the goal or task. 6. Match KDD goals with data mining methods to suggest hidden patterns. 7. Choose data mining algorithms to discover hidden patterns. This process includes deciding which models and parameters might be appropriate for the overall KDD process. 8. Search for patterns of interest in a particular representational form, which include classification rules or trees, regression and clustering. 9. Interpret essential knowledge from the mined patterns. 10. Use the knowledge and incorporate it into another system for further action. 11. Document it and make reports for interested parties. IMPORTANCE OF BI Business Intelligence is a concept that typically involves the delivery and integration of relevant and useful business information in an organization. As such, companies use business intelligence to detect significant events and identify/monitor business trends in order to adapt quickly to their changing environment or scenario. If you use effective business intelligence training in your organization, you can improve the decision making processes at all levels of management and improve your tactical and strategic management processes. Here are some of the top reasons for investing in a proper business intelligence system. To Get Insights into Consumer Behavior One of the main advantages of investing in business intelligence software and skilled personnel is the fact that it will boost your ability to analyze the current consumer buying trends. Once you understand what your consumers are buying, you can use this information to develop products that match the current consumption trends and consequently improve your profitability since you will be able to attract valuable customers. To Improve Visibility If you want to improve your control over various important processes in your organization, you should consider investing in a good business intelligence system. Business intelligence software will improve the visibility of these processes and make it possible to identify any areas that need improvement. Moreover, if you currently have to skim through hundreds of pages in your detailed periodic reports to assess the performance of your organization’s processes, you can save time and improve productivity by having skilled intelligence analysts using the software. To Turn Data into Actionable Information A business intelligence system is an analytical tool that can give you the insight you need to make successful strategic plans for your organization. This is because such a system would be able to identify key trends and patterns in your organizations data and consequently make it easier for you to make important connections between different areas of your business that may otherwise seem unrelated. As such, a business intelligence system can help you understand the implications of various organizational processes better and enhance your ability to identify suitable opportunities for your organization, thus enabling you to plan for a successful future. To Improve Efficiency One of the most important reasons why you need to invest in an effective business intelligence system is because such a system can improve efficiency within your organization and, as a result, increase productivity. You can use business intelligence to share information across different departments in your organization. This will enable you to save time on reporting processes and analytics. This ease in information sharing is likely to reduce duplication of roles/duties within the organization and improve the accuracy and usefulness of the data generated by different departments. Furthermore, information sharing also saves time and improves productivity. Conclusion In order to reap all the benefits of an effective business intelligence system, ensure you invest in the skilled business intelligence personnel and software designed for analytical efficiency and accessibility. You should also make sure that the system you choose can analyze both the content and context of data. Data Cleaning Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or coarse data.[1] After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records). Some data cleansing solutions will clean data by cross checking with a validated data set. Also data enhancement, where data is made more complete by adding related information, is a common data cleansing practice. For example, appending addresses with phone numbers related to that address. Data cleansing may also involve activities like, harmonization of data, and standardization of data. For example, harmonization of short codes (St, rd etc.) to actual words (street, road). Standardization of data is a means of changing a reference data set to a new standard, ex, use of standard codes. The process of data cleansing  Data auditing: The data is audited with the use of statistical and database methods to detect anomalies and contradictions: this eventually gives an indication of the characteristics of the anomalies and their locations. Several commercial software packages will let you specify constraints of various kinds (using a grammar that conforms to that of a standard programming language, e.g., JavaScript or Visual Basic) and then generate code that checks the data for violation of these constraints. This process is referred to below in the bullets "workflow specification" and "workflow execution." For users who lack access to high-end cleansing software, Microcomputer database packages such as Microsoft Access or FileMaker Pro will also let you perform such checks, on a constraint-by-constraint basis, interactively with little or no programming required in many cases.  Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of highquality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered.  Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient, even on large sets of data, which inevitably poses a trade-off because the execution of a data-cleansing operation can be computationally expensive.  Post-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during execution of the workflow is manually corrected, if possible. The result is a new cycle in the data-cleansing process where the data is audited again to allow the specification of an additional workflow to further cleanse the data by automatic processing. OLAP server Online Analytical Processing, a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. OLAP often is used in data mining. The chief component of OLAP is the OLAP server, which sits between a client and a database management systems (DBMS). The OLAP server understands how data is organized in the database and has special functions for analyzing the data. Types are: Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The Microstrategy's DSS server and Informix's Metacube, for example, adopt the ROLAP approach. Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. For example, Essbase from Hyperion is a MOLAP server. The advantage of using a data cube is that it allows fast indexing to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored.Many MOLAP servers adopt a two-level storage representation to handle sparse and dense data sets: the dense subcubes are identified and stored as array structures, while the sparse subcubes employ compression technology for efficient storage utilization. Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP server. Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases, some relational and data warehousing firms (e.g., Red Brick from Informix) implement specialized SQL servers that provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. ASSOCIATION RULE MINING Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a transactional database, relational database or other information repository. Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository. An example of an association rule would be "If a customer buys a dozen eggs, he is 80% likely to also purchase milk." An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent. Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true. In data mining, association rules are useful for analyzing and predicting customer behavior. They play an important part in shopping basket data analysis, product clustering, and catalog design and store layout. Programmers use association rules to build programs capable of machine learning. Machine learning is a type of artificial intelligence (AI) that seeks to build programs with the ability to become more efficient without being explicitly programmed.  Confidence (AB) = #tuples containing both A & B / #tuples containing A = P(B|A) = P(A U B ) / P (A)  Support (AB) = #tuples containing both A & B/ total number of tuples = P(A U B)  Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-Itemset).
 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            