Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING AND DATA WAREHOUSING 1- Discuss the typical OLAP operations with an example. An OLAP cube is a set of data, organized in a way that facilitates non-predetermined queries for aggregatedinformation, or in other words, online analytical processing.[1] OLAP is one of the computerbased techniques for analyzing business data that are collectively called business intelligence.[2] OLAP cubes can be thought of as extensions to the two-dimensional array of a spreadsheet. For example a company might wish to analyze some financial data by product, by time-period, by city, by type of revenue and cost, and by comparing actual data with a budget. These additional methods of analyzing the data are known as dimensions.[3] Because there can be more than three dimensions in an OLAP system the term hypercube is sometimes used. The OLAP cube consists of numeric facts called measures which are categorized by dimensions. The cube metadata (structure) may be created from a star schema or snowflake schema of tables in arelational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. A financial analyst might want to view or pivot the data in various ways, such as displaying all the cities down the page and all the products across a page. This could be for a specified period, version and type of expenditure. Having seen the data in this particular way the analyst might then immediately wish to view it in another way. The cube could effectively be re-oriented so that the data displayed now has periods across the page and type of cost down the page. Because this re-orientation involves re-summarizing very large amounts of data, this new view of the data has to be generated efficiently to avoid wasting the analyst's time, i.e., within seconds, rather than the hours a relational database and conventional report-writer might have taken.[4] Each of the elements of a dimension could be summarized using a hierarchy.[5] The hierarchy is a series of parent-child relationships, typically where a parent member represents the consolidation of the members which are its children. Parent members can be further aggregated as the children of another parent.[6]For example May 2005 could be summarized into Second Quarter 2005 which in turn would be summarized in the Year 2005. Similarly the cities could be summarized into regions, countries and then global regions; products could be summarized into larger categories; and cost headings could be grouped into types of expenditure. Conversely the analyst could start at a highly summarized level, such as the total difference between the actual results and the budget, and drill down into the cube to discover which locations, products and periods had produced this difference. 2- Discuss how computations can be performed efficiently on data cubes. Users of decision support systems often see data in the form of data cubes. The cube is used to represent data along some measure of interest. Although called a "cube", it can be 2-dimensional, 3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest. For example, they could contain a count for the number of times that attribute combination occurs in the database, or the minimum, maximum, sum or average value of some attribute. Queries are performed on the cube to retrieve decision support information.Example: We have a database that contains transaction information relating company sales of a part to a customer at a store location. The data cube formed from this database is a 3-dimensional representation, with each cell (p,c,s) of the cube representing a combination of values from part, customer and store-location. A sample data cube for this combination. The contents of each cell is the count of the number of times that specific combination of values occurs together in the database. Cells that appear blank in fact have a value of zero. The cube can then be used to retrieve information within the database about, for example, which store should be given a certain part to sell in order to make the greatest sales. The goal is to retrieve the decision support information from the data cube in the most efficient way possible. Three possible solutions are: 1. Pre-compute all cells in the cube 2. Pre-compute no cells 3. Pre-compute some of the cells If the whole cube is pre-computed, then queries run on the cube will be very fast. The disadvantage is that the pre-computed cube requires a lot of memory. The size of a cube for n attributes A1,...,An with cardinalities |A1|,...,|An| is π|Ai|. This size increases exponentially with the number of attributes and linearly with the cardinalities of those attributes.To minimize memory requirements, we can pre-compute none of the cells in the cube. The disadvantage here is that queries on the cube will run more slowly because the cube will need to be rebuilt for each query.As a compromise between these two, we can precompute only those cells in the cube which will most likely be used for decision support queries. The trade-off between memory space and computing time is called the space-time trade-off, and it often exists in data mining and computer science in general. m-Dimensional Array: A data cube built from m attributes can be stored as an m-dimensional array. Each element of the array contains the measure value, such as count. The array itself can be represented as a 1-dimensional array. For example, a 2-dimensional array of size xx y can be stored as a 1-dimensional array of size x*y, where element (i,j) in the 2-D array is stored in location (y*i+j) in the 1-D array. The disadvantage of storing the cube directly as an array is that most data cubes are sparse, so the array will contain many empty elements (zero values). 3- Write short notes on data warehouse meta data. The term metadata is used for two similar but fundamentally different concepts (types). The usual explanation is "data about data", but this does not apply to both concepts in the same way. In the first instance, structural metadata means the specification of data structures. This cannot be about data because the actual data content is unknown when the data structures are being designed. In this case the correct description would be "data about the containers of data". Descriptive metadata, on the other hand, is about individual instances of application data, the data content. In this case, a useful description (resulting in a disambiguating neologism) would be "data about data content" or "content about content" thus metacontent. Descriptive, Guide and the National Information Standards Organization concept of administrative metadata are all subtypes of metacontent.Metadata (metacontent) is traditionally found in the card catalogs of libraries. As information has become increasingly digital, metadata is also used to describe digital data using metadata standardsspecific to a particular discipline. By describing the contents and context of data files, the quality of the original data/files is greatly increased. For example, a webpage may include metadata specifying what language it's written in, what tools were used to create it, and where to go for more on the subject, allowing browsers to automatically improve the experience of users. Metadata is data. As such, metadata can be stored and managed in a database, often called aMetadata registry or Metadata repository.[1] However, without context and a point of reference, it can be impossible to identify metadata just by looking at it. [2] For example: by itself, a database containing several numbers, all 13 digits long could be the results of calculations or a list of numbers to plug into an equation - without any other context, the numbers themselves can be perceived as the data. But if given the context that this database is a log of a book collection, those 13-digit numbers may now beISBNs - information that refers to the book, but is not itself the information within the book.The term "metadata" was coined in 1968 by Philip Bagley, one of the pioneers of computerized document retrieval.[3][4] Since then the fields of information management, information science, information technology, librarianship and GIS have widely adopted the term. In these fields the word metadata is defined as "data about data".[5] While this is the generally accepted definition, various disciplines have adopted their own more specific explanation and uses of the term. 4- Explain various methods of data cleaning in detail. Data cleansing, data cleaning, or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data.After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores.Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. Data auditing: The data is audited with the use of statistical methods to detect anomalies and contradictions. This eventually gives an indication of the characteristics of the anomalies and their locations. Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of high-quality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered. For instance, if we find that an anomaly is a result of typing errors in data input stages, the layout of the keyboard can help in manifesting possible solutions. Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient, even on large sets of data, which inevitably poses a trade-off because the execution of a data-cleansing operation can be computationally expensive. Post-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during execution of the workflow is manually corrected, if possible. The result is a new cycle in the data-cleansing process where the data is audited again to allow the specification of an additional workflow to further cleanse the data by automatic processing. 5- Give an account on data mining Query language. 6- How is Attribute-Oriented Induction implemented? Explain in detail. Data mining has become an important technique which has tremendous potential in many commercial and industrial applications. Attribute-oriented induction is a powerful mining technique and has been successfully implemented in the data mining system DBMiner (Han et al. Proc. 1996 Int'l Conf. on Data Mining and Knowledge Discovery (KDD'96), Portland, Oregon, 1996). However, its induction capability is limited by the unconditional concept generalization. In this paper, we extend the concept generalization to rule-based concept hierarchy, which enhances greatly its induction power. When previously proposed induction algorithm is applied to the more general rule-based case, a problem of induction anomaly occurs which impacts its efficiency. We have developed an efficient algorithm to facilitate induction on the rule-based case which can avoid the anomaly. Performance studies have shown that the algorithm is superior than a previously proposed algorithm based on backtracking. data mining - knowledge discovery in databases - rule-based concept generalization - rule-based concept hierarchy - attribute-oriented induction - inductive learning - learning and adaptive systems This paper will propose a novel star schema attribute induction as a new attribute induction paradigm and as improving from current attribute oriented induction. A novel star schema attribute induction will be examined with current attribute oriented induction based on characteristic rule and using non rule based concept hierarchy by implementing both of approaches. In novel star schema attribute induction some improvements have been implemented like elimination threshold number as maximum tuples control for generalization result, there is no ANY as the most general concept, replacement the role concept hierarchy with concept tree, simplification for the generalization strategy steps and elimination attribute oriented induction algorithm. Novel star schema attribute induction is more powerful than the current attribute oriented induction since can produce small number final generalization tuples and there is no ANY in the results. 7- Write and explain the algorithm for mining frequent item sets without candidate generation. Give relevant example. 8- Discuss the approaches for mining multi level association rules from the transactional databases. Give relevant example.