Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Eighth sem Regular Examination-2015 Solution of Data and Web Mining Branch-CSE 1. (a) Why concept hirarchies are useful in Data Mining Task Ans.Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity. For a typical data mining task, the following basic steps should be executed and concept hierarchies play a key role in these steps. 1. Retrieval of the task-related data set . Generation of a data cube. 2. Generalization of raw data to certain higher abstraction level . 3. Further generalization or specialization . Multiple-level rule mining. 4. Display of discovered knowledge . (b) What are multi dimensional association rules? Explain with example. Ans. In multidimensional association ruleAttribute A in a rule is assumed to have value a, attribute B value b and attribute C value c in the same tuple.Items in the multidimensional association rules refer to two or more dimensions or predicates, e.g., "buys", "time_of_transaction", "customer_category". (c) Differentiate between descriptive and predictive data mining? Ans.There are three types of data analysis: Predictive (forecasting) Descriptive (business intelligence and data mining). Predictive analytics turns data into valuable, actionable information. Predictive analytics uses data to determine the probable future outcome of an event or a likelihood of a situation occurring. Descriptive analytics looks at data and analyzes past events for insight as to how to approach the future. (d) Give examples of atleast four categories of clustering method. Ans. The clustering methods are di- vided into: hierarchical, partitioning, density-based, modelbased, grid-based, and soft-computing methods. (e) How will you solve a classification problem using decision tree ? Ans. Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees. In these tree structures, leaves represent class labels and branches represent conjuctions of features that lead to those class labels. (f) Discuss (shortly ) whether or not each of the following activities is a data mining task: (1)Predicting the outcomes of tossing a (fair) pair of dice. 1 (2) Predicting the future stock price of a company using historical records. Ans.(1)Predicting the outcomes of tossing a (fair) pair of dice. (No) (2) Predicting the future stock price of a company using historical records.(Yes) (g) What is the difference between web content mining and web usage mining ? Ans.Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. (h) How does the page rank algorithm works ? Ans. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites (i) How does a string matching work? Explai with example Ans. String matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is classified into two sub-problems namely finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately. (j) What is opinion spam? Explain. Opinion Spamming: It refers to "illegal" activities (e.g., writing fake reviews, also called shilling) that try to mislead readers or automated opinion mining and sentiment analysis systems by giving undeserving positive opinions to some target entities in order to promote the entities and/or by giving false negative opinions to some other entities in order to damage their reputations. Opinion spam has many forms, e.g., fake reviews (also called bogus reviews), fake comments, fake blogs, fake social network postings, deceptions, and deceptive messages. 2.(a) Write and explain the algorithm for mining frequent item sets without candidate generation? Ans.Association rules are usually required to satisfy a user-specified minimum support and a userspecified minimum confidence at the same time. Association rule generation is usually split up into two separate steps: 1. First, minimum support is applied to find all frequent itemsets in a database. 2. Second, these frequent itemsets and the minimum confidence constraint are used to form rules. While the second step is straightforward, the first step needs more attention. Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets (item combinations). The set of possible itemsets is the power set over I and has size (excluding the empty set which is not a valid itemset). Although the size of the powerset grows exponentially in the number of items n in I, efficient search is possible using the downward2 closure property of support (also called anti-monotonicity[6]) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent. Exploiting this property, efficient algorithms (e.g., Apriori and Eclat) can find all frequent itemsets. (b) A database has nine transactions let min_sup=30% TID List of Items_Ids 1 a,b,e 6 b,c 2 b,d 7 a,c 3 b,c 8 a,b,c,e 4 a,b,d 9 a,b,c 5 a,c Find all frequent item sets using the above algorithm. Ans. (a,b)=4 , (b,c)=4, (a,c)=4 This item sets are frequent itemsets. 3. (a)Clustering has been popularly recoglized as an important data mining task with broad applications. Give an application example for each of the following cases: (a) An application that takes clustering as a major datamining function. (b) An application that takes clustering as a preprocessing tool for data preparation for other datamining tasks. Ans. Clustering is a technique that divides division of data into groups of similar objects. Each group, called a cluster, consists of objects that are similar to one another and dissimilar to objects of other groups. When repre- senting data with fewer clusters necessarily loses certain fine details (akin to lossy data compression), but achieves simplification. It represents many data objects by few clusters, and hence, it models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. Therefore, clustering is unsupervised learning of a hidden data concept. Data mining applications add to a general picture three complications: (a) large databases, (b) many attributes, (c) attributes of different types. This imposes on a data analysis severe computational requirements. Data mining applications include scientific data exploration, information retrieval, text mining, spatial databases, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. (b) The term "k-means" was first used by James Mac Queen in 1967 , The standard algorithm was 3 first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, though it wasn't published until 1982. K-means is a widely used partitioned clustering method in the industries. The K-means algorithm is the most commonly used partitioned clustering algorithm because it can be easily implemented and is the most efficient one in terms of the execution time. Density-based clustering algorithms try to find clusters based on density of data points in a region. The key idea of density-based clustering is that for each instance of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of instances (Min Pts). One of the most well known density-based clustering algorithms is the DBSCAN . DBSCAN separates data points into three classes: 1. Core points: These are points that are at the interior of a cluster. 2. Border points: A border point is a point that is not a core point, but it falls within the neighborhood of a core point. 3. Noise points: A noise point is any point that is not a core point or a border point. 4. (a) Construct the FP tree for given transaction DB TID Frequent Itemsets 100 200 300 400 500 f,c,a,m,p f,c,a,b,m f,b c,b,p f,c,a,m,p Ans. F P Tree for given data {} Header Table Item head f c a b m p f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 (b) Pre-processing is an important task in data mining. Justify. Ans. Data pre-processing is an important step in the data mining process. The phrase “garbage in garbage out” is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-rangevalues (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the 4 representation and quality of data is first and foremost before running an analysis. Real world data are generally Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data Noisy: containing errors or outliers Inconsistent: containing discrepancies in codes or names If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes cleaning,normalization , transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. 5.(a) Explain mining WWW process. Ans. The advent of the World-Wide Web (WWW) has overwhelmed the typical home computer user with an enormous flood of information. To be able to cope with the abundance of available information, users of the WWW need to rely on intelligent tools that assist them in finding, sorting, and filtering the available information. Just as data mining aims at discovering valuable information that is hidden in conventional databases, the emerging field of Web mining aims at finding and extracting relevant information that is hidden in Web-related data, in particular in text documents that are published on the web. Depending on the nature of the data, one can distinguish three main areas of research within the Web mining community: 1. Web Content Mining: application of data mining techniques to unstructured or semistructured data, usually HTML-documents 2. Web Structure Mining: use of the hyperlink structure of the Web as an (additional) informationsource 3. Web Usage Mining: analysis of user interactions with a Web server (e.g., click-stream analysis) i.e. collecting data from web log records. Process diagram for mining www. (b) Explain the ways in which descripti ve mining of complex data objects is identified with an example. Ans.A major limitation of many commercial data warehouse and OLAP tools for multidimensional database analysis is their restriction on the allowable data types for dimensions and measures. Most data 5 cube implementations confine dimensions to nonnumeric data and measures to simple aggregated values. To introduce data mining and multidimensional data analysis for complex objects, this section examines how to perform generalization on complex structured objects and construct object cubes for OLAP and mining in object databases. The storage and access of complex structured data have been studied in object-relational and object-oriented database systems. These systems organize a large set of complex data objects into classes, which are in turn organized into class/subclass hierarchies. Each object in a class is associated with 1) An object-identifier 2) A set of attributes that may contain sophisticated data structures, set- or list-valued data, class composition and hierarchies, multimedia data and so on & 3) A set of methods that specify the computational routines or rules associated with the object class. To facilitate generalization and induction in object-relational and object-oriented databases, it is important to study how the generalized data can be used for multidimensional data and analysis and data mining. Suppose that we have different pieces of land for various purposes of agricultural usage, such as the planting of vegetables, grains, and fruits. These pieces can be merged or aggregated into one large piece of agricultural land by a spatial merge. However, such a piece of agricultural land may contain highways, houses, small stores, and so on. If the majority of the land is used for agriculture, the scattered regions for other purposes can be ignored, and the whole region can be claimed as an agricultural area by approximation. A multimedia database may contain complex texts, graphics, images, video fragments, maps, voice, music, and other forms of audio/video information. Multimedia data are typically stored as sequences of bytes with variable lengths, and segments of data are linked together or indexed in a multidimensional way for easy reference. Recognition and extraction of the essential features and/or general patterns of such data can perform generalization on multimedia data. There are many ways to extract such information. For an image, aggregation and/or approximation can extract the size, color, shape, texture, orientation, and relative positions and structures of the contained objects or regions in the image. For a segment of music, its melody can be summarized based on its tone, tempo, or the major musical instruments played. For an article, its abstract or general organizational structure (e.g., the table of contents, the subject and index terms that frequently occur in the article, etc.) may serve as its generalization. In general, it is a challenging task to generalize spatial data and multimedia data in order to extract interesting knowledge implicitly stored in the data. Technologies developed in spatial databases and multimedia databases such as spatial data accessing and analysis techniques and content based image retrieval and multidimensional indexing methods should be integrated with data generalization and data mining techniques to achieve satisfactory results. Techniques for mining such data are further discussed in following sessions. 6.(a) Explain preprocessing of a web mining application. Ans.Web mining is the type of activity that one or more web server user access patterns to automatically search is included. As more organizations rely on the Internet and World Wide Web to 6 conduct business, to traditional market analysis techniques and strategies should be revisited in this context. Organizations often generate and collect large amounts of data in their daily activities. Most of this information is usually generated automatically collected by Web servers in the server access logs. Ideally, web usage mining process to input a user session file that a Web site, what pages were requested to deliver and in what order, and how long each page was viewed an accurate account does. Page a user reaches the session during a visit to a Web site for the set. However, after the reasons we have a raw web server data preprocessing before a user session file does not represent strength will discuss information contained in the log. Generally, data cleansing data preprocessing user identification, session identification and full path, as shown in Figure Phases of Data Preprocessing in Web Mining (b) How web mining tools can answer which advertising campaign results in the most purchases ? Ans.As online advertising banners become more popular, companies using them accurately measure overall return on advertising investment. This benefits both advertisers and sites running ads because it allows advertising rates to vary according to their success. Proper measurement of advertising reports centers on two specific areas: Quantity: How many impressions were delivered for each ad banner and page, and how many people clicked on each ad? These are usually reported as impressions and click-throughs. Quality: Of people who clicked on an ad banner, how many actually purchased? This return is best measured by subtracting advertising expenses from the resulting revenue. For companies offering ad space on their site, reporting ad impressions and click-through rates for any page running advertisements is important. For companies running banner ads on other sites, 7 prospect quality can be measured. A manager should evaluate both the effectiveness of individual ad banners and the effectiveness of each Web page with an ad. By combining these, an advertiser optimizes his or her advertising by selecting the best combination of ad banner and Web page for additional ad placements. 7. (a) Give an account of opinionmining. Ans.Opinion mining, which is also called sentiment analysis, involves building a system to collect and categorize opinions about a product. Automated opinion mining often uses machine learning, a type of artificial intelligence (AI), to mine text for sentiment. Opinion mining can be useful in several ways. It can help marketers evaluate the success of an ad campaign or new product launch, determine which versions of a product or service are popular and identify which demographics like or dislike particular product features. For example, a review on a website might be broadly positive about a digital camera, but be specifically negative about how heavy it is. Being able to identify this kind of information in a systematic way gives the vendor a much clearer picture of public opinion than surveys or focus groups do, because the data is created by the customer (b) Give an account of the techniques of web usage patterns discovery to find out which pages are being accessed most frequently? Ans.Web usage mining also known as web log mining is the application of data mining techniques on large web log repositories to discover useful knowledge about user’s behavioral patterns and website usage statistics that can be used for various website design tasks. The main source of data for web usage mining consists of textual logs collected by numerous web servers all around the world. There are four stages in web usage mining. Data Collection : users log data is collected from various sources like serverside, client side, proxy servers and so on. Preprocessing : Performs a series of processing of web log file coveringd.ata cleaning, user identification, session identification, path completion and transaction identificatio Web log data cubes are constructed to give the user the flexibility of viewing data from different perspectives and performing ad hoc analytical quires. A typical Web log ad hoc analysis example is querying how overall usage of Web site has changed in the last quarter, testing if most server requests have been answered, hopefully with expected or low level of errors. If some weeks or days are worse than the others, the user might navigate further down into those levels, always looking for some reason to explain the observed anomalies. At each step, the user might add or remove some dimension, changing their perspective, select subset of the data at hand, drill down, or roll up, and then inspect the new view of the data cube again. Each step of this process signifies a query or hypothesis, and each query follows the result of the previous step. 8. Write Short Notes on any two of the following: (a) Bayesian classification 8 (b) Grid-based methods (c) Wrapper generation (d) Privacy preserving data mining (a) Ans. Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features. For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelyhood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. (b) Ans. The grid-based clustering approach differs from the conventional clustering algorithms in that it is concerned not with the data points but with the value space that surrounds the data points. In general, a typical grid-based clustering algorithm consists of the following five basic steps (Grabusts and Borisov, 2002): 1. Creating the grid structure, i.e., partitioning the data space into a finite number of cells. 2. Calculating the cell density for each cell. 3. Sorting of the cells according to their densities. 4. Identifying cluster centers. 5. Traversal of neighbor cells. (c) Ans. Wrapper is a program that extracts content of a particular information source and translates it into a relational form in the data mining process. There are two main approaches to wrapper generation: wrapper induction and automated data extraction. Wrapper induction uses supervised learning to learn data extraction rules from manually labeled training examples. The disadvantages of wrapper induction are the time-consuming manual labeling process and the difficulty of wrapper maintenance. Due to the manual labeling effort, it is hard to extract data from a large number of sites as each site has its own templates and requires separate manual labeling for wrapper learning. Wrapper maintenance is also a major issue because whenever a site changes the wrappers built for the site become obsolete. Due to these shortcomings, researchers have studied automated wrapper generation using unsupervised pattern mining. Automated extraction is possible because most Web data objects follow fixed templates. Discovering such templates or patterns enables the system to 9 perform extraction automatically. Wrapper generation on the Web is an important problem with a wide range of applications. Extraction of such data enables one to integrate data/information from multiple Web sites to provide value-added services, e.g., comparative shopping, object search, and information integration. –the wrapper content can be enhanced (d) Ans. Privacy preserving has originated as an important concern with reference to the success of the data mining. Privacy preserving data mining (PPDM) deals with protecting the privacy of individual data or sensitive knowledge without sacrificing the utility of the data. People have become well aware of the privacy intrusions on their personal data and are very reluctant to share their sensitive information. This may lead to the inadvertent results of the data mining. Within the constraints of privacy, several methods have been proposed but still this branch of research is in its infancy. The success of privacy preserving data mining algorithms is measured in terms of its performance, data utility, level of uncertainty or resistance to data mining algorithms etc. However no privacy preserving algorithm exists that outperforms all others on all possible criteria. 10