Download University Question Answer 2015(sub- DWM)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Eighth sem Regular Examination-2015
Solution of Data and Web Mining
Branch-CSE
1. (a) Why concept hirarchies are useful in Data Mining Task
Ans.Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple
granularity. For a typical data mining task, the following basic steps should be executed
and concept hierarchies play a key role in these steps.
1. Retrieval of the task-related data set . Generation of a data cube.
2. Generalization of raw data to certain higher abstraction level .
3. Further generalization or specialization . Multiple-level rule mining.
4. Display of discovered knowledge .
(b) What are multi dimensional association rules? Explain with example.
Ans. In multidimensional association ruleAttribute A in a rule is assumed to have value a, attribute
B value b and attribute C value c in the same tuple.Items in the multidimensional association rules
refer to two or more dimensions or predicates, e.g., "buys", "time_of_transaction",
"customer_category".
(c) Differentiate between descriptive and predictive data mining?
Ans.There are three types of data analysis:
Predictive (forecasting)
Descriptive (business intelligence and data mining).
Predictive analytics turns data into valuable, actionable information. Predictive analytics uses data
to determine the probable future outcome of an event or a likelihood of a situation occurring.
Descriptive analytics looks at data and analyzes past events for insight as to how to approach the
future.
(d) Give examples of atleast four categories of clustering method.
Ans. The clustering methods are di- vided into: hierarchical, partitioning, density-based, modelbased, grid-based, and soft-computing methods.
(e) How will you solve a classification problem using decision tree ?
Ans. Decision tree learning uses a decision tree as a predictive model which maps observations
about an item to conclusions about the item's target value. It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models where the target
variable can take a finite set of values are called classification trees. In these tree structures, leaves
represent class labels and branches represent conjuctions of features that lead to those class labels.
(f) Discuss (shortly ) whether or not each of the following activities is a data mining task:
(1)Predicting the outcomes of tossing a (fair) pair of dice.
1
(2) Predicting the future stock price of a company using historical records.
Ans.(1)Predicting the outcomes of tossing a (fair) pair of dice. (No)
(2) Predicting the future stock price of a company using historical records.(Yes)
(g) What is the difference between web content mining and web usage mining ?
Ans.Web usage mining is the process of extracting useful information from server logs e.g. use
Web usage mining is the process of finding out what users are looking for on the Internet.
Web content mining is the mining, extraction and integration of useful data, information and
knowledge from Web page content.
(h) How does the page rank algorithm works ?
Ans. PageRank works by counting the number and quality of links to a page to determine a rough
estimate of how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites
(i) How does a string matching work? Explai with example
Ans. String matching is the technique of finding strings that match a pattern approximately. The
problem of approximate string matching is classified into two sub-problems namely finding
approximate substring matches inside a given string and finding dictionary strings that match the
pattern approximately.
(j) What is opinion spam? Explain.
Opinion Spamming: It refers to "illegal" activities (e.g., writing fake reviews, also called shilling)
that try to mislead readers or automated opinion mining and sentiment analysis systems by giving
undeserving positive opinions to some target entities in order to promote the entities and/or by
giving false negative opinions to some other entities in order to damage their reputations. Opinion
spam has many forms, e.g., fake reviews (also called bogus reviews), fake comments, fake blogs,
fake social network postings, deceptions, and deceptive messages.
2.(a) Write and explain the algorithm for mining frequent item sets without candidate
generation?
Ans.Association rules are usually required to satisfy a user-specified minimum support and a userspecified minimum confidence at the same time. Association rule generation is usually split up into
two separate steps:
1. First, minimum support is applied to find all frequent itemsets in a database.
2. Second, these frequent itemsets and the minimum confidence constraint are used to form
rules.
While the second step is straightforward, the first step needs more attention.
Finding all frequent itemsets in a database is difficult since it involves searching all possible
itemsets (item combinations). The set of possible itemsets is the power set over I and has size
(excluding the empty set which is not a valid itemset). Although the size of the powerset
grows exponentially in the number of items n in I, efficient search is possible using the downward2
closure property of support (also called anti-monotonicity[6]) which guarantees that for a frequent
itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also
be infrequent. Exploiting this property, efficient algorithms (e.g., Apriori and Eclat) can find all
frequent itemsets.
(b) A database has nine transactions let min_sup=30%
TID
List of Items_Ids
1
a,b,e
6
b,c
2
b,d
7
a,c
3
b,c
8
a,b,c,e
4
a,b,d
9
a,b,c
5
a,c
Find all frequent item sets using the above algorithm.
Ans. (a,b)=4 , (b,c)=4, (a,c)=4 This item sets are frequent itemsets.
3. (a)Clustering has been popularly recoglized as an important data mining task with broad
applications. Give an application example for each of the following cases:
(a) An application that takes clustering as a major datamining function.
(b) An application that takes clustering as a preprocessing tool for data preparation for other
datamining tasks.
Ans. Clustering is a technique that divides division of data into groups of similar objects. Each
group, called a cluster, consists of objects that are similar to one another and dissimilar to objects of
other groups. When repre- senting data with fewer clusters necessarily loses certain fine details
(akin to
lossy data compression), but achieves simplification. It represents many data objects by few
clusters, and hence, it models data by its clusters. Data modeling puts clustering in a historical
perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning
perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning,
and the resulting system represents a data concept. Therefore, clustering is unsupervised learning of
a hidden data concept. Data mining applications add to a general picture three complications: (a)
large databases, (b) many attributes, (c) attributes of different types. This imposes on a data analysis
severe computational requirements. Data mining applications include scientific data exploration,
information retrieval, text mining, spatial databases, Web analysis, CRM, marketing, medical
diagnostics, computational biology, and many others.
(b) The term "k-means" was first used by James Mac Queen in 1967 , The standard algorithm was
3
first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, though it wasn't
published until 1982. K-means is a widely used partitioned clustering method in the industries. The
K-means algorithm is the most commonly used partitioned clustering algorithm because it can be
easily implemented and is the most efficient one in terms of the execution time.
Density-based clustering algorithms try to find clusters based on density of data points in a
region. The key idea of density-based clustering is that for each instance of a cluster the
neighborhood of a given radius (Eps) has to contain at least a minimum number of instances (Min
Pts). One of the
most well known density-based clustering algorithms is the DBSCAN . DBSCAN separates data
points into three classes:
1. Core points: These are points that are at the interior of a cluster.
2. Border points: A border point is a point that is not a core point, but it falls within the
neighborhood of a core point.
3. Noise points: A noise point is any point that is not a core point or a border point.
4. (a) Construct the FP tree for given transaction DB
TID
Frequent Itemsets
100
200
300
400
500
f,c,a,m,p
f,c,a,b,m
f,b
c,b,p
f,c,a,m,p
Ans. F P Tree for given data
{}
Header Table
Item head
f
c
a
b
m
p
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
(b) Pre-processing is an important task in data mining. Justify.
Ans. Data pre-processing is an important step in the data mining process. The phrase “garbage in
garbage out” is particularly applicable to data mining and machine learning projects. Data-gathering
methods are often loosely controlled, resulting in out-of-rangevalues (e.g., Income: −100),
impossible data combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc. Analyzing data
that has not been carefully screened for such problems can produce misleading results. Thus, the
4
representation and quality of data is first and foremost before running an analysis.
Real world data are generally
 Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in codes or names
If there is much irrelevant and redundant information present or noisy and unreliable data,
then knowledge discovery during the training phase is more difficult. Data preparation and filtering
steps can take considerable amount of processing time. Data pre-processing includes
cleaning,normalization , transformation, feature extraction and selection, etc. The product of data
pre-processing is the final training set.
5.(a) Explain mining WWW process.
Ans. The advent of the World-Wide Web (WWW) has overwhelmed the typical home computer
user with an enormous flood of information. To be able to cope with the abundance of available
information, users of the WWW need to rely on intelligent tools that assist them in finding, sorting,
and filtering the available information. Just as data mining aims at discovering valuable information
that is hidden in conventional databases, the emerging field of Web mining aims at finding and
extracting relevant information that is hidden in Web-related data, in particular in text documents
that are published on the web. Depending on the nature of the data, one can distinguish three main
areas
of
research
within
the
Web
mining
community:
1. Web Content Mining: application of data mining techniques to unstructured or semistructured
data,
usually
HTML-documents
2. Web Structure Mining: use of the hyperlink structure of the Web as an (additional)
informationsource
3. Web Usage Mining: analysis of user interactions with a Web server (e.g., click-stream
analysis) i.e. collecting data from web log records.
Process
diagram
for
mining
www.
(b)
Explain
the ways
in which
descripti
ve mining
of
complex
data objects is identified with an example.
Ans.A major limitation of many commercial data warehouse and OLAP tools for multidimensional
database analysis is their restriction on the allowable data types for dimensions and measures. Most
data
5
cube implementations confine dimensions to nonnumeric data and measures to simple aggregated
values.
To introduce data mining and multidimensional data analysis for complex objects, this section
examines
how to perform generalization on complex structured objects and construct object cubes for OLAP
and
mining in object databases.
The storage and access of complex structured data have been studied in object-relational and
object-oriented database systems. These systems organize a large set of complex data objects into
classes,
which are in turn organized into class/subclass hierarchies. Each object in a class is associated with
1) An object-identifier
2) A set of attributes that may contain sophisticated data structures, set- or list-valued data,
class composition and hierarchies, multimedia data and so on &
3) A set of methods that specify the computational routines or rules associated with the
object class.
To facilitate generalization and induction in object-relational and object-oriented databases, it is
important to study how the generalized data can be used for multidimensional data and analysis and
data mining.
Suppose that we have different pieces of land for various purposes of agricultural usage, such as
the planting of vegetables, grains, and fruits. These pieces can be merged or aggregated into one
large
piece of agricultural land by a spatial merge. However, such a piece of agricultural land may contain
highways, houses, small stores, and so on. If the majority of the land is used for agriculture, the
scattered
regions for other purposes can be ignored, and the whole region can be claimed as an agricultural
area by approximation. A multimedia database may contain complex texts, graphics, images, video
fragments, maps, voice, music, and other forms of audio/video information. Multimedia data are
typically stored as sequences of bytes with variable lengths, and segments of data are linked
together or indexed in a multidimensional way for easy reference. Recognition and extraction of the
essential features and/or general patterns of such data can perform generalization on multimedia
data. There are many ways to extract such information. For an image, aggregation and/or
approximation can extract the size, color, shape, texture, orientation, and relative positions and
structures of the contained objects or regions in the image. For a segment of music, its melody can
be summarized based on its tone, tempo, or the major musical instruments played. For an article, its
abstract or general organizational structure (e.g., the table of contents, the subject and index terms
that frequently occur in the article, etc.) may serve as its generalization.
In general, it is a challenging task to generalize spatial data and multimedia data in order to
extract interesting knowledge implicitly stored in the data. Technologies developed in spatial
databases and multimedia databases such as spatial data accessing and analysis techniques and
content based image retrieval and multidimensional indexing methods should be integrated with
data generalization and data mining techniques to achieve satisfactory results. Techniques for
mining such data are further discussed in following sessions.
6.(a) Explain preprocessing of a web mining application.
Ans.Web mining is the type of activity that one or more web server user access patterns to
automatically search is included. As more organizations rely on the Internet and World Wide Web to
6
conduct business, to traditional market analysis techniques and strategies should be revisited in this
context. Organizations often generate and collect large amounts of data in their daily activities.
Most of this information is usually generated automatically collected by Web servers in the server
access logs.
Ideally, web usage mining process to input a user session file that a Web site, what pages were
requested to deliver and in what order, and how long each page was viewed an accurate account
does. Page a user reaches the session during a visit to a Web site for the set. However, after the
reasons we have a raw web server data preprocessing before a user session file does not represent
strength will discuss information contained in the log. Generally, data cleansing data preprocessing
user identification, session identification and full path, as shown in Figure
Phases of Data Preprocessing in Web Mining
(b) How web mining tools can answer which advertising campaign results in the most
purchases ?
Ans.As online advertising banners become more popular, companies using them accurately measure
overall return on advertising investment. This benefits both advertisers and sites running ads
because it allows advertising rates to vary according to their success. Proper measurement of
advertising reports centers on two specific areas:
Quantity: How many impressions were delivered for each ad banner and page, and how many
people clicked on each ad? These are usually reported as impressions and click-throughs.
Quality: Of people who clicked on an ad banner, how many actually purchased? This return is best
measured by subtracting advertising expenses from the resulting revenue.
For companies offering ad space on their site, reporting ad impressions and click-through rates for
any page running advertisements is important. For companies running banner ads on other sites,
7
prospect quality can be measured. A manager should evaluate both the effectiveness of individual
ad banners and the effectiveness of each Web page with an ad. By combining these, an advertiser
optimizes his or her advertising by selecting the best combination of ad banner and Web page for
additional ad placements.
7. (a) Give an account of opinionmining.
Ans.Opinion mining, which is also called sentiment analysis, involves building a system to collect
and categorize opinions about a product. Automated opinion mining often uses machine learning, a
type of artificial intelligence (AI), to mine text for sentiment.
Opinion mining can be useful in several ways. It can help marketers evaluate the success of
an ad campaign or new product launch, determine which versions of a product or service are
popular and identify which demographics like or dislike particular product features. For example, a
review on a website might be broadly positive about a digital camera, but be specifically negative
about how heavy it is. Being able to identify this kind of information in a systematic way gives the
vendor a much clearer picture of public opinion than surveys or focus groups do, because the data is
created by the customer
(b) Give an account of the techniques of web usage patterns discovery to find out which pages
are being accessed most frequently?
Ans.Web usage mining also known as web log mining is the application of data mining techniques
on large web log repositories to discover useful knowledge about user’s behavioral patterns and
website usage statistics that can be used for various website design tasks. The main source of data
for web usage mining consists of textual logs collected by numerous web servers all around the
world. There are four
stages in web usage mining.
Data Collection : users log data is collected from various sources like serverside, client side, proxy
servers and so on. Preprocessing : Performs a series of processing of web log file coveringd.ata
cleaning, user identification, session identification, path completion and transaction identificatio
Web log data cubes are constructed to give the user the flexibility of viewing data from different
perspectives and performing ad hoc analytical quires. A typical Web log ad hoc analysis example is
querying how overall usage of Web site has changed in the last quarter, testing if most server
requests have been answered, hopefully with expected or low level of errors. If some weeks or days
are worse than the others, the user might navigate further down into those levels, always looking for
some reason to explain the observed anomalies. At each step, the user might add or remove some
dimension, changing their perspective, select subset of the data at hand, drill down, or roll up, and
then inspect the new view of the data cube again. Each step of this process signifies a query or
hypothesis, and each query follows the result of the previous step.
8. Write Short Notes on any two of the following:
(a) Bayesian classification
8
(b) Grid-based methods
(c) Wrapper generation
(d) Privacy preserving data mining
(a) Ans. Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the probability
that a given tuple belongs to a particular class.
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to
problem instances, represented as vectors of feature values, where the class labels are drawn from
some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms
based on a common principle: all naive Bayes classifiers assume that the value of a particular
feature is independent of the value of any other feature, given the class variable. For example, a
fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes
classifier considers each of these features to contribute independently to the probability that this
fruit is an apple, regardless of any possible correlations between the color, roundness and diameter
features.
For some types of probability models, naive Bayes classifiers can be trained very efficiently in a
supervised learning setting. In many practical applications, parameter estimation for naive Bayes
models uses the method of maximum likelyhood; in other words, one can work with the naive
Bayes model without accepting Bayesian probability or using any Bayesian methods.
(b) Ans. The grid-based clustering approach differs from the conventional clustering algorithms in
that it is concerned not with the data points but with the value space that surrounds the data points.
In general, a typical grid-based clustering algorithm consists of the following five basic steps
(Grabusts and Borisov, 2002):
1. Creating the grid structure, i.e., partitioning the data space into a finite number of cells.
2. Calculating the cell density for each cell.
3. Sorting of the cells according to their densities.
4. Identifying cluster centers.
5. Traversal of neighbor cells.
(c) Ans. Wrapper is a program that extracts content of a particular information source and translates
it into a relational form in the data mining process. There are two main approaches to wrapper
generation: wrapper induction and automated data extraction. Wrapper induction uses supervised
learning to learn data extraction rules from manually labeled training examples. The disadvantages
of wrapper induction are
 the time-consuming manual labeling process and
 the difficulty of wrapper maintenance.
Due to the manual labeling effort, it is hard to extract data from a large number of sites as each site
has its own templates and requires separate manual labeling for wrapper learning. Wrapper
maintenance is also a major issue because whenever a site changes the wrappers built for the site
become obsolete. Due to these shortcomings, researchers have studied automated wrapper
generation using unsupervised pattern mining. Automated extraction is possible because most Web
data objects follow fixed templates. Discovering such templates or patterns enables the system to
9
perform extraction automatically.
Wrapper generation on the Web is an important problem with a wide range of applications.
Extraction of such data enables one to integrate data/information from multiple Web sites to provide
value-added services, e.g., comparative shopping, object search, and information integration. –the
wrapper content can be enhanced
(d) Ans. Privacy preserving has originated as an important concern with reference to the success of
the data mining. Privacy preserving data mining (PPDM) deals with protecting the privacy of
individual data or sensitive knowledge without sacrificing the utility of the data. People have
become well aware of the privacy intrusions on their personal data and are very reluctant to share
their sensitive information. This may lead to the inadvertent results of the data mining. Within the
constraints of privacy, several methods have been proposed but still this branch of research is in its
infancy. The success of privacy preserving data mining algorithms is measured in terms of its
performance, data utility, level of uncertainty or resistance to data mining algorithms etc. However
no privacy preserving algorithm exists that outperforms all others on all possible criteria.
10