Download solution_DMA - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Intelligent maintenance system wikipedia , lookup

Transcript
DMBI Question bank of Unit 2, 4, and 6
1. What is warehouse? Which are characteristic of the warehouse?
 “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decisionmaking process.”
 A Data Warehouse
 is a Structured Repository of Historic Data.
 is developed in an Evolutionary Process by Integrating Data
from Non-integrated Legacy Systems.
Collection of data in support of management decision processes
Characteristic:
1) Subject-Oriented
2) Integrated
3) Time-Variant
4) Nonvolatile
2. What is warehousing?
 Data warehousing is the process of constructing and using data
warehouses.
3. Warehouse is a time-variant. Explain this.
 The time horizon for the data warehouse is significantly longer than that
of operational systems
o Operational database: current value data
o Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
o Contains an element of time, explicitly or implicitly
o But the key of operational data may or may not contain “time
element”
4. Explain A Multi-Tiered Architecture of Data Warehouse.
 Mainly 3 tiers
o Data storage
o OLAP servers
o Front-end tool
5. Explain Extraction, Transformation, and Loading (ETL).
 Data extraction
o get data from multiple, heterogeneous, and external sources
 Data cleaning
o detect errors in the data and rectify them when possible
 Data transformation
o convert data from legacy or host format to warehouse format
 Load
o sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions
 Refresh
o propagate the updates from the data sources to the warehouse
6. Explain various Data Warehouse Models.
 Enterprise warehouse
o collects all of the information about subjects spanning the entire
organization
 Data Mart
o a subset of corporate-wide data that is of value to a specific groups
of users.
o Its scope is confined to specific, selected groups, such as marketing
data mart
 Independent vs. dependent (directly from warehouse) data
mart
 Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
7. What is Metadata Repository?
 Meta data is the data defining warehouse objects.
 It stores:
 Description of the structure of the data warehouse
o schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
 Operational meta-data
o data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Data related to system performance
o warehouse schema, view and derived data definitions
 Business data
o business terms and definitions, ownership of data, charging
policies
8. Differentiate OLTP and OLAP
 OLTP (Online transaction processing) is a class of information systems
that facilitate and manage transaction-oriented applications, typically for
data entry and retrieval transaction processing.
 OLAP (online analytical processing) is computer processing that enables
user to (easily and selectively) extract and view data from different points
of view.
9. What is confidence of a rule? Give formula for the confidence.
Confidence
The confidence of a rule is defined:
Conf(x →y) = supp(x ∪ y)/ supp(x)
For the rule {milk,bread}=>{butter} we have the following confidence:
supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 / 0.4 = 0.65
This means that for 65% of the transactions containing milk and bread the rule is
correct.Confidence can be interpreted as an estimate of the probability P(Y | X), the
probability of finding the RHS of the rule in transactions under the condition that
these transactions also contain the LHS.
10.What is itemset? What is Frequent Itemset Mining? Which are
difference Frequent Itemset Mining Methods?
11.Which are applications of Frequent Itemset Mining?
 Applications
1. Basket data analysis,
2. cross-marketing,
3. catalog design,
4. sale campaign analysis,
5. Web log (click stream) analysis,
6. DNA sequence analysis,
7. Program/service selection
12.What does confidence tell about rule?
Confidence
The confidence of a rule is defined:
Conf(x →y) = supp(x ∪ y)/ supp(x)
For the rule {milk,bread}=>{butter} we have the following confidence:
supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 / 0.4 = 0.65
This means that for 65% of the transactions containing milk and bread the rule is
correct.Confidence can be interpreted as an estimate of the probability P(Y | X), the
probability of finding the RHS of the rule in transactions under the condition that
these transactions also contain the LHS.
13.Find out the frequent itemsets from the following data for marketbasket analysis by applying Apriori algorithm. Take Minimum support
value 2.
1 B,C,D
2 A,B,C,E
3 B,D
4 A,D,E
14.What is support?
Support
The support supp(X) of an itemset X is defined as the proportion of
transactions in the data set which contain the itemset.
supp(X)= no. of transactions which contain the itemset X / total no. of
transactions
In the example database, the itemset {milk,bread,butter} has a support of
4 /15 = 0.26 since it occurs in 26% of all transactions. To be even more explicit
we can point out that 4 is the number of transactions from the database which
contain the itemset {milk,bread,butter} while 15 represents the total number of
transactions.
15.Write apriori algorithm. Which are limitations of apriori algorithm?
How can we improve apriori algorithm?
16.Explain various aspects of dataminig.
17.Write short note on Opinion mining.
Opinion Mining
In today’s competitive word everyone wants to give best education to their
children and every Institute is trying to prove that they are best among all.
Students are talking about their institute face to face or behind the back.
Institute valuation is depending upon what student’s feel about the institute.
Students are comparing their institute with other institute. Students write
feedback about the course every semester, discuss with their friends through
chat room, mail, social media etc. The aim of the proposed model is to extract
the facts from different sources like blogs, comments, feedback, social media
etc. In academic institution student feedback about the course can be considered
as a significant informative resource to improve the course.
The employees of the institution are also important factor. Employee can
produce greater ideas. Employees are also discussing about the institute with
their friends or family members. Similarly parents, Industries and other institute
who are related to the institute are also passing comments formally or
informally. They are passing serious and meaningful comments on the
institution either positive or negative.
Students are giving feedback every year or every semester about the course,
institute, faculties and facilities provided to them. Based on the students
feedback Institutes are taking steps to eliminate the drawbacks specified by the
students. Industries are also playing major role in progress of institution.
Because at last Institute knows by no. of students are recruited in well-known
companies and their packages. So, aim of the institute is to provide best
knowledge to students which will be helpful in any fields and this can be solved
if Institute is aware of current trends and technology used by industries.
Institutes can eliminate their lacking area by using Industries feedback. Institute
can reduce the gap between Industry and academia.
Parents and people associated with the Institute are also discussing with their
colleagues, family members and friends about the institute. Opinions can create
immense impact on institute.
18.Write short note on education data mining.
Education Data Mining
As we have discussed the spectrum of Data mining, one of the most
important is education also. Higher educational organizations are
placing in a very high competitive environment and are aiming
to get more competitive advantages over the other business
competitors.
Educational organizations consider students and professors as
their main assets and they want to improve their key process
indicators by effective and efficient use of their assets.
To remain competitive in educational domain these organizations need
to implement current trends, which are constantly evolving and new
ones are emerging. According to these trends and technologies,
higher-education institutions will be able to prepare students to
become the next generation of productive employees and
innovative leaders the world needs.
Most of the higher-education
institutions need to implement current higher educational trends,
for this required knowledge can be extracted from the historical
and operational data that reside in the educational organization’s
database.
One of the significant facts in higher learning institution is the
explosive growth of educational data. These data are increasing rapidly
without any benefit to the management. Data mining techniques
are analytical tools that can be used to extract meaningful
knowledge from large data sets. This chapter addresses the
capabilities of data mining in higher learning institution by proposing
a new guideline of data mining application in education. It
focuses on how data mining may help to implement current
trends in higher learning institutions.
Data Mining can be applied in almost all areas to produce variety of
results. In recent years, researchers found some innovative
applications of data mining in the field of education also. This
innovative science is termed as Education Data Mining. It is also
referred to as EDM.
EDM can be defined as the area of scientific inquiry centered on the
development of methods for making discoveries within the unique
kinds of data that come from educational settings, and using those
methods to better understand students and the settings which they
learn in.
Educational data mining has emerged as an independent research area
in recent years, culminating in 2008 with the establishment of the
annual International Conference on Educational Data Mining, and the
Journal of Educational Data Mining.
The aim of using data mining in the education field is to enhance
educational performance by many ways. Knowledge discovery requires
a clear methodology that can be successfully applied in the education
sector. This can be obtained from the use of the CRoss-Industry
Standard Process for Data Mining (CRISP–DM). The CRISP-DM is the
method to implement Data Mining knowledge discovery from the
database of schools/colleges. Three methods of data mining, Naïve
Bayes, Nearest Neighbor and the C4.5 decision tree, are implemented
on the school data. The results showed that the C4.5 decision tree is
significantly more accurate compared with the other methods.
19.Write short note on text mining.
Text Mining
identification of a set of documents that relate to a set of key words. As
text mining involves applying very computationally-intensive algorithms
to large document collections, IR can speed up the discovery cycle
considerably by reducing the number of documents found for analysis.
For example, if a researcher is interested in mining information only
about protein interactions, he/ Text mining involves the application of
techniques from areas such as information retrieval, natural language
processing, information extraction and data mining. These various stages
can be combined together into a single workflow.
Information can be reduced to a smaller subset Information Retrieval (IR)
systems identify the documents in a collection which match a user’s
query. The most well-known IR systems are search engines such as
Google, which allows she might restrict their analysis to documents that
contain the name of a protein, or some form of the verb ‘to interact’, or
one of its synonyms. Already, through application of IR, the vast
accumulation of scientific of relevant items.
Natural Language Processing (NLP) is the analysis of human language so
that computers can understand research terms in the same way as
humans do. Although this goal is still some way off, NLP can perform
some types of analysis with a high degree of success. For example: