Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DMBI Question bank of Unit 2, 4, and 6 1. What is warehouse? Which are characteristic of the warehouse? “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process.” A Data Warehouse is a Structured Repository of Historic Data. is developed in an Evolutionary Process by Integrating Data from Non-integrated Legacy Systems. Collection of data in support of management decision processes Characteristic: 1) Subject-Oriented 2) Integrated 3) Time-Variant 4) Nonvolatile 2. What is warehousing? Data warehousing is the process of constructing and using data warehouses. 3. Warehouse is a time-variant. Explain this. The time horizon for the data warehouse is significantly longer than that of operational systems o Operational database: current value data o Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse o Contains an element of time, explicitly or implicitly o But the key of operational data may or may not contain “time element” 4. Explain A Multi-Tiered Architecture of Data Warehouse. Mainly 3 tiers o Data storage o OLAP servers o Front-end tool 5. Explain Extraction, Transformation, and Loading (ETL). Data extraction o get data from multiple, heterogeneous, and external sources Data cleaning o detect errors in the data and rectify them when possible Data transformation o convert data from legacy or host format to warehouse format Load o sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Refresh o propagate the updates from the data sources to the warehouse 6. Explain various Data Warehouse Models. Enterprise warehouse o collects all of the information about subjects spanning the entire organization Data Mart o a subset of corporate-wide data that is of value to a specific groups of users. o Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart Virtual warehouse o A set of views over operational databases o Only some of the possible summary views may be materialized 7. What is Metadata Repository? Meta data is the data defining warehouse objects. It stores: Description of the structure of the data warehouse o schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents Operational meta-data o data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) The algorithms used for summarization The mapping from operational environment to the data warehouse Data related to system performance o warehouse schema, view and derived data definitions Business data o business terms and definitions, ownership of data, charging policies 8. Differentiate OLTP and OLAP OLTP (Online transaction processing) is a class of information systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. OLAP (online analytical processing) is computer processing that enables user to (easily and selectively) extract and view data from different points of view. 9. What is confidence of a rule? Give formula for the confidence. Confidence The confidence of a rule is defined: Conf(x →y) = supp(x ∪ y)/ supp(x) For the rule {milk,bread}=>{butter} we have the following confidence: supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 / 0.4 = 0.65 This means that for 65% of the transactions containing milk and bread the rule is correct.Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. 10.What is itemset? What is Frequent Itemset Mining? Which are difference Frequent Itemset Mining Methods? 11.Which are applications of Frequent Itemset Mining? Applications 1. Basket data analysis, 2. cross-marketing, 3. catalog design, 4. sale campaign analysis, 5. Web log (click stream) analysis, 6. DNA sequence analysis, 7. Program/service selection 12.What does confidence tell about rule? Confidence The confidence of a rule is defined: Conf(x →y) = supp(x ∪ y)/ supp(x) For the rule {milk,bread}=>{butter} we have the following confidence: supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 / 0.4 = 0.65 This means that for 65% of the transactions containing milk and bread the rule is correct.Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. 13.Find out the frequent itemsets from the following data for marketbasket analysis by applying Apriori algorithm. Take Minimum support value 2. 1 B,C,D 2 A,B,C,E 3 B,D 4 A,D,E 14.What is support? Support The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. supp(X)= no. of transactions which contain the itemset X / total no. of transactions In the example database, the itemset {milk,bread,butter} has a support of 4 /15 = 0.26 since it occurs in 26% of all transactions. To be even more explicit we can point out that 4 is the number of transactions from the database which contain the itemset {milk,bread,butter} while 15 represents the total number of transactions. 15.Write apriori algorithm. Which are limitations of apriori algorithm? How can we improve apriori algorithm? 16.Explain various aspects of dataminig. 17.Write short note on Opinion mining. Opinion Mining In today’s competitive word everyone wants to give best education to their children and every Institute is trying to prove that they are best among all. Students are talking about their institute face to face or behind the back. Institute valuation is depending upon what student’s feel about the institute. Students are comparing their institute with other institute. Students write feedback about the course every semester, discuss with their friends through chat room, mail, social media etc. The aim of the proposed model is to extract the facts from different sources like blogs, comments, feedback, social media etc. In academic institution student feedback about the course can be considered as a significant informative resource to improve the course. The employees of the institution are also important factor. Employee can produce greater ideas. Employees are also discussing about the institute with their friends or family members. Similarly parents, Industries and other institute who are related to the institute are also passing comments formally or informally. They are passing serious and meaningful comments on the institution either positive or negative. Students are giving feedback every year or every semester about the course, institute, faculties and facilities provided to them. Based on the students feedback Institutes are taking steps to eliminate the drawbacks specified by the students. Industries are also playing major role in progress of institution. Because at last Institute knows by no. of students are recruited in well-known companies and their packages. So, aim of the institute is to provide best knowledge to students which will be helpful in any fields and this can be solved if Institute is aware of current trends and technology used by industries. Institutes can eliminate their lacking area by using Industries feedback. Institute can reduce the gap between Industry and academia. Parents and people associated with the Institute are also discussing with their colleagues, family members and friends about the institute. Opinions can create immense impact on institute. 18.Write short note on education data mining. Education Data Mining As we have discussed the spectrum of Data mining, one of the most important is education also. Higher educational organizations are placing in a very high competitive environment and are aiming to get more competitive advantages over the other business competitors. Educational organizations consider students and professors as their main assets and they want to improve their key process indicators by effective and efficient use of their assets. To remain competitive in educational domain these organizations need to implement current trends, which are constantly evolving and new ones are emerging. According to these trends and technologies, higher-education institutions will be able to prepare students to become the next generation of productive employees and innovative leaders the world needs. Most of the higher-education institutions need to implement current higher educational trends, for this required knowledge can be extracted from the historical and operational data that reside in the educational organization’s database. One of the significant facts in higher learning institution is the explosive growth of educational data. These data are increasing rapidly without any benefit to the management. Data mining techniques are analytical tools that can be used to extract meaningful knowledge from large data sets. This chapter addresses the capabilities of data mining in higher learning institution by proposing a new guideline of data mining application in education. It focuses on how data mining may help to implement current trends in higher learning institutions. Data Mining can be applied in almost all areas to produce variety of results. In recent years, researchers found some innovative applications of data mining in the field of education also. This innovative science is termed as Education Data Mining. It is also referred to as EDM. EDM can be defined as the area of scientific inquiry centered on the development of methods for making discoveries within the unique kinds of data that come from educational settings, and using those methods to better understand students and the settings which they learn in. Educational data mining has emerged as an independent research area in recent years, culminating in 2008 with the establishment of the annual International Conference on Educational Data Mining, and the Journal of Educational Data Mining. The aim of using data mining in the education field is to enhance educational performance by many ways. Knowledge discovery requires a clear methodology that can be successfully applied in the education sector. This can be obtained from the use of the CRoss-Industry Standard Process for Data Mining (CRISP–DM). The CRISP-DM is the method to implement Data Mining knowledge discovery from the database of schools/colleges. Three methods of data mining, Naïve Bayes, Nearest Neighbor and the C4.5 decision tree, are implemented on the school data. The results showed that the C4.5 decision tree is significantly more accurate compared with the other methods. 19.Write short note on text mining. Text Mining identification of a set of documents that relate to a set of key words. As text mining involves applying very computationally-intensive algorithms to large document collections, IR can speed up the discovery cycle considerably by reducing the number of documents found for analysis. For example, if a researcher is interested in mining information only about protein interactions, he/ Text mining involves the application of techniques from areas such as information retrieval, natural language processing, information extraction and data mining. These various stages can be combined together into a single workflow. Information can be reduced to a smaller subset Information Retrieval (IR) systems identify the documents in a collection which match a user’s query. The most well-known IR systems are search engines such as Google, which allows she might restrict their analysis to documents that contain the name of a protein, or some form of the verb ‘to interact’, or one of its synonyms. Already, through application of IR, the vast accumulation of scientific of relevant items. Natural Language Processing (NLP) is the analysis of human language so that computers can understand research terms in the same way as humans do. Although this goal is still some way off, NLP can perform some types of analysis with a high degree of success. For example: