Download Midterm exam solution

ROLL NO.: NAME: CS 543 – Data Warehousing Midterm Exam Solution April 13, 2005 Duration: 75 minutes (11:45 to 13:00) 1. The following is NOT a reason for the failure of many data warehousing projects in the 1990s: a. Lack of planning b. Unclear motivation and goal for the data warehouse c. Poor project management d. Underestimating cost of development e. Insufficient hardware capabilities f. None of the above 2. A dependent data mart results from a a. top-down approach to DW design b. bottom-up approach to DW design c. hybrid (practical) approach to DW design 3. List at least 5 different sources from which data can be integrated into a data warehouse?  Production data / operational system data  Internal data / private data / personal data / spreadsheets / databases, etc  Archived data  External data / subscripted data / industry stats, etc Questions 4 to 10 are based on the following description. The relational schema below defines an operational system of a university (say LUMS). Underlined attribute(s) = primary key(s) Asterisked* attribute(s) = foreign key(s) Course table level_offered = offering level of the course (freshman, sophomore, junior, senior, graduate} pre-requisites = list of pre-requisite courses course_ids units = number of units major = major in which course is offered Student table standing = rank of student (freshman, sophomore, junior, senior, graduate) Faculty table rank = rank of faculty (assistant, associate, full, adjunct) Department table HoD_id = faculty_id of current HoD majors_offered = list of majors offered by the department Enrollment table quarter, year = quarter and year of offering CS 543 (Sp 04/05) – Dr. Asim Karim Page 1 of 6 grade = letter grade obtained 1 course_id, dept_id*, level_offered, pre-requisites, units, major m student_id, name, standing, major, units_completed m 1 m course_id, student_id, faculty_id, quarter, year, grade 1 m faculty_id, name, rank, dept_id* m 1 dept_id, name, HoD_id, majors_offered 1 4. Is the schema in 3NF? If not, identify the normal form(s) and their violations. Normalize the schema to remove these violation(s). No, the schema is not in 3NF. The primary violations are: pre-requisites is a multi-valued attribute which violate the 1NF. To fix this, include multiple attributes pre-req1, pre-req2, pre-req3 each with a single atomic value. course_id, dept_id*, level_offered, pre-req1, pre-req2, pre-req3, units, major_id majors_offered is also a multi-valued attribute which violate the 1NF. To fix this, the best approach would be to create a new relation called major and link it with the department and course tables. major_id, major_name, dept_id* In the enrollment table, quarter and year are not fully dependent on the primary keys. This is a violation of the 2NF. To fix this, make quarter and year part of the primary keys. 5. How many new records will get added to the enrollment table every year (4 quarters) if a student takes on average 3 courses per quarter. Assume there are 200 active students, 50 active faculty members, and 200 courses offered in the year. Each row in the enrollment table represents a student taking a particular course offered by a given faculty in a given quarter/year. CS 543 (Sp 04/05) – Dr. Asim Karim Page 2 of 6 No. of rows/year = 200 * 3 * 4 = 2,400 6. Create a complete information package diagram for the information given in the schema. Subject: University Enrollment Analysis Time Course Student Quarter Title Name Year Level_offered Standing Faculty Name Rank Dept. Name Major Name Metrics or Facts: Units, Grade (GPA equivalent) 7. Create a star schema for the information showing all dimension tables, fact tables, attributes, and the relationships between the tables. Identify key attributes and relationship cardinalities. Notice that enrollment is not an explicit fact or metric; it is implied from the fact table. Time_key Quarter Year Student_key Name Standing Course_key Title Level_offerd Faculty_key Name Rank 1 1 m 1 1 Time_key* Student_key* Course_key* Faculty_key* Dept_key* Major_key* Units GPA m 1 1 Dept_key Name Major_key Major_nam e 8. Suppose faculty XYZ gets promoted to the rank of ‘associate professor’. What type of update is this? How would you update the faculty dimension table. Show the new dimension table and its attributes. This is a type 2 update (not type 3, since the faculty is not likely to be demoted to ‘assistant professor’). A new row is added to the faculty dimension table with the new rank value of ‘associate professor’. A new attribute eff_date may be also added to track the time of change. CS 543 (Sp 04/05) – Dr. Asim Karim Page 3 of 6 9. How many joins (based on the star schema) would be needed to answer the query: what is the average enrollment in courses offered by faculty ABC? Explain briefly. Two joins only. The faculty dimension table is joined with fact table to find all students who have taken a course offered by ABC. To find the average, the number of rows in the join is divided by the number of distinct course_keys. 10. Suppose a large number of queries will require aggregations for courses offered in a major. Modify the star schema to cater for efficient processing of such queries. An aggregate fact table is added to the schema in which each row represents aggregate for all course in the major against time. Notice that the metric GPA is semi-additive and its average is computed in the aggregate table. Time_key Quarter Year Student_key Name Standing 1 m 1 m Course_key Title Level_offerd 1 Time_key* Student_key* Course_key* Faculty_key* Dept_key* Major_key* Units GPA Time_key Major_key Units Av_GPA Faculty_key Name Rank 1 1 m 1 Dept_key Name Major_key Major_nam e 11. What is a factless fact table? A factless fact table does not have any numeric fact or metric. The metric is implied by the presence of the row in the fact table. 12. How is the metadata in a data warehouse different from the data dictionary of an operational system database? Explain briefly. The DW metadata maintains information relevant to the entire data warehousing environment and caters to the needs of humans as well as software components. Its serves CS 543 (Sp 04/05) – Dr. Asim Karim Page 4 of 6 as a glue that ties all components of the data warehousing environment. The data dictionary, on the other hand, provides basic information on the data structures and semantics. 13. The most important decisions in the logical and physical design of a data warehouse are/is: a. Level of data granularity and method of data partitioning b. Sizing of physical storage c. Type of data structures d. Type of modeling technique e. None of the above 14. How many multi-way aggregate fact tables are possible for a 4 dimension star schema in which each dimension has 5 hierarchical attributes. 5 * 5 * 5 * 5 = 625 15. The snowflake schema is better than the star schema in the following way: a. It saves large percentage of space b. It is easier to maintain and keep consistent c. It is easier to understand by end-users d. It is more efficient for transaction storage e. None of the above 16. List at least 5 key roles/titles (besides project manager) in a data warehouse project team.              Executive sponsor User liaison manager Lead architect Infrastructure specialist Business analyst Data modeler DW administrator ETL specialist Quality assurance analyst Testing coordinator End-user applications specialist Programmer Lead trainer 17. The term ‘information crisis’ refers to a. Inability to save all generated information b. Inability to process all generated information c. Paucity of information d. Excess data but limited knowledge e. Lack of good quality information CS 543 (Sp 04/05) – Dr. Asim Karim Page 5 of 6 18. Indicate true or false in front of the following statements. a. ERP systems may be substituted for data warehouses (FALSE) b. Metadata standards facilitate deploying a combination of best-of-breed products (TRUE) c. A separate data staging platform is necessary for a data warehousing environment (FALSE/TRUE) d. Business dimensions can be identified from operational system databases (TRUE) e. Transactional fact tables can grow in size very quickly over time (TRUE) 19. (10 points) Match the columns (write the matching statements’ letter in front of the number). 1. information package diagram (H) A. determine data extraction 2. need for drill-down (F) B. provide OLAP 3. data transformations (J) C. provide data feed 4. data sources (A) D. influences load management 5. data aging (I) E. query management in DBMS 6. sophisticated analysis (B) F. low levels of data 7. simple and complex queries (E) G. larger staging area 8. data volume (D) H. influence data design 9. specialized DSS (C) I. possible pollution source 10. corporate data warehouse (G) J. data staging design CS 543 (Sp 04/05) – Dr. Asim Karim Page 6 of 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Midterm exam solution