Download ProjOrient - Zhangxi Lin

ISQS 6339, Business Intelligence Supplemental Notes on the Term Project (Spring 2017) Zhangxi Lin Texas Tech University 1 Projects  Two data warehousing projects (70%)    Big data collaborative studies (30%).     SQL Server based Hadoop based One presentation – 30-40 minutes, and another 10 minutes for discussion Report & references Videos and demonstrations Total 60 points Term project   3-5 students form a team to fulfill a data mart development project.  Stage 1 (10%): One-page project proposal. April 11  Stage 2 (20%): Data mart Implementation. April 20  Stage 3 (30%): Collaborative study. Due April 27  Stage 4 (20%): Hadoop Project completed. Due May 4  Stage 5 (20%): Final report. Due May 12 Detailed instructions: http://zlin.ba.ttu.edu/6339/Projects17.html Merits of data warehousing projects        Carefully developed project proposal demonstrating the understanding of the business requirements, attractive analytics themes, and clearly defined project goal and objectives Comprehensive data mart design, such as multiple fact tables, with supporting analytic themes Applications of advanced ETL model or techniques, such as slowly changing dimensions, the use of containers, etc. Advanced OLAP cube design, and/or optional MDX scripting by selftaught Rich data analysis outcomes Well-presented final report Demonstrating the creative ideas and skillful data warehousing ability HADOOP PROJECTS Components                Load Balancer Oozie Solr, SolrCloud, SolrJ, HA NewSQL Kafka, Storm, Impala REST ZK MySQL Nginx/HA-Proxy Flume Sqoop Ganglia Technology stack Tomcat, Jetty Avro Big Data Presentation Topics No: 1 Topic Data warehousing Components HDFS, HBase, HIVE, NoSQL/NewSQL, Solr Team# DW1 Hortonworks, CloudEra, HaaS, EC2 DW2 Focus: tools and free resources MapReduce & Data mining Mahout, H2O, R, Python DW3 Kettle, Flume, Sqoop, Impala DW4 Focus: Hadoop Data warehouse design 2 3 Publicly available big data services Focus: Efficiency of distributed data/text mining 4 Big data ETL Focus: Heterogeneous data processing across platforms 5 System management: Focus: Load balancing and system efficiency 6 Application development platform Oozie, ZooKeeper, Ambari, Loom, DW5 Ganglia Tomcat, Neo4J, Pig, Hue DW6 Pentaho, Tableau Saiku, Mondrian, Gephi, DW7 Focus: Algorithms and innovative development environments 7 Tools & Visualizations Focus: Features for big data visualization and data utilization. 8 Streaming data processing Focus: Efficiency and effectiveness of real-time data processing Spark, Storm, Kafka, Avro Presentation Data Warehousing Methodology - Implementing data warehouse systematically 8 Dimensional Modeling Process  Preparation      Data profiling and research         Data profiling and source system exploration Interacting with source system experts Identifying core business users Studying existing reporting systems Building Dimensional models   Identify roles and participants Understanding the data architecture strategy Setting up the modeling environment Establishing naming conventions High-level dimensional model design Identifying dimension and fact attributes Developing the detailed dimensional model Testing the model Reviewing and validating the model Business Dimensional Lifecycle Project Planning Business Req’ts definition Technical Arch. Design Product Selection & Installation Dimensional Modeling Physical Design BI Appl. Specification Growth ETL design & Development BI Application Development Project Management 10 Deployment Maintenance Data Profiling     Data profiling is a methodology for learning about he characteristics of the data It is a hierarchical process that attempt to build an assessment of the metadata associated with a collection of data sets. Three levels  Bottom – characterizing the values associated with individual attributes  Middle – the assessment looking at relationships between multiple columns within a single table.  Highest level – the profile describing relationships that exist between data attributes across different tables. Can run a program against the sandbox source system to obtain the needed information. 11 ETL Methodology       Develop a high-level map Build a sandbox source system (optional) Detailed data profiling Make decisions  The source-to-target mapping  How often loading tables  The strategy for partitioning the relational and Analysis Services fact table  The strategy for extracting data from each source system De-duplicate key data from each source system (optional) Develop a strategy for distributing dimension tables across multiple database servers (optional) 12 Sandbox Source System    Sandbox  A protected, limited environment where applications are allowed to "play" without risking damage to the rest of the system.  A term for the R&D department at many software and computer companies. The term is half-derisive, but reflects the truth that research is a form of creative play. In the DW/BI context, sandbox source system is a subset of source database for analytic exploration tasks How to create  Set up a static snapshot of the database  By sampling 13 Decision Issues in ETL System Design    Source-to-target mapping Load frequency How much history is needed 14 Strategies for Extracting Data     Extracting data from packaged source systems –self-contained data sources  May not be good to use their APIs  May not be good to use their add-on analytic system Extracting directly from the source databases  Strategies vary depending on the nature of the source database Extracting data from incremental loads  How the source database records the changes of the rows Extracting historical data 15

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ProjOrient - Zhangxi Lin