Download ProjOrient - Zhangxi Lin

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Intelligent maintenance system wikipedia , lookup

Transcript
ISQS 6339, Business Intelligence
Supplemental Notes on the Term
Project (Spring 2017)
Zhangxi Lin
Texas Tech University
1
Projects

Two data warehousing projects (70%)



Big data collaborative studies (30%).




SQL Server based
Hadoop based
One presentation – 30-40 minutes, and another
10 minutes for discussion
Report & references
Videos and demonstrations
Total 60 points
Term project


3-5 students form a team to fulfill a data mart development
project.
 Stage 1 (10%): One-page project proposal. April 11
 Stage 2 (20%): Data mart Implementation. April 20
 Stage 3 (30%): Collaborative study. Due April 27
 Stage 4 (20%): Hadoop Project completed. Due May 4
 Stage 5 (20%): Final report. Due May 12
Detailed instructions: http://zlin.ba.ttu.edu/6339/Projects17.html
Merits of data warehousing projects







Carefully developed project proposal demonstrating the
understanding of the business requirements, attractive analytics
themes, and clearly defined project goal and objectives
Comprehensive data mart design, such as multiple fact tables, with
supporting analytic themes
Applications of advanced ETL model or techniques, such as slowly
changing dimensions, the use of containers, etc.
Advanced OLAP cube design, and/or optional MDX scripting by selftaught
Rich data analysis outcomes
Well-presented final report
Demonstrating the creative ideas and skillful data warehousing
ability
HADOOP PROJECTS
Components















Load Balancer
Oozie
Solr, SolrCloud, SolrJ, HA
NewSQL
Kafka, Storm, Impala
REST
ZK
MySQL
Nginx/HA-Proxy
Flume
Sqoop
Ganglia
Technology stack
Tomcat, Jetty
Avro
Big Data Presentation Topics
No:
1
Topic
Data warehousing
Components
HDFS, HBase, HIVE,
NoSQL/NewSQL, Solr
Team#
DW1
Hortonworks, CloudEra, HaaS,
EC2
DW2
Focus: tools and free resources
MapReduce & Data mining
Mahout, H2O, R, Python
DW3
Kettle, Flume, Sqoop, Impala
DW4
Focus: Hadoop Data warehouse design
2
3
Publicly available big data services
Focus: Efficiency of distributed data/text mining
4
Big data ETL
Focus: Heterogeneous data processing
across platforms
5
System management:
Focus: Load balancing and system
efficiency
6
Application development platform
Oozie, ZooKeeper, Ambari, Loom, DW5
Ganglia
Tomcat, Neo4J, Pig, Hue
DW6
Pentaho, Tableau
Saiku, Mondrian, Gephi,
DW7
Focus: Algorithms and innovative development
environments
7
Tools & Visualizations
Focus: Features for big data visualization and
data utilization.
8
Streaming data processing
Focus: Efficiency and effectiveness of real-time
data processing
Spark, Storm, Kafka, Avro
Presentation
Data Warehousing
Methodology
- Implementing data warehouse systematically
8
Dimensional Modeling Process

Preparation





Data profiling and research








Data profiling and source system exploration
Interacting with source system experts
Identifying core business users
Studying existing reporting systems
Building Dimensional models


Identify roles and participants
Understanding the data architecture strategy
Setting up the modeling environment
Establishing naming conventions
High-level dimensional model design
Identifying dimension and fact attributes
Developing the detailed dimensional model
Testing the model
Reviewing and validating the model
Business Dimensional Lifecycle
Project
Planning
Business
Req’ts
definition
Technical
Arch.
Design
Product
Selection &
Installation
Dimensional
Modeling
Physical
Design
BI Appl.
Specification
Growth
ETL design
&
Development
BI
Application
Development
Project Management
10
Deployment
Maintenance
Data Profiling




Data profiling is a methodology for learning about he
characteristics of the data
It is a hierarchical process that attempt to build an assessment of
the metadata associated with a collection of data sets.
Three levels
 Bottom – characterizing the values associated with individual
attributes
 Middle – the assessment looking at relationships between
multiple columns within a single table.
 Highest level – the profile describing relationships that exist
between data attributes across different tables.
Can run a program against the sandbox source system to obtain
the needed information.
11
ETL Methodology






Develop a high-level map
Build a sandbox source system (optional)
Detailed data profiling
Make decisions
 The source-to-target mapping
 How often loading tables
 The strategy for partitioning the relational and Analysis Services
fact table
 The strategy for extracting data from each source system
De-duplicate key data from each source system (optional)
Develop a strategy for distributing dimension tables across
multiple database servers (optional)
12
Sandbox Source System



Sandbox
 A protected, limited environment where applications are
allowed to "play" without risking damage to the rest of the
system.
 A term for the R&D department at many software and
computer companies. The term is half-derisive, but reflects
the truth that research is a form of creative play.
In the DW/BI context, sandbox source system is a subset of
source database for analytic exploration tasks
How to create
 Set up a static snapshot of the database
 By sampling
13
Decision Issues in ETL System Design



Source-to-target mapping
Load frequency
How much history is needed
14
Strategies for Extracting Data




Extracting data from packaged source systems –self-contained
data sources
 May not be good to use their APIs
 May not be good to use their add-on analytic system
Extracting directly from the source databases
 Strategies vary depending on the nature of the source database
Extracting data from incremental loads
 How the source database records the changes of the rows
Extracting historical data
15