Download Field Reports from the Inquidia Rescue Team:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
BIG DATA TODAY
Field Reports from the
Inquidia
Rescue Team:
11
10
x^Steps to Make Sure Your Pentaho
Deployment Succeeds and Thrives
Bryan Senseman
Partner, Inquidia Consulting
About This Session
Over the past 9 years, we’ve consulted with nearly 100
11
Pentaho customers. This session highlights 10
X ^best practices
and pitfalls to avoid to best exploit the power of Pentaho.
Topics cover:
 Data Architecture
 Data Engineering
 BI Application Development
 Environment & Software Management
 Organizational Concerns
(briefly…)
We deliver full-spectrum data and analytics services: Strategy,
Architecture, Engineering, Analytics and Science.
 Pentaho Partner Since 2006 (Version 1.x)
 2-Time winner of Pentaho Partner Award
 60+ Successful Pentaho Projects
 Pioneered Pentaho Training Curriculum
‒ Original Trainer for public and on-site Pentaho training
 Member of Pentaho Big Data Launch Team
 Frequent Pentaho Marketplace Plug-in Development
1. Database Performance Matters
Perhaps the #1 reason ETL and BI applications fail is due to
poor database design, tuning and selection.
 The Zen of Database Design
‒ Intelligent Use of Indexes for both ETL and Query processing
‒ Turn off Referential Integrity
‒ Use Surrogate Keys
‒ Streamline Audit/Timestamp Column Usage
 You can “kill problems” with hardware
‒ Switch to SSD provides significant lift
 Multi-purpose vs Analytical Databases
2. Design ETL Pessimistically
ETL jobs will fail, it’s a matter of how elegantly they die and
how efficiently you can remediate and restart.
 Expect Errors and Act on Failure!
‒ Bad Data, Late Data, Environment Issue
 Use Data Validator Step to improve robustness and
failure messaging
‒ Data Types, Formats, Nulls, Valid Values, etc.
 Architect restartability into your jobs
‒ Always think about Delta processing
‒ “Upsert” vs Trunc/Load Patterns
3. Always Be Tuning (your ETL)
Data volumes change. Content skews. More jobs are added.
But, your data processing window likely is shrinking.
Set up Diagnostics
•
•
Database Logging
Operations Mart
Tune Low Hanging
Fruit
•
•
•
•
•
Parallel Job/Transform Execution
Multithread Bottlenecking Steps
Choose the right Lookup Strategy
Intelligent Upsert Logic vs Insert/Update Step
ETL vs ELT: Row vs Set Processing
Smart Use of DB
Specific Features
•
•
Indexes can help and hurt performance
Selective use of DB links & stored procedures*
*Beware the downsides of DB links & stored procedures
4. Tier Your Data Architecture
A 1-hop architecture from Sources to Star Schema is fragile.
 Modern architectures need to ingest more frequently than they process/analyze.
 Agile analytics often requires restructure of data for optimization
 Analytic database technology often forces bulk loads, ELT approach
 Using 2+ tiers allows you to modularize your architecture.
Source
DBs
Web Services
Files
ETL
EDW
ODS
Historical ODS
Etc.
ETL
Analytic DB
Analytic DB
Data Marts
4. Tier Your Data Architecture
A 1-hop architecture from Sources to Star Schema is fragile.
 Modern architectures need to ingest more frequently than they process/analyze.
Using 2+ tiers allows you to modularize your architecture.
ANALYZE
PROCESS
INGEST
Web
Server
Flume
Web
Server
Process Orchestration
•
•
•
Pentaho Map Reduce
Hive
Pig
Log
Processing
External
Parties
Access Portal
Process
Logging
Real Time Metrics
Dashboards
Hadoop Cluster
External
Repository
Aggregate
Processing
MongoDB
Info Exchange
Data
Node 1
Data Services
Data
Node 2
Data
Node X
Raw
Raw
Raw
Agg
Agg
Agg
OLAP
Standard Reports
Analytic DB
Configuration
Dims
Aggregates
6. Intelligent Agg Design for Pentaho Analysis
Aggregates are top priority for Pentaho Analysis Performance
 Which combinations of dimensions should be aggregated?
‒ Ensure every (non-high cardinality) dimension exists in one aggregate
‒ Capture and analyze Mondrian SQL logs to identify new candidates
‒ Order of magnitude rule of thumb
‒ Don’t throw away the “freebies”
 Performance test with real volume/cardinality
 Analytic databases still benefit from aggregates!
 One more thing: Don’t forget that Mondrian caching matters too!
5. Use the Right Tool for the Job
Your information delivery mechanism matters
Component
Best used when/for/by…
Cubes/Analysis Reports
•
•
•
•
Ad-hoc/exploratory analysis (pivot tables on ‘roids)
Aggregations with selective drilling
Interactive crosstabs and charts
Data Analysts
Interactive Reports
•
•
•
Ad-hoc, interactive reports (break-group formatting)
Small partition of detail grain data
Operational list reporting
PRPT Reports
•
•
•
Pixels and rendering matter
Repeatable and standardized
Compound subject areas (mashups)
Dashboard Designer
•
•
KPI Reporting w/ common graphics
Speed of development
C*Tools Apps
•
•
Richness of UI is the priority
Embedding custom controls
7. Configure BA Server w/ Scalability in Mind
Your Pentaho environment is going to be a success, expect
increasing usage
Scaling
Memory
• Primary memory usage from Mondrian
• Less so for report generation
Scaling
Compute
• Primary driver of processor use is concurrent usage
• 100 : 10 : 1 rule of thumb
Configure for
Failover &
Scalability
• Separate Web from App Server
• Separate Repository DB Server from App Server
• You can always configure this “later”…but why wait?
8. Assume Regular Software Upgrades
Upgrades are your friends, embrace them.
 Establish a Cadence in sync with Pentaho’s Release Schedule
‒ Major Release upgrade annually. It’s mandatory! Plan (budget) for it.
‒ Upgrade to incremental point releases as your needs dictate
 Establish Separate Dev-QATest-Prod Environments
‒ Its ok to develop your initial solution in Prod, but too many clients stop there.
‒ You need separate environments!
 Regression Test before Deploying
‒ Make a copy of prod to QA and thoroughly regression test
9. No Shortcuts to Team Productivity
Automate and standardize so your development team focuses on
what matters
 Implement code management (e.g. git)
‒ Establish an integration and deployment process and follow it
 Issue Management
‒ No more excuses for this with JIRA and Trello available
 Replicable/Consistent Developer Environments
‒ Shared.xml, Kettle.properties, JNDI
10. Staff with the Right Skillsets
Finding staff who like what they do will motivate innovation
 Software Engineering <> Data Engineering
‒ Very few web/app developers successfully become ETL/Data programmers.
‒ Don’t force this!
 ETL Support must be provided by Data Engineers
‒ Pure operators/admins are usually unable to remediate unexpected data
anomalies and bugs
 BI Developers should be more concerned about Communicating
Meaning than Visual Design
11. Properly Fund Ongoing Budget
Analytics Solutions are not ERP Implementations
 Analytics is organic, plan for evergreen resourcing
‒ Its not maintenance, its much broader than that
‒ Use consultants for jumpstart a/o bandwidth, but…
‒ Make sure your staff becomes intimately involved in order to transition
 Deprecate, Deprecate, Deprecate
‒ BI components have a lifespan, continually prune the deadwood
 Leverage a BI Competency Center
‒ For user support/training and application extension/requirements
Summary
What We Covered Today:
 Common pitfalls to avoid and best practices to use for both
Pentaho implementation and analytics solutions generally
 With some diligence and, perhaps, some expert help, you can
have a successful Pentaho analytics solution!
Next Steps
Want to learn more?
 Visit the Inquidia Consulting booth in the Expo Hall
 Feel free to contact me:
Bryan Senseman
[email protected]
m: 317-514-4525
We’ll help you become data-driven.
Thank You
Join the Conversation
#PWorld15
Follow us on Twitter: @inquidia