Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIG DATA TODAY Field Reports from the Inquidia Rescue Team: 11 10 x^Steps to Make Sure Your Pentaho Deployment Succeeds and Thrives Bryan Senseman Partner, Inquidia Consulting About This Session Over the past 9 years, we’ve consulted with nearly 100 11 Pentaho customers. This session highlights 10 X ^best practices and pitfalls to avoid to best exploit the power of Pentaho. Topics cover: Data Architecture Data Engineering BI Application Development Environment & Software Management Organizational Concerns (briefly…) We deliver full-spectrum data and analytics services: Strategy, Architecture, Engineering, Analytics and Science. Pentaho Partner Since 2006 (Version 1.x) 2-Time winner of Pentaho Partner Award 60+ Successful Pentaho Projects Pioneered Pentaho Training Curriculum ‒ Original Trainer for public and on-site Pentaho training Member of Pentaho Big Data Launch Team Frequent Pentaho Marketplace Plug-in Development 1. Database Performance Matters Perhaps the #1 reason ETL and BI applications fail is due to poor database design, tuning and selection. The Zen of Database Design ‒ Intelligent Use of Indexes for both ETL and Query processing ‒ Turn off Referential Integrity ‒ Use Surrogate Keys ‒ Streamline Audit/Timestamp Column Usage You can “kill problems” with hardware ‒ Switch to SSD provides significant lift Multi-purpose vs Analytical Databases 2. Design ETL Pessimistically ETL jobs will fail, it’s a matter of how elegantly they die and how efficiently you can remediate and restart. Expect Errors and Act on Failure! ‒ Bad Data, Late Data, Environment Issue Use Data Validator Step to improve robustness and failure messaging ‒ Data Types, Formats, Nulls, Valid Values, etc. Architect restartability into your jobs ‒ Always think about Delta processing ‒ “Upsert” vs Trunc/Load Patterns 3. Always Be Tuning (your ETL) Data volumes change. Content skews. More jobs are added. But, your data processing window likely is shrinking. Set up Diagnostics • • Database Logging Operations Mart Tune Low Hanging Fruit • • • • • Parallel Job/Transform Execution Multithread Bottlenecking Steps Choose the right Lookup Strategy Intelligent Upsert Logic vs Insert/Update Step ETL vs ELT: Row vs Set Processing Smart Use of DB Specific Features • • Indexes can help and hurt performance Selective use of DB links & stored procedures* *Beware the downsides of DB links & stored procedures 4. Tier Your Data Architecture A 1-hop architecture from Sources to Star Schema is fragile. Modern architectures need to ingest more frequently than they process/analyze. Agile analytics often requires restructure of data for optimization Analytic database technology often forces bulk loads, ELT approach Using 2+ tiers allows you to modularize your architecture. Source DBs Web Services Files ETL EDW ODS Historical ODS Etc. ETL Analytic DB Analytic DB Data Marts 4. Tier Your Data Architecture A 1-hop architecture from Sources to Star Schema is fragile. Modern architectures need to ingest more frequently than they process/analyze. Using 2+ tiers allows you to modularize your architecture. ANALYZE PROCESS INGEST Web Server Flume Web Server Process Orchestration • • • Pentaho Map Reduce Hive Pig Log Processing External Parties Access Portal Process Logging Real Time Metrics Dashboards Hadoop Cluster External Repository Aggregate Processing MongoDB Info Exchange Data Node 1 Data Services Data Node 2 Data Node X Raw Raw Raw Agg Agg Agg OLAP Standard Reports Analytic DB Configuration Dims Aggregates 6. Intelligent Agg Design for Pentaho Analysis Aggregates are top priority for Pentaho Analysis Performance Which combinations of dimensions should be aggregated? ‒ Ensure every (non-high cardinality) dimension exists in one aggregate ‒ Capture and analyze Mondrian SQL logs to identify new candidates ‒ Order of magnitude rule of thumb ‒ Don’t throw away the “freebies” Performance test with real volume/cardinality Analytic databases still benefit from aggregates! One more thing: Don’t forget that Mondrian caching matters too! 5. Use the Right Tool for the Job Your information delivery mechanism matters Component Best used when/for/by… Cubes/Analysis Reports • • • • Ad-hoc/exploratory analysis (pivot tables on ‘roids) Aggregations with selective drilling Interactive crosstabs and charts Data Analysts Interactive Reports • • • Ad-hoc, interactive reports (break-group formatting) Small partition of detail grain data Operational list reporting PRPT Reports • • • Pixels and rendering matter Repeatable and standardized Compound subject areas (mashups) Dashboard Designer • • KPI Reporting w/ common graphics Speed of development C*Tools Apps • • Richness of UI is the priority Embedding custom controls 7. Configure BA Server w/ Scalability in Mind Your Pentaho environment is going to be a success, expect increasing usage Scaling Memory • Primary memory usage from Mondrian • Less so for report generation Scaling Compute • Primary driver of processor use is concurrent usage • 100 : 10 : 1 rule of thumb Configure for Failover & Scalability • Separate Web from App Server • Separate Repository DB Server from App Server • You can always configure this “later”…but why wait? 8. Assume Regular Software Upgrades Upgrades are your friends, embrace them. Establish a Cadence in sync with Pentaho’s Release Schedule ‒ Major Release upgrade annually. It’s mandatory! Plan (budget) for it. ‒ Upgrade to incremental point releases as your needs dictate Establish Separate Dev-QATest-Prod Environments ‒ Its ok to develop your initial solution in Prod, but too many clients stop there. ‒ You need separate environments! Regression Test before Deploying ‒ Make a copy of prod to QA and thoroughly regression test 9. No Shortcuts to Team Productivity Automate and standardize so your development team focuses on what matters Implement code management (e.g. git) ‒ Establish an integration and deployment process and follow it Issue Management ‒ No more excuses for this with JIRA and Trello available Replicable/Consistent Developer Environments ‒ Shared.xml, Kettle.properties, JNDI 10. Staff with the Right Skillsets Finding staff who like what they do will motivate innovation Software Engineering <> Data Engineering ‒ Very few web/app developers successfully become ETL/Data programmers. ‒ Don’t force this! ETL Support must be provided by Data Engineers ‒ Pure operators/admins are usually unable to remediate unexpected data anomalies and bugs BI Developers should be more concerned about Communicating Meaning than Visual Design 11. Properly Fund Ongoing Budget Analytics Solutions are not ERP Implementations Analytics is organic, plan for evergreen resourcing ‒ Its not maintenance, its much broader than that ‒ Use consultants for jumpstart a/o bandwidth, but… ‒ Make sure your staff becomes intimately involved in order to transition Deprecate, Deprecate, Deprecate ‒ BI components have a lifespan, continually prune the deadwood Leverage a BI Competency Center ‒ For user support/training and application extension/requirements Summary What We Covered Today: Common pitfalls to avoid and best practices to use for both Pentaho implementation and analytics solutions generally With some diligence and, perhaps, some expert help, you can have a successful Pentaho analytics solution! Next Steps Want to learn more? Visit the Inquidia Consulting booth in the Expo Hall Feel free to contact me: Bryan Senseman [email protected] m: 317-514-4525 We’ll help you become data-driven. Thank You Join the Conversation #PWorld15 Follow us on Twitter: @inquidia