* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Business Intelligence Framework
Oracle Database wikipedia , lookup
Microsoft Access wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Ingres (database) wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Relational model wikipedia , lookup
ContactPoint wikipedia , lookup
An “Agile Framework” for Data Warehouse Development Goals • To describe, in general, the essential aspects of a rapid (agile) data warehouse development environment • We will specifically talk about (briefly) • • • • • • Microsoft Visual Studio (SSDT) Database Projects Microsoft SQL Server 2012 Integration Services (SSIS) SVN and AnkSVN for Source Code Control tSQLt for unit testing Jenkins for continuous integration Agile for project methodology • Leave you with some references for more information FYI, this was not my idea • This was the system brought in by the Lead Architect at AHN, and was based on frameworks he used in the past • Aaron Pugliese • BI Data Architect at TeleTracking • [email protected] • Who Am I? • Steve Tirone • [email protected] | @sttpgh | http://stephentirone.com Quickly: What is Business Intelligence? I like this definition: “… the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes.” -- from Wikipedia http://en.wikipedia.org/wiki/Business_intelligence How Do We Make BI Happen? • We collect RAW DATA from various systems and STAGE that data in a staging area • Then we TRANSFORM that data into a form more suitable for business analysis (the DATA WAREHOUSE) • Typically a star schema design, and typically we further transform this into some type of OLAP database for efficient reporting purposes • We are going to stop at the data warehouse for this presentation – stay tuned, though, I hope to get into self-service BI using the tabular cube model and DAX query language at the next meetup First, we need a database • We used a Visual Studio 2012 (SSDT) Database project to create a Staging database which held all our staging tables • Database projects are a complete representation of your database as SQL scripts • Publishing a database using SSDT incorporates all the work you used to do manually to update a database, a great time saver • In addition, you can create pre- and post- deployment scripts to customize the deployment of your database • <Publish Profiles> allow you to publish to Dev, Test, and Prod, and can also be customized • http://sqlblog.com/blogs/jamie_thomson/archive/2012/05/09/publish-profilefiles-in-sql-server-data-tools-ssdt.aspx SSDT Magic! Use INTEGRATED Source Code Control! • Visual Studio is often used with TFS (Team Foundation System) • But, we didn’t have that • SVN was used for version control of the projects and solutions • The AnkhSVN plugin is available to interface with SVN from within Visual Studio • Best Practice: use source code control integrated into your IDE, so developers don’t need to make extra effort to save history of changes Let’s get some data into those Staging tables • A Visual Studio 2012 Integration Services project was used to hold the SSIS packages which pulled data from the sources to the Staging database • Here SQL Server 2012 provided some new benefits • Project level parameterizable connection managers • Environment scripts were used for Dev, Test, Prod • Project Deployment (vs Package Deployment) – all or none • SSIS packages were TEMPLATES, automatically incorporating important features (logging), allowing the developer to focus on the specifics task at hand A Staging Package A Fact Table Package Step 1: set up a variable to store counts DECLARE @Operations TABLE ( Operation CHAR(1), OrderDateKey INT, OrderDate DATETIME, ShipDateKey INT, ShipDate DATETIME, CustomerKey INT, CustomerID INT, RecordID INT ); Step 2: Capture staging row count --------------------------------------- Get row count of main staging table -- (Include WHERE clause if necessary) ---------------------------------------SELECT @SourceRowCount = COUNT(*) FROM ETLFrameworkSampleStaging.dbo.stgSalesOrderDetail; Step 3: MERGE to the fact table MERGE dbo.factSales AS Target USING (SELECT ISNULL(da.DateKey, -1) AS OrderDateKey, ISNULL(da2.DateKey, -1) AS ShipDateKey, ... OUTPUT LEFT($action, 1), INSERTED.OrderDateKey, Source.OrderDate, INSERTED.ShipDateKey, Source.ShipDate, INSERTED.ProductKey, Source.ProductID, INSERTED.TerritoryKey, Source. TerritoryID, INSERTED.CustomerKey, Source.CustomerID, INSERTED.RecordId INTO @Operations; Step 4: Log Dimension Key Lookup Errors --------------------------------------------------------------------------------Log Dimension Key Lookup Errors ------------------------------------------------------------------------------INSERT INTO neuETLFramework.Audit.PackageErrorLog (PackageExecutionLogKey, RecordID, BusinessKey, SurrogageKey) SELECT @PackageExecutionLogKey AS PackageExecutionLogKey, RecordID, OrderDate, OrderDateKey . . . FROM @Operations AS t1 WHERE OrderDateKey = -1 OR ShipDateKey =-1 OR ProductKey = -1 SET @LookupErrorRowCount = @@ROWCOUNT; Step 5: return row counts -------------------------------------------------Return Row Count Information to Package ------------------------------------------------ SELECT COUNT(CASE Operation WHEN 'U' THEN 1 END) AS UpdateRC, COUNT(CASE Operation WHEN 'I' THEN 1 END) AS InsertRC, @SourceRowCount AS SrcRC, @LookupErrorRowCount AS LookupRC FROM @Operations; Where does that information go? So here’s our VS solution But wait, there’s more! • Billy Mays, famous pitchman • Pittsburgh native • Was a student at Sto-Rox HS and also at WVU, where he was a walk-on linebacker on the football team there • (all from Wikipedia) Unit Testing • How do we know our tables are accurate and correct? • Use tSQLt to run unit tests on them • tSQLt is a unit testing framework for Microsoft SQL Server that allows you to write and execute unit tests in T-SQL • We used AssertObjectExists to makes sure we had all our tables, and AssertResultSetsHaveSameMetadata to make sure each table’s structure was what we expected it to be • This was just the beginning, much more unit testing can be done How did we do this? • Created a unit testing database • Put all our tSQLt unit tests in there as stored procedures • Called tSQLt.RunAll from a post deployment script in order to run the scripts upon a deployment of the database • The developer generally deployed the database in order to run the tests, but we will see in a moment how we automated this process An example tSQLt test ALTER PROCEDURE testDWTables.[test dimBeneficiary has unknown member] AS BEGIN CREATE TABLE #expected (BeneficiaryKey int, DESYNPUF_IDvarchar(16) , . . . INSERT INTO #expected ( BeneficiaryKey, DESYNPUF_ID, . . . VALUES ( -1, '-1', -1, -1, -1, -1, 'UNKNOWN', ''); SELECT BeneficiaryKey, DESYNPUF_ID, DateOfBirthKey, DateOfDeathKey, GenderKey, RaceKey, FakeName, FakeZIP5 INTO #actual FROM DESynPUF_DataWarehouse.dbo.dimBeneficiary d WHERE d.BeneficiaryKey = -1 EXEC tSQLt.AssertEqualsTable '#expected', '#actual'; END; Running the Test Here is our VS solution again How did we automate unit testing? • By using Jenkins • Jenkins is a tool for managing builds • What’s really cool is this allows for continuous integration • Any time there is a check in to SVN, Jenkins does a full build of the solution, including unit tests and deploying to the integration SQL server • Broken builds are detected immediately, and hopefully fixed immediately • Martin Fowler – Continuous Delivery • http://youtu.be/aoMfbgF2D_4 (17 minutes) Jenkins Example In Summary Some Best Practices for Data Warehouse Development: • Templates implement a framework for rapid development, consistent structure, and always conforming to best practices • Unit testing can be easily integrated into the process, and will serve to find problems early and help promote consistency • Continuous integration is important so that bugs are caught and fixed early, and the data warehouse code base is always in a state ready to deploy to production But, There’s One More Thing Agile Project Management Methodology • We’ve described the platform for rapid development of a data warehouse, but not what to put into it, when, or why • Enter Agile • Agile (to me) is a methodology that emphasizes collaboration and iterative development, where you get your product out to your customers quickly for their feedback, and you use that to rapidly improve the product • You need engaged stakeholders! • All the development productivity in the world isn’t going to help if you don’t have engaged stakeholders clearly communicating with your development team Back to Our Group Charter • “to meet, network, and learn about strategies, best practices, and tools” References • Agile Data Warehousing • http://www.amazon.com/Agile-Data-Warehousing-World-ClassIntelligence/dp/0595471676 • tSQLt • http://tSQLt.org • SVN and AnkSVN • SVN -- https://subversion.apache.org/ • AnkhSVN -- https://ankhsvn.open.collab.net/ • Jenkins • https://jenkins-ci.org/ • SQL Server Business Intelligence Technology Stack • http://www.microsoft.com/en-us/server-cloud/products/sql-server/ Thank You • Slides will be somewhere on the meetup site