* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download data movement and transformation
Survey
Document related concepts
Open Database Connectivity wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Transcript
Intro to Data Factory PASS Cloud Virtual Chapter March 23, 2015 Steve Hughes, Architect INTELLIGENT DATA SOLUTIONS WWW.PRAGMATICWORKS.COM About the Presenter Steve Hughes – Architect for Pragmatic Works Blog: www.dataonwheels.com Twitter: @dataonwheels LinkedIn: linked.com/in/dataonwheels Email: [email protected] INTELLIGENT DATA SOLUTIONS 2 WWW.PRAGMATICWORKS.COM What is Data Factory? Cloud-based, highly scalable data movement and transformation tool Built on Azure for integrating all kinds of data Still in preview so it is likely not yet feature complete (e.g. Machine Learning Activity added in December 2014) INTELLIGENT DATA SOLUTIONS 3 WWW.PRAGMATICWORKS.COM Data Factory Components Linked Servers SQL Server Database – PaaS, IaaS, On Premise Azure Storage – Blob, Table Datasets Input/Output using JSON deployed with PowerShell Pipelines Activities using JSON deployed with PowerShell Copy, HDInsight, Azure Machine Learning INTELLIGENT DATA SOLUTIONS 4 WWW.PRAGMATICWORKS.COM Current Activities Supported CopyActivity copy data from a source to a sink (destination) HDInsightActivity – Pig, Hive, MapReduce Transformations MLBatchScoringActivity – Can be used to score data with the ML Batch Scoring API StoredProcedureActivity – Executes stored procedures in an Azure SQL Database C# or .NET Custom Activity INTELLIGENT DATA SOLUTIONS 5 WWW.PRAGMATICWORKS.COM Data for the Demo Movies.txt in Azure Blob Storage Movies table in Azure SQL Database INTELLIGENT DATA SOLUTIONS WWW.PRAGMATICWORKS.COM Building a Data Factory Pipeline 1. 2. 3. 4. 5. Create Data Factory Create Linked Services Create Input and Output Tables or Datasets Create Pipeline Set the Active Period for the Pipeline INTELLIGENT DATA SOLUTIONS WWW.PRAGMATICWORKS.COM Step 1: Create a Data Factory in Windows Azure INTELLIGENT DATA SOLUTIONS 8 WWW.PRAGMATICWORKS.COM Step 2 – Create Linked Services 1. 2. 1. 2. Click Linked Services tile Add Data Stores Add Blob Storage Add SQL Database Three Data Store Types Supported: • Azure Storage Account • Azure SQL Database • SQL Server Data Gateways can also be used for on premise SQL Server sources INTELLIGENT DATA SOLUTIONS 9 WWW.PRAGMATICWORKS.COM Step 3 – Create Datasets/Tables JSON File for Datasets • Structure – Name, Type (String,Int,Decimal,Guid,Boolean,Date) • {name: “ThisName”, type:”String”} • Location – Azure Table, Azure Blob, SQL Database • Availability – “cadence in which a slice of the table is produced” INTELLIGENT DATA SOLUTIONS 10 WWW.PRAGMATICWORKS.COM { "name": "MoviesFromBlob", Step 3 – Input JSON Dataset Name "properties": { "structure": [ { "name": "MovieTitle", "type": "String"}, { "name": "Studio", "type": "String"}, Structure defines the structure of the data in the file { "name": "YearReleased", "type": "Int"} ], "location": { Location defines the location and file format information "type": "AzureBlobLocation", "folderPath": "data-factory-files/Movies.csv", "format": { "type": "TextFormat", "columnDelimiter": "," }, "linkedServiceName": "Shughes Blob Storage" }, "availability": { Availability sets the cadence to once every 4 hours "frequency": "hour", "interval": 4 } } } INTELLIGENT DATA SOLUTIONS 11 WWW.PRAGMATICWORKS.COM { "name": "MoviesToSqlDb", Step 3 – Output JSON Dataset Name "properties": { "structure": [ { "name": "MovieName", "type": "String"}, { "name": "Studio", "type": "String"}, Structure defines the table Structure, only fields targeted are mapped { "name": "YearReleased", "type": "Int"} ], "location": { "type": "AzureSQLTableLocation", "tableName": "Movies", Location defines the location and the table name "linkedServiceName": "Media Library DB" }, "availability": { "frequency": "hour", Availability sets the cadence to once every 4 hours "interval": 4 } } } INTELLIGENT DATA SOLUTIONS 12 WWW.PRAGMATICWORKS.COM Step 3 – Deploy Datasets Deployment is done via PowerShell PS C:\> New-AzureDataFactoryTable -ResourceGroupName shughes-datafactory -DataFactoryName shughes-datafactory -File c:\data\JSON\MoviesFromBlob.json PS C:\> New-AzureDataFactoryTable -ResourceGroupName shughes-datafactory -DataFactoryName shughes-datafactory -File c:\data\JSON\MoviesToSqlDb.json INTELLIGENT DATA SOLUTIONS 13 WWW.PRAGMATICWORKS.COM { "name": "MoviesPipeline", Pipeline Name Step 4 – Pipeline JSON "properties": { "description" : "Copy data from csv file in Azure storage to Azure SQL database table", "activities": [ { "name": "CopyMoviesFromBlobToSqlDb", Activity Name "description": "Add new movies to the Media Library", "type": "CopyActivity", Activity definition – type (CopyActivity), Input, Output "inputs": [ {"name": "MoviesFromBlob"} ], "outputs": [ {"name": "MoviesToSqlDb"} ], "transformation": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink" } CopyActivity transformation – source and sink Policy required for SqlSink – concurrency must be set or deployment fails "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "style": "StartOfInterval", "retry": 0, "timeout": "01:00:00" } } } INTELLIGENT DATA SOLUTIONS 14 WWW.PRAGMATICWORKS.COM Step 4 – Deploy Pipeline New-AzureDataFactoryPipeline -ResourceGroupName shughesdatafactory -DataFactoryName shughes-datafactory -File c:\data\JSON\MoviesPipeline.json INTELLIGENT DATA SOLUTIONS 15 WWW.PRAGMATICWORKS.COM Step 4 – Deployed Pipeline INTELLIGENT DATA SOLUTIONS 16 WWW.PRAGMATICWORKS.COM Step 4 – Pipeline Diagram INTELLIGENT DATA SOLUTIONS 17 WWW.PRAGMATICWORKS.COM Step 5 – Set Active Period Set-AzureDataFactoryPipelineActivePeriod -ResourceGroupName shughesdatafactory -DataFactoryName shughes-datafactory -StartDateTime 2015-01-12 – EndDateTime 2015-01-14 –Name MoviesPipeline This gives the duration that data slices will be available to be processed. The frequency is set in the dataset parameters. INTELLIGENT DATA SOLUTIONS 18 WWW.PRAGMATICWORKS.COM Exploring Blades in Azure Portal Start with the Diagram Drill to various details in the pipeline Latest Update full online design capability INTELLIGENT DATA SOLUTIONS WWW.PRAGMATICWORKS.COM Looking at Monitoring Review monitoring information in Azure portal INTELLIGENT DATA SOLUTIONS WWW.PRAGMATICWORKS.COM Common Use Cases Log Import for Analysis INTELLIGENT DATA SOLUTIONS 21 WWW.PRAGMATICWORKS.COM Resources Azure Storage Explorer – Codeplex.com Azure.Microsoft.com – Data Factory Azure.Microsoft.com – Azure PowerShell INTELLIGENT DATA SOLUTIONS 22 WWW.PRAGMATICWORKS.COM Questions? Contact me at [email protected] [email protected] Blog: www.dataonwheels.com Pragmatic Works: www.pragmaticworks.com Products Improve the quality, productivity, and performance of your SQL Server and BI solutions. Services Speed development through training and rapid development services from Pragmatic Works. Foundation Helping those who don’t have the means to get into information technology and to achieve their dreams.