Download data movement and transformation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Big data wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Intro to Data Factory
PASS Cloud Virtual Chapter
March 23, 2015
Steve Hughes, Architect
INTELLIGENT DATA SOLUTIONS
WWW.PRAGMATICWORKS.COM
About the Presenter
Steve Hughes – Architect for Pragmatic Works
Blog: www.dataonwheels.com
Twitter: @dataonwheels
LinkedIn: linked.com/in/dataonwheels
Email: [email protected]
INTELLIGENT DATA SOLUTIONS
2
WWW.PRAGMATICWORKS.COM
What is Data Factory?
Cloud-based, highly scalable data movement and
transformation tool
Built on Azure for integrating all kinds of data
Still in preview so it is likely not yet feature
complete (e.g. Machine Learning Activity added in
December 2014)
INTELLIGENT DATA SOLUTIONS
3
WWW.PRAGMATICWORKS.COM
Data Factory Components
Linked Servers
SQL Server Database – PaaS, IaaS, On Premise
Azure Storage – Blob, Table
Datasets
Input/Output using JSON deployed with PowerShell
Pipelines
Activities using JSON deployed with PowerShell
Copy, HDInsight, Azure Machine Learning
INTELLIGENT DATA SOLUTIONS
4
WWW.PRAGMATICWORKS.COM
Current Activities Supported
CopyActivity copy data from a source to a sink
(destination)
HDInsightActivity – Pig, Hive, MapReduce
Transformations
MLBatchScoringActivity – Can be used to score
data with the ML Batch Scoring API
StoredProcedureActivity – Executes stored
procedures in an Azure SQL Database
C# or .NET Custom Activity
INTELLIGENT DATA SOLUTIONS
5
WWW.PRAGMATICWORKS.COM
Data for the Demo
Movies.txt in Azure Blob Storage
Movies table in Azure SQL Database
INTELLIGENT DATA SOLUTIONS
WWW.PRAGMATICWORKS.COM
Building a Data Factory Pipeline
1.
2.
3.
4.
5.
Create Data Factory
Create Linked Services
Create Input and Output Tables or Datasets
Create Pipeline
Set the Active Period for the Pipeline
INTELLIGENT DATA SOLUTIONS
WWW.PRAGMATICWORKS.COM
Step 1: Create a Data Factory in Windows Azure
INTELLIGENT DATA SOLUTIONS
8
WWW.PRAGMATICWORKS.COM
Step 2 – Create Linked Services
1.
2.
1.
2.
Click Linked
Services tile
Add Data
Stores
Add Blob Storage
Add SQL Database
Three Data Store Types
Supported:
• Azure Storage Account
• Azure SQL Database
• SQL Server
Data Gateways can also be used
for on premise SQL Server
sources
INTELLIGENT DATA SOLUTIONS
9
WWW.PRAGMATICWORKS.COM
Step 3 – Create Datasets/Tables
JSON File for Datasets
• Structure – Name, Type
(String,Int,Decimal,Guid,Boolean,Date)
•
{name: “ThisName”, type:”String”}
• Location – Azure Table, Azure Blob, SQL
Database
• Availability – “cadence in which a slice of the
table is produced”
INTELLIGENT DATA SOLUTIONS
10
WWW.PRAGMATICWORKS.COM
{
"name": "MoviesFromBlob",
Step 3 – Input JSON
Dataset Name
"properties":
{
"structure":
[
{ "name": "MovieTitle", "type": "String"},
{ "name": "Studio", "type": "String"},
Structure defines the structure of the data in the file
{ "name": "YearReleased", "type": "Int"}
],
"location":
{
Location defines the location and file format information
"type": "AzureBlobLocation",
"folderPath": "data-factory-files/Movies.csv",
"format":
{
"type": "TextFormat",
"columnDelimiter": ","
},
"linkedServiceName": "Shughes Blob Storage"
},
"availability":
{
Availability sets the cadence to once every 4 hours
"frequency": "hour",
"interval": 4
}
}
}
INTELLIGENT DATA SOLUTIONS
11
WWW.PRAGMATICWORKS.COM
{
"name": "MoviesToSqlDb",
Step 3 – Output JSON
Dataset Name
"properties":
{
"structure":
[
{ "name": "MovieName", "type": "String"},
{ "name": "Studio", "type": "String"},
Structure defines the table Structure, only fields targeted are mapped
{ "name": "YearReleased", "type": "Int"}
],
"location":
{
"type": "AzureSQLTableLocation",
"tableName": "Movies",
Location defines the location and the table name
"linkedServiceName": "Media Library DB"
},
"availability":
{
"frequency": "hour",
Availability sets the cadence to once every 4 hours
"interval": 4
}
}
}
INTELLIGENT DATA SOLUTIONS
12
WWW.PRAGMATICWORKS.COM
Step 3 – Deploy Datasets
Deployment is done via PowerShell
PS C:\> New-AzureDataFactoryTable -ResourceGroupName shughes-datafactory -DataFactoryName
shughes-datafactory -File c:\data\JSON\MoviesFromBlob.json
PS C:\> New-AzureDataFactoryTable -ResourceGroupName shughes-datafactory -DataFactoryName
shughes-datafactory -File c:\data\JSON\MoviesToSqlDb.json
INTELLIGENT DATA SOLUTIONS
13
WWW.PRAGMATICWORKS.COM
{
"name": "MoviesPipeline",
Pipeline Name
Step 4 – Pipeline JSON
"properties":
{
"description" : "Copy data from csv file in Azure storage to Azure SQL database table",
"activities":
[
{
"name": "CopyMoviesFromBlobToSqlDb",
Activity Name
"description": "Add new movies to the Media Library",
"type": "CopyActivity",
Activity definition – type (CopyActivity), Input, Output
"inputs": [ {"name": "MoviesFromBlob"} ],
"outputs": [ {"name": "MoviesToSqlDb"} ],
"transformation":
{
"source":
{
"type": "BlobSource"
},
"sink":
{
"type": "SqlSink"
}
CopyActivity transformation – source and sink
Policy required for SqlSink – concurrency must be set or deployment fails
"Policy":
{
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
}
}
INTELLIGENT DATA SOLUTIONS
14
WWW.PRAGMATICWORKS.COM
Step 4 – Deploy Pipeline
New-AzureDataFactoryPipeline -ResourceGroupName shughesdatafactory -DataFactoryName shughes-datafactory -File
c:\data\JSON\MoviesPipeline.json
INTELLIGENT DATA SOLUTIONS
15
WWW.PRAGMATICWORKS.COM
Step 4 – Deployed Pipeline
INTELLIGENT DATA SOLUTIONS
16
WWW.PRAGMATICWORKS.COM
Step 4 – Pipeline Diagram
INTELLIGENT DATA SOLUTIONS
17
WWW.PRAGMATICWORKS.COM
Step 5 – Set Active Period
Set-AzureDataFactoryPipelineActivePeriod -ResourceGroupName shughesdatafactory -DataFactoryName shughes-datafactory -StartDateTime 2015-01-12 –
EndDateTime 2015-01-14 –Name MoviesPipeline
This gives the duration that data slices will be available to be processed.
The frequency is set in the dataset parameters.
INTELLIGENT DATA SOLUTIONS
18
WWW.PRAGMATICWORKS.COM
Exploring Blades in Azure Portal
Start with the Diagram
Drill to various details in the pipeline
Latest Update full online design
capability
INTELLIGENT DATA SOLUTIONS
WWW.PRAGMATICWORKS.COM
Looking at Monitoring
Review monitoring
information in Azure portal
INTELLIGENT DATA SOLUTIONS
WWW.PRAGMATICWORKS.COM
Common Use Cases
Log Import for Analysis
INTELLIGENT DATA SOLUTIONS
21
WWW.PRAGMATICWORKS.COM
Resources
Azure Storage Explorer – Codeplex.com
Azure.Microsoft.com – Data Factory
Azure.Microsoft.com – Azure PowerShell
INTELLIGENT DATA SOLUTIONS
22
WWW.PRAGMATICWORKS.COM
Questions?
Contact me at
[email protected]
[email protected]
Blog: www.dataonwheels.com
Pragmatic Works:
www.pragmaticworks.com
Products
Improve the quality,
productivity, and
performance of your SQL
Server and BI solutions.
Services
Speed development through
training and rapid
development services from
Pragmatic Works.
Foundation
Helping those who don’t
have the means to get into
information technology and
to achieve their dreams.