Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Table of Contents Overview Introduction to Azure Data Factory Concepts Pipelines and activities Datasets Scheduling and execution Get Started Tutorial: Create a pipeline to copy data Copy Wizard Azure portal Visual Studio PowerShell Azure Resource Manager template REST API .NET API Tutorial: Create a pipeline to transform data Azure portal Visual Studio PowerShell Azure Resource Manager template REST API Tutorial: Move data between on-premises and cloud FAQ How To Move Data Copy Activity Overview Data Factory Copy Wizard Performance and tuning guide Security considerations Connectors Data Management Gateway Transform Data HDInsight Hive Activity HDInsight Pig Activity HDInsight MapReduce Activity HDInsight Streaming Activity HDInsight Spark Activity Machine Learning Batch Execution Activity Machine Learning Update Resource Activity Stored Procedure Activity Data Lake Analytics U-SQL Activity .NET custom activity Invoke R scripts Reprocess models in Azure Analysis Services Compute Linked Services Develop Azure Resource Manager template Samples Functions and system variables Naming rules .NET API change log Monitor and Manage Monitoring and Management app Azure Data Factory pipelines Using .NET SDK Troubleshoot Data Factory issues Troubleshoot issues with using Data Management Gateway Reference PowerShell .NET REST JSON Resources Release notes for Data Management Gateway Learning path Case Studies Product Recommendations Customer Profiling Process large-scale datasets using Data Factory and Batch Service updates Pricing MSDN Forum Stack Overflow Videos Request a feature Introduction to Azure Data Factory 5/3/2017 • 10 min to read • Edit Online What is Azure Data Factory? In the world of big data, how is existing data leveraged in business? Is it possible to enrich data generated in the cloud by using reference data from on-premises data sources or other disparate data sources? For example, a gaming company collects many logs produced by games in the cloud. It wants to analyze these logs to gain insights in to customer preferences, demographics, usage behavior etc. to identify up-sell and cross-sell opportunities, develop new compelling features to drive business growth, and provide a better experience to customers. To analyze these logs, the company needs to use the reference data such as customer information, game information, marketing campaign information that is in an on-premises data store. Therefore, the company wants to ingest log data from the cloud data store and reference data from the on-premises data store. Then, process the data by using Hadoop in the cloud (Azure HDInsight) and publish the result data into a cloud data warehouse such as Azure SQL Data Warehouse or an on-premises data store such as SQL Server. It wants this workflow to run weekly once. What is needed is a platform that allows the company to create a workflow that can ingest data from both onpremises and cloud data stores, and transform or process data by using existing compute services such as Hadoop, and publish the results to an on-premises or cloud data store for BI applications to consume. Azure Data Factory is the platform for this kind of scenarios. It is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores, process/transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume. It's more of an Extract-and-Load (EL) and then Transform-and-Load (TL) platform rather than a traditional ExtractTransform-and-Load (ETL) platform. The transformations that are performed are to transform/process data by using compute services rather than to perform transformations like the ones for adding derived columns, counting number of rows, sorting data, etc. Currently, in Azure Data Factory, the data that is consumed and produced by workflows is time-sliced data (hourly, daily, weekly, etc.). For example, a pipeline may read input data, process data, and produce output data once a day. You can also run a workflow just one time. How does it work? The pipelines (data-driven workflows) in Azure Data Factory typically perform the following three steps: Connect and collect Enterprises have data of various types located in disparate sources. The first step in building an information production system is to connect to all the required sources of data and processing, such as SaaS services, file shares, FTP, web services, and move the data as-needed to a centralized location for subsequent processing. Without Data Factory, enterprises must build custom data movement components or write custom services to integrate these data sources and processing. It is expensive and hard to integrate and maintain such systems, and it often lacks the enterprise grade monitoring and alerting, and the controls that a fully managed service can offer. With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can collect data in an Azure Data Lake Store and transform the data later by using an Azure Data Lake Analytics compute service. Or, collect data in an Azure Blob Storage and transform data later by using an Azure HDInsight Hadoop cluster. Transform and enrich Once data is present in a centralized data store in the cloud, you want the collected data to be processed or transformed by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning. You want to reliably produce transformed data on a maintainable and controlled schedule to feed production environments with trusted data. Publish Deliver transformed data from the cloud to on-premises sources like SQL Server, or keep it in your cloud storage sources for consumption by business intelligence (BI) and analytics tools and other applications Key components An Azure subscription may have one or more Azure Data Factory instances (or data factories). Azure Data Factory is composed of four key components that work together to provide the platform on which you can compose data-driven workflows with steps to move and transform data. Pipeline A data factory may have one or more pipelines. A pipeline is a group of activities. Together, the activities in a pipeline perform a task. For example, a pipeline could contain a group of activities that ingests data from an Azure blob, and then run a Hive query on an HDInsight cluster to partition the data. The benefit of this is that the pipeline allows you to manage the activities as a set instead of each one individually. For example, you can deploy and schedule the pipeline, instead of the activities independently. Activity A pipeline may have one or more activities. Activities define the actions to perform on your data. For example, you may use a Copy activity to copy data from one data store to another data store. Similarly, you may use a Hive activity, which runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory supports two types of activities: data movement activities and data transformation activities. Data movement activities Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store. CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases NoSQL File Others Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ ✓ ✓ ✓ ✓ CATEGORY DATA STORE SUPPORTED AS A SOURCE Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK For more information, see Data Movement Activities article. Data transformation activities Azure Data Factory supports the following transformation activities that can be added to pipelines either individually or chained with another activity. DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT Hive HDInsight [Hadoop] Pig HDInsight [Hadoop] MapReduce HDInsight [Hadoop] Hadoop Streaming HDInsight [Hadoop] Spark HDInsight [Hadoop] Machine Learning activities: Batch Execution and Update Resource Azure VM Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server Data Lake Analytics U-SQL Azure Data Lake Analytics DotNet HDInsight [Hadoop] or Azure Batch NOTE You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark programs from Azure Data Factory for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script using Azure Data Factory. For more information, see Data Transformation Activities article. Custom .NET activities If you need to move data to/from a data store that Copy Activity doesn't support, or transform data using your own logic, create a custom .NET activity. For details on creating and using a custom activity, see Use custom activities in an Azure Data Factory pipeline. Datasets An activity takes zero or more datasets as inputs and one or more datasets as outputs. Datasets represent data structures within the data stores, which simply point or reference the data you want to use in your activities as inputs or outputs. For example, an Azure Blob dataset specifies the blob container and folder in the Azure Blob Storage from which the pipeline should read the data. Or, an Azure SQL Table dataset specifies the table to which the output data is written by the activity. Linked services Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Think of it this way - a linked service defines the connection to the data source and a dataset represents the structure of the data. For example, an Azure Storage linked service specifies connection string to connect to the Azure Storage account. And, an Azure Blob dataset specifies the blob container and the folder that contains the data. Linked services are used for two purposes in Data Factory: To represent a data store including, but not limited to, an on-premises SQL Server, Oracle database, file share, or an Azure Blob Storage account. See the Data movement activities section for a list of supported data stores. To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. See Data transformation activities section for a list of supported compute environments. Relationship between Data Factory entities Figure 2. Relationships between Dataset, Activity, Pipeline, and Linked service Supported regions Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data factory can access data stores and compute services in other Azure regions to move data between data stores or process data using compute services. Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms. Even though Data Factory is available in only West US, East US, and North Europe regions, the service powering the data movement in Data Factory is available globally in several regions. If a data store is behind a firewall, then a Data Management Gateway installed in your on-premises environment moves the data instead. For an example, let us assume that your compute environments such as Azure HDInsight cluster and Azure Machine Learning are running out of West Europe region. You can create and use an Azure Data Factory instance in North Europe and use it to schedule jobs on your compute environments in West Europe. It takes a few milliseconds for Data Factory to trigger the job on your compute environment but the time for running the job on your computing environment does not change. Get started with creating a pipeline You can use one of these tools or APIs to create data pipelines in Azure Data Factory: Azure portal Visual Studio PowerShell .NET API REST API Azure Resource Manager template. To learn how to build data factories with data pipelines, follow step-by-step instructions in the following tutorials: TUTORIAL DESCRIPTION Move data between two cloud data stores In this tutorial, you create a data factory with a pipeline that moves data from Blob storage to SQL database. Transform data using Hadoop cluster In this tutorial, you build your first Azure data factory with a data pipeline that processes data by running Hive script on an Azure HDInsight (Hadoop) cluster. Move data between an on-premises data store and a cloud data store using Data Management Gateway In this tutorial, you build a data factory with a pipeline that moves data from an on-premises SQL Server database to an Azure blob. As part of the walkthrough, you install and configure the Data Management Gateway on your machine. Pipelines and Activities in Azure Data Factory 6/13/2017 • 16 min to read • Edit Online This article helps you understand pipelines and activities in Azure Data Factory and use them to construct end-to-end data-driven workflows for your data movement and data processing scenarios. NOTE This article assumes that you have gone through Introduction to Azure Data Factory. If you do not have hands-on-experience with creating data factories, going through data transformation tutorial and/or data movement tutorial would help you understand this article better. Overview A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data. For example, you may use a copy activity to copy data from an on-premises SQL Server to an Azure Blob Storage. Then, use a Hive activity that runs a Hive script on an Azure HDInsight cluster to process/transform data from the blob storage to produce output data. Finally, use a second copy activity to copy the output data to an Azure SQL Data Warehouse on top of which business intelligence (BI) reporting solutions are built. An activity can take zero or more input datasets and produce one or more output datasets. The following diagram shows the relationship between pipeline, activity, and dataset in Data Factory: A pipeline allows you to manage activities as a set instead of each one individually. For example, you can deploy, schedule, suspend, and resume a pipeline, instead of dealing with activities in the pipeline independently. Data Factory supports two types of activities: data movement activities and data transformation activities. Each activity can have zero or more input datasets and produce one or more output datasets. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity. Datasets identify data within different data stores, such as tables, files, folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity. For more information about datasets, see Datasets in Azure Data Factory article. Data movement activities Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store. CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases NoSQL File Others Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ ✓ ✓ ✓ ✓ CATEGORY DATA STORE SUPPORTED AS A SOURCE Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK NOTE Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-premises/Azure IaaS machine. For more information, see Data Movement Activities article. Data transformation activities Azure Data Factory supports the following transformation activities that can be added to pipelines either individually or chained with another activity. DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT Hive HDInsight [Hadoop] Pig HDInsight [Hadoop] MapReduce HDInsight [Hadoop] Hadoop Streaming HDInsight [Hadoop] Spark HDInsight [Hadoop] Machine Learning activities: Batch Execution and Update Resource Azure VM Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server Data Lake Analytics U-SQL Azure Data Lake Analytics DotNet HDInsight [Hadoop] or Azure Batch NOTE You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark programs from Azure Data Factory for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script using Azure Data Factory. For more information, see Data Transformation Activities article. Custom .NET activities If you need to move data to/from a data store that the Copy Activity doesn't support, or transform data using your own logic, create a custom .NET activity. For details on creating and using a custom activity, see Use custom activities in an Azure Data Factory pipeline. Schedule pipelines A pipeline is active only between its start time and end time. It is not executed before the start time or after the end time. If the pipeline is paused, it does not get executed irrespective of its start and end time. For a pipeline to run, it should not be paused. See Scheduling and Execution to understand how scheduling and execution works in Azure Data Factory. Pipeline JSON Let us take a closer look on how a pipeline is defined in JSON format. The generic structure for a pipeline looks as follows: { "name": "PipelineName", "properties": { "description" : "pipeline description", "activities": [ ], "start": "<start date-time>", "end": "<end date-time>", "isPaused": true/false, "pipelineMode": "scheduled/onetime", "expirationTime": "15.00:00:00", "datasets": [ ] } } TAG DESCRIPTION REQUIRED name Name of the pipeline. Specify a name that represents the action that the pipeline performs. Maximum number of characters: 260 Must start with a letter number, or an underscore (_) Following characters are not allowed: “.”, “+”, “?”, “/”, “<”,”>”,”*”,”%”,”&”,”:”,”\” Yes description Specify the text describing what the pipeline is used for. Yes TAG DESCRIPTION REQUIRED activities The activities section can have one or more activities defined within it. See the next section for details about the activities JSON element. Yes start Start date-time for the pipeline. Must be in ISO format. For example: 2016-10-14T16:32:41Z . No It is possible to specify a local time, for example an EST time. Here is an example: 2016-02-27T06:00:00-05:00 ", which is 6 AM EST. The start and end properties together specify active period for the pipeline. Output slices are only produced with in this active period. end End date-time for the pipeline. If specified must be in ISO format. For example: 2016-10-14T17:32:41Z It is possible to specify a local time, for example an EST time. Here is an example: 2016-02-27T06:00:00-05:00 , which is 6 AM EST. If you specify a value for the end property, you must specify value for the start property. The start and end times can both be empty to create a pipeline. You must specify both values to set an active period for the pipeline to run. If you do not specify start and end times when creating a pipeline, you can set them using the SetAzureRmDataFactoryPipelineActive Period cmdlet later. No If you specify a value for the start property, you must specify value for the end property. See notes for the start property. To run the pipeline indefinitely, specify 9999-09-09 as the value for the end property. A pipeline is active only between its start time and end time. It is not executed before the start time or after the end time. If the pipeline is paused, it does not get executed irrespective of its start and end time. For a pipeline to run, it should not be paused. See Scheduling and Execution to understand how scheduling and execution works in Azure Data Factory. isPaused If set to true, the pipeline does not run. It's in the paused state. Default value = false. You can use this property to enable or disable a pipeline. No TAG DESCRIPTION REQUIRED pipelineMode The method for scheduling runs for the pipeline. Allowed values are: scheduled (default), onetime. No ‘Scheduled’ indicates that the pipeline runs at a specified time interval according to its active period (start and end time). ‘Onetime’ indicates that the pipeline runs only once. Onetime pipelines once created cannot be modified/updated currently. See Onetime pipeline for details about onetime setting. expirationTime Duration of time after creation for which the one-time pipeline is valid and should remain provisioned. If it does not have any active, failed, or pending runs, the pipeline is automatically deleted once it reaches the expiration time. The default value: No "expirationTime": "3.00:00:00" datasets List of datasets to be used by activities defined in the pipeline. This property can be used to define datasets that are specific to this pipeline and not defined within the data factory. Datasets defined within this pipeline can only be used by this pipeline and cannot be shared. See Scoped datasets for details. No Activity JSON The activities section can have one or more activities defined within it. Each activity has the following top-level structure: { "name": "ActivityName", "description": "description", "type": "<ActivityType>", "inputs": "[]", "outputs": "[]", "linkedServiceName": "MyLinkedService", "typeProperties": { }, "policy": { }, "scheduler": { } } Following table describes properties in the activity JSON definition: TAG DESCRIPTION REQUIRED name Name of the activity. Specify a name that represents the action that the activity performs. Maximum number of characters: 260 Must start with a letter number, or an underscore (_) Following characters are not allowed: “.”, “+”, “?”, “/”, “<”,”>”,”*”,”%”,”&”,”:”,”\” Yes description Text describing what the activity or is used for Yes type Type of the activity. See the Data Movement Activities and Data Transformation Activities sections for different types of activities. Yes inputs Input tables used by the activity Yes // one input table "inputs": [ { "name": "inputtable1" } ], // two input tables "inputs": [ { "name": "inputtable1" }, { "name": "inputtable2" } ], TAG DESCRIPTION REQUIRED outputs Output tables used by the activity. Yes // one output table "outputs": [ { "name": "outputtable1" } ], //two output tables "outputs": [ { "name": "outputtable1" }, { "name": "outputtable2" } ], linkedServiceName Name of the linked service used by the activity. An activity may require that you specify the linked service that links to the required compute environment. Yes for HDInsight Activity and Azure Machine Learning Batch Scoring Activity No for all others typeProperties Properties in the typeProperties section depend on type of the activity. To see type properties for an activity, click links to the activity in the previous section. No policy Policies that affect the run-time behavior of the activity. If it is not specified, default policies are used. No scheduler “scheduler” property is used to define desired scheduling for the activity. Its subproperties are the same as the ones in the availability property in a dataset. No Policies Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed. The following table provides the details. PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION concurrency Integer 1 Number of concurrent executions of the activity. Max value: 10 It determines the number of parallel activity executions that can happen on different slices. For example, if an activity needs to go through a large set of available data, having a larger concurrency value speeds up the data processing. PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION executionPriorityOrder NewestFirst OldestFirst Determines the ordering of data slices that are being processed. OldestFirst For example, if you have 2 slices (one happening at 4pm, and another one at 5pm), and both are pending execution. If you set the executionPriorityOrder to be NewestFirst, the slice at 5 PM is processed first. Similarly if you set the executionPriorityORder to be OldestFIrst, then the slice at 4 PM is processed. retry Integer 0 Number of retries before the data processing for the slice is marked as Failure. Activity execution for a data slice is retried up to the specified retry count. The retry is done as soon as possible after the failure. 00:00:00 Timeout for the activity. Example: 00:10:00 (implies timeout 10 mins) Max value can be 10 timeout TimeSpan If a value is not specified or is 0, the timeout is infinite. If the data processing time on a slice exceeds the timeout value, it is canceled, and the system attempts to retry the processing. The number of retries depends on the retry property. When timeout occurs, the status is set to TimedOut. delay TimeSpan 00:00:00 Specify the delay before data processing of the slice starts. The execution of activity for a data slice is started after the Delay is past the expected execution time. Example: 00:10:00 (implies delay of 10 mins) longRetry Integer 1 longRetry Integer 1 PROPERTY PERMITTED VALUES DEFAULT VALUE Max value: 10 The number of long retry DESCRIPTION attempts before the slice execution is failed. longRetry attempts are spaced by longRetryInterval. So if you need to specify a time between retry attempts, use longRetry. If both Retry and longRetry are specified, each longRetry attempt includes Retry attempts and the max number of attempts is Retry * longRetry. For example, if we have the following settings in the activity policy: Retry: 3 longRetry: 2 longRetryInterval: 01:00:00 Assume there is only one slice to execute (status is Waiting) and the activity execution fails every time. Initially there would be 3 consecutive execution attempts. After each attempt, the slice status would be Retry. After first 3 attempts are over, the slice status would be LongRetry. After an hour (that is, longRetryInteval’s value), there would be another set of 3 consecutive execution attempts. After that, the slice status would be Failed and no more retries would be attempted. Hence overall 6 attempts were made. If any execution succeeds, the slice status would be Ready and no more retries are attempted. longRetry may be used in situations where dependent data arrives at non-deterministic times or the overall environment is flaky under which data processing occurs. In such cases, doing retries one after another may PROPERTY longRetryInterval PERMITTED VALUES TimeSpan DEFAULT VALUE 00:00:00 not help and doing so DESCRIPTION after an interval of time results in the desired output. Word of caution: do not set high values for longRetry or longRetryInterval. Typically, higher values imply other systemic issues. The delay between long retry attempts Sample copy pipeline In the following sample pipeline, there is one activity of type Copy in the activities section. In this sample, the copy activity copies data from an Azure Blob storage to an Azure SQL database. { "name": "CopyPipeline", "properties": { "description": "Copy data from a blob to Azure SQL table", "activities": [ { "name": "CopyFromBlobToSQL", "type": "Copy", "inputs": [ { "name": "InputDataset" } ], "outputs": [ { "name": "OutputDataset" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink", "writeBatchSize": 10000, "writeBatchTimeout": "60:00:00" } }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 0, "timeout": "01:00:00" } } ], "start": "2016-07-12T00:00:00Z", "end": "2016-07-13T00:00:00Z" } } Note the following points: In the activities section, there is only one activity whose type is set to Copy. Input for the activity is set to InputDataset and output for the activity is set to OutputDataset. See Datasets article for defining datasets in JSON. In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink type. In the Data movement activities section, click the data store that you want to use as a source or a sink to learn more about moving data to/from that data store. For a complete walkthrough of creating this pipeline, see Tutorial: Copy data from Blob Storage to SQL Database. Sample transformation pipeline In the following sample pipeline, there is one activity of type HDInsightHive in the activities section. In this sample, the HDInsight Hive activity transforms data from an Azure Blob storage by running a Hive script file on an Azure HDInsight Hadoop cluster. { "name": "TransformPipeline", "properties": { "description": "My first Azure Data Factory pipeline", "activities": [ { "type": "HDInsightHive", "typeProperties": { "scriptPath": "adfgetstarted/script/partitionweblogs.hql", "scriptLinkedService": "AzureStorageLinkedService", "defines": { "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" } }, "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 }, "name": "RunSampleHiveActivity", "linkedServiceName": "HDInsightOnDemandLinkedService" } ], "start": "2016-04-01T00:00:00Z", "end": "2016-04-02T00:00:00Z", "isPaused": false } } Note the following points: In the activities section, there is only one activity whose type is set to HDInsightHive. The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container adfgetstarted. The defines section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable} , ${hiveconf:partitionedtable} ). The typeProperties section is different for each transformation activity. To learn about type properties supported for a transformation activity, click the transformation activity in the Data transformation activities table. For a complete walkthrough of creating this pipeline, see Tutorial: Build your first pipeline to process data using Hadoop cluster. Multiple activities in a pipeline The previous two sample pipelines have only one activity in them. You can have more than one activity in a pipeline. If you have multiple activities in a pipeline and output of an activity is not an input of another activity, the activities may run in parallel if input data slices for the activities are ready. You can chain two activities by having the output dataset of one activity as the input dataset of the other activity. The second activity executes only when the first one completes successfully. In this sample, the pipeline has two activities: Activity1 and Activity2. The Activity1 takes Dataset1 as an input and produces an output Dataset2. The Activity takes Dataset2 as an input and produces an output Dataset3. Since the output of Activity1 (Dataset2) is the input of Activity2, the Activity2 runs only after the Activity completes successfully and produces the Dataset2 slice. If the Activity1 fails for some reason and does not produce the Dataset2 slice, the Activity 2 does not run for that slice (for example: 9 AM to 10 AM). You can also chain activities that are in different pipelines. In this sample, Pipeline1 has only one activity that takes Dataset1 as an input and produces Dataset2 as an output. The Pipeline2 also has only one activity that takes Dataset2 as an input and Dataset3 as an output. For more information, see scheduling and execution. Create and monitor pipelines You can create pipelines by using one of these tools or SDKs. Copy Wizard. Azure portal Visual Studio Azure PowerShell Azure Resource Manager template REST API .NET API See the following tutorials for step-by-step instructions for creating pipelines by using one of these tools or SDKs. Build a pipeline with a data transformation activity Build a pipeline with a data movement activity Once a pipeline is created/deployed, you can manage and monitor your pipelines by using the Azure portal blades or Monitor and Manage App. See the following topics for step-by-step instructions. Monitor and manage pipelines by using Azure portal blades. Monitor and manage pipelines by using Monitor and Manage App Onetime pipeline You can create and schedule a pipeline to run periodically (for example: hourly or daily) within the start and end times you specify in the pipeline definition. See Scheduling activities for details. You can also create a pipeline that runs only once. To do so, you set the pipelineMode property in the pipeline definition to onetime as shown in the following JSON sample. The default value for this property is scheduled. { "name": "CopyPipeline", "properties": { "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource", "recursive": false }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "InputDataset" } ], "outputs": [ { "name": "OutputDataset" } ] "name": "CopyActivity-0" } ] "pipelineMode": "OneTime" } } Note the following: Start and end times for the pipeline are not specified. Availability of input and output datasets is specified (frequency and interval), even though Data Factory does not use the values. Diagram view does not show one-time pipelines. This behavior is by design. One-time pipelines cannot be updated. You can clone a one-time pipeline, rename it, update properties, and deploy it to create another one. Next Steps For more information about datasets, see Create datasets article. For more information about how pipelines are scheduled and executed, see Scheduling and execution in Azure Data Factory article. Datasets in Azure Data Factory 5/1/2017 • 15 min to read • Edit Online This article describes what datasets are, how they are defined in JSON format, and how they are used in Azure Data Factory pipelines. It provides details about each section (for example, structure, availability, and policy) in the dataset JSON definition. The article also provides examples for using the offset, anchorDateTime, and style properties in a dataset JSON definition. NOTE If you are new to Data Factory, see Introduction to Azure Data Factory for an overview. If you do not have hands-on experience with creating data factories, you can gain a better understanding by reading the data transformation tutorial and the data movement tutorial. Overview A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy activity to copy data from an on-premises SQL Server to Azure Blob storage. Then, you might use a Hive activity that runs a Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data. Finally, you might use a second copy activity to copy the output data to Azure SQL Data Warehouse, on top of which business intelligence (BI) reporting solutions are built. For more information about pipelines and activities, see Pipelines and activities in Azure Data Factory. An activity can take zero or more input datasets, and produce one or more output datasets. An input dataset represents the input for an activity in the pipeline, and an output dataset represents the output for the activity. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the pipeline should read the data. Before you create a dataset, create a linked service to link your data store to the data factory. Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Datasets identify data within the linked data stores, such as SQL tables, files, folders, and documents. For example, an Azure Storage linked service links a storage account to the data factory. An Azure Blob dataset represents the blob container and the folder that contains the input blobs to be processed. Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked services: Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service). The Azure Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at runtime to connect to your Azure Storage and Azure SQL Database, respectively. The Azure Blob dataset specifies the blob container and blob folder that contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies the SQL table in your SQL database to which the data is to be copied. The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory: Dataset JSON A dataset in Data Factory is defined in JSON format as follows: { "name": "<name of dataset>", "properties": { "type": "<type of dataset: AzureBlob, AzureSql etc...>", "external": <boolean flag to indicate external data. only for input datasets>, "linkedServiceName": "<Name of the linked service that refers to a data store.>", "structure": [ { "name": "<Name of the column>", "type": "<Name of the type>" } ], "typeProperties": { "<type specific property>": "<value>", "<type specific property 2>": "<value 2>", }, "availability": { "frequency": "<Specifies the time unit for data slice production. Supported frequency: Minute, Hour, Day, Week, Month>", "interval": "<Specifies the interval within the defined frequency. For example, frequency set to 'Hour' and interval set to 1 indicates that new data slices should be produced hourly>" }, "policy": { } } } The following table describes properties in the above JSON: PROPERTY DESCRIPTION REQUIRED DEFAULT name Name of the dataset. See Azure Data Factory - Naming rules for naming rules. Yes NA PROPERTY DESCRIPTION REQUIRED DEFAULT type Type of the dataset. Specify one of the types supported by Data Factory (for example: AzureBlob, AzureSqlTable). Yes NA No NA For details, see Dataset type. structure Schema of the dataset. For details, see Dataset structure. typeProperties The type properties are different for each type (for example: Azure Blob, Azure SQL table). For details on the supported types and their properties, see Dataset type. Yes NA external Boolean flag to specify whether a dataset is explicitly produced by a data factory pipeline or not. If the input dataset for an activity is not produced by the current pipeline, set this flag to true. Set this flag to true for the input dataset of the first activity in the pipeline. No false availability Defines the processing window (for example, hourly or daily) or the slicing model for the dataset production. Each unit of data consumed and produced by an activity run is called a data slice. If the availability of an output dataset is set to daily (frequency - Day, interval - 1), a slice is produced daily. Yes NA For details, see Dataset availability. For details on the dataset slicing model, see the Scheduling and execution article. PROPERTY DESCRIPTION REQUIRED DEFAULT policy Defines the criteria or the condition that the dataset slices must fulfill. No NA For details, see the Dataset policy section. Dataset example In the following example, the dataset represents a table named MyTable in a SQL database. { "name": "DatasetSample", "properties": { "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "MyTable" }, "availability": { "frequency": "Day", "interval": 1 } } } Note the following points: type is set to AzureSqlTable. tableName type property (specific to AzureSqlTable type) is set to MyTable. linkedServiceName refers to a linked service of type AzureSqlDatabase, which is defined in the next JSON snippet. availability frequency is set to Day, and interval is set to 1. This means that the dataset slice is produced daily. AzureSqlLinkedService is defined as follows: { "name": "AzureSqlLinkedService", "properties": { "type": "AzureSqlDatabase", "description": "", "typeProperties": { "connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>@<servername>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30" } } } In the preceding JSON snippet: type is set to AzureSqlDatabase. connectionString type property specifies information to connect to a SQL database. As you can see, the linked service defines how to connect to a SQL database. The dataset defines what table is used as an input and output for the activity in a pipeline. IMPORTANT Unless a dataset is being produced by the pipeline, it should be marked as external. This setting generally applies to inputs of first activity in a pipeline. Dataset type The type of the dataset depends on the data store you use. See the following table for a list of data stores supported by Data Factory. Click a data store to learn how to create a linked service and a dataset for that data store. CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ ✓ ✓ ✓ CATEGORY DATA STORE SUPPORTED AS A SOURCE NoSQL Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ File Others SUPPORTED AS A SINK ✓ NOTE Data stores with * can be on-premises or on Azure infrastructure as a service (IaaS). These data stores require you to install Data Management Gateway. In the example in the previous section, the type of the dataset is set to AzureSqlTable. Similarly, for an Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following JSON: { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "fileName": "input.log", "folderPath": "adfgetstarted/inputdata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 }, "external": true, "policy": {} } } Dataset structure The structure section is optional. It defines the schema of the dataset by containing a collection of names and data types of columns. You use the structure section to provide type information that is used to convert types and map columns from the source to the destination. In the following example, the dataset has three columns: slicetimestamp , projectname , and pageviews . They are of type String, String, and Decimal, respectively. structure: [ { "name": "slicetimestamp", "type": "String"}, { "name": "projectname", "type": "String"}, { "name": "pageviews", "type": "Decimal"} ] Each column in the structure contains the following properties: PROPERTY DESCRIPTION REQUIRED name Name of the column. Yes type Data type of the column. No culture .NET-based culture to be used when the type is a .NET type: Datetime or Datetimeoffset . The default is en-us . No format Format string to be used when the type is a .NET type: Datetime or Datetimeoffset . No The following guidelines help you determine when to include structure information, and what to include in the structure section. For structured data sources, specify the structure section only if you want map source columns to sink columns, and their names are not the same. This kind of structured data source stores data schema and type information along with the data itself. Examples of structured data sources include SQL Server, Oracle, and Azure table. As type information is already available for structured data sources, you should not include type information when you do include the structure section. For schema on read data sources (specifically Blob storage), you can choose to store data without storing any schema or type information with the data. For these types of data sources, include structure when you want to map source columns to sink columns. Also include structure when the dataset is an input for a copy activity, and data types of source dataset should be converted to native types for the sink. Data Factory supports the following values for providing type information in structure: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Bool, String, Guid, Datetime, Datetimeoffset, and Timespan. These values are Common Language Specification (CLS)compliant, .NET-based type values. Data Factory automatically performs type conversions when moving data from a source data store to a sink data store. Dataset availability The availability section in a dataset defines the processing window (for example, hourly, daily, or weekly) for the dataset. For more information about activity windows, see Scheduling and execution. The following availability section specifies that the output dataset is either produced hourly, or the input dataset is available hourly: "availability": { "frequency": "Hour", "interval": 1 } If the pipeline has the following start and end times: "start": "2016-08-25T00:00:00Z", "end": "2016-08-25T05:00:00Z", The output dataset is produced hourly within the pipeline start and end times. Therefore, there are five dataset slices produced by this pipeline, one for each activity window (12 AM - 1 AM, 1 AM - 2 AM, 2 AM - 3 AM, 3 AM - 4 AM, 4 AM - 5 AM). The following table describes properties you can use in the availability section: PROPERTY DESCRIPTION REQUIRED DEFAULT frequency Specifies the time unit for dataset slice production. Yes NA Supported frequency: Minute, Hour, Day, Week, Month PROPERTY DESCRIPTION REQUIRED DEFAULT interval Specifies a multiplier for frequency. Yes NA No EndOfInterval "Frequency x interval" determines how often the slice is produced. For example, if you need the dataset to be sliced on an hourly basis, you set frequency to Hour, and interval to 1. Note that if you specify frequency as Minute, you should set the interval to no less than 15. style Specifies whether the slice should be produced at the start or end of the interval. StartOfInterval EndOfInterval If frequency is set to Month, and style is set to EndOfInterval, the slice is produced on the last day of month. If style is set to StartOfInterval, the slice is produced on the first day of month. If frequency is set to Day, and style is set to EndOfInterval, the slice is produced in the last hour of the day. If frequency is set to Hour, and style is set to EndOfInterval, the slice is produced at the end of the hour. For example, for a slice for the 1 PM - 2 PM period, the slice is produced at 2 PM. PROPERTY DESCRIPTION REQUIRED DEFAULT anchorDateTime Defines the absolute position in time used by the scheduler to compute dataset slice boundaries. No 01/01/0001 No NA Note that if this propoerty has date parts that are more granular than the specified frequency, the more granular parts are ignored. For example, if the interval is hourly (frequency: hour and interval: 1), and the anchorDateTime contains minutes and seconds, then the minutes and seconds parts of anchorDateTime are ignored. offset Timespan by which the start and end of all dataset slices are shifted. Note that if both anchorDateTime and offset are specified, the result is the combined shift. offset example By default, daily ( "frequency": "Day", "interval": 1 ) slices start at 12 AM (midnight) Coordinated Universal Time (UTC). If you want the start time to be 6 AM UTC time instead, set the offset as shown in the following snippet: "availability": { "frequency": "Day", "interval": 1, "offset": "06:00:00" } anchorDateTime example In the following example, the dataset is produced once every 23 hours. The first slice starts at the time specified by anchorDateTime, which is set to 2017-04-19T08:00:00 (UTC). "availability": { "frequency": "Hour", "interval": 23, "anchorDateTime":"2017-04-19T08:00:00" } offset/style example The following dataset is monthly, and is produced on the 3rd of every month at 8:00 AM ( 3.08:00:00 ): "availability": { "frequency": "Month", "interval": 1, "offset": "3.08:00:00", "style": "StartOfInterval" } Dataset policy The policy section in the dataset definition defines the criteria or the condition that the dataset slices must fulfill. Validation policies POLICY NAME DESCRIPTION APPLIED TO REQUIRED DEFAULT minimumSizeMB Validates that the data in Azure Blob storage meets the minimum size requirements (in megabytes). Azure Blob storage No NA minimumRows Validates that the data in an Azure SQL database or an Azure table contains the minimum number of rows. No NA Examples minimumSizeMB: "policy": { "validation": { "minimumSizeMB": 10.0 } } minimumRows: Azure SQL database Azure table "policy": { "validation": { "minimumRows": 100 } } External datasets External datasets are the ones that are not produced by a running pipeline in the data factory. If the dataset is marked as external, the ExternalData policy may be defined to influence the behavior of the dataset slice availability. Unless a dataset is being produced by Data Factory, it should be marked as external. This setting generally applies to the inputs of first activity in a pipeline, unless activity or pipeline chaining is being used. NAME DESCRIPTION REQUIRED DEFAULT VALUE dataDelay The time to delay the check on the availability of the external data for the given slice. For example, you can delay an hourly check by using this setting. No 0 The setting only applies to the present time. For example, if it is 1:00 PM right now and this value is 10 minutes, the validation starts at 1:10 PM. Note that this setting does not affect slices in the past. Slices with Slice End Time + dataDelay < Now are processed without any delay. Times greater than 23:59 hours should be specified by using the day.hours:minutes:seconds format. For example, to specify 24 hours, don't use 24:00:00. Instead, use 1.00:00:00. If you use 24:00:00, it is treated as 24 days (24.00:00:00). For 1 day and 4 hours, specify 1:04:00:00. NAME DESCRIPTION REQUIRED DEFAULT VALUE retryInterval The wait time between a failure and the next attempt. This setting applies to present time. If the previous try failed, the next try is after the retryInterval period. No 00:01:00 (1 minute) No 00:10:00 (10 minutes) No 3 If it is 1:00 PM right now, we begin the first try. If the duration to complete the first validation check is 1 minute and the operation failed, the next retry is at 1:00 + 1min (duration) + 1min (retry interval) = 1:02 PM. For slices in the past, there is no delay. The retry happens immediately. retryTimeout The timeout for each retry attempt. If this property is set to 10 minutes, the validation should be completed within 10 minutes. If it takes longer than 10 minutes to perform the validation, the retry times out. If all attempts for the validation time out, the slice is marked as TimedOut. maximumRetry The number of times to check for the availability of the external data. The maximum allowed value is 10. Create datasets You can create datasets by using one of these tools or SDKs: Copy Wizard Azure portal Visual Studio PowerShell Azure Resource Manager template REST API .NET API See the following tutorials for step-by-step instructions for creating pipelines and datasets by using one of these tools or SDKs: Build a pipeline with a data transformation activity Build a pipeline with a data movement activity After a pipeline is created and deployed, you can manage and monitor your pipelines by using the Azure portal blades, or the Monitoring and Management app. See the following topics for step-bystep instructions: Monitor and manage pipelines by using Azure portal blades Monitor and manage pipelines by using the Monitoring and Management app Scoped datasets You can create datasets that are scoped to a pipeline by using the datasets property. These datasets can only be used by activities within this pipeline, not by activities in other pipelines. The following example defines a pipeline with two datasets (InputDataset-rdc and OutputDataset-rdc) to be used within the pipeline. IMPORTANT Scoped datasets are supported only with one-time pipelines (where pipelineMode is set to OneTime). See Onetime pipeline for details. { "name": "CopyPipeline-rdc", "properties": { "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource", "recursive": false }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "InputDataset-rdc" } ], "outputs": [ { "name": "OutputDataset-rdc" } ], "scheduler": { "frequency": "Day", "interval": 1, "style": "StartOfInterval" }, "name": "CopyActivity-0" "name": "CopyActivity-0" } ], "start": "2016-02-28T00:00:00Z", "end": "2016-02-28T00:00:00Z", "isPaused": false, "pipelineMode": "OneTime", "expirationTime": "15.00:00:00", "datasets": [ { "name": "InputDataset-rdc", "properties": { "type": "AzureBlob", "linkedServiceName": "InputLinkedService-rdc", "typeProperties": { "fileName": "emp.txt", "folderPath": "adftutorial/input", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "," } }, "availability": { "frequency": "Day", "interval": 1 }, "external": true, "policy": {} } }, { "name": "OutputDataset-rdc", "properties": { "type": "AzureBlob", "linkedServiceName": "OutputLinkedService-rdc", "typeProperties": { "fileName": "emp.txt", "folderPath": "adftutorial/output", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "," } }, "availability": { "frequency": "Day", "interval": 1 }, "external": false, "policy": {} } } ] } } Next steps For more information about pipelines, see Create pipelines. For more information about how pipelines are scheduled and executed, see Scheduling and execution in Azure Data Factory. Data Factory scheduling and execution 5/22/2017 • 22 min to read • Edit Online This article explains the scheduling and execution aspects of the Azure Data Factory application model. This article assumes that you understand basics of Data Factory application model concepts, including activity, pipelines, linked services, and datasets. For basic concepts of Azure Data Factory, see the following articles: Introduction to Data Factory Pipelines Datasets Start and end times of pipeline A pipeline is active only between its start time and end time. It is not executed before the start time or after the end time. If the pipeline is paused, it is not executed irrespective of its start and end time. For a pipeline to run, it should not be paused. You find these settings (start, end, paused) in the pipeline definition: "start": "2017-04-01T08:00:00Z", "end": "2017-04-01T11:00:00Z" "isPaused": false For more information these properties, see create pipelines article. Specify schedule for an activity It is not the pipeline that is executed. It is the activities in the pipeline that are executed in the overall context of the pipeline. You can specify a recurring schedule for an activity by using the scheduler section of activity JSON. For example, you can schedule an activity to run hourly as follows: "scheduler": { "frequency": "Hour", "interval": 1 }, As shown in the following diagram, specifying a schedule for an activity creates a series of tumbling windows with in the pipeline start and end times. Tumbling windows are a series of fixed-size non-overlapping, contiguous time intervals. These logical tumbling windows for an activity are called activity windows. The scheduler property for an activity is optional. If you do specify this property, it must match the cadence you specify in the definition of output dataset for the activity. Currently, output dataset is what drives the schedule. Therefore, you must create an output dataset even if the activity does not produce any output. Specify schedule for a dataset An activity in a Data Factory pipeline can take zero or more input datasets and produce one or more output datasets. For an activity, you can specify the cadence at which the input data is available or the output data is produced by using the availability section in the dataset definitions. Frequency in the availability section specifies the time unit. The allowed values for frequency are: Minute, Hour, Day, Week, and Month. The interval property in the availability section specifies a multiplier for frequency. For example: if the frequency is set to Day and interval is set to 1 for an output dataset, the output data is produced daily. If you specify the frequency as minute, we recommend that you set the interval to no less than 15. In the following example, the input data is available hourly and the output data is produced hourly ( "frequency": "Hour", "interval": 1 ). Input dataset: { "name": "AzureSqlInput", "properties": { "published": false, "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "MyTable" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": {} } } Output dataset { "name": "AzureBlobOutput", "properties": { "published": false, "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mypath/{Year}/{Month}/{Day}/{Hour}", "format": { "type": "TextFormat" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" }} ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Currently, output dataset drives the schedule. In other words, the schedule specified for the output dataset is used to run an activity at runtime. Therefore, you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. In the following pipeline definition, the scheduler property is used to specify schedule for the activity. This property is optional. Currently, the schedule for the activity must match the schedule specified for the output dataset. { "name": "SamplePipeline", "properties": { "description": "copy activity", "activities": [ { "type": "Copy", "name": "AzureSQLtoBlob", "description": "copy activity", "typeProperties": { "source": { "type": "SqlSource", "sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink", "writeBatchSize": 100000, "writeBatchTimeout": "00:05:00" } }, "inputs": [ { "name": "AzureSQLInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "scheduler": { "frequency": "Hour", "interval": 1 } } ], "start": "2017-04-01T08:00:00Z", "end": "2017-04-01T11:00:00Z" } } In this example, the activity runs hourly between the start and end times of the pipeline. The output data is produced hourly for three-hour windows (8 AM - 9 AM, 9 AM - 10 AM, and 10 AM - 11 AM). Each unit of data consumed or produced by an activity run is called a data slice. The following diagram shows an example of an activity with one input dataset and one output dataset: The diagram shows the hourly data slices for the input and output dataset. The diagram shows three input slices that are ready for processing. The 10-11 AM activity is in progress, producing the 10-11 AM output slice. You can access the time interval associated with the current slice in the dataset JSON by using variables: SliceStart and SliceEnd. Similarly, you can access the time interval associated with an activity window by using the WindowStart and WindowEnd. The schedule of an activity must match the schedule of the output dataset for the activity. Therefore, the SliceStart and SliceEnd values are the same as WindowStart and WindowEnd values respectively. For more information on these variables, see Data Factory functions and system variables articles. You can use these variables for different purposes in your activity JSON. For example, you can use them to select data from input and output datasets representing time series data (for example: 8 AM to 9 AM). This example also uses WindowStart and WindowEnd to select relevant data for an activity run and copy it to a blob with the appropriate folderPath. The folderPath is parameterized to have a separate folder for every hour. In the preceding example, the schedule specified for input and output datasets is the same (hourly). If the input dataset for the activity is available at a different frequency, say every 15 minutes, the activity that produces this output dataset still runs once an hour as the output dataset is what drives the activity schedule. For more information, see Model datasets with different frequencies. Dataset availability and policies You have seen the usage of frequency and interval properties in the availability section of dataset definition. There are a few other properties that affect the scheduling and execution of an activity. Dataset availability The following table describes properties you can use in the availability section: PROPERTY DESCRIPTION REQUIRED DEFAULT frequency Specifies the time unit for dataset slice production. Yes NA Supported frequency: Minute, Hour, Day, Week, Month PROPERTY DESCRIPTION REQUIRED DEFAULT interval Specifies a multiplier for frequency Yes NA No EndOfInterval ”Frequency x interval” determines how often the slice is produced. If you need the dataset to be sliced on an hourly basis, you set Frequency to Hour, and interval to 1. Note: If you specify Frequency as Minute, we recommend that you set the interval to no less than 15 style Specifies whether the slice should be produced at the start/end of the interval. StartOfInterval EndOfInterval If Frequency is set to Month and style is set to EndOfInterval, the slice is produced on the last day of month. If the style is set to StartOfInterval, the slice is produced on the first day of month. If Frequency is set to Day and style is set to EndOfInterval, the slice is produced in the last hour of the day. If Frequency is set to Hour and style is set to EndOfInterval, the slice is produced at the end of the hour. For example, for a slice for 1 PM – 2 PM period, the slice is produced at 2 PM. PROPERTY DESCRIPTION REQUIRED DEFAULT anchorDateTime Defines the absolute position in time used by scheduler to compute dataset slice boundaries. No 01/01/0001 No NA Note: If the AnchorDateTime has date parts that are more granular than the frequency then the more granular parts are ignored. For example, if the interval is hourly (frequency: hour and interval: 1) and the AnchorDateTime contains minutes and seconds, then the minutes and seconds parts of the AnchorDateTime are ignored. offset Timespan by which the start and end of all dataset slices are shifted. Note: If both anchorDateTime and offset are specified, the result is the combined shift. offset example By default, daily ( "frequency": "Day", "interval": 1 ) slices start at 12 AM UTC time (midnight). If you want the start time to be 6 AM UTC time instead, set the offset as shown in the following snippet: "availability": { "frequency": "Day", "interval": 1, "offset": "06:00:00" } anchorDateTime example In the following example, the dataset is produced once every 23 hours. The first slice starts at the time specified by the anchorDateTime, which is set to 2017-04-19T08:00:00 (UTC time). "availability": { "frequency": "Hour", "interval": 23, "anchorDateTime":"2017-04-19T08:00:00" } offset/style Example The following dataset is a monthly dataset and is produced on 3rd of every month at 8:00 AM ( 3.08:00:00 ): "availability": { "frequency": "Month", "interval": 1, "offset": "3.08:00:00", "style": "StartOfInterval" } Dataset policy A dataset can have a validation policy defined that specifies how the data generated by a slice execution can be validated before it is ready for consumption. In such cases, after the slice has finished execution, the output slice status is changed to Waiting with a substatus of Validation. After the slices are validated, the slice status changes to Ready. If a data slice has been produced but did not pass the validation, activity runs for downstream slices that depend on this slice are not processed. Monitor and manage pipelines covers the various states of data slices in Data Factory. The policy section in dataset definition defines the criteria or the condition that the dataset slices must fulfill. The following table describes properties you can use in the policy section: POLICY NAME DESCRIPTION APPLIED TO REQUIRED DEFAULT minimumSizeMB Validates that the data in an Azure blob meets the minimum size requirements (in megabytes). Azure Blob No NA minimumRows Validates that the data in an Azure SQL database or an Azure table contains the minimum number of rows. No NA Examples minimumSizeMB: "policy": { "validation": { "minimumSizeMB": 10.0 } } minimumRows "policy": { "validation": { "minimumRows": 100 } } Azure SQL Database Azure Table For more information about these properties and examples, see Create datasets article. Activity policies Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed. The following table provides the details. PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION concurrency Integer 1 Number of concurrent executions of the activity. Max value: 10 It determines the number of parallel activity executions that can happen on different slices. For example, if an activity needs to go through a large set of available data, having a larger concurrency value speeds up the data processing. executionPriorityOrder NewestFirst OldestFirst OldestFirst Determines the ordering of data slices that are being processed. For example, if you have 2 slices (one happening at 4pm, and another one at 5pm), and both are pending execution. If you set the executionPriorityOrder to be NewestFirst, the slice at 5 PM is processed first. Similarly if you set the executionPriorityORder to be OldestFIrst, then the slice at 4 PM is processed. retry Integer Max value can be 10 0 Number of retries before the data processing for the slice is marked as Failure. Activity execution for a data slice is retried up to the specified retry count. The retry is done as soon as possible after the failure. PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION timeout TimeSpan 00:00:00 Timeout for the activity. Example: 00:10:00 (implies timeout 10 mins) If a value is not specified or is 0, the timeout is infinite. If the data processing time on a slice exceeds the timeout value, it is canceled, and the system attempts to retry the processing. The number of retries depends on the retry property. When timeout occurs, the status is set to TimedOut. delay TimeSpan 00:00:00 Specify the delay before data processing of the slice starts. The execution of activity for a data slice is started after the Delay is past the expected execution time. Example: 00:10:00 (implies delay of 10 mins) longRetry Integer Max value: 10 1 The number of long retry attempts before the slice execution is failed. longRetry attempts are spaced by longRetryInterval. So if you need to specify a time between retry attempts, use longRetry. If both Retry and longRetry are specified, each longRetry attempt includes Retry attempts and the max number of attempts is Retry * longRetry. For example, if we have the following settings in the activity policy: Retry: 3 longRetry: 2 longRetryInterval: 01:00:00 Assume there is only one slice to execute (status is Waiting) and the activity execution fails every time. Initially there would be 3 consecutive execution attempts. After each attempt, the slice status PROPERTY PERMITTED VALUES DEFAULT VALUE would be Retry. After first 3 DESCRIPTION attempts are over, the slice status would be LongRetry. After an hour (that is, longRetryInteval’s value), there would be another set of 3 consecutive execution attempts. After that, the slice status would be Failed and no more retries would be attempted. Hence overall 6 attempts were made. If any execution succeeds, the slice status would be Ready and no more retries are attempted. longRetry may be used in situations where dependent data arrives at nondeterministic times or the overall environment is flaky under which data processing occurs. In such cases, doing retries one after another may not help and doing so after an interval of time results in the desired output. Word of caution: do not set high values for longRetry or longRetryInterval. Typically, higher values imply other systemic issues. longRetryInterval TimeSpan 00:00:00 The delay between long retry attempts For more information, see Pipelines article. Parallel processing of data slices You can set the start date for the pipeline in the past. When you do so, Data Factory automatically calculates (back fills) all data slices in the past and begins processing them. For example: if you create a pipeline with start date 2017-04-01 and the current date is 2017-04-10. If the cadence of the output dataset is daily, then Data Factory starts processing all the slices from 2017-04-01 to 2017-04-09 immediately because the start date is in the past. The slice from 2017-04-10 is not processed yet because the value of style property in the availability section is EndOfInterval by default. The oldest slice is processed first as the default value of executionPriorityOrder is OldestFirst. For a description of the style property, see dataset availability section. For a description of the executionPriorityOrder section, see the activity policies section. You can configure back-filled data slices to be processed in parallel by setting the concurrency property in the policy section of the activity JSON. This property determines the number of parallel activity executions that can happen on different slices. The default value for the concurrency property is 1. Therefore, one slice is processed at a time by default. The maximum value is 10. When a pipeline needs to go through a large set of available data, having a larger concurrency value speeds up the data processing. Rerun a failed data slice When an error occurs while processing a data slice, you can find out why the processing of a slice failed by using Azure portal blades or Monitor and Manage App. See Monitoring and managing pipelines using Azure portal blades or Monitoring and Management app for details. Consider the following example, which shows two activities. Activity1 and Activity 2. Activity1 consumes a slice of Dataset1 and produces a slice of Dataset2, which is consumed as an input by Activity2 to produce a slice of the Final Dataset. The diagram shows that out of three recent slices, there was a failure producing the 9-10 AM slice for Dataset2. Data Factory automatically tracks dependency for the time series dataset. As a result, it does not start the activity run for the 9-10 AM downstream slice. Data Factory monitoring and management tools allow you to drill into the diagnostic logs for the failed slice to easily find the root cause for the issue and fix it. After you have fixed the issue, you can easily start the activity run to produce the failed slice. For more information on how to rerun and understand state transitions for data slices, see Monitoring and managing pipelines using Azure portal blades or Monitoring and Management app. After you rerun the 9-10 AM slice for Dataset2, Data Factory starts the run for the 9-10 AM dependent slice on the final dataset. Multiple activities in a pipeline You can have more than one activity in a pipeline. If you have multiple activities in a pipeline and the output of an activity is not an input of another activity, the activities may run in parallel if input data slices for the activities are ready. You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. The activities can be in the same pipeline or in different pipelines. The second activity executes only when the first one finishes successfully. For example, consider the following case where a pipeline has two activities: 1. Activity A1 that requires external input dataset D1, and produces output dataset D2. 2. Activity A2 that requires input from dataset D2, and produces output dataset D3. In this scenario, activities A1 and A2 are in the same pipeline. The activity A1 runs when the external data is available and the scheduled availability frequency is reached. The activity A2 runs when the scheduled slices from D2 become available and the scheduled availability frequency is reached. If there is an error in one of the slices in dataset D2, A2 does not run for that slice until it becomes available. The Diagram view with both activities in the same pipeline would look like the following diagram: As mentioned earlier, the activities could be in different pipelines. In such a scenario, the diagram view would look like the following diagram: See the copy sequentially section in the appendix for an example. Model datasets with different frequencies In the samples, the frequencies for input and output datasets and the activity schedule window were the same. Some scenarios require the ability to produce output at a frequency different than the frequencies of one or more inputs. Data Factory supports modeling these scenarios. Sample 1: Produce a daily output report for input data that is available every hour Consider a scenario in which you have input measurement data from sensors available every hour in Azure Blob storage. You want to produce a daily aggregate report with statistics such as mean, maximum, and minimum for the day with Data Factory hive activity. Here is how you can model this scenario with Data Factory: Input dataset The hourly input files are dropped in the folder for the given day. Availability for input is set at Hour (frequency: Hour, interval: 1). { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", "partitionedBy": [ { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} ], "format": { "type": "TextFormat" } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Output dataset One output file is created every day in the day's folder. Availability of output is set at Day (frequency: Day and interval: 1). { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", "partitionedBy": [ { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} ], "format": { "type": "TextFormat" } }, "availability": { "frequency": "Day", "interval": 1 } } } Activity: hive activity in a pipeline The hive script receives the appropriate DateTime information as parameters that use the WindowStart variable as shown in the following snippet. The hive script uses this variable to load the data from the correct folder for the day and run the aggregation to generate the output. { "name":"SamplePipeline", "properties":{ "start":"2015-01-01T08:00:00", "end":"2015-01-01T11:00:00", "description":"hive activity", "activities": [ { "name": "SampleHiveActivity", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "linkedServiceName": "HDInsightLinkedService", "type": "HDInsightHive", "typeProperties": { "scriptPath": "adftutorial\\hivequery.hql", "scriptLinkedService": "StorageLinkedService", "defines": { "Year": "$$Text.Format('{0:yyyy}',WindowStart)", "Month": "$$Text.Format('{0:MM}',WindowStart)", "Day": "$$Text.Format('{0:dd}',WindowStart)" } }, "scheduler": { "frequency": "Day", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 2, "timeout": "01:00:00" } } ] } } The following diagram shows the scenario from a data-dependency point of view. The output slice for every day depends on 24 hourly slices from an input dataset. Data Factory computes these dependencies automatically by figuring out the input data slices that fall in the same time period as the output slice to be produced. If any of the 24 input slices is not available, Data Factory waits for the input slice to be ready before starting the daily activity run. Sample 2: Specify dependency with expressions and Data Factory functions Let’s consider another scenario. Suppose you have a hive activity that processes two input datasets. One of them has new data daily, but one of them gets new data every week. Suppose you wanted to do a join across the two inputs and produce an output every day. The simple approach in which Data Factory automatically figures out the right input slices to process by aligning to the output data slice’s time period does not work. You must specify that for every activity run, the Data Factory should use last week’s data slice for the weekly input dataset. You use Azure Data Factory functions as shown in the following snippet to implement this behavior. Input1: Azure blob The first input is the Azure blob being updated daily. { "name": "AzureBlobInputDaily", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", "partitionedBy": [ { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} ], "format": { "type": "TextFormat" } }, "external": true, "availability": { "frequency": "Day", "interval": 1 } } } Input2: Azure blob Input2 is the Azure blob being updated weekly. { "name": "AzureBlobInputWeekly", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", "partitionedBy": [ { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} ], "format": { "type": "TextFormat" } }, "external": true, "availability": { "frequency": "Day", "interval": 7 } } } Output: Azure blob One output file is created every day in the folder for the day. Availability of output is set to day (frequency: Day, interval: 1). { "name": "AzureBlobOutputDaily", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", "partitionedBy": [ { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} ], "format": { "type": "TextFormat" } }, "availability": { "frequency": "Day", "interval": 1 } } } Activity: hive activity in a pipeline The hive activity takes the two inputs and produces an output slice every day. You can specify every day’s output slice to depend on the previous week’s input slice for weekly input as follows. { "name":"SamplePipeline", "properties":{ "start":"2015-01-01T08:00:00", "end":"2015-01-01T11:00:00", "description":"hive activity", "activities": [ { "name": "SampleHiveActivity", "inputs": [ { "name": "AzureBlobInputDaily" }, { "name": "AzureBlobInputWeekly", "startTime": "Date.AddDays(SliceStart, - Date.DayOfWeek(SliceStart))", "endTime": "Date.AddDays(SliceEnd, -Date.DayOfWeek(SliceEnd))" } ], "outputs": [ { "name": "AzureBlobOutputDaily" } ], "linkedServiceName": "HDInsightLinkedService", "type": "HDInsightHive", "typeProperties": { "scriptPath": "adftutorial\\hivequery.hql", "scriptLinkedService": "StorageLinkedService", "defines": { "Year": "$$Text.Format('{0:yyyy}',WindowStart)", "Month": "$$Text.Format('{0:MM}',WindowStart)", "Day": "$$Text.Format('{0:dd}',WindowStart)" } }, "scheduler": { "frequency": "Day", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 2, "timeout": "01:00:00" } } ] } } See Data Factory functions and system variables for a list of functions and system variables that Data Factory supports. Appendix Example: copy sequentially It is possible to run multiple copy operations one after another in a sequential/ordered manner. For example, you might have two copy activities in a pipeline (CopyActivity1 and CopyActivity2) with the following input data output datasets: CopyActivity1 Input: Dataset. Output: Dataset2. CopyActivity2 Input: Dataset2. Output: Dataset3. CopyActivity2 would run only if the CopyActivity1 has run successfully and Dataset2 is available. Here is the sample pipeline JSON: { "name": "ChainActivities", "properties": { "description": "Run activities in sequence", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "BlobSink", "copyBehavior": "PreserveHierarchy", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "Dataset1" } ], "outputs": [ { "name": "Dataset2" } ], "policy": { "timeout": "01:00:00" }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "CopyFromBlob1ToBlob2", "description": "Copy data from a blob to another" }, { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "Dataset2" } ], "outputs": [ { "name": "Dataset3" } ], ], "policy": { "timeout": "01:00:00" }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "CopyFromBlob2ToBlob3", "description": "Copy data from a blob to another" } ], "start": "2016-08-25T01:00:00Z", "end": "2016-08-25T01:00:00Z", "isPaused": false } } Notice that in the example, the output dataset of the first copy activity (Dataset2) is specified as input for the second activity. Therefore, the second activity runs only when the output dataset from the first activity is ready. In the example, CopyActivity2 can have a different input, such as Dataset3, but you specify Dataset2 as an input to CopyActivity2, so the activity does not run until CopyActivity1 finishes. For example: CopyActivity1 Input: Dataset1. Output: Dataset2. CopyActivity2 Inputs: Dataset3, Dataset2. Output: Dataset4. { "name": "ChainActivities", "properties": { "description": "Run activities in sequence", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "BlobSink", "copyBehavior": "PreserveHierarchy", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "Dataset1" } ], "outputs": [ { "name": "Dataset2" } ], "policy": { "timeout": "01:00:00" }, "scheduler": { "frequency": "Hour", "interval": 1 }, }, "name": "CopyFromBlobToBlob", "description": "Copy data from a blob to another" }, { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "Dataset3" }, { "name": "Dataset2" } ], "outputs": [ { "name": "Dataset4" } ], "policy": { "timeout": "01:00:00" }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "CopyFromBlob3ToBlob4", "description": "Copy data from a blob to another" } ], "start": "2017-04-25T01:00:00Z", "end": "2017-04-25T01:00:00Z", "isPaused": false } } Notice that in the example, two input datasets are specified for the second copy activity. When multiple inputs are specified, only the first input dataset is used for copying data, but other datasets are used as dependencies. CopyActivity2 would start only after the following conditions are met: CopyActivity1 has successfully completed and Dataset2 is available. This dataset is not used when copying data to Dataset4. It only acts as a scheduling dependency for CopyActivity2. Dataset3 is available. This dataset represents the data that is copied to the destination. Tutorial: Copy data from Blob Storage to SQL Database using Data Factory 4/28/2017 • 4 min to read • Edit Online In this tutorial, you create a data factory with a pipeline to copy data from Blob storage to SQL database. The Copy Activity performs the data movement in Azure Data Factory. It is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. See Data Movement Activities article for details about the Copy Activity. NOTE For a detailed overview of the Data Factory service, see the Introduction to Azure Data Factory article. Prerequisites for the tutorial Before you begin this tutorial, you must have the following prerequisites: Azure subscription. If you don't have a subscription, you can create a free trial account in just a couple of minutes. See the Free Trial article for details. Azure Storage Account. You use the blob storage as a source data store in this tutorial. if you don't have an Azure storage account, see the Create a storage account article for steps to create one. Azure SQL Database. You use an Azure SQL database as a destination data store in this tutorial. If you don't have an Azure SQL database that you can use in the tutorial, See How to create and configure an Azure SQL Database to create one. SQL Server 2012/2014 or Visual Studio 2013. You use SQL Server Management Studio or Visual Studio to create a sample database and to view the result data in the database. Collect blob storage account name and key You need the account name and account key of your Azure storage account to do this tutorial. Note down account name and account key for your Azure storage account. 1. Log in to the Azure portal. 2. Click More services on the left menu and select Storage Accounts. 3. In the Storage Accounts blade, select the Azure storage account that you want to use in this tutorial. 4. Select Access keys link under SETTINGS. 5. Click copy (image) button next to Storage account name text box and save/paste it somewhere (for example: in a text file). 6. Repeat the previous step to copy or note down the key1. 7. Close all the blades by clicking X. Collect SQL server, database, user names You need the names of Azure SQL server, database, and user to do this tutorial. Note down names of server, database, and user for your Azure SQL database. 1. In the Azure portal, click More services on the left and select SQL databases. 2. In the SQL databases blade, select the database that you want to use in this tutorial. Note down the database name. 3. In the SQL database blade, click Properties under SETTINGS. 4. Note down the values for SERVER NAME and SERVER ADMIN LOGIN. 5. Close all the blades by clicking X. Allow Azure services to access SQL server Ensure that Allow access to Azure services setting turned ON for your Azure SQL server so that the Data Factory service can access your Azure SQL server. To verify and turn on this setting, do the following steps: 1. 2. 3. 4. Click More services hub on the left and click SQL servers. Select your server, and click Firewall under SETTINGS. In the Firewall settings blade, click ON for Allow access to Azure services. Close all the blades by clicking X. Prepare Blob Storage and SQL Database Now, prepare your Azure blob storage and Azure SQL database for the tutorial by performing the following steps: 1. Launch Notepad. Copy the following text and save it as emp.txt to C:\ADFGetStarted folder on your hard drive. John, Doe Jane, Doe 2. Use tools such as Azure Storage Explorer to create the adftutorial container and to upload the emp.txt file to the container. 3. Use the following SQL script to create the emp table in your Azure SQL Database. CREATE TABLE dbo.emp ( ID int IDENTITY(1,1) NOT NULL, FirstName varchar(50), LastName varchar(50), ) GO CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID); If you have SQL Server 2012/2014 installed on your computer: follow instructions from Managing Azure SQL Database using SQL Server Management Studio to connect to your Azure SQL server and run the SQL script. This article uses the classic Azure portal, not the new Azure portal, to configure firewall for an Azure SQL server. If your client is not allowed to access the Azure SQL server, you need to configure firewall for your Azure SQL server to allow access from your machine (IP Address). See this article for steps to configure the firewall for your Azure SQL server. Create a data factory You have completed the prerequisites. You can create a data factory using one of the following ways. Click one of the options in the drop-down list at the top or the following links to perform the tutorial. Copy Wizard Azure portal Visual Studio PowerShell Azure Resource Manager template REST API .NET API NOTE The data pipeline in this tutorial copies data from a source data store to a destination data store. It does not transform input data to produce output data. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build your first pipeline to transform data using Hadoop cluster. You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information. Tutorial: Create a pipeline with Copy Activity using Data Factory Copy Wizard 5/16/2017 • 6 min to read • Edit Online This tutorial shows you how to use the Copy Wizard to copy data from an Azure blob storage to an Azure SQL database. The Azure Data Factory Copy Wizard allows you to quickly create a data pipeline that copies data from a supported source data store to a supported destination data store. Therefore, we recommend that you use the wizard as a first step to create a sample pipeline for your data movement scenario. For a list of data stores supported as sources and as destinations, see supported data stores. This tutorial shows you how to create an Azure data factory, launch the Copy Wizard, go through a series of steps to provide details about your data ingestion/movement scenario. When you finish steps in the wizard, the wizard automatically creates a pipeline with a Copy Activity to copy data from an Azure blob storage to an Azure SQL database. For more information about Copy Activity, see data movement activities. Prerequisites Complete prerequisites listed in the Tutorial Overview article before performing this tutorial. Create data factory In this step, you use the Azure portal to create an Azure data factory named ADFTutorialDataFactory. 1. Log in to Azure portal. 2. Click + NEW from the top-left corner, click Data + analytics, and click Data Factory. 3. In the New data factory blade: a. Enter ADFTutorialDataFactory for the name. The name of the Azure data factory must be globally unique. If you receive the error: Data factory name “ADFTutorialDataFactory” is not available , change the name of the data factory (for example, yournameADFTutorialDataFactoryYYYYMMDD) and try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. b. Select your Azure subscription. c. For Resource Group, do one of the following steps: Select Use existing to select an existing resource group. Select Create new to enter a name for a resource group. Some of the steps in this tutorial assume that you use the name: ADFTutorialResourceGroup for the resource group. To learn about resource groups, see Using resource groups to manage your Azure resources. d. Select a location for the data factory. e. Select Pin to dashboard check box at the bottom of the blade. f. Click Create. 4. After the creation is complete, you see the Data Factory blade as shown in the following image: Launch Copy Wizard 1. On the Data Factory blade, click Copy data [PREVIEW] to launch the Copy Wizard. NOTE If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third-party cookies and site data setting in the browser settings (or) keep it enabled and create an exception for login.microsoftonline.com and then try launching the wizard again. 2. In the Properties page: a. Enter CopyFromBlobToAzureSql for Task name b. Enter description (optional). c. Change the Start date time and the End date time so that the end date is set to today and start date to five days earlier. d. Click Next. 3. On the Source data store page, click Azure Blob Storage tile. You use this page to specify the source data store for the copy task. 4. On the Specify the Azure Blob storage account page: a. b. c. d. Enter AzureStorageLinkedService for Linked service name. Confirm that From Azure subscriptions option is selected for Account selection method. Select your Azure subscription. Select an Azure storage account from the list of Azure storage accounts available in the selected subscription. You can also choose to enter storage account settings manually by selecting Enter manually option for the Account selection method, and then click Next. 5. On Choose the input file or folder page: a. Double-click adftutorial (folder). b. Select emp.txt, and click Choose 6. On the Choose the input file or folder page, click Next. Do not select Binary copy. 7. On the File format settings page, you see the delimiters and the schema that is auto-detected by the wizard by parsing the file. You can also enter the delimiters manually for the copy wizard to stop autodetecting or to override. Click Next after you review the delimiters and preview data. 8. On the Destination data store page, select Azure SQL Database, and click Next. 9. On Specify the Azure SQL database page: a. Enter AzureSqlLinkedService for the Connection name field. b. Confirm that From Azure subscriptions option is selected for Server / database selection c. d. e. f. method. Select your Azure subscription. Select Server name and Database. Enter User name and Password. Click Next. 10. On the Table mapping page, select emp for the Destination field from the drop-down list, click down arrow (optional) to see the schema and to preview the data. 11. On the Schema mapping page, click Next. 12. On the Performance settings page, click Next. 13. Review information in the Summary page, and click Finish. The wizard creates two linked services, two datasets (input and output), and one pipeline in the data factory (from where you launched the Copy Wizard). Launch Monitor and Manage application 1. On the Deployment page, click the link: Click here to monitor copy pipeline . 2. The monitoring application is launched in a separate tab in your web browser. 3. To see the latest status of hourly slices, click Refresh button in the ACTIVITY WINDOWS list at the bottom. You see five activity windows for five days between start and end times for the pipeline. The list is not automatically refreshed, so you may need to click Refresh a couple of times before you see all the activity windows in the Ready state. 4. Select an activity window in the list. See the details about it in the Activity Window Explorer on the right. Notice that the dates 11, 12, 13, 14, and 15 are in green color, which means that the daily output slices for these dates have already been produced. You also see this color coding on the pipeline and the output dataset in the diagram view. In the previous step, notice that two slices have already been produced, one slice is currently being processed, and the other two are waiting to be processed (based on the color coding). For more information on using this application, see Monitor and manage pipeline using Monitoring App article. Next steps In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ CATEGORY DATA STORE SUPPORTED AS A SOURCE ✓ Azure Search Index Databases NoSQL File Others SUPPORTED AS A SINK Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ ✓ ✓ ✓ ✓ For details about fields/properties that you see in the copy wizard for a data store, click the link for the data store in the table. Tutorial: Use Azure portal to create a Data Factory pipeline to copy data 6/13/2017 • 18 min to read • Edit Online In this article, you learn how to use Azure portal to create a data factory with a pipeline that copies data from an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the Introduction to Azure Data Factory article before doing this tutorial. In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see supported data stores. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data Movement Activities. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see multiple activities in a pipeline. NOTE The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster. Prerequisites Complete prerequisites listed in the tutorial prerequisites article before performing this tutorial. Steps Here are the steps you perform as part of this tutorial: 1. Create an Azure data factory. In this step, you create a data factory named ADFTutorialDataFactory. 2. Create linked services in the data factory. In this step, you create two linked services of types: Azure Storage and Azure SQL Database. The AzureStorageLinkedService links your Azure storage account to the data factory. You created a container and uploaded data to this storage account as part of prerequisites. AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created a SQL table in this database as part of prerequisites. 3. Create input and output datasets in the data factory. The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset specifies the container and the folder that contains the input data. Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset specifies the table in the database to which the data from the blob storage is copied. 4. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity. The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL database. You can use a copy activity in a pipeline to copy data from any supported source to any supported destination. For a list of supported data stores, see data movement activities article. 5. Monitor the pipeline. In this step, you monitor the slices of input and output datasets by using Azure portal. Create data factory IMPORTANT Complete prerequisites for the tutorial if you haven't already done so. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data to product output data. Let's start with creating the data factory in this step. 1. After logging in to the Azure portal, click New on the left menu, click Data + Analytics, and click Data Factory. 2. In the New data factory blade: a. Enter ADFTutorialDataFactory for the name. The name of the Azure data factory must be globally unique. If you receive the following error, change the name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. Data factory name “ADFTutorialDataFactory” is not available b. Select your Azure subscription in which you want to create the data factory. c. For the Resource Group, do one of the following steps: Select Use existing, and select an existing resource group from the drop-down list. Select Create new, and enter the name of a resource group. Some of the steps in this tutorial assume that you use the name: ADFTutorialResourceGroup for the resource group. To learn about resource groups, see Using resource groups to manage your Azure resources. d. Select the location for the data factory. Only regions supported by the Data Factory service are shown in the drop-down list. e. Select Pin to dashboard. f. Click Create. IMPORTANT To create Data Factory instances, you must be a member of the Data Factory Contributor role at the subscription/resource group level. The name of the data factory may be registered as a DNS name in the future and hence become publically visible. 3. On the dashboard, you see the following tile with status: Deploying data factory. 4. After the creation is complete, you see the Data Factory blade as shown in the image. Create linked services You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and AzureSqlDatabase. The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of prerequisites. AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of prerequisites. Create Azure Storage linked service In this step, you link your Azure storage account to your data factory. You specify the name and key of your Azure storage account in this section. 1. In the Data Factory blade, click Author and deploy tile. 2. You see the Data Factory Editor as shown in the following image: 3. In the editor, click New data store button on the toolbar and select Azure storage from the dropdown menu. You should see the JSON template for creating an Azure storage linked service in the right pane. 4. Replace <accountname> and <accountkey> with the account name and account key values for your Azure storage account. 5. Click Deploy on the toolbar. You should see the deployed AzureStorageLinkedService in the tree view now. For more information about JSON properties in the linked service definition, see Azure Blob Storage connector article. Create a linked service for the Azure SQL Database In this step, you link your Azure SQL database to your data factory. You specify the Azure SQL server name, database name, user name, and user password in this section. 1. In the Data Factory Editor, click New data store button on the toolbar and select Azure SQL Database from the drop-down menu. You should see the JSON template for creating the Azure SQL linked service in the right pane. 2. Replace <servername> , <databasename> , <username>@<servername> , and <password> with names of your Azure SQL server, database, user account, and password. 3. Click Deploy on the toolbar to create and deploy the AzureSqlLinkedService. 4. Confirm that you see AzureSqlLinkedService in the tree view under Linked services. For more information about these JSON properties, see Azure SQL Database connector. Create datasets In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and AzureSqlLinkedService respectively. The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and the folder that contains the input data. Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. Create input dataset In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. 1. In the Editor for the Data Factory, click ... More, click New dataset, and click Azure Blob storage from the drop-down menu. 2. Replace JSON in the right pane with the following JSON snippet: { "name": "InputDataset", "properties": { "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adftutorial/", "fileName": "emp.txt", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureBlob because data resides in an Azure blob storage. linkedServiceName Refers to the AzureStorageLinkedService that you created earlier. folderPath Specifies the blob container and the folder that contains input blobs. In this tutorial, adftutorial is the blob container and folder is the root folder. fileName This property is optional. If you omit this property, all files from the folderPath are picked. In this tutorial, emp.txt is specified for the fileName, so only that file is picked up for processing. format -> type The input file is in the text format, so we use TextFormat. columnDelimiter The columns in the input file are delimited by comma character ( , ). PROPERTY DESCRIPTION frequency/interval The frequency is set to Hour and interval is set to 1, which means that the input slices are available hourly. In other words, the Data Factory service looks for input data every hour in the root folder of blob container (adftutorial) you specified. It looks for the data within the pipeline start and end times, not before or after these times. external This property is set to true if the data is not generated by this pipeline. The input data in this tutorial is in the emp.txt file, which is not generated by this pipeline, so we set this property to true. For more information about these JSON properties, see Azure Blob connector article. 3. Click Deploy on the toolbar to create and deploy the InputDataset dataset. Confirm that you see the InputDataset in the tree view. Create output dataset The Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. The output SQL table dataset (OututDataset) you create in this step specifies the table in the database to which the data from the blob storage is copied. 1. In the Editor for the Data Factory, click ... More, click New dataset, and click Azure SQL from the dropdown menu. 2. Replace JSON in the right pane with the following JSON snippet: { "name": "OutputDataset", "properties": { "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "emp" }, "availability": { "frequency": "Hour", "interval": 1 } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION PROPERTY DESCRIPTION type The type property is set to AzureSqlTable because data is copied to a table in an Azure SQL database. linkedServiceName Refers to the AzureSqlLinkedService that you created earlier. tableName Specified the table to which the data is copied. frequency/interval The frequency is set to Hour and interval is 1, which means that the output slices are produced hourly between the pipeline start and end times, not before or after these times. There are three columns – ID, FirstName, and LastName – in the emp table in the database. ID is an identity column, so you need to specify only FirstName and LastName here. For more information about these JSON properties, see Azure SQL connector article. 3. Click Deploy on the toolbar to create and deploy the OutputDataset dataset. Confirm that you see the OutputDataset in the tree view under Datasets. Create pipeline In this step, you create a pipeline with a copy activity that uses InputDataset as an input and OutputDataset as an output. Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. 1. In the Editor for the Data Factory, click ... More, and click New pipeline. Alternatively, you can right-click Pipelines in the tree view and click New pipeline. 2. Replace JSON in the right pane with the following JSON snippet: { "name": "ADFTutorialPipeline", "properties": { "description": "Copy data from a blob to Azure SQL table", "activities": [ { "name": "CopyFromBlobToSQL", "type": "Copy", "inputs": [ { "name": "InputDataset" } ], "outputs": [ { "name": "OutputDataset" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink", "writeBatchSize": 10000, "writeBatchTimeout": "60:00:00" } }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 0, "timeout": "01:00:00" } } ], "start": "2017-05-11T00:00:00Z", "end": "2017-05-12T00:00:00Z" } } Note the following points: In the activities section, there is only one activity whose type is set to Copy. For more information about the copy activity, see data movement activities. In Data Factory solutions, you can also use data transformation activities. Input for the activity is set to InputDataset and output for the activity is set to OutputDataset. In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see supported data stores. To learn how to use a specific supported data store as a source/sink, click the link in the table. Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The end time is optional, but we use it in this tutorial. If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline indefinitely, specify 9999-0909 as the value for the end property. In the preceding example, there are 24 data slices as each data slice is produced hourly. For descriptions of JSON properties in a pipeline definition, see create pipelines article. For descriptions of JSON properties in a copy activity definition, see data movement activities. For descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector article. 3. Click Deploy on the toolbar to create and deploy the ADFTutorialPipeline. Confirm that you see the pipeline in the tree view. 4. Now, close the Editor blade by clicking X. Click X again to see the Data Factory home page for the ADFTutorialDataFactory. Congratulations! You have successfully created an Azure data factory with a pipeline to copy data from an Azure blob storage to an Azure SQL database. Monitor pipeline In this step, you use the Azure portal to monitor what’s going on in an Azure data factory. Monitor pipeline using Monitor & Manage App The following steps show you how to monitor pipelines in your data factory by using the Monitor & Manage application: 1. Click Monitor & Manage tile on the home page for your data factory. 2. You should see Monitor & Manage application in a separate tab. NOTE If you see that the web browser is stuck at "Authorizing...", do one of the following: clear the Block third-party cookies and site data check box (or) create an exception for login.microsoftonline.com, and then try to open the app again. 3. Change the Start time and End time to include start (2017-05-11) and end times (2017-05-12) of your pipeline, and click Apply. 4. You see the activity windows associated with each hour between pipeline start and end times in the list in the middle pane. 5. To see details about an activity window, select the activity window in the Activity Windows list. In Activity Window Explorer on the right, you see that the slices up to the current UTC time (8:12 PM) are all processed (in green color). The 8-9 PM, 9 - 10 PM, 10 - 11 PM, 11 PM - 12 AM slices are not processed yet. The Attempts section in the right pane provides information about the activity run for the data slice. If there was an error, it provides details about the error. For example, if the input folder or container does not exist and the slice processing fails, you see an error message stating that the container or folder does not exist. 6. Launch SQL Server Management Studio, connect to the Azure SQL Database, and verify that the rows are inserted in to the emp table in the database. For detailed information about using this application, see Monitor and manage Azure Data Factory pipelines using Monitoring and Management App. Monitor pipeline using Diagram View You can also monitor data pipelines by using the diagram view. 1. In the Data Factory blade, click Diagram. 2. You should see the diagram similar to the following image: 3. In the diagram view, double-click InputDataset to see slices for the dataset. 4. Click See more link to see all the data slices. You see 24 hourly slices between pipeline start and end times. Notice that all the data slices up to the current UTC time are Ready because the emp.txt file exists all the time in the blob container: adftutorial\input. The slices for the future times are not in ready state yet. Confirm that no slices show up in the Recently failed slices section at the bottom. 5. Close the blades until you see the diagram view (or) scroll left to see the diagram view. Then, double-click OutputDataset. 6. Click See more link on the Table blade for OutputDataset to see all the slices. 7. Notice that all the slices up to the current UTC time move from pending execution state => In progress ==> Ready state. The slices from the past (before current time) are processed from latest to oldest by default. For example, if the current time is 8:12 PM UTC, the slice for 7 PM - 8 PM is processed ahead of the 6 PM - 7 PM slice. The 8 PM - 9 PM slice is processed at the end of the time interval by default, that is after 9 PM. 8. Click any data slice from the list and you should see the Data slice blade. A piece of data associated with an activity window is called a slice. A slice can be one file or multiple files. If the slice is not in the Ready state, you can see the upstream slices that are not Ready and are blocking the current slice from executing in the Upstream slices that are not ready list. 9. In the DATA SLICE blade, you should see all activity runs in the list at the bottom. Click an activity run to see the Activity run details blade. In this blade, you see how long the copy operation took, what throughput is, how many bytes of data were read and written, run start time, run end time etc. 10. Click X to close all the blades until you get back to the home blade for the ADFTutorialDataFactory. 11. (optional) click the Datasets tile or Pipelines tile to get the blades you have seen the preceding steps. 12. Launch SQL Server Management Studio, connect to the Azure SQL Database, and verify that the rows are inserted in to the emp table in the database. Summary In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database. You used the Azure portal to create the data factory, linked services, datasets, and a pipeline. Here are the high-level steps you performed in this tutorial: 1. Created an Azure data factory. 2. Created linked services: a. An Azure Storage linked service to link your Azure Storage account that holds input data. b. An Azure SQL linked service to link your Azure SQL database that holds the output data. 3. Created datasets that describe input data and output data for pipelines. 4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink. Next steps In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ Azure Search Index ✓ CATEGORY Databases NoSQL File Others DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Table storage ✓ ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ ✓ ✓ ✓ To learn about how to copy data to/from a data store, click the link for the data store in the table. Tutorial: Create a pipeline with Copy Activity using Visual Studio 5/22/2017 • 19 min to read • Edit Online In this article, you learn how to use the Microsoft Visual Studio to create a data factory with a pipeline that copies data from an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the Introduction to Azure Data Factory article before doing this tutorial. In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see supported data stores. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data Movement Activities. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see multiple activities in a pipeline. NOTE The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster. Prerequisites 1. Read through Tutorial Overview article and complete the prerequisite steps. 2. To create Data Factory instances, you must be a member of the Data Factory Contributor role at the subscription/resource group level. 3. You must have the following installed on your computer: Visual Studio 2013 or Visual Studio 2015 Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to Azure Download Page and click VS 2013 or VS 2015 in the .NET section. Download the latest Azure Data Factory plugin for Visual Studio: VS 2013 or VS 2015. You can also update the plugin by doing the following steps: On the menu, click Tools -> Extensions and Updates -> Online -> Visual Studio Gallery -> Microsoft Azure Data Factory Tools for Visual Studio -> Update. Steps Here are the steps you perform as part of this tutorial: 1. Create linked services in the data factory. In this step, you create two linked services of types: Azure Storage and Azure SQL Database. The AzureStorageLinkedService links your Azure storage account to the data factory. You created a container and uploaded data to this storage account as part of prerequisites. AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created a SQL table in this database as part of prerequisites. 2. Create input and output datasets in the data factory. The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset specifies the container and the folder that contains the input data. Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset specifies the table in the database to which the data from the blob storage is copied. 3. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity. The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL database. You can use a copy activity in a pipeline to copy data from any supported source to any supported destination. For a list of supported data stores, see data movement activities article. 4. Create an Azure data factory when deploying Data Factory entities (linked services, datasets/tables, and pipelines). Create Visual Studio project 1. Launch Visual Studio 2015. Click File, point to New, and click Project. You should see the New Project dialog box. 2. In the New Project dialog, select the DataFactory template, and click Empty Data Factory Project. 3. Specify the name of the project, location for the solution, and name of the solution, and then click OK. Create linked services You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). Therefore, you create two linked services of types: AzureStorage and AzureSqlDatabase. The Azure Storage linked service links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of prerequisites. Azure SQL linked service links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of prerequisites. Linked services link data stores or compute services to an Azure data factory. See supported data stores for all the sources and sinks supported by the Copy Activity. See compute linked services for the list of compute services supported by Data Factory. In this tutorial, you do not use any compute service. Create the Azure Storage linked service 1. In Solution Explorer, right-click Linked Services, point to Add, and click New Item. 2. In the Add New Item dialog box, select Azure Storage Linked Service from the list, and click Add. 3. Replace <accountname> and <accountkey> * with the name of your Azure storage account and its key. 4. Save the AzureStorageLinkedService1.json file. For more information about JSON properties in the linked service definition, see Azure Blob Storage connector article. Create the Azure SQL linked service 1. Right-click on Linked Services node in the Solution Explorer again, point to Add, and click New Item. 2. This time, select Azure SQL Linked Service, and click Add. 3. In the AzureSqlLinkedService1.json file, replace <servername> , <databasename> , <username@servername> , and <password> with names of your Azure SQL server, database, user account, and password. 4. Save the AzureSqlLinkedService1.json file. For more information about these JSON properties, see Azure SQL Database connector. Create datasets In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService1 and AzureSqlLinkedService1 respectively. The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and the folder that contains the input data. Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. Create input dataset In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService1 linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. Here, you use the term "tables" rather than "datasets". A table is a rectangular dataset and is the only type of dataset supported right now. 1. Right-click Tables in the Solution Explorer, point to Add, and click New Item. 2. In the Add New Item dialog box, select Azure Blob, and click Add. 3. Replace the JSON text with the following text and save the AzureBlobLocation1.json file. { "name": "InputDataset", "properties": { "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService1", "typeProperties": { "folderPath": "adftutorial/", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureBlob because data resides in an Azure blob storage. linkedServiceName Refers to the AzureStorageLinkedService that you created earlier. folderPath Specifies the blob container and the folder that contains input blobs. In this tutorial, adftutorial is the blob container and folder is the root folder. fileName This property is optional. If you omit this property, all files from the folderPath are picked. In this tutorial, emp.txt is specified for the fileName, so only that file is picked up for processing. format -> type The input file is in the text format, so we use TextFormat. columnDelimiter The columns in the input file are delimited by comma character ( , ). PROPERTY DESCRIPTION frequency/interval The frequency is set to Hour and interval is set to 1, which means that the input slices are available hourly. In other words, the Data Factory service looks for input data every hour in the root folder of blob container (adftutorial) you specified. It looks for the data within the pipeline start and end times, not before or after these times. external This property is set to true if the data is not generated by this pipeline. The input data in this tutorial is in the emp.txt file, which is not generated by this pipeline, so we set this property to true. For more information about these JSON properties, see Azure Blob connector article. Create output dataset In this step, you create an output dataset named OutputDataset. This dataset points to a SQL table in the Azure SQL database represented by AzureSqlLinkedService1. 1. Right-click Tables in the Solution Explorer again, point to Add, and click New Item. 2. In the Add New Item dialog box, select Azure SQL, and click Add. 3. Replace the JSON text with the following JSON and save the AzureSqlTableLocation1.json file. { "name": "OutputDataset", "properties": { "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService1", "typeProperties": { "tableName": "emp" }, "availability": { "frequency": "Hour", "interval": 1 } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureSqlTable because data is copied to a table in an Azure SQL database. linkedServiceName Refers to the AzureSqlLinkedService that you created earlier. PROPERTY DESCRIPTION tableName Specified the table to which the data is copied. frequency/interval The frequency is set to Hour and interval is 1, which means that the output slices are produced hourly between the pipeline start and end times, not before or after these times. There are three columns – ID, FirstName, and LastName – in the emp table in the database. ID is an identity column, so you need to specify only FirstName and LastName here. For more information about these JSON properties, see Azure SQL connector article. Create pipeline In this step, you create a pipeline with a copy activity that uses InputDataset as an input and OutputDataset as an output. Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. 1. Right-click Pipelines in the Solution Explorer, point to Add, and click New Item. 2. Select Copy Data Pipeline in the Add New Item dialog box and click Add. 3. Replace the JSON with the following JSON and save the CopyActivity1.json file. { "name": "ADFTutorialPipeline", "properties": { "description": "Copy data from a blob to Azure SQL table", "activities": [ { "name": "CopyFromBlobToSQL", "type": "Copy", "inputs": [ { "name": "InputDataset" } ], "outputs": [ { "name": "OutputDataset" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink", "writeBatchSize": 10000, "writeBatchTimeout": "60:00:00" } }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "style": "StartOfInterval", "retry": 0, "timeout": "01:00:00" } } ], "start": "2017-05-11T00:00:00Z", "end": "2017-05-12T00:00:00Z", "isPaused": false } } In the activities section, there is only one activity whose type is set to Copy. For more information about the copy activity, see data movement activities. In Data Factory solutions, you can also use data transformation activities. Input for the activity is set to InputDataset and output for the activity is set to OutputDataset. In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see supported data stores. To learn how to use a specific supported data store as a source/sink, click the link in the table. Replace the value of the start property with the current day and end value with the next day. You can specify only the date part and skip the time part of the date time. For example, "201602-03", which is equivalent to "2016-02-03T00:00:00Z" Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The end time is optional, but we use it in this tutorial. If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline indefinitely, specify 9999-09-09 as the value for the end property. In the preceding example, there are 24 data slices as each data slice is produced hourly. For descriptions of JSON properties in a pipeline definition, see create pipelines article. For descriptions of JSON properties in a copy activity definition, see data movement activities. For descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector article. Publish/deploy Data Factory entities In this step, you publish Data Factory entities (linked services, datasets, and pipeline) you created earlier. You also specify the name of the new data factory to be created to hold these entities. 1. Right-click project in the Solution Explorer, and click Publish. 2. If you see Sign in to your Microsoft account dialog box, enter your credentials for the account that has Azure subscription, and click sign in. 3. You should see the following dialog box: 4. In the Configure data factory page, do the following steps: a. select Create New Data Factory option. b. Enter VSTutorialFactory for Name. IMPORTANT The name of the Azure data factory must be globally unique. If you receive an error about the name of data factory when publishing, change the name of the data factory (for example, yournameVSTutorialFactory) and try publishing again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. c. Select your Azure subscription for the Subscription field. IMPORTANT If you do not see any subscription, ensure that you logged in using an account that is an admin or coadmin of the subscription. d. Select the resource group for the data factory to be created. e. Select the region for the data factory. Only regions supported by the Data Factory service are shown in the drop-down list. f. Click Next to switch to the Publish Items page. 5. In the Publish Items page, ensure that all the Data Factories entities are selected, and click Next to switch to the Summary page. 6. Review the summary and click Next to start the deployment process and view the Deployment Status. 7. In the Deployment Status page, you should see the status of the deployment process. Click Finish after the deployment is done. Note the following points: If you receive the error: "This subscription is not registered to use namespace Microsoft.DataFactory", do one of the following and try publishing again: In Azure PowerShell, run the following command to register the Data Factory provider. Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory You can run the following command to confirm that the Data Factory provider is registered. Get-AzureRmResourceProvider Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. The name of the data factory may be registered as a DNS name in the future and hence become publically visible. IMPORTANT To create Data Factory instances, you need to be a admin/co-admin of the Azure subscription Monitor pipeline Navigate to the home page for your data factory: 1. Log in to Azure portal. 2. Click More services on the left menu, and click Data factories. 3. Start typing the name of your data factory. 4. Click your data factory in the results list to see the home page for your data factory. 5. Follow instructions from Monitor datasets and pipeline to monitor the pipeline and datasets you have created in this tutorial. Currently, Visual Studio does not support monitoring Data Factory pipelines. Summary In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database. You used Visual Studio to create the data factory, linked services, datasets, and a pipeline. Here are the highlevel steps you performed in this tutorial: 1. Created an Azure data factory. 2. Created linked services: a. An Azure Storage linked service to link your Azure Storage account that holds input data. b. An Azure SQL linked service to link your Azure SQL database that holds the output data. 3. Created datasets, which describe input data and output data for pipelines. 4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink. To see how to use a HDInsight Hive Activity to transform data by using Azure HDInsight cluster, see Tutorial: Build your first pipeline to transform data using Hadoop cluster. You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information. View all data factories in Server Explorer This section describes how to use the Server Explorer in Visual Studio to view all the data factories in your Azure subscription and create a Visual Studio project based on an existing data factory. 1. In Visual Studio, click View on the menu, and click Server Explorer. 2. In the Server Explorer window, expand Azure and expand Data Factory. If you see Sign in to Visual Studio, enter the account associated with your Azure subscription and click Continue. Enter password, and click Sign in. Visual Studio tries to get information about all Azure data factories in your subscription. You see the status of this operation in the Data Factory Task List window. Create a Visual Studio project for an existing data factory Right-click a data factory in Server Explorer, and select Export Data Factory to New Project to create a Visual Studio project based on an existing data factory. Update Data Factory tools for Visual Studio To update Azure Data Factory tools for Visual Studio, do the following steps: 1. Click Tools on the menu and select Extensions and Updates. 2. Select Updates in the left pane and then select Visual Studio Gallery. 3. Select Azure Data Factory tools for Visual Studio and click Update. If you do not see this entry, you already have the latest version of the tools. Use configuration files You can use configuration files in Visual Studio to configure properties for linked services/tables/pipelines differently for each environment. Consider the following JSON definition for an Azure Storage linked service. To specify connectionString with different values for accountname and accountkey based on the environment (Dev/Test/Production) to which you are deploying Data Factory entities. You can achieve this behavior by using separate configuration file for each environment. { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "description": "", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Add a configuration file Add a configuration file for each environment by performing the following steps: 1. Right-click the Data Factory project in your Visual Studio solution, point to Add, and click New item. 2. Select Config from the list of installed templates on the left, select Configuration File, enter a name for the configuration file, and click Add. 3. Add configuration parameters and their values in the following format: { "$schema": "http://datafactories.schema.management.azure.com/vsschemas/V1/Microsoft.DataFactory.Config.json", "AzureStorageLinkedService1": [ { "name": "$.properties.typeProperties.connectionString", "value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } ], "AzureSqlLinkedService1": [ { "name": "$.properties.typeProperties.connectionString", "value": "Server=tcp:spsqlserver.database.windows.net,1433;Database=spsqldb;User ID=spelluru;Password=Sowmya123;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } ] } This example configures connectionString property of an Azure Storage linked service and an Azure SQL linked service. Notice that the syntax for specifying name is JsonPath. If JSON has a property that has an array of values as shown in the following code: "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], Configure properties as shown in the following configuration file (use zero-based indexing): { "name": "$.properties.structure[0].name", "value": "FirstName" } { "name": "$.properties.structure[0].type", "value": "String" } { "name": "$.properties.structure[1].name", "value": "LastName" } { "name": "$.properties.structure[1].type", "value": "String" } Property names with spaces If a property name has spaces in it, use square brackets as shown in the following example (Database server name): { "name": "$.properties.activities[1].typeProperties.webServiceParameters.['Database server name']", "value": "MyAsqlServer.database.windows.net" } Deploy solution using a configuration When you are publishing Azure Data Factory entities in VS, you can specify the configuration that you want to use for that publishing operation. To publish entities in an Azure Data Factory project using configuration file: 1. Right-click Data Factory project and click Publish to see the Publish Items dialog box. 2. Select an existing data factory or specify values for creating a data factory on the Configure data factory page, and click Next. 3. On the Publish Items page: you see a drop-down list with available configurations for the Select Deployment Config field. 4. Select the configuration file that you would like to use and click Next. 5. Confirm that you see the name of JSON file in the Summary page and click Next. 6. Click Finish after the deployment operation is finished. When you deploy, the values from the configuration file are used to set values for properties in the JSON files before the entities are deployed to Azure Data Factory service. Use Azure Key Vault It is not advisable and often against security policy to commit sensitive data such as connection strings to the code repository. See ADF Secure Publish sample on GitHub to learn about storing sensitive information in Azure Key Vault and using it while publishing Data Factory entities. The Secure Publish extension for Visual Studio allows the secrets to be stored in Key Vault and only references to them are specified in linked services/ deployment configurations. These references are resolved when you publish Data Factory entities to Azure. These files can then be committed to source repository without exposing any secrets. Next steps In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ ✓ ✓ ✓ CATEGORY NoSQL File Others DATA STORE SUPPORTED AS A SOURCE Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK ✓ To learn about how to copy data to/from a data store, click the link for the data store in the table. Tutorial: Create a Data Factory pipeline that moves data by using Azure PowerShell 6/13/2017 • 17 min to read • Edit Online In this article, you learn how to use PowerShell to create a data factory with a pipeline that copies data from an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the Introduction to Azure Data Factory article before doing this tutorial. In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see supported data stores. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data Movement Activities. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see multiple activities in a pipeline. NOTE This article does not cover all the Data Factory cmdlets. See Data Factory Cmdlet Reference for comprehensive documentation on these cmdlets. The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster. Prerequisites Complete prerequisites listed in the tutorial prerequisites article. Install Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell. Steps Here are the steps you perform as part of this tutorial: 1. Create an Azure data factory. In this step, you create a data factory named ADFTutorialDataFactoryPSH. 2. Create linked services in the data factory. In this step, you create two linked services of types: Azure Storage and Azure SQL Database. The AzureStorageLinkedService links your Azure storage account to the data factory. You created a container and uploaded data to this storage account as part of prerequisites. AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created a SQL table in this database as part of prerequisites. 3. Create input and output datasets in the data factory. The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset specifies the container and the folder that contains the input data. Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset specifies the table in the database to which the data from the blob storage is copied. 4. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity. The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL database. You can use a copy activity in a pipeline to copy data from any supported source to any supported destination. For a list of supported data stores, see data movement activities article. 5. Monitor the pipeline. In this step, you monitor the slices of input and output datasets by using PowerShell. Create a data factory IMPORTANT Complete prerequisites for the tutorial if you haven't already done so. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data to product output data. Let's start with creating the data factory in this step. 1. Launch PowerShell. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run the commands again. Run the following command, and enter the user name and password that you use to sign in to the Azure portal: Login-AzureRmAccount Run the following command to view all the subscriptions for this account: Get-AzureRmSubscription Run the following command to select the subscription that you want to work with. Replace <NameOfAzureSubscription> with the name of your Azure subscription: Get-AzureRmSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzureRmContext 2. Create an Azure resource group named ADFTutorialResourceGroup by running the following command: New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US" Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use it in place of ADFTutorialResourceGroup in this tutorial. 3. Run the New-AzureRmDataFactory cmdlet to create a data factory named ADFTutorialDataFactoryPSH: $df=New-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name ADFTutorialDataFactoryPSH –Location "West US" This name may already have been taken. Therefore, make the name of the data factory unique by adding a prefix or suffix (for example: ADFTutorialDataFactoryPSH05152017) and run the command again. Note the following points: The name of the Azure data factory must be globally unique. If you receive the following error, change the name (for example, yournameADFTutorialDataFactoryPSH). Use this name in place of ADFTutorialFactoryPSH while performing steps in this tutorial. See Data Factory - Naming Rules for Data Factory artifacts. Data factory name “ADFTutorialDataFactoryPSH” is not available To create Data Factory instances, you must be a contributor or administrator of the Azure subscription. The name of the data factory may be registered as a DNS name in the future, and hence become publicly visible. You may receive the following error: "This subscription is not registered to use namespace Microsoft.DataFactory." Do one of the following, and try publishing again: In Azure PowerShell, run the following command to register the Data Factory provider: Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory Run the following command to confirm that the Data Factory provider is registered: Get-AzureRmResourceProvider Sign in by using the Azure subscription to the Azure portal. Go to a Data Factory blade, or create a data factory in the Azure portal. This action automatically registers the provider for you. Create linked services You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and AzureSqlDatabase. The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of prerequisites. AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of prerequisites. Create a linked service for an Azure storage account In this step, you link your Azure storage account to your data factory. 1. Create a JSON file named AzureStorageLinkedService.json in C:\ADFGetStartedPSH folder with the following content: (Create the folder ADFGetStartedPSH if it does not already exist.) IMPORTANT Replace <accountname> and <accountkey> with name and key of your Azure storage account before saving the file. { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName= <accountname>;AccountKey=<accountkey>" } } } 2. In Azure PowerShell, switch to the ADFGetStartedPSH folder. 3. Run the New-AzureRmDataFactoryLinkedService cmdlet to create the linked service: AzureStorageLinkedService. This cmdlet, and other Data Factory cmdlets you use in this tutorial requires you to pass values for the ResourceGroupName and DataFactoryName parameters. Alternatively, you can pass the DataFactory object returned by the New-AzureRmDataFactory cmdlet without typing ResourceGroupName and DataFactoryName each time you run a cmdlet. New-AzureRmDataFactoryLinkedService $df -File .\AzureStorageLinkedService.json Here is the sample output: LinkedServiceName ResourceGroupName DataFactoryName Properties ProvisioningState : : : : : AzureStorageLinkedService ADFTutorialResourceGroup ADFTutorialDataFactoryPSH0516 Microsoft.Azure.Management.DataFactories.Models.LinkedServiceProperties Succeeded Other way of creating this linked service is to specify resource group name and data factory name instead of specifying the DataFactory object. New-AzureRmDataFactoryLinkedService -ResourceGroupName ADFTutorialResourceGroup -DataFactoryName <Name of your data factory> -File .\AzureStorageLinkedService.json Create a linked service for an Azure SQL database In this step, you link your Azure SQL database to your data factory. 1. Create a JSON file named AzureSqlLinkedService.json in C:\ADFGetStartedPSH folder with the following content: IMPORTANT Replace <servername>, <databasename>, <username@servername>, and <password> with names of your Azure SQL server, database, user account, and password. { "name": "AzureSqlLinkedService", "properties": { "type": "AzureSqlDatabase", "typeProperties": { "connectionString": "Server=tcp:<server>.database.windows.net,1433;Database= <databasename>;User ID=<user>@<server>;Password= <password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } } } 2. Run the following command to create a linked service: New-AzureRmDataFactoryLinkedService $df -File .\AzureSqlLinkedService.json Here is the sample output: LinkedServiceName ResourceGroupName DataFactoryName Properties ProvisioningState : : : : : AzureSqlLinkedService ADFTutorialResourceGroup ADFTutorialDataFactoryPSH0516 Microsoft.Azure.Management.DataFactories.Models.LinkedServiceProperties Succeeded Confirm that Allow access to Azure services setting is turned on for your SQL database server. To verify and turn it on, do the following steps: a. b. c. d. e. f. Log in to the Azure portal Click More services > on the left, and click SQL servers in the DATABASES category. Select your server in the list of SQL servers. On the SQL server blade, click Show firewall settings link. In the Firewall settings blade, click ON for Allow access to Azure services. Click Save on the toolbar. Create datasets In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and AzureSqlLinkedService respectively. The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and the folder that contains the input data. Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. Create an input dataset In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. 1. Create a JSON file named InputDataset.json in the C:\ADFGetStartedPSH folder, with the following content: { "name": "InputDataset", "properties": { "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "fileName": "emp.txt", "folderPath": "adftutorial/", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureBlob because data resides in an Azure blob storage. linkedServiceName Refers to the AzureStorageLinkedService that you created earlier. folderPath Specifies the blob container and the folder that contains input blobs. In this tutorial, adftutorial is the blob container and folder is the root folder. fileName This property is optional. If you omit this property, all files from the folderPath are picked. In this tutorial, emp.txt is specified for the fileName, so only that file is picked up for processing. format -> type The input file is in the text format, so we use TextFormat. columnDelimiter The columns in the input file are delimited by comma character ( , ). PROPERTY DESCRIPTION frequency/interval The frequency is set to Hour and interval is set to 1, which means that the input slices are available hourly. In other words, the Data Factory service looks for input data every hour in the root folder of blob container (adftutorial) you specified. It looks for the data within the pipeline start and end times, not before or after these times. external This property is set to true if the data is not generated by this pipeline. The input data in this tutorial is in the emp.txt file, which is not generated by this pipeline, so we set this property to true. For more information about these JSON properties, see Azure Blob connector article. 2. Run the following command to create the Data Factory dataset. New-AzureRmDataFactoryDataset $df -File .\InputDataset.json Here is the sample output: DatasetName ResourceGroupName DataFactoryName Availability Location Policy Structure Properties ProvisioningState : : : : : : : : : InputDataset ADFTutorialResourceGroup ADFTutorialDataFactoryPSH0516 Microsoft.Azure.Management.DataFactories.Common.Models.Availability Microsoft.Azure.Management.DataFactories.Models.AzureBlobDataset Microsoft.Azure.Management.DataFactories.Common.Models.Policy {FirstName, LastName} Microsoft.Azure.Management.DataFactories.Models.DatasetProperties Succeeded Create an output dataset In this part of the step, you create an output dataset named OutputDataset. This dataset points to a SQL table in the Azure SQL database represented by AzureSqlLinkedService. 1. Create a JSON file named OutputDataset.json in the C:\ADFGetStartedPSH folder with the following content: { "name": "OutputDataset", "properties": { "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "emp" }, "availability": { "frequency": "Hour", "interval": 1 } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureSqlTable because data is copied to a table in an Azure SQL database. linkedServiceName Refers to the AzureSqlLinkedService that you created earlier. tableName Specified the table to which the data is copied. frequency/interval The frequency is set to Hour and interval is 1, which means that the output slices are produced hourly between the pipeline start and end times, not before or after these times. There are three columns – ID, FirstName, and LastName – in the emp table in the database. ID is an identity column, so you need to specify only FirstName and LastName here. For more information about these JSON properties, see Azure SQL connector article. 2. Run the following command to create the data factory dataset. New-AzureRmDataFactoryDataset $df -File .\OutputDataset.json Here is the sample output: DatasetName ResourceGroupName DataFactoryName Availability Location Policy Structure Properties ProvisioningState : : : : : : : : : OutputDataset ADFTutorialResourceGroup ADFTutorialDataFactoryPSH0516 Microsoft.Azure.Management.DataFactories.Common.Models.Availability Microsoft.Azure.Management.DataFactories.Models.AzureSqlTableDataset {FirstName, LastName} Microsoft.Azure.Management.DataFactories.Models.DatasetProperties Succeeded Create a pipeline In this step, you create a pipeline with a copy activity that uses InputDataset as an input and OutputDataset as an output. Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. 1. Create a JSON file named ADFTutorialPipeline.json in the C:\ADFGetStartedPSH folder, with the following content: { "name": "ADFTutorialPipeline", "properties": { "description": "Copy data from a blob to Azure SQL table", "activities": [ { "name": "CopyFromBlobToSQL", "type": "Copy", "inputs": [ { "name": "InputDataset" } ], "outputs": [ { "name": "OutputDataset" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink", "writeBatchSize": 10000, "writeBatchTimeout": "60:00:00" } }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 0, "timeout": "01:00:00" } } ], "start": "2017-05-11T00:00:00Z", "end": "2017-05-12T00:00:00Z" } } Note the following points: In the activities section, there is only one activity whose type is set to Copy. For more information about the copy activity, see data movement activities. In Data Factory solutions, you can also use data transformation activities. Input for the activity is set to InputDataset and output for the activity is set to OutputDataset. In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see supported data stores. To learn how to use a specific supported data store as a source/sink, click the link in the table. Replace the value of the start property with the current day and end value with the next day. You can specify only the date part and skip the time part of the date time. For example, "201602-03", which is equivalent to "2016-02-03T00:00:00Z" Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The end time is optional, but we use it in this tutorial. If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline indefinitely, specify 9999-09-09 as the value for the end property. In the preceding example, there are 24 data slices as each data slice is produced hourly. For descriptions of JSON properties in a pipeline definition, see create pipelines article. For descriptions of JSON properties in a copy activity definition, see data movement activities. For descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector article. 2. Run the following command to create the data factory table. New-AzureRmDataFactoryPipeline $df -File .\ADFTutorialPipeline.json Here is the sample output: PipelineName ResourceGroupName DataFactoryName Properties ProvisioningState : : : : : ADFTutorialPipeline ADFTutorialResourceGroup ADFTutorialDataFactoryPSH0516 Microsoft.Azure.Management.DataFactories.Models.PipelinePropertie Succeeded Congratulations! You have successfully created an Azure data factory with a pipeline to copy data from an Azure blob storage to an Azure SQL database. Monitor the pipeline In this step, you use Azure PowerShell to monitor what’s going on in an Azure data factory. 1. Replace <DataFactoryName> with the name of your data factory and run Get-AzureRmDataFactory, and assign the output to a variable $df. $df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name <DataFactoryName> For example: $df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name ADFTutorialDataFactoryPSH0516 Then, run print the contents of $df to see the following output: PS C:\ADFGetStartedPSH> $df DataFactoryName DataFactoryId ResourceGroupName Location Tags Properties ProvisioningState : : : : : : : ADFTutorialDataFactoryPSH0516 6f194b34-03b3-49ab-8f03-9f8a7b9d3e30 ADFTutorialResourceGroup West US {} Microsoft.Azure.Management.DataFactories.Models.DataFactoryProperties Succeeded 2. Run Get-AzureRmDataFactorySlice to get details about all slices of the OutputDataset, which is the output dataset of the pipeline. Get-AzureRmDataFactorySlice $df -DatasetName OutputDataset -StartDateTime 2017-05-11T00:00:00Z This setting should match the Start value in the pipeline JSON. You should see 24 slices, one for each hour from 12 AM of the current day to 12 AM of the next day. Here are three sample slices from the output: ResourceGroupName DataFactoryName DatasetName Start End RetryCount State SubState LatencyStatus LongRetryCount : : : : : : : : : : ADFTutorialResourceGroup ADFTutorialDataFactoryPSH0516 OutputDataset 5/11/2017 11:00:00 PM 5/12/2017 12:00:00 AM 0 Ready ResourceGroupName DataFactoryName DatasetName Start End RetryCount State SubState LatencyStatus LongRetryCount : : : : : : : : : : ADFTutorialResourceGroup ADFTutorialDataFactoryPSH0516 OutputDataset 5/11/2017 9:00:00 PM 5/11/2017 10:00:00 PM 0 InProgress ResourceGroupName DataFactoryName DatasetName Start End RetryCount State SubState LatencyStatus LongRetryCount : : : : : : : : : : ADFTutorialResourceGroup ADFTutorialDataFactoryPSH0516 OutputDataset 5/11/2017 8:00:00 PM 5/11/2017 9:00:00 PM 0 Waiting ConcurrencyLimit 0 0 0 3. Run Get-AzureRmDataFactoryRun to get the details of activity runs for a specific slice. Copy the date-time value from the output of the previous command to specify the value for the StartDateTime parameter. Get-AzureRmDataFactoryRun $df -DatasetName OutputDataset -StartDateTime "5/11/2017 09:00:00 PM" Here is the sample output: Id : c0ddbd75-d0c7-4816-a775704bbd7c7eab_636301332000000000_636301368000000000_OutputDataset ResourceGroupName : ADFTutorialResourceGroup DataFactoryName : ADFTutorialDataFactoryPSH0516 DatasetName : OutputDataset ProcessingStartTime : 5/16/2017 8:00:33 PM ProcessingEndTime : 5/16/2017 8:01:36 PM PercentComplete : 100 DataSliceStart : 5/11/2017 9:00:00 PM DataSliceEnd : 5/11/2017 10:00:00 PM Status : Succeeded Timestamp : 5/16/2017 8:00:33 PM RetryAttempt : 0 Properties : {} ErrorMessage : ActivityName : CopyFromBlobToSQL PipelineName : ADFTutorialPipeline Type : Copy For comprehensive documentation on Data Factory cmdlets, see Data Factory Cmdlet Reference. Summary In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database. You used PowerShell to create the data factory, linked services, datasets, and a pipeline. Here are the highlevel steps you performed in this tutorial: 1. Created an Azure data factory. 2. Created linked services: a. An Azure Storage linked service to link your Azure storage account that holds input data. b. An Azure SQL linked service to link your SQL database that holds the output data. 3. Created datasets that describe input data and output data for pipelines. 4. Created a pipeline with Copy Activity, with BlobSource as the source and SqlSink as the sink. Next steps In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases NoSQL File Others Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ ✓ ✓ ✓ ✓ CATEGORY DATA STORE SUPPORTED AS A SOURCE Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK To learn about how to copy data to/from a data store, click the link for the data store in the table. Tutorial: Use Azure Resource Manager template to create a Data Factory pipeline to copy data 5/18/2017 • 13 min to read • Edit Online This tutorial shows you how to use an Azure Resource Manager template to create an Azure data factory. The data pipeline in this tutorial copies data from a source data store to a destination data store. It does not transform input data to produce output data. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster. In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see supported data stores. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data Movement Activities. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see multiple activities in a pipeline. NOTE The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster. Prerequisites Go through Tutorial Overview and Prerequisites and complete the prerequisite steps. Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure PowerShell on your computer. In this tutorial, you use PowerShell to deploy Data Factory entities. (optional) See Authoring Azure Resource Manager Templates to learn about Azure Resource Manager templates. In this tutorial In this tutorial, you create a data factory with the following Data Factory entities: ENTITY DESCRIPTION Azure Storage linked service Links your Azure Storage account to the data factory. Azure Storage is the source data store and Azure SQL database is the sink data store for the copy activity in the tutorial. It specifies the storage account that contains the input data for the copy activity. Azure SQL Database linked service Links your Azure SQL database to the data factory. It specifies the Azure SQL database that holds the output data for the copy activity. ENTITY DESCRIPTION Azure Blob input dataset Refers to the Azure Storage linked service. The linked service refers to an Azure Storage account and the Azure Blob dataset specifies the container, folder, and file name in the storage that holds the input data. Azure SQL output dataset Refers to the Azure SQL linked service. The Azure SQL linked service refers to an Azure SQL server and the Azure SQL dataset specifies the name of the table that holds the output data. Data pipeline The pipeline has one activity of type Copy that takes the Azure blob dataset as an input and the Azure SQL dataset as an output. The copy activity copies data from an Azure blob to a table in the Azure SQL database. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. There are two types of activities: data movement activities and data transformation activities. In this tutorial, you create a pipeline with one activity (copy activity). The following section provides the complete Resource Manager template for defining Data Factory entities so that you can quickly run through the tutorial and test the template. To understand how each Data Factory entity is defined, see Data Factory entities in the template section. Data Factory JSON template The top-level Resource Manager template for defining a data factory is: { "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { ... }, "variables": { ... }, "resources": [ { "name": "[parameters('dataFactoryName')]", "apiVersion": "[variables('apiVersion')]", "type": "Microsoft.DataFactory/datafactories", "location": "westus", "resources": [ { ... }, { ... }, { ... }, { ... } ] } ] } Create a JSON file named ADFCopyTutorialARM.json in C:\ADFGetStarted folder with the following content: { "contentVersion": "1.0.0.0", "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "parameters": { "storageAccountName": { "type": "string", "metadata": { "description": "Name of the Azure storage account that contains the data to be copied." } }, "storageAccountKey": { "type": "securestring", "metadata": { "description": "Key for the Azure storage account." } }, "sourceBlobContainer": { "type": "string", "metadata": { "description": "Name of the blob container in the Azure Storage account." } }, "sourceBlobName": { "type": "string", "metadata": { "description": "Name of the blob in the container that has the data to be copied to Azure SQL Database table" } }, "sqlServerName": { "type": "string", "metadata": { "description": "Name of the Azure SQL Server that will hold the output/copied data." } }, "databaseName": { "type": "string", "metadata": { "description": "Name of the Azure SQL Database in the Azure SQL server." } }, "sqlServerUserName": { "type": "string", "metadata": { "description": "Name of the user that has access to the Azure SQL server." } }, "sqlServerPassword": { "type": "securestring", "metadata": { "description": "Password for the user." } }, "targetSQLTable": { "type": "string", "metadata": { "description": "Table in the Azure SQL Database that will hold the copied data." } } }, "variables": { "dataFactoryName": "[concat('AzureBlobToAzureSQLDatabaseDF', uniqueString(resourceGroup().id))]", "azureSqlLinkedServiceName": "AzureSqlLinkedService", "azureStorageLinkedServiceName": "AzureStorageLinkedService", "blobInputDatasetName": "BlobInputDataset", "sqlOutputDatasetName": "SQLOutputDataset", "pipelineName": "Blob2SQLPipeline" }, "resources": [ { "name": "[variables('dataFactoryName')]", "apiVersion": "2015-10-01", "type": "Microsoft.DataFactory/datafactories", "location": "West US", "resources": [ { "type": "linkedservices", "name": "[variables('azureStorageLinkedServiceName')]", "dependsOn": [ "[variables('dataFactoryName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureStorage", "description": "Azure Storage linked service", "typeProperties": { "connectionString": " [concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parame ters('storageAccountKey'))]" } } }, { "type": "linkedservices", "name": "[variables('azureSqlLinkedServiceName')]", "dependsOn": [ "[variables('dataFactoryName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureSqlDatabase", "description": "Azure SQL linked service", "typeProperties": { "connectionString": " [concat('Server=tcp:',parameters('sqlServerName'),'.database.windows.net,1433;Database=', parameters('databaseName'), ';User ID=',parameters('sqlServerUserName'),';Password=',parameters('sqlServerPassword'),';Trusted_Connection=False ;Encrypt=True;Connection Timeout=30')]" } } }, { "type": "datasets", "name": "[variables('blobInputDatasetName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureBlob", "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", "structure": [ { "name": "Column0", "type": "String" }, { "name": "Column1", "type": "String" } ], "typeProperties": { "folderPath": "[concat(parameters('sourceBlobContainer'), '/')]", "fileName": "[parameters('sourceBlobName')]", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } }, { "type": "datasets", "name": "[variables('sqlOutputDatasetName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureSqlLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureSqlTable", "linkedServiceName": "[variables('azureSqlLinkedServiceName')]", "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "typeProperties": { "tableName": "[parameters('targetSQLTable')]" }, "availability": { "frequency": "Hour", "interval": 1 } } }, { "type": "datapipelines", "name": "[variables('pipelineName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]", "[variables('azureSqlLinkedServiceName')]", "[variables('blobInputDatasetName')]", "[variables('sqlOutputDatasetName')]" ], "apiVersion": "2015-10-01", "properties": { "activities": [ { "name": "CopyFromAzureBlobToAzureSQL", "description": "Copy data frm Azure blob to Azure SQL", "type": "Copy", "inputs": [ { "name": "[variables('blobInputDatasetName')]" } ], "outputs": [ { "name": "[variables('sqlOutputDatasetName')]" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink", "sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')" }, "translator": { "type": "TabularTranslator", "columnMappings": "Column0:FirstName,Column1:LastName" "columnMappings": "Column0:FirstName,Column1:LastName" } }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 3, "timeout": "01:00:00" } } ], "start": "2017-05-11T00:00:00Z", "end": "2017-05-12T00:00:00Z" } } ] } ] } Parameters JSON Create a JSON file named ADFCopyTutorialARM-Parameters.json that contains parameters for the Azure Resource Manager template. IMPORTANT Specify name and key of your Azure Storage account for storageAccountName and storageAccountKey parameters. Specify Azure SQL server, database, user, and password for sqlServerName, databaseName, sqlServerUserName, and sqlServerPassword parameters. { "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#", "contentVersion": "1.0.0.0", "parameters": { "storageAccountName": { "value": "<Name of the Azure storage account>" }, "storageAccountKey": { "value": "<Key for the Azure storage account>" }, "sourceBlobContainer": { "value": "adftutorial" }, "sourceBlobName": { "value": "emp.txt" }, "sqlServerName": { "value": "<Name of the Azure SQL server>" }, "databaseName": { "value": "<Name of the Azure SQL database>" }, "sqlServerUserName": { "value": "<Name of the user who has access to the Azure SQL database>" }, "sqlServerPassword": { "value": "<password for the user>" }, "targetSQLTable": { "value": "emp" } } } IMPORTANT You may have separate parameter JSON files for development, testing, and production environments that you can use with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying Data Factory entities in these environments. Create data factory 1. Start Azure PowerShell and run the following command: Run the following command and enter the user name and password that you use to sign in to the Azure portal. Login-AzureRmAccount Run the following command to view all the subscriptions for this account. Get-AzureRmSubscription Run the following command to select the subscription that you want to work with. Get-AzureRmSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzureRmContext 2. Run the following command to deploy Data Factory entities using the Resource Manager template you created in Step 1. New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup TemplateFile C:\ADFGetStarted\ADFCopyTutorialARM.json -TemplateParameterFile C:\ADFGetStarted\ADFCopyTutorialARM-Parameters.json Monitor pipeline 1. Log in to the Azure portal using your Azure account. 2. Click Data factories on the left menu (or) click More services and click Data factories under INTELLIGENCE + ANALYTICS category. 3. In the Data factories page, search for and find your data factory (AzureBlobToAzureSQLDatabaseDF). 4. Click your Azure data factory. You see the home page for the data factory. 5. Follow instructions from Monitor datasets and pipeline to monitor the pipeline and datasets you have created in this tutorial. Currently, Visual Studio does not support monitoring Data Factory pipelines. 6. When a slice is in the Ready state, verify that the data is copied to the emp table in the Azure SQL database. For more information on how to use Azure portal blades to monitor pipeline and datasets you have created in this tutorial, see Monitor datasets and pipeline . For more information on how to use the Monitor & Manage application to monitor your data pipelines, see Monitor and manage Azure Data Factory pipelines using Monitoring App. Data Factory entities in the template Define data factory You define a data factory in the Resource Manager template as shown in the following sample: "resources": [ { "name": "[variables('dataFactoryName')]", "apiVersion": "2015-10-01", "type": "Microsoft.DataFactory/datafactories", "location": "West US" } The dataFactoryName is defined as: "dataFactoryName": "[concat('AzureBlobToAzureSQLDatabaseDF', uniqueString(resourceGroup().id))]" It is a unique string based on the resource group ID. Defining Data Factory entities The following Data Factory entities are defined in the JSON template: 1. 2. 3. 4. 5. Azure Storage linked service Azure SQL linked service Azure blob dataset Azure SQL dataset Data pipeline with a copy activity Azure Storage linked service The AzureStorageLinkedService links your Azure storage account to the data factory. You created a container and uploaded data to this storage account as part of prerequisites. You specify the name and key of your Azure storage account in this section. See Azure Storage linked service for details about JSON properties used to define an Azure Storage linked service. { "type": "linkedservices", "name": "[variables('azureStorageLinkedServiceName')]", "dependsOn": [ "[variables('dataFactoryName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureStorage", "description": "Azure Storage linked service", "typeProperties": { "connectionString": " [concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parame ters('storageAccountKey'))]" } } } The connectionString uses the storageAccountName and storageAccountKey parameters. The values for these parameters passed by using a configuration file. The definition also uses variables: azureStroageLinkedService and dataFactoryName defined in the template. Azure SQL Database linked service AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of prerequisites. You specify the Azure SQL server name, database name, user name, and user password in this section. See Azure SQL linked service for details about JSON properties used to define an Azure SQL linked service. { "type": "linkedservices", "name": "[variables('azureSqlLinkedServiceName')]", "dependsOn": [ "[variables('dataFactoryName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureSqlDatabase", "description": "Azure SQL linked service", "typeProperties": { "connectionString": " [concat('Server=tcp:',parameters('sqlServerName'),'.database.windows.net,1433;Database=', parameters('databaseName'), ';User ID=',parameters('sqlServerUserName'),';Password=',parameters('sqlServerPassword'),';Trusted_Connection=False ;Encrypt=True;Connection Timeout=30')]" } } } The connectionString uses sqlServerName, databaseName, sqlServerUserName, and sqlServerPassword parameters whose values are passed by using a configuration file. The definition also uses the following variables from the template: azureSqlLinkedServiceName, dataFactoryName. Azure blob dataset The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. In Azure blob dataset definition, you specify names of blob container, folder, and file that contains the input data. See Azure Blob dataset properties for details about JSON properties used to define an Azure Blob dataset. { "type": "datasets", "name": "[variables('blobInputDatasetName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureBlob", "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", "structure": [ { "name": "Column0", "type": "String" }, { "name": "Column1", "type": "String" } ], "typeProperties": { "folderPath": "[concat(parameters('sourceBlobContainer'), '/')]", "fileName": "[parameters('sourceBlobName')]", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } Azure SQL dataset You specify the name of the table in the Azure SQL database that holds the copied data from the Azure Blob storage. See Azure SQL dataset properties for details about JSON properties used to define an Azure SQL dataset. { "type": "datasets", "name": "[variables('sqlOutputDatasetName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureSqlLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureSqlTable", "linkedServiceName": "[variables('azureSqlLinkedServiceName')]", "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "typeProperties": { "tableName": "[parameters('targetSQLTable')]" }, "availability": { "frequency": "Hour", "interval": 1 } } } Data pipeline You define a pipeline that copies data from the Azure blob dataset to the Azure SQL dataset. See Pipeline JSON for descriptions of JSON elements used to define a pipeline in this example. { "type": "datapipelines", "name": "[variables('pipelineName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]", "[variables('azureSqlLinkedServiceName')]", "[variables('blobInputDatasetName')]", "[variables('sqlOutputDatasetName')]" ], "apiVersion": "2015-10-01", "properties": { "activities": [ { "name": "CopyFromAzureBlobToAzureSQL", "description": "Copy data frm Azure blob to Azure SQL", "type": "Copy", "inputs": [ { "name": "[variables('blobInputDatasetName')]" } ], "outputs": [ { "name": "[variables('sqlOutputDatasetName')]" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink", "sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')" }, "translator": { "type": "TabularTranslator", "columnMappings": "Column0:FirstName,Column1:LastName" } }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 3, "timeout": "01:00:00" } } ], "start": "2017-05-11T00:00:00Z", "end": "2017-05-12T00:00:00Z" } } Reuse the template In the tutorial, you created a template for defining Data Factory entities and a template for passing values for parameters. The pipeline copies data from an Azure Storage account to an Azure SQL database specified via parameters. To use the same template to deploy Data Factory entities to different environments, you create a parameter file for each environment and use it when deploying to that environment. Example: New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Dev.json New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Test.json New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Production.json Notice that the first command uses parameter file for the development environment, second one for the test environment, and the third one for the production environment. You can also reuse the template to perform repeated tasks. For example, you need to create many data factories with one or more pipelines that implement the same logic but each data factory uses different Storage and SQL Database accounts. In this scenario, you use the same template in the same environment (dev, test, or production) with different parameter files to create data factories. Next steps In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ ✓ ✓ CATEGORY NoSQL File Others DATA STORE SUPPORTED AS A SOURCE SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK ✓ ✓ To learn about how to copy data to/from a data store, click the link for the data store in the table. Tutorial: Use REST API to create an Azure Data Factory pipeline to copy data 6/13/2017 • 17 min to read • Edit Online In this article, you learn how to use REST API to create a data factory with a pipeline that copies data from an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the Introduction to Azure Data Factory article before doing this tutorial. In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see supported data stores. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data Movement Activities. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see multiple activities in a pipeline. NOTE This article does not cover all the Data Factory REST API. See Data Factory REST API Reference for comprehensive documentation on Data Factory cmdlets. The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster. Prerequisites Go through Tutorial Overview and complete the prerequisite steps. Install Curl on your machine. You use the Curl tool with REST commands to create a data factory. Follow instructions from this article to: 1. Create a Web application named ADFCopyTutorialApp in Azure Active Directory. 2. Get client ID and secret key. 3. Get tenant ID. 4. Assign the ADFCopyTutorialApp application to the Data Factory Contributor role. Install Azure PowerShell. Launch PowerShell and do the following steps. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run the commands again. 1. Run the following command and enter the user name and password that you use to sign in to the Azure portal: Login-AzureRmAccount 2. Run the following command to view all the subscriptions for this account: Get-AzureRmSubscription 3. Run the following command to select the subscription that you want to work with. Replace <NameOfAzureSubscription> with the name of your Azure subscription. Get-AzureRmSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzureRmContext 4. Create an Azure resource group named ADFTutorialResourceGroup by running the following command in the PowerShell: New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US" If the resource group already exists, you specify whether to update it (Y) or keep it as (N). Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of your resource group in place of ADFTutorialResourceGroup in this tutorial. Create JSON definitions Create following JSON files in the folder where curl.exe is located. datafactory.json IMPORTANT Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name. { "name": "ADFCopyTutorialDF", "location": "WestUS" } azurestoragelinkedservice.json IMPORTANT Replace accountname and accountkey with name and key of your Azure storage account. To learn how to get your storage access key, see View, copy and regenerate storage access keys. { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } For details about JSON properties, see Azure Storage linked service. azuersqllinkedservice.json IMPORTANT Replace servername, databasename, username, and password with name of your Azure SQL server, name of SQL database, user account, and password for the account. { "name": "AzureSqlLinkedService", "properties": { "type": "AzureSqlDatabase", "description": "", "typeProperties": { "connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog= <databasename>;User ID=<username>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30" } } } For details about JSON properties, see Azure SQL linked service. inputdataset.json { "name": "AzureBlobInput", "properties": { "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adftutorial/", "fileName": "emp.txt", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureBlob because data resides in an Azure blob storage. PROPERTY DESCRIPTION linkedServiceName Refers to the AzureStorageLinkedService that you created earlier. folderPath Specifies the blob container and the folder that contains input blobs. In this tutorial, adftutorial is the blob container and folder is the root folder. fileName This property is optional. If you omit this property, all files from the folderPath are picked. In this tutorial, emp.txt is specified for the fileName, so only that file is picked up for processing. format -> type The input file is in the text format, so we use TextFormat. columnDelimiter The columns in the input file are delimited by comma character ( , ). frequency/interval The frequency is set to Hour and interval is set to 1, which means that the input slices are available hourly. In other words, the Data Factory service looks for input data every hour in the root folder of blob container (adftutorial) you specified. It looks for the data within the pipeline start and end times, not before or after these times. external This property is set to true if the data is not generated by this pipeline. The input data in this tutorial is in the emp.txt file, which is not generated by this pipeline, so we set this property to true. For more information about these JSON properties, see Azure Blob connector article. outputdataset.json { "name": "AzureSqlOutput", "properties": { "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "emp" }, "availability": { "frequency": "Hour", "interval": 1 } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureSqlTable because data is copied to a table in an Azure SQL database. linkedServiceName Refers to the AzureSqlLinkedService that you created earlier. tableName Specified the table to which the data is copied. frequency/interval The frequency is set to Hour and interval is 1, which means that the output slices are produced hourly between the pipeline start and end times, not before or after these times. There are three columns – ID, FirstName, and LastName – in the emp table in the database. ID is an identity column, so you need to specify only FirstName and LastName here. For more information about these JSON properties, see Azure SQL connector article. pipeline.json { "name": "ADFTutorialPipeline", "properties": { "description": "Copy data from a blob to Azure SQL table", "activities": [ { "name": "CopyFromBlobToSQL", "description": "Push Regional Effectiveness Campaign data to Azure SQL database", "type": "Copy", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureSqlOutput" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink", "writeBatchSize": 10000, "writeBatchTimeout": "60:00:00" } }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 0, "timeout": "01:00:00" } } ], "start": "2017-05-11T00:00:00Z", "end": "2017-05-12T00:00:00Z" } } Note the following points: In the activities section, there is only one activity whose type is set to Copy. For more information about the copy activity, see data movement activities. In Data Factory solutions, you can also use data transformation activities. Input for the activity is set to AzureBlobInput and output for the activity is set to AzureSqlOutput. In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see supported data stores. To learn how to use a specific supported data store as a source/sink, click the link in the table. Replace the value of the start property with the current day and end value with the next day. You can specify only the date part and skip the time part of the date time. For example, "2017-02-03", which is equivalent to "2017-02-03T00:00:00Z" Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The end time is optional, but we use it in this tutorial. If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline indefinitely, specify 9999-09-09 as the value for the end property. In the preceding example, there are 24 data slices as each data slice is produced hourly. For descriptions of JSON properties in a pipeline definition, see create pipelines article. For descriptions of JSON properties in a copy activity definition, see data movement activities. For descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector article. Set global variables In Azure PowerShell, execute the following commands after replacing the values with your own: IMPORTANT See Prerequisites section for instructions on getting client ID, client secret, tenant ID, and subscription ID. $client_id = "<client ID of application in AAD>" $client_secret = "<client key of application in AAD>" $tenant = "<Azure tenant ID>"; $subscription_id="<Azure subscription ID>"; $rg = "ADFTutorialResourceGroup" Run the following command after updating the name of the data factory you are using: $adf = "ADFCopyTutorialDF" Authenticate with AAD Run the following command to authenticate with Azure Active Directory (AAD): $cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token -F grant_type=client_credentials -F resource=https://management.core.windows.net/ -F client_id=$client_id -F client_secret=$client_secret }; $responseToken = Invoke-Command -scriptblock $cmd; $accessToken = (ConvertFrom-Json $responseToken).access_token; (ConvertFrom-Json $responseToken) Create data factory In this step, you create an Azure Data Factory named ADFCopyTutorialDF. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store. A HDInsight Hive activity to run a Hive script to transform input data to product output data. Run the following commands to create the data factory: 1. Assign the command to variable named cmd. IMPORTANT Confirm that the name of the data factory you specify here (ADFCopyTutorialDF) matches the name specified in the datafactory.json. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data “@datafactory.json” https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/ADFCopyTutorialDF0411?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in the results; otherwise, you see an error message. Write-Host $results Note the following points: The name of the Azure Data Factory must be globally unique. If you see the error in results: Data factory name “ADFCopyTutorialDF” is not available, do the following steps: 1. Change the name (for example, yournameADFCopyTutorialDF) in the datafactory.json file. 2. In the first command where the $cmd variable is assigned a value, replace ADFCopyTutorialDF with the new name and run the command. 3. Run the next two commands to invoke the REST API to create the data factory and print the results of the operation. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription The name of the data factory may be registered as a DNS name in the future and hence become publicly visible. If you receive the error: "This subscription is not registered to use namespace Microsoft.DataFactory", do one of the following and try publishing again: In Azure PowerShell, run the following command to register the Data Factory provider: Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory You can run the following command to confirm that the Data Factory provider is registered. Get-AzureRmResourceProvider Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to link source and destination data stores to your data store. Then, define input and output datasets to represent data in linked data stores. Finally, create the pipeline with an activity that uses these datasets. Create linked services You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and AzureSqlDatabase. The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of prerequisites. AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of prerequisites. Create Azure Storage linked service In this step, you link your Azure storage account to your data factory. You specify the name and key of your Azure storage account in this section. See Azure Storage linked service for details about JSON properties used to define an Azure Storage linked service. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@azurestoragelinkedservice.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/linkedservices/AzureStorageLinkedService?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the results; otherwise, you see an error message. Write-Host $results Create Azure SQL linked service In this step, you link your Azure SQL database to your data factory. You specify the Azure SQL server name, database name, user name, and user password in this section. See Azure SQL linked service for details about JSON properties used to define an Azure SQL linked service. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data “@azuresqllinkedservice.json” https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/linkedservices/AzureSqlLinkedService?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the results; otherwise, you see an error message. Write-Host $results Create datasets In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to your data factory. In this step, you define two datasets named AzureBlobInput and AzureSqlOutput that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and AzureSqlLinkedService respectively. The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (AzureBlobInput) specifies the container and the folder that contains the input data. Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. Create input dataset In this step, you create a dataset named AzureBlobInput that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@inputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/datasets/AzureBlobInput?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message. Write-Host $results Create output dataset The Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. The output SQL table dataset (OututDataset) you create in this step specifies the table in the database to which the data from the blob storage is copied. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@outputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/datasets/AzureSqlOutput?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message. Write-Host $results Create pipeline In this step, you create a pipeline with a copy activity that uses AzureBlobInput as an input and AzureSqlOutput as an output. Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@pipeline.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/datapipelines/MyFirstPipeline?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message. Write-Host $results Congratulations! You have successfully created an Azure data factory, with a pipeline that copies data from Azure Blob Storage to Azure SQL database. Monitor pipeline In this step, you use Data Factory REST API to monitor slices being produced by the pipeline. $ds ="AzureSqlOutput" IMPORTANT Make sure that the start and end times specified in the following command match the start and end times of the pipeline. $cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactor y/datafactories/$adf/datasets/$ds/slices?start=2017-05-11T00%3a00%3a00.0000000Z"&"end=2017-0512T00%3a00%3a00.0000000Z"&"api-version=2015-10-01}; $results2 = Invoke-Command -scriptblock $cmd; IF ((ConvertFrom-Json $results2).value -ne $NULL) { ConvertFrom-Json $results2 | Select-Object -Expand value | Format-Table } else { (convertFrom-Json $results2).RemoteException } Run the Invoke-Command and the next one until you see a slice in Ready state or Failed state. When the slice is in Ready state, check the emp table in your Azure SQL database for the output data. For each slice, two rows of data from the source file are copied to the emp table in the Azure SQL database. Therefore, you see 24 new records in the emp table when all the slices are successfully processed (in Ready state). Summary In this tutorial, you used REST API to create an Azure data factory to copy data from an Azure blob to an Azure SQL database. Here are the high-level steps you performed in this tutorial: 1. Created an Azure data factory. 2. Created linked services: a. An Azure Storage linked service to link your Azure Storage account that holds input data. b. An Azure SQL linked service to link your Azure SQL database that holds the output data. 3. Created datasets, which describe input data and output data for pipelines. 4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink. Next steps In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases NoSQL File Others Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ ✓ ✓ ✓ ✓ CATEGORY DATA STORE SUPPORTED AS A SOURCE Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK To learn about how to copy data to/from a data store, click the link for the data store in the table. Tutorial: Create a pipeline with Copy Activity using .NET API 6/13/2017 • 14 min to read • Edit Online In this article, you learn how to use .NET API to create a data factory with a pipeline that copies data from an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the Introduction to Azure Data Factory article before doing this tutorial. In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see supported data stores. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data Movement Activities. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see multiple activities in a pipeline. NOTE For complete documentation on .NET API for Data Factory, see Data Factory .NET API Reference. The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster. Prerequisites Go through Tutorial Overview and Pre-requisites to get an overview of the tutorial and complete the prerequisite steps. Visual Studio 2012 or 2013 or 2015 Download and install Azure .NET SDK Azure PowerShell. Follow instructions in How to install and configure Azure PowerShell article to install Azure PowerShell on your computer. You use Azure PowerShell to create an Azure Active Directory application. Create an application in Azure Active Directory Create an Azure Active Directory application, create a service principal for the application, and assign it to the Data Factory Contributor role. 1. Launch PowerShell. 2. Run the following command and enter the user name and password that you use to sign in to the Azure portal. Login-AzureRmAccount 3. Run the following command to view all the subscriptions for this account. Get-AzureRmSubscription 4. Run the following command to select the subscription that you want to work with. Replace <NameOfAzureSubscription> with the name of your Azure subscription. Get-AzureRmSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzureRmContext IMPORTANT Note down SubscriptionId and TenantId from the output of this command. 5. Create an Azure resource group named ADFTutorialResourceGroup by running the following command in the PowerShell. New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US" If the resource group already exists, you specify whether to update it (Y) or keep it as (N). If you use a different resource group, you need to use the name of your resource group in place of ADFTutorialResourceGroup in this tutorial. 6. Create an Azure Active Directory application. $azureAdApplication = New-AzureRmADApplication -DisplayName "ADFCopyTutotiralApp" -HomePage "https://www.contoso.org" -IdentifierUris "https://www.adfcopytutorialapp.org/example" -Password "Pass@word1" If you get the following error, specify a different URL and run the command again. Another object with the same value for property identifierUris already exists. 7. Create the AD service principal. New-AzureRmADServicePrincipal -ApplicationId $azureAdApplication.ApplicationId 8. Add service principal to the Data Factory Contributor role. New-AzureRmRoleAssignment -RoleDefinitionName "Data Factory Contributor" -ServicePrincipalName $azureAdApplication.ApplicationId.Guid 9. Get the application ID. $azureAdApplication Note down the application ID (applicationID) from the output. You should have following four values from these steps: Tenant ID Subscription ID Application ID Password (specified in the first command) Walkthrough 1. Using Visual Studio 2012/2013/2015, create a C# .NET console application. a. Launch Visual Studio 2012/2013/2015. b. Click File, point to New, and click Project. c. Expand Templates, and select Visual C#. In this walkthrough, you use C#, but you can use any .NET language. d. Select Console Application from the list of project types on the right. e. Enter DataFactoryAPITestApp for the Name. f. Select C:\ADFGetStarted for the Location. g. Click OK to create the project. 2. Click Tools, point to NuGet Package Manager, and click Package Manager Console. 3. In the Package Manager Console, do the following steps: a. Run the following command to install Data Factory package: Install-Package Microsoft.Azure.Management.DataFactories b. Run the following command to install Azure Active Directory package (you use Active Directory API in the code): Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.19.208020213 4. Add the following appSetttings section to the App.config file. These settings are used by the helper method: GetAuthorizationHeader. Replace values for <Application ID>, <Password>, <Subscription ID>, and <tenant ID> with your own values. <?xml version="1.0" encoding="utf-8" ?> <configuration> <appSettings> <add key="ActiveDirectoryEndpoint" value="https://login.windows.net/" /> <add key="ResourceManagerEndpoint" value="https://management.azure.com/" /> <add key="WindowsManagementUri" value="https://management.core.windows.net/" /> <add key="ApplicationId" value="your application ID" /> <add key="Password" value="Password you used while creating the AAD application" /> <add key="SubscriptionId" value= "Subscription ID" /> <add key="ActiveDirectoryTenantId" value="Tenant ID" /> </appSettings> </configuration> 5. Add the following using statements to the source file (Program.cs) in the project. using using using using System.Configuration; System.Collections.ObjectModel; System.Threading; System.Threading.Tasks; using using using using Microsoft.Azure; Microsoft.Azure.Management.DataFactories; Microsoft.Azure.Management.DataFactories.Models; Microsoft.Azure.Management.DataFactories.Common.Models; using Microsoft.IdentityModel.Clients.ActiveDirectory; 6. Add the following code that creates an instance of DataPipelineManagementClient class to the Main method. You use this object to create a data factory, a linked service, input and output datasets, and a pipeline. You also use this object to monitor slices of a dataset at runtime. // create data factory management client string resourceGroupName = "ADFTutorialResourceGroup"; string dataFactoryName = "APITutorialFactory"; TokenCloudCredentials aadTokenCredentials = new TokenCloudCredentials( ConfigurationManager.AppSettings["SubscriptionId"], GetAuthorizationHeader().Result); Uri resourceManagerUri = new Uri(ConfigurationManager.AppSettings["ResourceManagerEndpoint"]); DataFactoryManagementClient client = new DataFactoryManagementClient(aadTokenCredentials, resourceManagerUri); IMPORTANT Replace the value of resourceGroupName with the name of your Azure resource group. Update name of the data factory (dataFactoryName) to be unique. Name of the data factory must be globally unique. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. 7. Add the following code that creates a data factory to the Main method. // create a data factory Console.WriteLine("Creating a data factory"); client.DataFactories.CreateOrUpdate(resourceGroupName, new DataFactoryCreateOrUpdateParameters() { DataFactory = new DataFactory() { Name = dataFactoryName, Location = "westus", Properties = new DataFactoryProperties() } } ); A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data to product output data. Let's start with creating the data factory in this step. 8. Add the following code that creates an Azure Storage linked service to the Main method. IMPORTANT Replace storageaccountname and accountkey with name and key of your Azure Storage account. // create a linked service for input data store: Azure Storage Console.WriteLine("Creating Azure Storage linked service"); client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName, new LinkedServiceCreateOrUpdateParameters() { LinkedService = new LinkedService() { Name = "AzureStorageLinkedService", Properties = new LinkedServiceProperties ( new AzureStorageLinkedService("DefaultEndpointsProtocol=https;AccountName= <storageaccountname>;AccountKey=<accountkey>") ) } } ); You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and AzureSqlDatabase. The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of prerequisites. 9. Add the following code that creates an Azure SQL linked service to the Main method. IMPORTANT Replace servername, databasename, username, and password with names of your Azure SQL server, database, user, and password. // create a linked service for output data store: Azure SQL Database Console.WriteLine("Creating Azure SQL Database linked service"); client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName, new LinkedServiceCreateOrUpdateParameters() { LinkedService = new LinkedService() { Name = "AzureSqlLinkedService", Properties = new LinkedServiceProperties ( new AzureSqlDatabaseLinkedService("Data Source=tcp: <servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>;Password= <password>;Integrated Security=False;Encrypt=True;Connect Timeout=30") ) } } ); AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of prerequisites. 10. Add the following code that creates input and output datasets to the Main method. // create input and output datasets Console.WriteLine("Creating input and output datasets"); string Dataset_Source = "InputDataset"; string Dataset_Destination = "OutputDataset"; string Dataset_Destination = "OutputDataset"; Console.WriteLine("Creating input dataset of type: Azure Blob"); client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName, new DatasetCreateOrUpdateParameters() { Dataset = new Dataset() { Name = Dataset_Source, Properties = new DatasetProperties() { Structure = new List<DataElement>() { new DataElement() { Name = "FirstName", Type = "String" }, new DataElement() { Name = "LastName", Type = "String" } }, LinkedServiceName = "AzureStorageLinkedService", TypeProperties = new AzureBlobDataset() { FolderPath = "adftutorial/", FileName = "emp.txt" }, External = true, Availability = new Availability() { Frequency = SchedulePeriod.Hour, Interval = 1, }, Policy = new Policy() { Validation = new ValidationPolicy() { MinimumRows = 1 } } } } }); Console.WriteLine("Creating output dataset of type: Azure SQL"); client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName, new DatasetCreateOrUpdateParameters() { Dataset = new Dataset() { Name = Dataset_Destination, Properties = new DatasetProperties() { Structure = new List<DataElement>() { new DataElement() { Name = "FirstName", Type = "String" }, new DataElement() { Name = "LastName", Type = "String" } }, LinkedServiceName = "AzureSqlLinkedService", TypeProperties = new AzureSqlTableDataset() { TableName = "emp" }, Availability = new Availability() { Frequency = SchedulePeriod.Hour, Interval = 1, }, } } }); In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and AzureSqlLinkedService respectively. The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and the folder that contains the input data. Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. In this step, you create an output dataset named OutputDataset. This dataset points to a SQL table in the Azure SQL database represented by AzureSqlLinkedService. 11. Add the following code that creates and activates a pipeline to the Main method. In this step, you create a pipeline with a copy activity that uses InputDataset as an input and OutputDataset as an output. // create a pipeline Console.WriteLine("Creating a pipeline"); DateTime PipelineActivePeriodStartTime = new DateTime(2017, 5, 11, 0, 0, 0, 0, DateTimeKind.Utc); DateTime PipelineActivePeriodEndTime = new DateTime(2017, 5, 12, 0, 0, 0, 0, DateTimeKind.Utc); string PipelineName = "ADFTutorialPipeline"; client.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName, new PipelineCreateOrUpdateParameters() { Pipeline = new Pipeline() { Name = PipelineName, Properties = new PipelineProperties() { Description = "Demo Pipeline for data transfer between blobs", // Initial value for pipeline's active period. With this, you won't need to set slice status Start = PipelineActivePeriodStartTime, End = PipelineActivePeriodEndTime, Activities = new List<Activity>() { new Activity() { Name = "BlobToAzureSql", Inputs = new List<ActivityInput>() { new ActivityInput() { Name = Dataset_Source } }, Outputs = new List<ActivityOutput>() { new ActivityOutput() { Name = Dataset_Destination } }, TypeProperties = new CopyActivity() { Source = new BlobSource(), Sink = new BlobSink() { WriteBatchSize = 10000, WriteBatchTimeout = TimeSpan.FromMinutes(10) } } } } } } }); Note the following points: In the activities section, there is only one activity whose type is set to Copy. For more information about the copy activity, see data movement activities. In Data Factory solutions, you can also use data transformation activities. Input for the activity is set to InputDataset and output for the activity is set to OutputDataset. In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see supported data stores. To learn how to use a specific supported data store as a source/sink, click the link in the table. Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. 12. Add the following code to the Main method to get the status of a data slice of the output dataset. There is only slice expected in this sample. // Pulling status within a timeout threshold DateTime start = DateTime.Now; bool done = false; while (DateTime.Now - start < TimeSpan.FromMinutes(5) && !done) { Console.WriteLine("Pulling the slice status"); // wait before the next status check Thread.Sleep(1000 * 12); var datalistResponse = client.DataSlices.List(resourceGroupName, dataFactoryName, Dataset_Destination, new DataSliceListParameters() { DataSliceRangeStartTime = PipelineActivePeriodStartTime.ConvertToISO8601DateTimeString(), DataSliceRangeEndTime = PipelineActivePeriodEndTime.ConvertToISO8601DateTimeString() }); foreach (DataSlice slice in datalistResponse.DataSlices) { if (slice.State == DataSliceState.Failed || slice.State == DataSliceState.Ready) { Console.WriteLine("Slice execution is done with status: {0}", slice.State); done = true; break; } else { Console.WriteLine("Slice status is: {0}", slice.State); } } } 13. Add the following code to get run details for a data slice to the Main method. Console.WriteLine("Getting run details of a data slice"); // give it a few minutes for the output slice to be ready Console.WriteLine("\nGive it a few minutes for the output slice to be ready and press any key."); Console.ReadKey(); var datasliceRunListResponse = client.DataSliceRuns.List( resourceGroupName, dataFactoryName, Dataset_Destination, new DataSliceRunListParameters() { DataSliceStartTime = PipelineActivePeriodStartTime.ConvertToISO8601DateTimeString() } ); foreach (DataSliceRun run in datasliceRunListResponse.DataSliceRuns) { Console.WriteLine("Status: \t\t{0}", run.Status); Console.WriteLine("DataSliceStart: \t{0}", run.DataSliceStart); Console.WriteLine("DataSliceEnd: \t\t{0}", run.DataSliceEnd); Console.WriteLine("ActivityId: \t\t{0}", run.ActivityName); Console.WriteLine("ProcessingStartTime: \t{0}", run.ProcessingStartTime); Console.WriteLine("ProcessingEndTime: \t{0}", run.ProcessingEndTime); Console.WriteLine("ErrorMessage: \t{0}", run.ErrorMessage); } Console.WriteLine("\nPress any key to exit."); Console.ReadKey(); 14. Add the following helper method used by the Main method to the Program class. NOTE When you copy and paste the following code, make sure that the copied code is at the same level as the Main method. public static async Task<string> GetAuthorizationHeader() { AuthenticationContext context = new AuthenticationContext(ConfigurationManager.AppSettings["ActiveDirectoryEndpoint"] + ConfigurationManager.AppSettings["ActiveDirectoryTenantId"]); ClientCredential credential = new ClientCredential( ConfigurationManager.AppSettings["ApplicationId"], ConfigurationManager.AppSettings["Password"]); AuthenticationResult result = await context.AcquireTokenAsync( resource: ConfigurationManager.AppSettings["WindowsManagementUri"], clientCredential: credential); if (result != null) return result.AccessToken; throw new InvalidOperationException("Failed to acquire token"); } 15. In the Solution Explorer, expand the project (DataFactoryAPITestApp), right-click References, and click Add Reference. Select check box for System.Configuration assembly. and click OK. 16. Build the console application. Click Build on the menu and click Build Solution. 17. Confirm that there is at least one file in the adftutorial container in your Azure blob storage. If not, create Emp.txt file in Notepad with the following content and upload it to the adftutorial container. John, Doe Jane, Doe 18. Run the sample by clicking Debug -> Start Debugging on the menu. When you see the Getting run details of a data slice, wait for a few minutes, and press ENTER. 19. Use the Azure portal to verify that the data factory APITutorialFactory is created with the following artifacts: Linked service: LinkedService_AzureStorage Dataset: InputDataset and OutputDataset. Pipeline: PipelineBlobSample 20. Verify that the two employee records are created in the emp table in the specified Azure SQL database. Next steps For complete documentation on .NET API for Data Factory, see Data Factory .NET API Reference. In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ ✓ ✓ ✓ CATEGORY NoSQL File Others DATA STORE SUPPORTED AS A SOURCE Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK ✓ To learn about how to copy data to/from a data store, click the link for the data store in the table. Tutorial: Build your first pipeline to transform data using Hadoop cluster 6/13/2017 • 4 min to read • Edit Online In this tutorial, you build your first Azure data factory with a data pipeline. The pipeline transforms input data by running Hive script on an Azure HDInsight (Hadoop) cluster to produce output data. This article provides overview and prerequisites for the tutorial. After you complete the prerequisites, you can do the tutorial using one of the following tools/SDKs: Azure portal, Visual Studio, PowerShell, Resource Manager template, REST API. Select one of the options in the drop-down list at the beginning (or) links at the end of this article to do the tutorial using one of these options. Tutorial overview In this tutorial, you perform the following steps: 1. Create a data factory. A data factory can contain one or more data pipelines that move and transform data. In this tutorial, you create one pipeline in the data factory. 2. Create a pipeline. A pipeline can have one or more activities (Examples: Copy Activity, HDInsight Hive Activity). This sample uses the HDInsight Hive activity that runs a Hive script on a HDInsight Hadoop cluster. The script first creates a table that references the raw web log data stored in Azure blob storage and then partitions the raw data by year and month. In this tutorial, the pipeline uses the Hive Activity to transform data by running a Hive query on an Azure HDInsight Hadoop cluster. 3. Create linked services. You create a linked service to link a data store or a compute service to the data factory. A data store such as Azure Storage holds input/output data of activities in the pipeline. A compute service such as HDInsight Hadoop cluster processes/transforms data. In this tutorial, you create two linked services: Azure Storage and Azure HDInsight. The Azure Storage linked service links an Azure Storage Account that holds the input/output data to the data factory. Azure HDInsight linked service links an Azure HDInsight cluster that is used to transform data to the data factory. 4. Create input and output datasets. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity. In this tutorial, the input and output datasets specify locations of input and output data in the Azure Blob Storage. The Azure Storage linked service specifies what Azure Storage Account is used. An input dataset specifies where the input files are located and an output dataset specifies where the output files are placed. See Introduction to Azure Data Factory article for a detailed overview of Azure Data Factory. Here is the diagram view of the sample data factory you build in this tutorial. MyFirstPipeline has one activity of type Hive that consumes AzureBlobInput dataset as an input and produces AzureBlobOutput dataset as an output. In this tutorial, inputdata folder of the adfgetstarted Azure blob container contains one file named input.log. This log file has entries from three months: January, February, and March of 2016. Here are the sample rows for each month in the input file. 2016-01-01,02:01:09,SAMPLEWEBSITE,GET,/blogposts/mvc4/step2.png,X-ARR-LOG-ID=2ec4b8ad-3cf0-4442-93ab837317ece6a1,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+ (KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-andpost-scenarios.aspx,\N,200,0,0,53175,871 2016-02-01,02:01:10,SAMPLEWEBSITE,GET,/blogposts/mvc4/step7.png,X-ARR-LOG-ID=d7472a26-431a-4a4d-99ebc7b4fda2cf4c,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+ (KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-andpost-scenarios.aspx,\N,200,0,0,30184,871 2016-03-01,02:01:10,SAMPLEWEBSITE,GET,/blogposts/mvc4/step7.png,X-ARR-LOG-ID=d7472a26-431a-4a4d-99ebc7b4fda2cf4c,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+ (KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-andpost-scenarios.aspx,\N,200,0,0,30184,871 When the file is processed by the pipeline with HDInsight Hive Activity, the activity runs a Hive script on the HDInsight cluster that partitions input data by year and month. The script creates three output folders that contain a file with entries from each month. adfgetstarted/partitioneddata/year=2016/month=1/000000_0 adfgetstarted/partitioneddata/year=2016/month=2/000000_0 adfgetstarted/partitioneddata/year=2016/month=3/000000_0 From the sample lines shown above, the first one (with 2016-01-01) is written to the 000000_0 file in the month=1 folder. Similarly, the second one is written to the file in the month=2 folder and the third one is written to the file in the month=3 folder. Prerequisites Before you begin this tutorial, you must have the following prerequisites: 1. Azure subscription - If you don't have an Azure subscription, you can create a free trial account in just a couple of minutes. See the Free Trial article on how you can obtain a free trial account. 2. Azure Storage – You use an Azure storage account for storing the data in this tutorial. If you don't have an Azure storage account, see the Create a storage account article. After you have created the storage account, note down the account name and access key. See View, copy and regenerate storage access keys. 3. Download and review the Hive query file (HQL) located at: https://adftutorialfiles.blob.core.windows.net/hivetutorial/partitionweblogs.hql. This query transforms input data to produce output data. 4. Download and review the sample input file (input.log) located at: https://adftutorialfiles.blob.core.windows.net/hivetutorial/input.log 5. Create a blob container named adfgetstarted in your Azure Blob Storage. 6. Upload partitionweblogs.hql file to the script folder in the adfgetstarted container. Use tools such as Microsoft Azure Storage Explorer. 7. Upload input.log file to the inputdata folder in the adfgetstarted container. After you complete the prerequisites, select one of the following tools/SDKs to do the tutorial: Azure portal Visual Studio PowerShell Resource Manager template REST API Azure portal and Visual Studio provide GUI way of building your data factories. Whereas, PowerShell, Resource Manager Template, and REST API options provides scripting/programming way of building your data factories. NOTE The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source data store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database. You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information. Tutorial: Build your first Azure data factory using Azure portal 6/13/2017 • 14 min to read • Edit Online In this article, you learn how to use Azure portal to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list. The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. NOTE The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution in Data Factory. Prerequisites 1. Read through Tutorial Overview article and complete the prerequisite steps. 2. This article does not provide a conceptual overview of the Azure Data Factory service. We recommend that you go through Introduction to Azure Data Factory article for a detailed overview of the service. Create data factory A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data to product output data. Let's start with creating the data factory in this step. 1. Log in to the Azure portal. 2. Click NEW on the left menu, click Data + Analytics, and click Data Factory. 3. In the New data factory blade, enter GetStartedDF for the Name. IMPORTANT The name of the Azure data factory must be globally unique. If you receive the error: Data factory name “GetStartedDF” is not available. Change the name of the data factory (for example, yournameGetStartedDF) and try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. The name of the data factory may be registered as a DNS name in the future and hence become publically visible. 4. Select the Azure subscription where you want the data factory to be created. 5. Select existing resource group or create a resource group. For the tutorial, create a resource group named: ADFGetStartedRG. 6. Select the location for the data factory. Only regions supported by the Data Factory service are shown in the drop-down list. 7. Select Pin to dashboard. 8. Click Create on the New data factory blade. IMPORTANT To create Data Factory instances, you must be a member of the Data Factory Contributor role at the subscription/resource group level. 9. On the dashboard, you see the following tile with status: Deploying data factory. 10. Congratulations! You have successfully created your first data factory. After the data factory has been created successfully, you see the data factory page, which shows you the contents of the data factory. Before creating a pipeline in the data factory, you need to create a few Data Factory entities first. You first create linked services to link data stores/computes to your data store, define input and output datasets to represent input/output data in linked data stores, and then create the pipeline with an activity that uses these datasets. Create linked services In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run a Hive script specified in the activity of the pipeline in this sample. Identify what data store/compute services are used in your scenario and link those services to the data factory by creating linked services. Create Azure Storage linked service In this step, you link your Azure Storage account to your data factory. In this tutorial, you use the same Azure Storage account to store input/output data and the HQL script file. 1. Click Author and deploy on the DATA FACTORY blade for GetStartedDF. You should see the Data Factory Editor. 2. Click New data store and choose Azure storage. 3. You should see the JSON script for creating an Azure Storage linked service in the editor. 4. Replace account name with the name of your Azure storage account and account key with the access key of the Azure storage account. To learn how to get your storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your storage account. 5. Click Deploy on the command bar to deploy the linked service. After the linked service is deployed successfully, the Draft-1 window should disappear and you see AzureStorageLinkedService in the tree view on the left. Create Azure HDInsight linked service In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime and deleted after it is done processing and idle for the specified amount of time. 1. In the Data Factory Editor, click ... More, click New compute, and select On-demand HDInsight cluster. 2. Copy and paste the following snippet to the Draft-1 window. The JSON snippet describes the properties that are used to create the HDInsight cluster on-demand. { "name": "HDInsightOnDemandLinkedService", "properties": { "type": "HDInsightOnDemand", "typeProperties": { "clusterSize": 1, "timeToLive": "00:30:00", "linkedServiceName": "AzureStorageLinkedService" } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION ClusterSize Specifies the size of the HDInsight cluster. TimeToLive Specifies that the idle time for the HDInsight cluster, before it is deleted. linkedServiceName Specifies the storage account that is used to store the logs that are generated by HDInsight. Note the following points: The Data Factory creates a Windows-based HDInsight cluster for you with the JSON. You could also have it create a Linux-based HDInsight cluster. See On-demand HDInsight Linked Service for details. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight Linked Service for details. The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done. As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicenamedatetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage. See On-demand HDInsight Linked Service for details. 3. Click Deploy on the command bar to deploy the linked service. 4. Confirm that you see both AzureStorageLinkedService and HDInsightOnDemandLinkedService in the tree view on the left. Create datasets In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the AzureStorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data. Create input dataset 1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob storage. 2. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a dataset called AzureBlobInput that represents input data for an activity in the pipeline. In addition, you specify that the input data is located in the blob container called adfgetstarted and the folder called inputdata. { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "fileName": "input.log", "folderPath": "adfgetstarted/inputdata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 }, "external": true, "policy": {} } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureBlob because data resides in an Azure blob storage. linkedServiceName Refers to the AzureStorageLinkedService you created earlier. folderPath Specifies the blob container and the folder that contains input blobs. PROPERTY DESCRIPTION fileName This property is optional. If you omit this property, all the files from the folderPath are picked. In this tutorial, only the input.log is processed. type The log files are in text format, so we use TextFormat. columnDelimiter columns in the log files are delimited by comma character ( , ) frequency/interval frequency set to Month and interval is 1, which means that the input slices are available monthly. external This property is set to true if the input data is not generated by this pipeline. In this tutorial, the input.log file is not generated by this pipeline, so we set the property to true. For more information about these JSON properties, see Azure Blob connector article. 3. Click Deploy on the command bar to deploy the newly created dataset. You should see the dataset in the tree view on the left. Create output dataset Now, you create the output dataset to represent the output data stored in the Azure Blob storage. 1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob storage. 2. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a dataset called AzureBlobOutput, and specifying the structure of the data that is produced by the Hive script. In addition, you specify that the results are stored in the blob container called adfgetstarted and the folder called partitioneddata. The availability section specifies that the output dataset is produced on a monthly basis. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adfgetstarted/partitioneddata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 } } } See Create the input dataset section for descriptions of these properties. You do not set the external property on an output dataset as the dataset is produced by the Data Factory service. 3. Click Deploy on the command bar to deploy the newly created dataset. 4. Verify that the dataset is created successfully. Create pipeline In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly (frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the following JSON are explained at the end of this section. 1. In the Data Factory Editor, click Ellipsis (…) More commands and then click New pipeline. 2. Copy and paste the following snippet to the Draft-1 window. IMPORTANT Replace storageaccountname with the name of your storage account in the JSON. { "name": "MyFirstPipeline", "properties": { "description": "My first Azure Data Factory pipeline", "activities": [ { "type": "HDInsightHive", "typeProperties": { "scriptPath": "adfgetstarted/script/partitionweblogs.hql", "scriptLinkedService": "AzureStorageLinkedService", "defines": { "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" } }, "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 }, "name": "RunSampleHiveActivity", "linkedServiceName": "HDInsightOnDemandLinkedService" } ], "start": "2016-04-01T00:00:00Z", "end": "2016-04-02T00:00:00Z", "isPaused": false } } In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process Data on an HDInsight cluster. The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container adfgetstarted. The defines section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}). The start and end properties of the pipeline specifies the active period of the pipeline. In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceName – HDInsightOnDemandLinkedService. NOTE See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties used in the example. 3. Confirm the following: a. input.log file exists in the inputdata folder of the adfgetstarted container in the Azure blob storage b. partitionweblogs.hql file exists in the script folder of the adfgetstarted container in the Azure blob storage. Complete the prerequisite steps in the Tutorial Overview if you don't see these files. c. Confirm that you replaced storageaccountname with the name of your storage account in the pipeline JSON. 4. Click Deploy on the command bar to deploy the pipeline. Since the start and end times are set in the past and isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy. 5. Confirm that you see the pipeline in the tree view. 6. Congratulations, you have successfully created your first pipeline! Monitor pipeline Monitor pipeline using Diagram View 1. Click X to close Data Factory Editor blades and to navigate back to the Data Factory blade, and click Diagram. 2. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial. 3. To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline. 4. Confirm that you see the HDInsightHive activity in the pipeline. To navigate back to the previous view, click Data factory in the breadcrumb menu at the top. 5. In the Diagram View, double-click the dataset AzureBlobInput. Confirm that the slice is in Ready state. It may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait for sometime, see if you have the input file (input.log) placed in the right container (adfgetstarted) and folder (inputdata). 6. Click X to close AzureBlobInput blade. 7. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently being processed. 8. When processing is done, you see the slice in Ready state. IMPORTANT Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately 30 minutes to process the slice. 9. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the output data. 10. Click the slice to see details about it in a Data slice blade. 11. Click an activity run in the Activity runs list to see details about an activity run (Hive activity in our scenario) in an Activity run details window. From the log files, you can see the Hive query that was executed and status information. These logs are useful for troubleshooting any issues. See Monitor and manage pipelines using Azure portal blades article for more details. IMPORTANT The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container. Monitor pipeline using Monitor & Manage App You can also use Monitor & Manage application to monitor your pipelines. For detailed information about using this application, see Monitor and manage Azure Data Factory pipelines using Monitoring and Management App. 1. Click Monitor & Manage tile on the home page for your data factory. 2. You should see Monitor & Manage application. Change the Start time and End time to match start (04-01-2016 12:00 AM) and end times (04-02-2016 12:00 AM) of your pipeline, and click Apply. 3. Select an activity window in the Activity Windows list to see details about it. Summary In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps: 1. Created an Azure data factory. 2. Created two linked services: a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data factory. b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data. 3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline. 4. Created a pipeline with a HDInsight Hive activity. Next Steps In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see Tutorial: Copy data from an Azure blob to Azure SQL. See Also TOPIC DESCRIPTION Pipelines This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct endto-end data-driven workflows for your scenario or business. Datasets This article helps you understand datasets in Azure Data Factory. Scheduling and execution This article explains the scheduling and execution aspects of Azure Data Factory application model. Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. Tutorial: Create a data factory by using Visual Studio 6/13/2017 • 22 min to read • Edit Online This tutorial shows you how to create an Azure data factory by using Visual Studio. You create a Visual Studio project using the Data Factory project template, define Data Factory entities (linked services, datasets, and pipeline) in JSON format, and then publish/deploy these entities to the cloud. The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. NOTE This tutorial does not show how copy data by using Azure Data Factory. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution in Data Factory. Walkthrough: Create and publish Data Factory entities Here are the steps you perform as part of this walkthrough: 1. Create two linked services: AzureStorageLinkedService1 and HDInsightOnDemandLinkedService1. In this tutorial, both input and output data for the hive activity are in the same Azure Blob Storage. You use an on-demand HDInsight cluster to process existing input data to produce output data. The on-demand HDInsight cluster is automatically created for you by Azure Data Factory at run time when the input data is ready to be processed. You need to link your data stores or computes to your data factory so that the Data Factory service can connect to them at runtime. Therefore, you link your Azure Storage Account to the data factory by using the AzureStorageLinkedService1, and link an on-demand HDInsight cluster by using the HDInsightOnDemandLinkedService1. When publishing, you specify the name for the data factory to be created or an existing data factory. 2. Create two datasets: InputDataset and OutputDataset, which represent the input/output data that is stored in the Azure blob storage. These dataset definitions refer to the Azure Storage linked service you created in the previous step. For the InputDataset, you specify the blob container (adfgetstarted) and the folder (inptutdata) that contains a blob with the input data. For the OutputDataset, you specify the blob container (adfgetstarted) and the folder (partitioneddata) that holds the output data. You also specify other properties such as structure, availability, and policy. 3. Create a pipeline named MyFirstPipeline. In this walkthrough, the pipeline has only one activity: HDInsight Hive Activity. This activity transform input data to produce output data by running a hive script on an on-demand HDInsight cluster. To learn more about hive activity, see Hive Activity 4. Create a data factory named DataFactoryUsingVS. Deploy the data factory and all Data Factory entities (linked services, tables, and the pipeline). 5. After you publish, you use Azure portal blades and Monitoring & Management App to monitor the pipeline. Prerequisites 1. Read through Tutorial Overview article and complete the prerequisite steps. You can also select the Overview and prerequisites option in the drop-down list at the top to switch to the article. After you complete the prerequisites, switch back to this article by selecting Visual Studio option in the drop-down list. 2. To create Data Factory instances, you must be a member of the Data Factory Contributor role at the subscription/resource group level. 3. You must have the following installed on your computer: Visual Studio 2013 or Visual Studio 2015 Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to Azure Download Page and click VS 2013 or VS 2015 in the .NET section. Download the latest Azure Data Factory plugin for Visual Studio: VS 2013 or VS 2015. You can also update the plugin by doing the following steps: On the menu, click Tools -> Extensions and Updates -> Online -> Visual Studio Gallery -> Microsoft Azure Data Factory Tools for Visual Studio -> Update. Now, let's use Visual Studio to create an Azure data factory. Create Visual Studio project 1. Launch Visual Studio 2013 or Visual Studio 2015. Click File, point to New, and click Project. You should see the New Project dialog box. 2. In the New Project dialog, select the DataFactory template, and click Empty Data Factory Project. 3. Enter a name for the project, location, and a name for the solution, and click OK. Create linked services In this step, you create two linked services: Azure Storage and HDInsight on-demand. The Azure Storage linked service links your Azure Storage account to the data factory by providing the connection information. Data Factory service uses the connection string from the linked service setting to connect to the Azure storage at runtime. This storage holds input and output data for the pipeline, and the hive script file used by the hive activity. With on-demand HDInsight linked service, The HDInsight cluster is automatically created at runtime when the input data is ready to processed. The cluster is deleted after it is done processing and idle for the specified amount of time. NOTE You create a data factory by specifying its name and settings at the time of publishing your Data Factory solution. Create Azure Storage linked service 1. Right-click Linked Services in the solution explorer, point to Add, and click New Item. 2. In the Add New Item dialog box, select Azure Storage Linked Service from the list, and click Add. 3. Replace <accountname> and <accountkey> with the name of your Azure storage account and its key. To learn how to get your storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your storage account. 4. Save the AzureStorageLinkedService1.json file. Create Azure HDInsight linked service 1. In the Solution Explorer, right-click Linked Services, point to Add, and click New Item. 2. Select HDInsight On Demand Linked Service, and click Add. 3. Replace the JSON with the following JSON: { "name": "HDInsightOnDemandLinkedService", "properties": { "type": "HDInsightOnDemand", "typeProperties": { "clusterSize": 1, "timeToLive": "00:30:00", "linkedServiceName": "AzureStorageLinkedService1" } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION ClusterSize Specifies the size of the HDInsight Hadoop cluster. TimeToLive Specifies that the idle time for the HDInsight cluster, before it is deleted. linkedServiceName Specifies the storage account that is used to store the logs that are generated by HDInsight Hadoop cluster. IMPORTANT The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done. As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: adf<yourdatafactoryname>-<linkedservicename>-datetimestamp . Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage. For more information about JSON properties, see Compute linked services article. 4. Save the HDInsightOnDemandLinkedService1.json file. Create datasets In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the AzureStorageLinkedService1 you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data. Create input dataset 1. In the Solution Explorer, right-click Tables, point to Add, and click New Item. 2. Select Azure Blob from the list, change the name of the file to InputDataSet.json, and click Add. 3. Replace the JSON in the editor with the following JSON snippet: { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService1", "typeProperties": { "fileName": "input.log", "folderPath": "adfgetstarted/inputdata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 }, "external": true, "policy": {} } } This JSON snippet defines a dataset called AzureBlobInput that represents input data for the hive activity in the pipeline. You specify that the input data is located in the blob container called adfgetstarted and the folder called inputdata . The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureBlob because data resides in Azure Blob Storage. linkedServiceName Refers to the AzureStorageLinkedService1 you created earlier. fileName This property is optional. If you omit this property, all the files from the folderPath are picked. In this case, only the input.log is processed. type The log files are in text format, so we use TextFormat. columnDelimiter columns in the log files are delimited by the comma character ( , ) frequency/interval frequency set to Month and interval is 1, which means that the input slices are available monthly. external This property is set to true if the input data for the activity is not generated by the pipeline. This property is only specified on input datasets. For the input dataset of the first activity, always set it to true. 4. Save the InputDataset.json file. Create output dataset Now, you create the output dataset to represent output data stored in the Azure Blob storage. 1. In the Solution Explorer, right-click tables, point to Add, and click New Item. 2. Select Azure Blob from the list, change the name of the file to OutputDataset.json, and click Add. 3. Replace the JSON in the editor with the following JSON: { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService1", "typeProperties": { "folderPath": "adfgetstarted/partitioneddata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 } } } The JSON snippet defines a dataset called AzureBlobOutput that represents output data produced by the hive activity in the pipeline. You specify that the output data is produced by the hive activity is placed in the blob container called adfgetstarted and the folder called partitioneddata . The availability section specifies that the output dataset is produced on a monthly basis. The output dataset drives the schedule of the pipeline. The pipeline runs monthly between its start and end times. See Create the input dataset section for descriptions of these properties. You do not set the external property on an output dataset as the dataset is produced by the pipeline. 4. Save the OutputDataset.json file. Create pipeline You have created the Azure Storage linked service, and input and output datasets so far. Now, you create a pipeline with a HDInsightHive activity. The input for the hive activity is set to AzureBlobInput and output is set to AzureBlobOutput. A slice of an input dataset is available monthly (frequency: Month, interval: 1), and the output slice is produced monthly too. 1. In the Solution Explorer, right-click Pipelines, point to Add, and click New Item. 2. Select Hive Transformation Pipeline from the list, and click Add. 3. Replace the JSON with the following snippet: IMPORTANT Replace <storageaccountname> with the name of your storage account. { "name": "MyFirstPipeline", "properties": { "description": "My first Azure Data Factory pipeline", "activities": [ { "type": "HDInsightHive", "typeProperties": { "scriptPath": "adfgetstarted/script/partitionweblogs.hql", "scriptLinkedService": "AzureStorageLinkedService1", "defines": { "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" } }, "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 }, "name": "RunSampleHiveActivity", "linkedServiceName": "HDInsightOnDemandLinkedService" } ], "start": "2016-04-01T00:00:00Z", "end": "2016-04-02T00:00:00Z", "isPaused": false } } IMPORTANT Replace <storageaccountname> with the name of your storage account. The JSON snippet defines a pipeline that consists of a single activity (Hive Activity). This activity runs a Hive script to process input data on an on-demand HDInsight cluster to produce output data. In the activities section of the pipeline JSON, you see only one activity in the array with type set to HDInsightHive. In the type properties that are specific to HDInsight Hive activity, you specify what Azure Storage linked service has the hive script file, the path to the script file, and parameters to the script file. The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the scriptLinkedService), and in the script folder in the container adfgetstarted . The defines section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable} , ${hiveconf:partitionedtable}) . The start and end properties of the pipeline specifies the active period of the pipeline. You configured the dataset to be produced monthly, therefore, only once slice is produced by the pipeline (because the month is same in start and end dates). In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceName – HDInsightOnDemandLinkedService. 4. Save the HiveActivity1.json file. Add partitionweblogs.hql and input.log as a dependency 1. Right-click Dependencies in the Solution Explorer window, point to Add, and click Existing Item. 2. Navigate to the C:\ADFGettingStarted and select partitionweblogs.hql, input.log files, and click Add. You created these two files as part of prerequisites from the Tutorial Overview. When you publish the solution in the next step, the partitionweblogs.hql file is uploaded to the script folder in the adfgetstarted blob container. Publish/deploy Data Factory entities In this step, you publish the Data Factory entities (linked services, datasets, and pipeline) in your project to the Azure Data Factory service. In the process of publishing, you specify the name for your data factory. 1. Right-click project in the Solution Explorer, and click Publish. 2. If you see Sign in to your Microsoft account dialog box, enter your credentials for the account that has Azure subscription, and click sign in. 3. You should see the following dialog box: 4. In the Configure data factory page, do the following steps: a. select Create New Data Factory option. b. Enter a unique name for the data factory. For example: DataFactoryUsingVS09152016. The name must be globally unique. c. Select the right subscription for the Subscription field. > [!IMPORTANT] > If you do not see any subscription, ensure that you logged in using an account that is an admin or co-admin of the subscription. d. Select the resource group for the data factory to be created. e. Select the region for the data factory. f. Click Next to switch to the Publish Items page. (Press TAB to move out of the Name field to if the Next button is disabled.) IMPORTANT If you receive the error Data factory name “DataFactoryUsingVS” is not available when publishing, change the name (for example, yournameDataFactoryUsingVS). See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. 5. In the Publish Items page, ensure that all the Data Factories entities are selected, and click Next to switch to the Summary page. 6. Review the summary and click Next to start the deployment process and view the Deployment Status. 7. In the Deployment Status page, you should see the status of the deployment process. Click Finish after the deployment is done. Important points to note: If you receive the error: This subscription is not registered to use namespace Microsoft.DataFactory, do one of the following and try publishing again: In Azure PowerShell, run the following command to register the Data Factory provider. Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory You can run the following command to confirm that the Data Factory provider is registered. Get-AzureRmResourceProvider Login using the Azure subscription in to the Azure portal and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. The name of the data factory may be registered as a DNS name in the future and hence become publically visible. To create Data Factory instances, you need to be an admin or co-admin of the Azure subscription Monitor pipeline In this step, you monitor the pipeline using Diagram View of the data factory. Monitor pipeline using Diagram View 1. Log in to the Azure portal, do the following steps: a. Click More services and click Data factories. b. Select the name of your data factory (for example: DataFactoryUsingVS09152016) from the list of data factories. 2. In the home page for your data factory, click Diagram. 3. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial. 4. To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline. 5. Confirm that you see the HDInsightHive activity in the pipeline. To navigate back to the previous view, click Data factory in the breadcrumb menu at the top. 6. In the Diagram View, double-click the dataset AzureBlobInput. Confirm that the slice is in Ready state. It may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait for sometime, see if you have the input file (input.log) placed in the right container ( adfgetstarted ) and folder ( inputdata ). And, make sure that the external property on the input dataset is set to true. 7. Click X to close AzureBlobInput blade. 8. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently being processed. 9. When processing is done, you see the slice in Ready state. IMPORTANT Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately 30 minutes to process the slice. 10. When the slice is in Ready state, check the blob storage for the output data. partitioneddata 11. Click the slice to see details about it in a Data slice blade. folder in the adfgetstarted container in your 12. Click an activity run in the Activity runs list to see details about an activity run (Hive activity in our scenario) in an Activity run details window. From the log files, you can see the Hive query that was executed and status information. These logs are useful for troubleshooting any issues. See Monitor datasets and pipeline for instructions on how to use the Azure portal to monitor the pipeline and datasets you have created in this tutorial. Monitor pipeline using Monitor & Manage App You can also use Monitor & Manage application to monitor your pipelines. For detailed information about using this application, see Monitor and manage Azure Data Factory pipelines using Monitoring and Management App. 1. Click Monitor & Manage tile. 2. You should see Monitor & Manage application. Change the Start time and End time to match start (0401-2016 12:00 AM) and end times (04-02-2016 12:00 AM) of your pipeline, and click Apply. 3. To see details about an activity window, select it in the Activity Windows list to see details about it. IMPORTANT The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container. Additional notes A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data. See supported data stores for all the sources and sinks supported by the Copy Activity. See compute linked services for the list of compute services supported by Data Factory. Linked services link data stores or compute services to an Azure data factory. See supported data stores for all the sources and sinks supported by the Copy Activity. See compute linked services for the list of compute services supported by Data Factory and transformation activities that can run on them. See Move data from/to Azure Blob for details about JSON properties used in the Azure Storage linked service definition. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute Linked Services for details. The Data Factory creates a Windows-based HDInsight cluster for you with the preceding JSON. You could also have it create a Linux-based HDInsight cluster. See On-demand HDInsight Linked Service for details. The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done. As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: adf**yourdatafactoryname**-**linkedservicename**-datetimestamp . Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. This tutorial does not show how copy data by using Azure Data Factory. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database. Use Server Explorer to view data factories 1. In Visual Studio, click View on the menu, and click Server Explorer. 2. In the Server Explorer window, expand Azure and expand Data Factory. If you see Sign in to Visual Studio, enter the account associated with your Azure subscription and click Continue. Enter password, and click Sign in. Visual Studio tries to get information about all Azure data factories in your subscription. You see the status of this operation in the Data Factory Task List window. 3. You can right-click a data factory, and select Export Data Factory to New Project to create a Visual Studio project based on an existing data factory. Update Data Factory tools for Visual Studio To update Azure Data Factory tools for Visual Studio, do the following steps: 1. Click Tools on the menu and select Extensions and Updates. 2. Select Updates in the left pane and then select Visual Studio Gallery. 3. Select Azure Data Factory tools for Visual Studio and click Update. If you do not see this entry, you already have the latest version of the tools. Use configuration files You can use configuration files in Visual Studio to configure properties for linked services/tables/pipelines differently for each environment. Consider the following JSON definition for an Azure Storage linked service. To specify connectionString with different values for accountname and accountkey based on the environment (Dev/Test/Production) to which you are deploying Data Factory entities. You can achieve this behavior by using separate configuration file for each environment. { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "description": "", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Add a configuration file Add a configuration file for each environment by performing the following steps: 1. Right-click the Data Factory project in your Visual Studio solution, point to Add, and click New item. 2. Select Config from the list of installed templates on the left, select Configuration File, enter a name for the configuration file, and click Add. 3. Add configuration parameters and their values in the following format: { "$schema": "http://datafactories.schema.management.azure.com/vsschemas/V1/Microsoft.DataFactory.Config.json", "AzureStorageLinkedService1": [ { "name": "$.properties.typeProperties.connectionString", "value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } ], "AzureSqlLinkedService1": [ { "name": "$.properties.typeProperties.connectionString", "value": "Server=tcp:spsqlserver.database.windows.net,1433;Database=spsqldb;User ID=spelluru;Password=Sowmya123;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } ] } This example configures connectionString property of an Azure Storage linked service and an Azure SQL linked service. Notice that the syntax for specifying name is JsonPath. If JSON has a property that has an array of values as shown in the following code: "structure": [ { "name": "FirstName", "type": "String" }, { "name": "LastName", "type": "String" } ], Configure properties as shown in the following configuration file (use zero-based indexing): { "name": "$.properties.structure[0].name", "value": "FirstName" } { "name": "$.properties.structure[0].type", "value": "String" } { "name": "$.properties.structure[1].name", "value": "LastName" } { "name": "$.properties.structure[1].type", "value": "String" } Property names with spaces If a property name has spaces in it, use square brackets as shown in the following example (Database server name): { "name": "$.properties.activities[1].typeProperties.webServiceParameters.['Database server name']", "value": "MyAsqlServer.database.windows.net" } Deploy solution using a configuration When you are publishing Azure Data Factory entities in VS, you can specify the configuration that you want to use for that publishing operation. To publish entities in an Azure Data Factory project using configuration file: 1. Right-click Data Factory project and click Publish to see the Publish Items dialog box. 2. Select an existing data factory or specify values for creating a data factory on the Configure data factory page, and click Next. 3. On the Publish Items page: you see a drop-down list with available configurations for the Select Deployment Config field. 4. Select the configuration file that you would like to use and click Next. 5. Confirm that you see the name of JSON file in the Summary page and click Next. 6. Click Finish after the deployment operation is finished. When you deploy, the values from the configuration file are used to set values for properties in the JSON files before the entities are deployed to Azure Data Factory service. Use Azure Key Vault It is not advisable and often against security policy to commit sensitive data such as connection strings to the code repository. See ADF Secure Publish sample on GitHub to learn about storing sensitive information in Azure Key Vault and using it while publishing Data Factory entities. The Secure Publish extension for Visual Studio allows the secrets to be stored in Key Vault and only references to them are specified in linked services/ deployment configurations. These references are resolved when you publish Data Factory entities to Azure. These files can then be committed to source repository without exposing any secrets. Summary In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps: 1. Created an Azure data factory. 2. Created two linked services: a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data factory. b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data. 3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline. 4. Created a pipeline with a HDInsight Hive activity. Next Steps In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see Tutorial: Copy data from an Azure blob to Azure SQL. You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information. See Also TOPIC DESCRIPTION Pipelines This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct datadriven workflows for your scenario or business. Datasets This article helps you understand datasets in Azure Data Factory. Data Transformation Activities This article provides a list of data transformation activities (such as HDInsight Hive transformation you used in this tutorial) supported by Azure Data Factory. TOPIC DESCRIPTION Scheduling and execution This article explains the scheduling and execution aspects of Azure Data Factory application model. Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. Tutorial: Build your first Azure data factory using Azure PowerShell 6/13/2017 • 14 min to read • Edit Online In this article, you use Azure PowerShell to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list. The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. NOTE The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source data store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution in Data Factory. Prerequisites Read through Tutorial Overview article and complete the prerequisite steps. Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure PowerShell on your computer. (optional) This article does not cover all the Data Factory cmdlets. See Data Factory Cmdlet Reference for comprehensive documentation on Data Factory cmdlets. Create data factory In this step, you use Azure PowerShell to create an Azure Data Factory named FirstDataFactoryPSH. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data. Let's start with creating the data factory in this step. 1. Start Azure PowerShell and run the following command. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run these commands again. Run the following command and enter the user name and password that you use to sign in to the Azure portal. PowerShell Login-AzureRmAccount Run the following command to view all the subscriptions for this account. PowerShell Get-AzureRmSubscription Run the following command to select the subscription that you want to work with. This subscription should be the same as the one you used in the Azure portal. PowerShell Get-AzureRmSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzureRmContext 2. Create an Azure resource group named ADFTutorialResourceGroup by running the following command: New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US" Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use it in place of ADFTutorialResourceGroup in this tutorial. 3. Run the New-AzureRmDataFactory cmdlet that creates a data factory named FirstDataFactoryPSH. New-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH –Location "West US" Note the following points: The name of the Azure Data Factory must be globally unique. If you receive the error Data factory name “FirstDataFactoryPSH” is not available, change the name (for example, yournameFirstDataFactoryPSH). Use this name in place of ADFTutorialFactoryPSH while performing steps in this tutorial. See Data Factory Naming Rules topic for naming rules for Data Factory artifacts. To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription The name of the data factory may be registered as a DNS name in the future and hence become publically visible. If you receive the error: "This subscription is not registered to use namespace Microsoft.DataFactory", do one of the following and try publishing again: In Azure PowerShell, run the following command to register the Data Factory provider: Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory You can run the following command to confirm that the Data Factory provider is registered: Get-AzureRmResourceProvider Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to link data stores/computes to your data store, define input and output datasets to represent input/output data in linked data stores, and then create the pipeline with an activity that uses these datasets. Create linked services In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run a Hive script specified in the activity of the pipeline in this sample. Identify what data store/compute services are used in your scenario and link those services to the data factory by creating linked services. Create Azure Storage linked service In this step, you link your Azure Storage account to your data factory. You use the same Azure Storage account to store input/output data and the HQL script file. 1. Create a JSON file named StorageLinkedService.json in the C:\ADFGetStarted folder with the following content. Create the folder ADFGetStarted if it does not already exist. { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "description": "", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Replace account name with the name of your Azure storage account and account key with the access key of the Azure storage account. To learn how to get your storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your storage account. 2. In Azure PowerShell, switch to the ADFGetStarted folder. 3. You can use the New-AzureRmDataFactoryLinkedService cmdlet that creates a linked service. This cmdlet and other Data Factory cmdlets you use in this tutorial requires you to pass values for the ResourceGroupName and DataFactoryName parameters. Alternatively, you can use GetAzureRmDataFactory to get a DataFactory object and pass the object without typing ResourceGroupName and DataFactoryName each time you run a cmdlet. Run the following command to assign the output of the Get-AzureRmDataFactory cmdlet to a $df variable. $df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH 4. Now, run the New-AzureRmDataFactoryLinkedService cmdlet that creates the linked StorageLinkedService service. New-AzureRmDataFactoryLinkedService $df -File .\StorageLinkedService.json If you hadn't run the Get-AzureRmDataFactory cmdlet and assigned the output to the $df variable, you would have to specify values for the ResourceGroupName and DataFactoryName parameters as follows. New-AzureRmDataFactoryLinkedService -ResourceGroupName ADFTutorialResourceGroup -DataFactoryName FirstDataFactoryPSH -File .\StorageLinkedService.json If you close Azure PowerShell in the middle of the tutorial, you have to run the Get-AzureRmDataFactory cmdlet next time you start Azure PowerShell to complete the tutorial. Create Azure HDInsight linked service In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute Linked Services for details. 1. Create a JSON file named HDInsightOnDemandLinkedService.json in the C:\ADFGetStarted folder with the following content. { "name": "HDInsightOnDemandLinkedService", "properties": { "type": "HDInsightOnDemand", "typeProperties": { "clusterSize": 1, "timeToLive": "00:30:00", "linkedServiceName": "StorageLinkedService" } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION ClusterSize Specifies the size of the HDInsight cluster. TimeToLive Specifies that the idle time for the HDInsight cluster, before it is deleted. linkedServiceName Specifies the storage account that is used to store the logs that are generated by HDInsight Note the following points: The Data Factory creates a Windows-based HDInsight cluster for you with the JSON. You could also have it create a Linux-based HDInsight cluster. See On-demand HDInsight Linked Service for details. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight Linked Service for details. The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done. As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicenamedatetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage. See On-demand HDInsight Linked Service for details. 2. Run the New-AzureRmDataFactoryLinkedService cmdlet that creates the linked service called HDInsightOnDemandLinkedService. New-AzureRmDataFactoryLinkedService $df -File .\HDInsightOnDemandLinkedService.json Create datasets In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the StorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data. Create input dataset 1. Create a JSON file named InputTable.json in the C:\ADFGetStarted folder with the following content: { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "fileName": "input.log", "folderPath": "adfgetstarted/inputdata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 }, "external": true, "policy": {} } } The JSON defines a dataset named AzureBlobInput, which represents input data for an activity in the pipeline. In addition, it specifies that the input data is located in the blob container called adfgetstarted and the folder called inputdata. The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureBlob because data resides in Azure blob storage. linkedServiceName refers to the StorageLinkedService you created earlier. fileName This property is optional. If you omit this property, all the files from the folderPath are picked. In this case, only the input.log is processed. type The log files are in text format, so we use TextFormat. columnDelimiter columns in the log files are delimited by the comma character (,). frequency/interval frequency set to Month and interval is 1, which means that the input slices are available monthly. external this property is set to true if the input data is not generated by the Data Factory service. 2. Run the following command in Azure PowerShell to create the Data Factory dataset: New-AzureRmDataFactoryDataset $df -File .\InputTable.json Create output dataset Now, you create the output dataset to represent the output data stored in the Azure Blob storage. 1. Create a JSON file named OutputTable.json in the C:\ADFGetStarted folder with the following content: { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "adfgetstarted/partitioneddata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 } } } The JSON defines a dataset named AzureBlobOutput, which represents output data for an activity in the pipeline. In addition, it specifies that the results are stored in the blob container called adfgetstarted and the folder called partitioneddata. The availability section specifies that the output dataset is produced on a monthly basis. 2. Run the following command in Azure PowerShell to create the Data Factory dataset: New-AzureRmDataFactoryDataset $df -File .\OutputTable.json Create pipeline In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly (frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the following JSON are explained at the end of this section. 1. Create a JSON file named MyFirstPipelinePSH.json in the C:\ADFGetStarted folder with the following content: IMPORTANT Replace storageaccountname with the name of your storage account in the JSON. { "name": "MyFirstPipeline", "properties": { "description": "My first Azure Data Factory pipeline", "activities": [ { "type": "HDInsightHive", "typeProperties": { "scriptPath": "adfgetstarted/script/partitionweblogs.hql", "scriptLinkedService": "StorageLinkedService", "defines": { "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" } }, "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 }, "name": "RunSampleHiveActivity", "linkedServiceName": "HDInsightOnDemandLinkedService" } ], "start": "2016-04-01T00:00:00Z", "end": "2016-04-02T00:00:00Z", "isPaused": false } } In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process Data on an HDInsight cluster. The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the scriptLinkedService, called StorageLinkedService), and in script folder in the container adfgetstarted. The defines section is used to specify the runtime settings that be passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}). The start and end properties of the pipeline specifies the active period of the pipeline. In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceName – HDInsightOnDemandLinkedService. NOTE See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties that are used in the example. 2. Confirm that you see the input.log file in the adfgetstarted/inputdata folder in the Azure blob storage, and run the following command to deploy the pipeline. Since the start and end times are set in the past and isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy. New-AzureRmDataFactoryPipeline $df -File .\MyFirstPipelinePSH.json 3. Congratulations, you have successfully created your first pipeline using Azure PowerShell! Monitor pipeline In this step, you use Azure PowerShell to monitor what’s going on in an Azure data factory. 1. Run Get-AzureRmDataFactory and assign the output to a $df variable. $df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH 2. Run Get-AzureRmDataFactorySlice to get details about all slices of the EmpSQLTable, which is the output table of the pipeline. Get-AzureRmDataFactorySlice $df -DatasetName AzureBlobOutput -StartDateTime 2016-04-01 Notice that the StartDateTime you specify here is the same start time specified in the pipeline JSON. Here is the sample output: ResourceGroupName DataFactoryName DatasetName Start End RetryCount State SubState LatencyStatus LongRetryCount : : : : : : : : : : ADFTutorialResourceGroup FirstDataFactoryPSH AzureBlobOutput 4/1/2016 12:00:00 AM 4/2/2016 12:00:00 AM 0 InProgress 0 3. Run Get-AzureRmDataFactoryRun to get the details of activity runs for a specific slice. Get-AzureRmDataFactoryRun $df -DatasetName AzureBlobOutput -StartDateTime 2016-04-01 Here is the sample output: Id : 0f6334f2-d56c-4d48-b427d4f0fb4ef883_635268096000000000_635292288000000000_AzureBlobOutput ResourceGroupName : ADFTutorialResourceGroup DataFactoryName : FirstDataFactoryPSH DatasetName : AzureBlobOutput ProcessingStartTime : 12/18/2015 4:50:33 AM ProcessingEndTime : 12/31/9999 11:59:59 PM PercentComplete : 0 DataSliceStart : 4/1/2016 12:00:00 AM DataSliceEnd : 4/2/2016 12:00:00 AM Status : AllocatingResources Timestamp : 12/18/2015 4:50:33 AM RetryAttempt : 0 Properties : {} ErrorMessage : ActivityName : RunSampleHiveActivity PipelineName : MyFirstPipeline Type : Script You can keep running this cmdlet until you see the slice in Ready state or Failed state. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the output data. Creation of an on-demand HDInsight cluster usually takes some time. IMPORTANT Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately 30 minutes to process the slice. The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container. Summary In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps: 1. Created an Azure data factory. 2. Created two linked services: a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data factory. b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data. 3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline. 4. Created a pipeline with a HDInsight Hive activity. Next steps In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see Tutorial: Copy data from an Azure Blob to Azure SQL. See Also TOPIC DESCRIPTION Data Factory Cmdlet Reference See comprehensive documentation on Data Factory cmdlets Pipelines This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct endto-end data-driven workflows for your scenario or business. Datasets This article helps you understand datasets in Azure Data Factory. Scheduling and Execution This article explains the scheduling and execution aspects of Azure Data Factory application model. Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. Tutorial: Build your first Azure data factory using Azure Resource Manager template 6/13/2017 • 12 min to read • Edit Online In this article, you use an Azure Resource Manager template to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list. The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. NOTE The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database. The pipeline in this tutorial has only one activity of type: HDInsightHive. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution in Data Factory. Prerequisites Read through Tutorial Overview article and complete the prerequisite steps. Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure PowerShell on your computer. See Authoring Azure Resource Manager Templates to learn about Azure Resource Manager templates. In this tutorial ENTITY DESCRIPTION Azure Storage linked service Links your Azure Storage account to the data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. HDInsight on-demand linked service Links an on-demand HDInsight cluster to the data factory. The cluster is automatically created for you to process data and is deleted after the processing is done. Azure Blob input dataset Refers to the Azure Storage linked service. The linked service refers to an Azure Storage account and the Azure Blob dataset specifies the container, folder, and file name in the storage that holds the input data. Azure Blob output dataset Refers to the Azure Storage linked service. The linked service refers to an Azure Storage account and the Azure Blob dataset specifies the container, folder, and file name in the storage that holds the output data. ENTITY DESCRIPTION Data pipeline The pipeline has one activity of type HDInsightHive, which consumes the input dataset and produces the output dataset. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. There are two types of activities: data movement activities and data transformation activities. In this tutorial, you create a pipeline with one activity (Hive activity). The following section provides the complete Resource Manager template for defining Data Factory entities so that you can quickly run through the tutorial and test the template. To understand how each Data Factory entity is defined, see Data Factory entities in the template section. Data Factory JSON template The top-level Resource Manager template for defining a data factory is: { "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { ... }, "variables": { ... }, "resources": [ { "name": "[parameters('dataFactoryName')]", "apiVersion": "[variables('apiVersion')]", "type": "Microsoft.DataFactory/datafactories", "location": "westus", "resources": [ { ... }, { ... }, { ... }, { ... } ] } ] } Create a JSON file named ADFTutorialARM.json in C:\ADFGetStarted folder with the following content: { "contentVersion": "1.0.0.0", "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "parameters": { "storageAccountName": { "type": "string", "metadata": { "description": "Name of the Azure storage account that contains the input/output data." } }, "storageAccountKey": { "type": "securestring", "metadata": { "description": "Key for the Azure storage account." } }, "blobContainer": { "type": "string", "metadata": { "description": "Name of the blob container in the Azure Storage account." } }, "inputBlobFolder": { "type": "string", "metadata": { "description": "The folder in the blob container that has the input file." } }, "inputBlobName": { "type": "string", "metadata": { "description": "Name of the input file/blob." } }, "outputBlobFolder": { "type": "string", "metadata": { "description": "The folder in the blob container that will hold the transformed data." } }, "hiveScriptFolder": { "type": "string", "metadata": { "description": "The folder in the blob container that contains the Hive query file." } }, "hiveScriptFile": { "type": "string", "metadata": { "description": "Name of the hive query (HQL) file." } } }, "variables": { "dataFactoryName": "[concat('HiveTransformDF', uniqueString(resourceGroup().id))]", "azureStorageLinkedServiceName": "AzureStorageLinkedService", "hdInsightOnDemandLinkedServiceName": "HDInsightOnDemandLinkedService", "blobInputDatasetName": "AzureBlobInput", "blobOutputDatasetName": "AzureBlobOutput", "pipelineName": "HiveTransformPipeline" }, "resources": [ { "name": "[variables('dataFactoryName')]", "apiVersion": "2015-10-01", "type": "Microsoft.DataFactory/datafactories", "location": "West US", "resources": [ { "type": "linkedservices", "name": "[variables('azureStorageLinkedServiceName')]", "dependsOn": [ "[variables('dataFactoryName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureStorage", "description": "Azure Storage linked service", "typeProperties": { "connectionString": " [concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',paramet ers('storageAccountKey'))]" } } }, { "type": "linkedservices", "name": "[variables('hdInsightOnDemandLinkedServiceName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "HDInsightOnDemand", "typeProperties": { "clusterSize": 1, "timeToLive": "00:05:00", "osType": "windows", "linkedServiceName": "[variables('azureStorageLinkedServiceName')]" } } }, { "type": "datasets", "name": "[variables('blobInputDatasetName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureBlob", "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", "typeProperties": { "fileName": "[parameters('inputBlobName')]", "folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'))]", "format": { "type": "TextFormat", "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 }, "external": true } }, { "type": "datasets", "name": "[variables('blobOutputDatasetName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureBlob", "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", "typeProperties": { "folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'))]", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 } } }, { "type": "datapipelines", "name": "[variables('pipelineName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]", "[variables('hdInsightOnDemandLinkedServiceName')]", "[variables('blobInputDatasetName')]", "[variables('blobOutputDatasetName')]" ], "apiVersion": "2015-10-01", "properties": { "description": "Pipeline that transforms data using Hive script.", "activities": [ { "type": "HDInsightHive", "typeProperties": { "scriptPath": "[concat(parameters('blobContainer'), '/', parameters('hiveScriptFolder'), '/', parameters('hiveScriptFile'))]", "scriptLinkedService": "[variables('azureStorageLinkedServiceName')]", "defines": { "inputtable": "[concat('wasb://', parameters('blobContainer'), '@', parameters('storageAccountName'), '.blob.core.windows.net/', parameters('inputBlobFolder'))]", "partitionedtable": "[concat('wasb://', parameters('blobContainer'), '@', parameters('storageAccountName'), '.blob.core.windows.net/', parameters('outputBlobFolder'))]" } }, "inputs": [ { "name": "[variables('blobInputDatasetName')]" } ], "outputs": [ { { "name": "[variables('blobOutputDatasetName')]" } ], "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 }, "name": "RunSampleHiveActivity", "linkedServiceName": "[variables('hdInsightOnDemandLinkedServiceName')]" } ], "start": "2016-10-01T00:00:00Z", "end": "2016-10-02T00:00:00Z", "isPaused": false } } ] } ] } NOTE You can find another example of Resource Manager template for creating an Azure data factory on Tutorial: Create a pipeline with Copy Activity using an Azure Resource Manager template. Parameters JSON Create a JSON file named ADFTutorialARM-Parameters.json that contains parameters for the Azure Resource Manager template. IMPORTANT Specify the name and key of your Azure Storage account for the storageAccountName and storageAccountKey parameters in this parameter file. { "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#", "contentVersion": "1.0.0.0", "parameters": { "storageAccountName": { "value": "<Name of your Azure Storage account>" }, "storageAccountKey": { "value": "<Key of your Azure Storage account>" }, "blobContainer": { "value": "adfgetstarted" }, "inputBlobFolder": { "value": "inputdata" }, "inputBlobName": { "value": "input.log" }, "outputBlobFolder": { "value": "partitioneddata" }, "hiveScriptFolder": { "value": "script" }, "hiveScriptFile": { "value": "partitionweblogs.hql" } } } IMPORTANT You may have separate parameter JSON files for development, testing, and production environments that you can use with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying Data Factory entities in these environments. Create data factory 1. Start Azure PowerShell and run the following command: Run the following command and enter the user name and password that you use to sign in to the Azure portal. PowerShell Login-AzureRmAccount Run the following command to view all the subscriptions for this account. PowerShell Get-AzureRmSubscription Run the following command to select the subscription that you want to work with. This subscription should be the same as the one you used in the Azure portal. Get-AzureRmSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzureRmContext 2. Run the following command to deploy Data Factory entities using the Resource Manager template you created in Step 1. New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup TemplateFile C:\ADFGetStarted\ADFTutorialARM.json -TemplateParameterFile C:\ADFGetStarted\ADFTutorialARM-Parameters.json Monitor pipeline 1. After logging in to the Azure portal, Click Browse and select Data factories. 2. In the Data Factories blade, click the data factory (TutorialFactoryARM) you created. 3. In the Data Factory blade for your data factory, click Diagram. 4. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial. 5. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently being processed. 6. When processing is done, you see the slice in Ready state. Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately 30 minutes to process the slice. 7. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the output data. See Monitor datasets and pipeline for instructions on how to use the Azure portal blades to monitor the pipeline and datasets you have created in this tutorial. You can also use Monitor and Manage App to monitor your data pipelines. See Monitor and manage Azure Data Factory pipelines using Monitoring App for details about using the application. IMPORTANT The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container. Data Factory entities in the template Define data factory You define a data factory in the Resource Manager template as shown in the following sample: "resources": [ { "name": "[variables('dataFactoryName')]", "apiVersion": "2015-10-01", "type": "Microsoft.DataFactory/datafactories", "location": "West US" } The dataFactoryName is defined as: "dataFactoryName": "[concat('HiveTransformDF', uniqueString(resourceGroup().id))]", It is a unique string based on the resource group ID. Defining Data Factory entities The following Data Factory entities are defined in the JSON template: Azure Storage linked service HDInsight on-demand linked service Azure blob input dataset Azure blob output dataset Data pipeline with a copy activity Azure Storage linked service You specify the name and key of your Azure storage account in this section. See Azure Storage linked service for details about JSON properties used to define an Azure Storage linked service. { "type": "linkedservices", "name": "[variables('azureStorageLinkedServiceName')]", "dependsOn": [ "[variables('dataFactoryName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureStorage", "description": "Azure Storage linked service", "typeProperties": { "connectionString": " [concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',paramet ers('storageAccountKey'))]" } } } The connectionString uses the storageAccountName and storageAccountKey parameters. The values for these parameters passed by using a configuration file. The definition also uses variables: azureStroageLinkedService and dataFactoryName defined in the template. HDInsight on-demand linked service See Compute linked services article for details about JSON properties used to define an HDInsight on-demand linked service. { "type": "linkedservices", "name": "[variables('hdInsightOnDemandLinkedServiceName')]", "dependsOn": [ "[variables('dataFactoryName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "HDInsightOnDemand", "typeProperties": { "clusterSize": 1, "timeToLive": "00:05:00", "osType": "windows", "linkedServiceName": "[variables('azureStorageLinkedServiceName')]" } } } Note the following points: The Data Factory creates a Windows-based HDInsight cluster for you with the above JSON. You could also have it create a Linux-based HDInsight cluster. See On-demand HDInsight Linked Service for details. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight Linked Service for details. The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed unless there is an existing live cluster (timeToLive) and is deleted when the processing is done. As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-datetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage. See On-demand HDInsight Linked Service for details. Azure blob input dataset You specify the names of blob container, folder, and file that contains the input data. See Azure Blob dataset properties for details about JSON properties used to define an Azure Blob dataset. { "type": "datasets", "name": "[variables('blobInputDatasetName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureBlob", "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", "typeProperties": { "fileName": "[parameters('inputBlobName')]", "folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'))]", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 }, "external": true } } This definition uses the following parameters defined in parameter template: blobContainer, inputBlobFolder, and inputBlobName. Azure Blob output dataset You specify the names of blob container and folder that holds the output data. See Azure Blob dataset properties for details about JSON properties used to define an Azure Blob dataset. { "type": "datasets", "name": "[variables('blobOutputDatasetName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]" ], "apiVersion": "2015-10-01", "properties": { "type": "AzureBlob", "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", "typeProperties": { "folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'))]", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 } } } This definition uses the following parameters defined in the parameter template: blobContainer and outputBlobFolder. Data pipeline You define a pipeline that transform data by running Hive script on an on-demand Azure HDInsight cluster. See Pipeline JSON for descriptions of JSON elements used to define a pipeline in this example. { "type": "datapipelines", "name": "[variables('pipelineName')]", "dependsOn": [ "[variables('dataFactoryName')]", "[variables('azureStorageLinkedServiceName')]", "[variables('hdInsightOnDemandLinkedServiceName')]", "[variables('blobInputDatasetName')]", "[variables('blobOutputDatasetName')]" ], "apiVersion": "2015-10-01", "properties": { "description": "Pipeline that transforms data using Hive script.", "activities": [ { "type": "HDInsightHive", "typeProperties": { "scriptPath": "[concat(parameters('blobContainer'), '/', parameters('hiveScriptFolder'), '/', parameters('hiveScriptFile'))]", "scriptLinkedService": "[variables('azureStorageLinkedServiceName')]", "defines": { "inputtable": "[concat('wasb://', parameters('blobContainer'), '@', parameters('storageAccountName'), '.blob.core.windows.net/', parameters('inputBlobFolder'))]", "partitionedtable": "[concat('wasb://', parameters('blobContainer'), '@', parameters('storageAccountName'), '.blob.core.windows.net/', parameters('outputBlobFolder'))]" } }, "inputs": [ { "name": "[variables('blobInputDatasetName')]" } ], "outputs": [ { "name": "[variables('blobOutputDatasetName')]" } ], "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 }, "name": "RunSampleHiveActivity", "linkedServiceName": "[variables('hdInsightOnDemandLinkedServiceName')]" } ], "start": "2016-10-01T00:00:00Z", "end": "2016-10-02T00:00:00Z", "isPaused": false } } Reuse the template In the tutorial, you created a template for defining Data Factory entities and a template for passing values for parameters. To use the same template to deploy Data Factory entities to different environments, you create a parameter file for each environment and use it when deploying to that environment. Example: New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Dev.json New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Test.json New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Production.json Notice that the first command uses parameter file for the development environment, second one for the test environment, and the third one for the production environment. You can also reuse the template to perform repeated tasks. For example, you need to create many data factories with one or more pipelines that implement the same logic but each data factory uses different Azure storage and Azure SQL Database accounts. In this scenario, you use the same template in the same environment (dev, test, or production) with different parameter files to create data factories. Resource Manager template for creating a gateway Here is a sample Resource Manager template for creating a logical gateway in the back. Install a gateway on your on-premises computer or Azure IaaS VM and register the gateway with Data Factory service using a key. See Move data between on-premises and cloud for details. { "contentVersion": "1.0.0.0", "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "parameters": { }, "variables": { "dataFactoryName": "GatewayUsingArmDF", "apiVersion": "2015-10-01", "singleQuote": "'" }, "resources": [ { "name": "[variables('dataFactoryName')]", "apiVersion": "[variables('apiVersion')]", "type": "Microsoft.DataFactory/datafactories", "location": "eastus", "resources": [ { "dependsOn": [ "[concat('Microsoft.DataFactory/dataFactories/', variables('dataFactoryName'))]" ], "type": "gateways", "apiVersion": "[variables('apiVersion')]", "name": "GatewayUsingARM", "properties": { "description": "my gateway" } } ] } ] } This template creates a data factory named GatewayUsingArmDF with a gateway named: GatewayUsingARM. See Also TOPIC DESCRIPTION Pipelines This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct endto-end data-driven workflows for your scenario or business. Datasets This article helps you understand datasets in Azure Data Factory. Scheduling and execution This article explains the scheduling and execution aspects of Azure Data Factory application model. Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. Tutorial: Build your first Azure data factory using Data Factory REST API 6/13/2017 • 14 min to read • Edit Online In this article, you use Data Factory REST API to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list. The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. NOTE This article does not cover all the REST API. For comprehensive documentation on REST API, see Data Factory REST API Reference. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution in Data Factory. Prerequisites Read through Tutorial Overview article and complete the prerequisite steps. Install Curl on your machine. You use the CURL tool with REST commands to create a data factory. Follow instructions from this article to: 1. Create a Web application named ADFGetStartedApp in Azure Active Directory. 2. Get client ID and secret key. 3. Get tenant ID. 4. Assign the ADFGetStartedApp application to the Data Factory Contributor role. Install Azure PowerShell. Launch PowerShell and run the following command. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run the commands again. 1. Run Login-AzureRmAccount and enter the user name and password that you use to sign in to the Azure portal. 2. Run Get-AzureRmSubscription to view all the subscriptions for this account. 3. Run Get-AzureRmSubscription -SubscriptionName NameOfAzureSubscription | SetAzureRmContext to select the subscription that you want to work with. Replace NameOfAzureSubscription with the name of your Azure subscription. Create an Azure resource group named ADFTutorialResourceGroup by running the following command in the PowerShell: New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US" Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of your resource group in place of ADFTutorialResourceGroup in this tutorial. Create JSON definitions Create following JSON files in the folder where curl.exe is located. datafactory.json IMPORTANT Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name. { "name": "FirstDataFactoryREST", "location": "WestUS" } azurestoragelinkedservice.json IMPORTANT Replace accountname and accountkey with name and key of your Azure storage account. To learn how to get your storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your storage account. { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } hdinsightondemandlinkedservice.json { "name": "HDInsightOnDemandLinkedService", "properties": { "type": "HDInsightOnDemand", "typeProperties": { "clusterSize": 1, "timeToLive": "00:30:00", "linkedServiceName": "AzureStorageLinkedService" } } } The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION ClusterSize Size of the HDInsight cluster. TimeToLive Specifies that the idle time for the HDInsight cluster, before it is deleted. PROPERTY DESCRIPTION linkedServiceName Specifies the storage account that is used to store the logs that are generated by HDInsight Note the following points: The Data Factory creates a Windows-based HDInsight cluster for you with the above JSON. You could also have it create a Linux-based HDInsight cluster. See On-demand HDInsight Linked Service for details. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight Linked Service for details. The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive) and is deleted when the processing is done. As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-datetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage. See On-demand HDInsight Linked Service for details. inputdataset.json { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "fileName": "input.log", "folderPath": "adfgetstarted/inputdata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 }, "external": true, "policy": {} } } The JSON defines a dataset named AzureBlobInput, which represents input data for an activity in the pipeline. In addition, it specifies that the input data is located in the blob container called adfgetstarted and the folder called inputdata. The following table provides descriptions for the JSON properties used in the snippet: PROPERTY DESCRIPTION type The type property is set to AzureBlob because data resides in Azure blob storage. PROPERTY DESCRIPTION linkedServiceName refers to the StorageLinkedService you created earlier. fileName This property is optional. If you omit this property, all the files from the folderPath are picked. In this case, only the input.log is processed. type The log files are in text format, so we use TextFormat. columnDelimiter columns in the log files are delimited by a comma character (,) frequency/interval frequency set to Month and interval is 1, which means that the input slices are available monthly. external this property is set to true if the input data is not generated by the Data Factory service. outputdataset.json { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adfgetstarted/partitioneddata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 } } } The JSON defines a dataset named AzureBlobOutput, which represents output data for an activity in the pipeline. In addition, it specifies that the results are stored in the blob container called adfgetstarted and the folder called partitioneddata. The availability section specifies that the output dataset is produced on a monthly basis. pipeline.json IMPORTANT Replace storageaccountname with name of your Azure storage account. { "name": "MyFirstPipeline", "properties": { "description": "My first Azure Data Factory pipeline", "activities": [{ "type": "HDInsightHive", "typeProperties": { "scriptPath": "adfgetstarted/script/partitionweblogs.hql", "scriptLinkedService": "AzureStorageLinkedService", "defines": { "inputtable": "wasb://adfgetstarted@<stroageaccountname>.blob.core.windows.net/inputdata", "partitionedtable": "wasb://adfgetstarted@<stroageaccountname>t.blob.core.windows.net/partitioneddata" } }, "inputs": [{ "name": "AzureBlobInput" }], "outputs": [{ "name": "AzureBlobOutput" }], "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 }, "name": "RunSampleHiveActivity", "linkedServiceName": "HDInsightOnDemandLinkedService" }], "start": "2016-07-10T00:00:00Z", "end": "2016-07-11T00:00:00Z", "isPaused": false } } In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process data on a HDInsight cluster. The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the scriptLinkedService, called StorageLinkedService), and in script folder in the container adfgetstarted. The defines section specifies runtime settings that are passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}). The start and end properties of the pipeline specifies the active period of the pipeline. In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceName – HDInsightOnDemandLinkedService. NOTE See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties used in the preceding example. Set global variables In Azure PowerShell, execute the following commands after replacing the values with your own: IMPORTANT See Prerequisites section for instructions on getting client ID, client secret, tenant ID, and subscription ID. $client_id = "<client ID of application in AAD>" $client_secret = "<client key of application in AAD>" $tenant = "<Azure tenant ID>"; $subscription_id="<Azure subscription ID>"; $rg = "ADFTutorialResourceGroup" $adf = "FirstDataFactoryREST" Authenticate with AAD $cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token -F grant_type=client_credentials -F resource=https://management.core.windows.net/ -F client_id=$client_id -F client_secret=$client_secret }; $responseToken = Invoke-Command -scriptblock $cmd; $accessToken = (ConvertFrom-Json $responseToken).access_token; (ConvertFrom-Json $responseToken) Create data factory In this step, you create an Azure Data Factory named FirstDataFactoryREST. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform data. Run the following commands to create the data factory: 1. Assign the command to variable named cmd. Confirm that the name of the data factory you specify here (ADFCopyTutorialDF) matches the name specified in the datafactory.json. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data “@datafactory.json” https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/FirstDataFactoryREST?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in the results; otherwise, you see an error message. Write-Host $results Note the following points: The name of the Azure Data Factory must be globally unique. If you see the error in results: Data factory name “FirstDataFactoryREST” is not available, do the following steps: 1. Change the name (for example, yournameFirstDataFactoryREST) in the datafactory.json file. See Data 1. Change the name (for example, yournameFirstDataFactoryREST) in the datafactory.json file. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. 2. In the first command where the $cmd variable is assigned a value, replace FirstDataFactoryREST with the new name and run the command. 3. Run the next two commands to invoke the REST API to create the data factory and print the results of the operation. To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription The name of the data factory may be registered as a DNS name in the future and hence become publicly visible. If you receive the error: "This subscription is not registered to use namespace Microsoft.DataFactory", do one of the following and try publishing again: In Azure PowerShell, run the following command to register the Data Factory provider: Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory You can run the following command to confirm that the Data Factory provider is registered: Get-AzureRmResourceProvider Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to link data stores/computes to your data store, define input and output datasets to represent data in linked data stores. Create linked services In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run a Hive script specified in the activity of the pipeline in this sample. Create Azure Storage linked service In this step, you link your Azure Storage account to your data factory. With this tutorial, you use the same Azure Storage account to store input/output data and the HQL script file. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data “@azurestoragelinkedservice.json” https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/linkedservices/AzureStorageLinkedService?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the results; otherwise, you see an error message. Write-Host $results Create Azure HDInsight linked service In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute Linked Services for details. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@hdinsightondemandlinkedservice.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/linkedservices/hdinsightondemandlinkedservice?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the results; otherwise, you see an error message. Write-Host $results Create datasets In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the StorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data. Create input dataset In this step, you create the input dataset to represent input data stored in the Azure Blob storage. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@inputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/datasets/AzureBlobInput?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message. Write-Host $results Create output dataset In this step, you create the output dataset to represent output data stored in the Azure Blob storage. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@outputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/datasets/AzureBlobOutput?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message. Write-Host $results Create pipeline In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly (frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. Confirm that you see the input.log file in the adfgetstarted/inputdata folder in the Azure blob storage, and run the following command to deploy the pipeline. Since the start and end times are set in the past and isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy. 1. Assign the command to variable named cmd. $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@pipeline.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat aFactory/datafactories/$adf/datapipelines/MyFirstPipeline?api-version=2015-10-01}; 2. Run the command by using Invoke-Command. $results = Invoke-Command -scriptblock $cmd; 3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message. Write-Host $results 4. Congratulations, you have successfully created your first pipeline using Azure PowerShell! Monitor pipeline In this step, you use Data Factory REST API to monitor slices being produced by the pipeline. $ds ="AzureBlobOutput" $cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactor y/datafactories/$adf/datasets/$ds/slices?start=1970-01-01T00%3a00%3a00.0000000Z"&"end=2016-0812T00%3a00%3a00.0000000Z"&"api-version=2015-10-01}; $results2 = Invoke-Command -scriptblock $cmd; IF ((ConvertFrom-Json $results2).value -ne $NULL) { ConvertFrom-Json $results2 | Select-Object -Expand value | Format-Table } else { (convertFrom-Json $results2).RemoteException } IMPORTANT Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately 30 minutes to process the slice. Run the Invoke-Command and the next one until you see the slice in Ready state or Failed state. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the output data. The creation of an on-demand HDInsight cluster usually takes some time. IMPORTANT The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container. You can also use Azure portal to monitor slices and troubleshoot any issues. See Monitor pipelines using Azure portal details. Summary In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps: 1. Created an Azure data factory. 2. Created two linked services: a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data factory. b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data. 3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline. 4. Created a pipeline with a HDInsight Hive activity. Next steps In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see Tutorial: Copy data from an Azure Blob to Azure SQL. See Also TOPIC DESCRIPTION Data Factory REST API Reference See comprehensive documentation on Data Factory cmdlets Pipelines This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct endto-end data-driven workflows for your scenario or business. Datasets This article helps you understand datasets in Azure Data Factory. Scheduling and Execution This article explains the scheduling and execution aspects of Azure Data Factory application model. Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. Move data between on-premises sources and the cloud with Data Management Gateway 5/4/2017 • 14 min to read • Edit Online This article provides an overview of data integration between on-premises data stores and cloud data stores using Data Factory. It builds on the Data Movement Activities article and other data factory core concepts articles: datasets and pipelines. Data Management Gateway You must install Data Management Gateway on your on-premises machine to enable moving data to/from an on-premises data store. The gateway can be installed on the same machine as the data store or on a different machine as long as the gateway can connect to the data store. IMPORTANT See Data Management Gateway article for details about Data Management Gateway. The following walkthrough shows you how to create a data factory with a pipeline that moves data from an on-premises SQL Server database to an Azure blob storage. As part of the walkthrough, you install and configure the Data Management Gateway on your machine. Walkthrough: copy on-premises data to cloud Create data factory In this step, you use the Azure portal to create an Azure Data Factory instance named ADFTutorialOnPremDF. 1. Log in to the Azure portal. 2. Click + NEW, click Intelligence + analytics, and click Data Factory. 3. In the New data factory blade, enter ADFTutorialOnPremDF for the Name. IMPORTANT The name of the Azure data factory must be globally unique. If you receive the error: Data factory name “ADFTutorialOnPremDF” is not available, change the name of the data factory (for example, yournameADFTutorialOnPremDF) and try creating again. Use this name in place of ADFTutorialOnPremDF while performing remaining steps in this tutorial. The name of the data factory may be registered as a DNS name in the future and hence become publically visible. 4. Select the Azure subscription where you want the data factory to be created. 5. Select existing resource group or create a resource group. For the tutorial, create a resource group named: ADFTutorialResourceGroup. 6. Click Create on the New data factory blade. IMPORTANT To create Data Factory instances, you must be a member of the Data Factory Contributor role at the subscription/resource group level. 7. After creation is complete, you see the Data Factory blade as shown in the following image: Create gateway 1. In the Data Factory blade, click Author and deploy tile to launch the Editor for the data factory. 2. In the Data Factory Editor, click ... More on the toolbar and then click New data gateway. Alternatively, you can right-click Data Gateways in the tree view, and click New data gateway. 3. In the Create blade, enter adftutorialgateway for the name, and click OK. 4. In the Configure blade, click Install directly on this computer. This action downloads the installation package for the gateway, installs, configures, and registers the gateway on the computer. NOTE Use Internet Explorer or a Microsoft ClickOnce compatible web browser. If you are using Chrome, go to the Chrome web store, search with "ClickOnce" keyword, choose one of the ClickOnce extensions, and install it. Do the same for Firefox (install add-in). Click Open Menu button on the toolbar (three horizontal lines in the top-right corner), click Add-ons, search with "ClickOnce" keyword, choose one of the ClickOnce extensions, and install it. This way is the easiest way (one-click) to download, install, configure, and register the gateway in one single step. You can see the Microsoft Data Management Gateway Configuration Manager application is installed on your computer. You can also find the executable ConfigManager.exe in the folder: C:\Program Files\Microsoft Data Management Gateway\2.0\Shared. You can also download and install gateway manually by using the links in this blade and register it using the key shown in the NEW KEY text box. See Data Management Gateway article for all the details about the gateway. NOTE You must be an administrator on the local computer to install and configure the Data Management Gateway successfully. You can add additional users to the Data Management Gateway Users local Windows group. The members of this group can use the Data Management Gateway Configuration Manager tool to configure the gateway. 5. Wait for a couple of minutes or wait until you see the following notification message: 6. Launch Data Management Gateway Configuration Manager application on your computer. In the Search window, type Data Management Gateway to access this utility. You can also find the executable ConfigManager.exe in the folder: C:\Program Files\Microsoft Data Management Gateway\2.0\Shared 7. Confirm that you see adftutorialgateway is connected to the cloud service message. The status bar the bottom displays Connected to the cloud service along with a green check mark. On the Home tab, you can also do the following operations: Register a gateway with a key from the Azure portal by using the Register button. Stop the Data Management Gateway Host Service running on your gateway machine. Schedule updates to be installed at a specific time of the day. View when the gateway was last updated. Specify time at which an update to the gateway can be installed. 8. Switch to the Settings tab. The certificate specified in the Certificate section is used to encrypt/decrypt credentials for the on-premises data store that you specify on the portal. (optional) Click Change to use your own certificate instead. By default, the gateway uses the certificate that is auto-generated by the Data Factory service. You can also do the following actions on the Settings tab: View or export the certificate being used by the gateway. Change the HTTPS endpoint used by the gateway. Set an HTTP proxy to be used by the gateway. 9. (optional) Switch to the Diagnostics tab, check the Enable verbose logging option if you want to enable verbose logging that you can use to troubleshoot any issues with the gateway. The logging information can be found in Event Viewer under Applications and Services Logs -> Data Management Gateway node. You can also perform the following actions in the Diagnostics tab: Use Test Connection section to an on-premises data source using the gateway. Click View Logs to see the Data Management Gateway log in an Event Viewer window. Click Send Logs to upload a zip file with logs of last seven days to Microsoft to facilitate troubleshooting of your issues. 10. On the Diagnostics tab, in the Test Connection section, select SqlServer for the type of the data store, enter the name of the database server, name of the database, specify authentication type, enter user name, and password, and click Test to test whether the gateway can connect to the database. 11. Switch to the web browser, and in the Azure portal, click OK on the Configure blade and then on the New data gateway blade. 12. You should see adftutorialgateway under Data Gateways in the tree view on the left. If you click it, you should see the associated JSON. Create linked services In this step, you create two linked services: AzureStorageLinkedService and SqlServerLinkedService. The SqlServerLinkedService links an on-premises SQL Server database and the AzureStorageLinkedService linked service links an Azure blob store to the data factory. You create a pipeline later in this walkthrough that copies data from the on-premises SQL Server database to the Azure blob store. Add a linked service to an on-premises SQL Server database 1. In the Data Factory Editor, click New data store on the toolbar and select SQL Server. 2. In the JSON editor on the right, do the following steps: a. For the gatewayName, specify adftutorialgateway. b. In the connectionString, do the following steps: a. For servername, enter the name of the server that hosts the SQL Server database. b. For databasename, enter the name of the database. c. Click Encrypt button on the toolbar. This downloads and launches the Credentials Manager application. d. In the Setting Credentials dialog box, specify authentication type, user name, and password, and click OK. If the connection is successful, the encrypted credentials are stored in the JSON and the dialog box closes. e. Close the empty browser tab that launched the dialog box if it is not automatically closed and get back to the tab with the Azure portal. On the gateway machine, these credentials are encrypted by using a certificate that the Data Factory service owns. If you want to use the certificate that is associated with the Data Management Gateway instead, see Set credentials securely. c. Click Deploy on the command bar to deploy the SQL Server linked service. You should see the linked service in the tree view. Add a linked service for an Azure storage account 1. 2. 3. 4. In the Data Factory Editor, click New data store on the command bar and click Azure storage. Enter the name of your Azure storage account for the Account name. Enter the key for your Azure storage account for the Account key. Click Deploy to deploy the AzureStorageLinkedService. Create datasets In this step, you create input and output datasets that represent input and output data for the copy operation (On-premises SQL Server database => Azure blob storage). Before creating datasets, do the following steps (detailed steps follows the list): Create a table named emp in the SQL Server Database you added as a linked service to the data factory and insert a couple of sample entries into the table. Create a blob container named adftutorial in the Azure blob storage account you added as a linked service to the data factory. Prepare On-premises SQL Server for the tutorial 1. In the database you specified for the on-premises SQL Server linked service (SqlServerLinkedService), use the following SQL script to create the emp table in the database. CREATE TABLE dbo.emp ( ID int IDENTITY(1,1) NOT NULL, FirstName varchar(50), LastName varchar(50), CONSTRAINT PK_emp PRIMARY KEY (ID) ) GO 2. Insert some sample into the table: INSERT INTO emp VALUES ('John', 'Doe') INSERT INTO emp VALUES ('Jane', 'Doe') Create input dataset 1. In the Data Factory Editor, click ... More, click New dataset on the command bar, and click SQL Server table. 2. Replace the JSON in the right pane with the following text: { "name": "EmpOnPremSQLTable", "properties": { "type": "SqlServerTable", "linkedServiceName": "SqlServerLinkedService", "typeProperties": { "tableName": "emp" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Note the following points: type is set to SqlServerTable. tableName is set to emp. linkedServiceName is set to SqlServerLinkedService (you had created this linked service earlier in this walkthrough.). For an input dataset that is not generated by another pipeline in Azure Data Factory, you must set external to true. It denotes the input data is produced external to the Azure Data Factory service. You can optionally specify any external data policies using the externalData element in the Policy section. See Move data to/from SQL Server for details about JSON properties. 3. Click Deploy on the command bar to deploy the dataset. Create output dataset 1. In the Data Factory Editor, click New dataset on the command bar, and click Azure Blob storage. 2. Replace the JSON in the right pane with the following text: { "name": "OutputBlobTable", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adftutorial/outfromonpremdf", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Hour", "interval": 1 } } } Note the following points: type is set to AzureBlob. linkedServiceName is set to AzureStorageLinkedService (you had created this linked service in Step 2). folderPath is set to adftutorial/outfromonpremdf where outfromonpremdf is the folder in the adftutorial container. Create the adftutorial container if it does not already exist. The availability is set to hourly (frequency set to hour and interval set to 1). The Data Factory service generates an output data slice every hour in the emp table in the Azure SQL Database. If you do not specify a fileName for an output table, the generated files in the folderPath are named in the following format: Data..txt (for example: : Data.0a405f8a-93ff-4c6f-b3bef69616f1df7a.txt.). To set folderPath and fileName dynamically based on the SliceStart time, use the partitionedBy property. In the following example, folderPath uses Year, Month, and Day from the SliceStart (start time of the slice being processed) and fileName uses Hour from the SliceStart. For example, if a slice is being produced for 2014-10-20T08:00:00, the folderName is set to wikidatagateway/wikisampledataout/2014/10/20 and the fileName is set to 08.csv. "folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { { { { "name": "name": "name": "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } ], See Move data to/from Azure Blob Storage for details about JSON properties. 3. Click Deploy on the command bar to deploy the dataset. Confirm that you see both the datasets in the tree view. Create pipeline In this step, you create a pipeline with one Copy Activity that uses EmpOnPremSQLTable as input and OutputBlobTable as output. 1. In Data Factory Editor, click ... More, and click New pipeline. 2. Replace the JSON in the right pane with the following text: { "name": "ADFTutorialPipelineOnPrem", "properties": { "description": "This pipeline has one Copy activity that copies data from an on-prem SQL to Azure blob", "activities": [ { "name": "CopyFromSQLtoBlob", "description": "Copy data from on-prem SQL server to blob", "type": "Copy", "inputs": [ { "name": "EmpOnPremSQLTable" } ], "outputs": [ { "name": "OutputBlobTable" } ], "typeProperties": { "source": { "type": "SqlSource", "sqlReaderQuery": "select * from emp" }, "sink": { "type": "BlobSink" } }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "style": "StartOfInterval", "retry": 0, "timeout": "01:00:00" } } ], "start": "2016-07-05T00:00:00Z", "end": "2016-07-06T00:00:00Z", "isPaused": false } } IMPORTANT Replace the value of the start property with the current day and end value with the next day. Note the following points: In the activities section, there is only activity whose type is set to Copy. Input for the activity is set to EmpOnPremSQLTable and output for the activity is set to OutputBlobTable. In the typeProperties section, SqlSource is specified as the source type and BlobSink **is specified as the **sink type. SQL query select * from emp is specified for the sqlReaderQuery property of SqlSource. Both start and end datetimes must be in ISO format. For example: 2014-10-14T16:32:41Z. The end time is optional, but we use it in this tutorial. If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline indefinitely, specify 9/9/9999 as the value for the end property. You are defining the time duration in which the data slices are processed based on the Availability properties that were defined for each Azure Data Factory dataset. In the example, there are 24 data slices as each data slice is produced hourly. 3. Click Deploy on the command bar to deploy the dataset (table is a rectangular dataset). Confirm that the pipeline shows up in the tree view under Pipelines node. 4. Now, click X twice to close the blades to get back to the Data Factory blade for the ADFTutorialOnPremDF. Congratulations! You have successfully created an Azure data factory, linked services, datasets, and a pipeline and scheduled the pipeline. View the data factory in a Diagram View 1. In the Azure portal, click Diagram tile on the home page for the ADFTutorialOnPremDF data factory. : 2. You should see the diagram similar to the following image: You can zoom in, zoom out, zoom to 100%, zoom to fit, automatically position pipelines and datasets, and show lineage information (highlights upstream and downstream items of selected items). You can double-click an object (input/output dataset or pipeline) to see properties for it. Monitor pipeline In this step, you use the Azure portal to monitor what’s going on in an Azure data factory. You can also use PowerShell cmdlets to monitor datasets and pipelines. For details about monitoring, see Monitor and Manage Pipelines. 1. In the diagram, double-click EmpOnPremSQLTable. 2. Notice that all the data slices up are in Ready state because the pipeline duration (start time to end time) is in the past. It is also because you have inserted the data in the SQL Server database and it is there all the time. Confirm that no slices show up in the Problem slices section at the bottom. To view all the slices, click See More at the bottom of the list of slices. 3. Now, In the Datasets blade, click OutputBlobTable. 4. Click any data slice from the list and you should see the Data Slice blade. You see activity runs for the slice. You see only one activity run usually. If the slice is not in the Ready state, you can see the upstream slices that are not Ready and are blocking the current slice from executing in the Upstream slices that are not ready list. 5. Click the activity run from the list at the bottom to see activity run details. You would see information such as throughput, duration, and the gateway used to transfer the data. 6. Click X to close all the blades until you 7. get back to the home blade for the ADFTutorialOnPremDF. 8. (optional) Click Pipelines, click ADFTutorialOnPremDF, and drill through input tables (Consumed) or output datasets (Produced). 9. Use tools such as Use tools such as Microsoft Storage Explorer to verify that a blob/file is created for each hour. Next Steps See Data Management Gateway article for all the details about the Data Management Gateway. See Copy data from Azure Blob to Azure SQL to learn about how to use Copy Activity to move data from a source data store to a sink data store. Azure Data Factory - Frequently Asked Questions 4/27/2017 • 22 min to read • Edit Online General questions What is Azure Data Factory? Data Factory is a cloud-based data integration service that automates the movement and transformation of data. Just like a factory that runs equipment to take raw materials and transform them into finished goods, Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information. Data Factory allows you to create data-driven workflows to move data between both on-premises and cloud data stores as well as process/transform data using compute services such as Azure HDInsight and Azure Data Lake Analytics. After you create a pipeline that performs the action that you need, you can schedule it to run periodically (hourly, daily, weekly etc.). For more information, see Overview & Key Concepts. Where can I find pricing details for Azure Data Factory? See Data Factory Pricing Details page for the pricing details for the Azure Data Factory. How do I get started with Azure Data Factory? For an overview of Azure Data Factory, see Introduction to Azure Data Factory. For a tutorial on how to copy/move data using Copy Activity, see Copy data from Azure Blob Storage to Azure SQL Database. For a tutorial on how to transform data using HDInsight Hive Activity. See Process data by running Hive script on Hadoop cluster What is the Data Factory’s region availability? Data Factory is available in US West and North Europe. The compute and storage services used by data factories can be in other regions. See Supported regions. What are the limits on number of data factories/pipelines/activities/datasets? See Azure Data Factory Limits section of the Azure Subscription and Service Limits, Quotas, and Constraints article. What is the authoring/developer experience with Azure Data Factory service? You can author/create data factories using one of the following tools/SDKs: Azure portal The Data Factory blades in the Azure portal provide rich user interface for you to create data factories ad linked services. The Data Factory Editor, which is also part of the portal, allows you to easily create linked services, tables, data sets, and pipelines by specifying JSON definitions for these artifacts. See Build your first data pipeline using Azure portal for an example of using the portal/editor to create and deploy a data factory. Visual Studio You can use Visual Studio to create an Azure data factory. See Build your first data pipeline using Visual Studio for details. Azure PowerShell See Create and monitor Azure Data Factory using Azure PowerShell for a tutorial/walkthrough for creating a data factory using PowerShell. See Data Factory Cmdlet Reference content on MSDN Library for a comprehensive documentation of Data Factory cmdlets. .NET Class Library You can programmatically create data factories by using Data Factory .NET SDK. See Create, monitor, and manage data factories using .NET SDK for a walkthrough of creating a data factory using .NET SDK. See Data Factory Class Library Reference for a comprehensive documentation of Data Factory .NET SDK. REST API You can also use the REST API exposed by the Azure Data Factory service to create and deploy data factories. See Data Factory REST API Reference for a comprehensive documentation of Data Factory REST API. Azure Resource Manager Template See Tutorial: Build your first Azure data factory using Azure Resource Manager template fo details. Can I rename a data factory? No. Like other Azure resources, the name of an Azure data factory cannot be changed. Can I move a data factory from one Azure subscription to another? Yes. Use the Move button on your data factory blade as shown in the following diagram: What are the compute environments supported by Data Factory? The following table provides a list of compute environments supported by Data Factory and the activities that can run on them. COMPUTE ENVIRONMENT ACTIVITIES On-demand HDInsight cluster or your own HDInsight cluster DotNet, Hive, Pig, MapReduce, Hadoop Streaming Azure Batch DotNet Azure Machine Learning Machine Learning activities: Batch Execution and Update Resource Azure Data Lake Analytics Data Lake Analytics U-SQL Azure SQL, Azure SQL Data Warehouse, SQL Server Stored Procedure How does Azure Data Factory compare with SQL Server Integration Services (SSIS )? See the Azure Data Factory vs. SSIS presentation from one of our MVPs (Most Valued Professionals): Reza Rad. Some of the recent changes in Data Factory may not be listed in the slide deck. We are continuously adding more capabilities to Azure Data Factory. We are continuously adding more capabilities to Azure Data Factory. We will incorporate these updates into the comparison of data integration technologies from Microsoft sometime later this year. Activities - FAQ What are the different types of activities you can use in a Data Factory pipeline? Data Movement Activities to move data. Data Transformation Activities to process/transform data. When does an activity run? The availability configuration setting in the output data table determines when the activity is run. If input datasets are specified, the activity checks whether all the input data dependencies are satisfied (that is, Ready state) before it starts running. Copy Activity - FAQ Is it better to have a pipeline with multiple activities or a separate pipeline for each activity? Pipelines are supposed to bundle related activities. If the datasets that connect them are not consumed by any other activity outside the pipeline, you can keep the activities in one pipeline. This way, you would not need to chain pipeline active periods so that they align with each other. Also, the data integrity in the tables internal to the pipeline is better preserved when updating the pipeline. Pipeline update essentially stops all the activities within the pipeline, removes them, and creates them again. From authoring perspective, it might also be easier to see the flow of data within the related activities in one JSON file for the pipeline. What are the supported data stores? Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store. CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ ✓ CATEGORY NoSQL File Others DATA STORE SUPPORTED AS A SOURCE MySQL* ✓ Oracle* ✓ PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK ✓ ✓ ✓ NOTE Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an onpremises/Azure IaaS machine. What are the supported file formats? Specifying formats Azure Data Factory supports the following format types: Text Format JSON Format Avro Format ORC Format Parquet Format Specifying TextFormat If you want to parse the text files or write the data in text format, set the format type property to TextFormat. You can also specify the following optional properties in the format section. See TextFormat example section on how to configure. PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED columnDelimiter The character used to separate columns in a file. You can consider to use a rare unprintable char which not likely exists in your data: e.g. specify "\u0001" which represents Start of Heading (SOH). Only one character is allowed. The default value is comma (','). No rowDelimiter The character used to separate rows in a file. Only one character is allowed. The default value is any of the following values on read: ["\r\n", "\r", "\n"] and "\r\n" on write. No escapeChar The special character used to escape a column delimiter in the content of input file. Only one character is allowed. No default value. No You cannot specify both escapeChar and quoteChar for a table. quoteChar The character used to quote a string value. The column and row delimiters inside the quote characters would be treated as part of the string value. This property is applicable to both input and output datasets. You cannot specify both escapeChar and quoteChar for a table. To use an Unicode character, refer to Unicode Characters to get the corresponding code for it. Example: if you have comma (',') as the column delimiter but you want to have the comma character in the text (example: "Hello, world"), you can define ‘$’ as the escape character and use string "Hello$, world" in the source. Only one character is allowed. No default value. For example, if you have comma (',') as the column delimiter but you want to have comma character in the text (example: ), you can define " (double quote) as the quote character and use the string "Hello, world" in the source. No PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED nullValue One or more characters used to represent a null value. One or more characters. The default values are "\N" and "NULL" on read and "\N" on write. No encodingName Specify the encoding name. A valid encoding name. see Encoding.EncodingName Property. Example: windows1250 or shift_jis. The default value is UTF-8. No firstRowAsHeader Specifies whether to consider the first row as a header. For an input dataset, Data Factory reads first row as a header. For an output dataset, Data Factory writes first row as a header. True False (default) No Integer No True (default) False No See Scenarios for using firstRowAsHeader and skipLineCount for sample scenarios. skipLineCount Indicates the number of rows to skip when reading data from input files. If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file. See Scenarios for using firstRowAsHeader and skipLineCount for sample scenarios. treatEmptyAsNull Specifies whether to treat null or empty string as a null value when reading data from an input file. TextFormat example The following sample shows some of the format properties for TextFormat. "typeProperties": { "folderPath": "mycontainer/myfolder", "fileName": "myblobname", "format": { "type": "TextFormat", "columnDelimiter": ",", "rowDelimiter": ";", "quoteChar": "\"", "NullValue": "NaN", "firstRowAsHeader": true, "skipLineCount": 0, "treatEmptyAsNull": true } }, To use an escapeChar instead of quoteChar , replace the line with quoteChar with the following escapeChar: "escapeChar": "$", Scenarios for using firstRowAsHeader and skipLineCount You are copying from a non-file source to a text file and would like to add a header line containing the schema metadata (for example: SQL schema). Specify firstRowAsHeader as true in the output dataset for this scenario. You are copying from a text file containing a header line to a non-file sink and would like to drop that line. Specify firstRowAsHeader as true in the input dataset. You are copying from a text file and want to skip a few lines at the beginning that contain no data or header information. Specify skipLineCount to indicate the number of lines to be skipped. If the rest of the file contains a header line, you can also specify firstRowAsHeader . If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file Specifying JsonFormat To import/export JSON files as-is into/from Azure Cosmos DB, see Import/export JSON documents section in the Azure Cosmos DB connector with details. If you want to parse the JSON files or write the data in JSON format, set the format type property to JsonFormat. You can also specify the following optional properties in the format section. See JsonFormat example section on how to configure. PROPERTY DESCRIPTION REQUIRED filePattern Indicate the pattern of data stored in each JSON file. Allowed values are: setOfObjects and arrayOfObjects. The default value is setOfObjects. See JSON file patterns section for details about these patterns. No jsonNodeReference If you want to iterate and extract data from the objects inside an array field with the same pattern, specify the JSON path of that array. This property is supported only when copying data from JSON files. No PROPERTY DESCRIPTION REQUIRED jsonPathDefinition Specify the JSON path expression for each column mapping with a customized column name (start with lowercase). This property is supported only when copying data from JSON files, and you can extract data from object or array. No For fields under root object, start with root $; for fields inside the array chosen by jsonNodeReference property, start from the array element. See JsonFormat example section on how to configure. encodingName Specify the encoding name. For the list of valid encoding names, see: Encoding.EncodingName Property. For example: windows-1250 or shift_jis. The default value is: UTF-8. No nestingSeparator Character that is used to separate nesting levels. The default value is '.' (dot). No JSON file patterns Copy activity can parse below patterns of JSON files: Type I: setOfObjects Each file contains single object, or line-delimited/concatenated multiple objects. When this option is chosen in an output dataset, copy activity produces a single JSON file with each object per line (line-delimited). single object JSON example { "time": "2015-04-29T07:12:20.9100000Z", "callingimsi": "466920403025604", "callingnum1": "678948008", "callingnum2": "567834760", "switch1": "China", "switch2": "Germany" } line-delimited JSON example {"time":"2015-0429T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"56 7834760","switch1":"China","switch2":"Germany"} {"time":"2015-0429T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"78 9037573","switch1":"US","switch2":"UK"} {"time":"2015-0429T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":"34 5626404","switch1":"Germany","switch2":"UK"} concatenated JSON example { "time": "2015-04-29T07:12:20.9100000Z", "callingimsi": "466920403025604", "callingnum1": "678948008", "callingnum2": "567834760", "switch1": "China", "switch2": "Germany" } { "time": "2015-04-29T07:13:21.0220000Z", "callingimsi": "466922202613463", "callingnum1": "123436380", "callingnum2": "789037573", "switch1": "US", "switch2": "UK" } { "time": "2015-04-29T07:13:21.4370000Z", "callingimsi": "466923101048691", "callingnum1": "678901578", "callingnum2": "345626404", "switch1": "Germany", "switch2": "UK" } Type II: arrayOfObjects Each file contains an array of objects. [ { "time": "2015-04-29T07:12:20.9100000Z", "callingimsi": "466920403025604", "callingnum1": "678948008", "callingnum2": "567834760", "switch1": "China", "switch2": "Germany" }, { "time": "2015-04-29T07:13:21.0220000Z", "callingimsi": "466922202613463", "callingnum1": "123436380", "callingnum2": "789037573", "switch1": "US", "switch2": "UK" }, { "time": "2015-04-29T07:13:21.4370000Z", "callingimsi": "466923101048691", "callingnum1": "678901578", "callingnum2": "345626404", "switch1": "Germany", "switch2": "UK" } ] JsonFormat example Case 1: Copying data from JSON files See below two types of samples when copying data from JSON files, and the generic points to note: Sample 1: extract data from object and array In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file with the following content: { "id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3", "context": { "device": { "type": "PC" }, "custom": { "dimensions": [ { "TargetResourceType": "Microsoft.Compute/virtualMachines" }, { "ResourceManagmentProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3" }, { "OccurrenceTime": "1/13/2017 11:24:37 AM" } ] } } } and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects and array: ID DEVICETYPE TARGETRESOURCETYPE RESOURCEMANAGMEN TPROCESSRUNID OCCURRENCETIME ed0e4960-d9c511e6-85dcd7996816aad3 PC Microsoft.Compute/vi rtualMachines 827f8aaa-ab72-437cba48-d8917a7336a3 1/13/2017 11:24:37 AM The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More specifically: section defines the customized column names and the corresponding data type while converting to tabular data. This section is optional unless you need to do column mapping. See Specifying structure definition for rectangular datasets section for more details. jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. To copy data from array, you can use array[x].property to extract value of the given property from the xth object, or you can use array[*].property to find the value from any object containing such property. structure "properties": { "structure": [ { "name": "id", "type": "String" }, { "name": "deviceType", "type": "String" }, { "name": "targetResourceType", "type": "String" }, { "name": "resourceManagmentProcessRunId", "type": "String" }, { "name": "occurrenceTime", "type": "DateTime" } ], "typeProperties": { "folderPath": "mycontainer/myfolder", "format": { "type": "JsonFormat", "filePattern": "setOfObjects", "jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type", "targetResourceType": "$.context.custom.dimensions[0].TargetResourceType", "resourceManagmentProcessRunId": "$.context.custom.dimensions[1].ResourceManagmentProcessRunId", "occurrenceTime": " $.context.custom.dimensions[2].OccurrenceTime"} } } } Sample 2: cross apply multiple objects with the same pattern from array In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have a JSON file with the following content: { "ordernumber": "01", "orderdate": "20170122", "orderlines": [ { "prod": "p1", "price": 23 }, { "prod": "p2", "price": 13 }, { "prod": "p3", "price": 231 } ], "city": [ { "sanmateo": "No 1" } ] } and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array and cross join with the common root info: ORDERNUMBER ORDERDATE ORDER_PD ORDER_PRICE CITY 01 20170122 P1 23 [{"sanmateo":"No 1"}] 01 20170122 P2 13 [{"sanmateo":"No 1"}] 01 20170122 P3 231 [{"sanmateo":"No 1"}] The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More specifically: structure section defines the customized column names and the corresponding data type while converting to tabular data. This section is optional unless you need to do column mapping. See Specifying structure definition for rectangular datasets section for more details. jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array orderlines. jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. In this example, "ordernumber", "orderdate" and "city" are under root object with JSON path starting with "$.", while "order_pd" and "order_price" are defined with path derived from the array element without "$.". "properties": { "structure": [ { "name": "ordernumber", "type": "String" }, { "name": "orderdate", "type": "String" }, { "name": "order_pd", "type": "String" }, { "name": "order_price", "type": "Int64" }, { "name": "city", "type": "String" } ], "typeProperties": { "folderPath": "mycontainer/myfolder", "format": { "type": "JsonFormat", "filePattern": "setOfObjects", "jsonNodeReference": "$.orderlines", "jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd": "prod", "order_price": "price", "city": " $.city"} } } } Note the following points: If the structure and jsonPathDefinition are not defined in the Data Factory dataset, the Copy Activity detects the schema from the first object and flatten the whole object. If the JSON input has an array, by default the Copy Activity converts the entire array value into a string. You can choose to extract data from it using jsonNodeReference and/or jsonPathDefinition , or skip it by not specifying it in jsonPathDefinition . If there are duplicate names at the same level, the Copy Activity picks the last one. Property names are case-sensitive. Two properties with same name but different casings are treated as two separate properties. Case 2: Writing data to JSON file If you have below table in SQL Database: ID ORDER_DATE ORDER_PRICE ORDER_BY 1 20170119 2000 David 2 20170120 3500 Patrick 3 20170121 4000 Jason and for each record, you expect to write to a JSON object in below format: { "id": "1", "order": { "date": "20170119", "price": 2000, "customer": "David" } } The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More specifically, structure section defines the customized property names in destination file, nestingSeparator (default is ".") will be used to identify the nest layer from the name. This section is optional unless you want to change the property name comparing with source column name, or nest some of the properties. "properties": { "structure": [ { "name": "id", "type": "String" }, { "name": "order.date", "type": "String" }, { "name": "order.price", "type": "Int64" }, { "name": "order.customer", "type": "String" } ], "typeProperties": { "folderPath": "mycontainer/myfolder", "format": { "type": "JsonFormat" } } } Specifying AvroFormat If you want to parse the Avro files or write the data in Avro format, set the format type property to AvroFormat. You do not need to specify any properties in the Format section within the typeProperties section. Example: "format": { "type": "AvroFormat", } To use Avro format in a Hive table, you can refer to Apache Hive’s tutorial. Note the following points: Complex data types are not supported (records, enums, arrays, maps, unions and fixed). Specifying OrcFormat If you want to parse the ORC files or write the data in ORC format, set the format type property to OrcFormat. You do not need to specify any properties in the Format section within the typeProperties section. Example: "format": { "type": "OrcFormat" } IMPORTANT If you are not copying ORC files as-is between on-premises and cloud data stores, you need to install the JRE 8 (Java Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit JRE. You can find both versions from here. Choose the appropriate one. Note the following points: Complex data types are not supported (STRUCT, MAP, LIST, UNION) ORC file has three compression-related options: NONE, ZLIB, SNAPPY. Data Factory supports reading data from ORC file in any of these compressed formats. It uses the compression codec is in the metadata to read the data. However, when writing to an ORC file, Data Factory chooses ZLIB, which is the default for ORC. Currently, there is no option to override this behavior. Specifying ParquetFormat If you want to parse the Parquet files or write the data in Parquet format, set the format type property to ParquetFormat. You do not need to specify any properties in the Format section within the typeProperties section. Example: "format": { "type": "ParquetFormat" } IMPORTANT If you are not copying Parquet files as-is between on-premises and cloud data stores, you need to install the JRE 8 (Java Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit JRE. You can find both versions from here. Choose the appropriate one. Note the following points: Complex data types are not supported (MAP, LIST) Parquet file has the following compression-related options: NONE, SNAPPY, GZIP, and LZO. Data Factory supports reading data from ORC file in any of these compressed formats. It uses the compression codec in the metadata to read the data. However, when writing to a Parquet file, Data Factory chooses SNAPPY, which is the default for Parquet format. Currently, there is no option to override this behavior. Where is the copy operation performed? See Globally available data movement section for details. In short, when an on-premises data store is involved, the copy operation is performed by the Data Management Gateway in your on-premises environment. And, when the data movement is between two cloud stores, the copy operation is performed in the region closest to the sink location in the same geography. HDInsight Activity - FAQ What regions are supported by HDInsight? See the Geographic Availability section in the following article: or HDInsight Pricing Details. What region is used by an on-demand HDInsight cluster? The on-demand HDInsight cluster is created in the same region where the storage you specified to be used with the cluster exists. How to associate additional storage accounts to your HDInsight cluster? If you are using your own HDInsight Cluster (BYOC - Bring Your Own Cluster), see the following topics: Using an HDInsight Cluster with Alternate Storage Accounts and Metastores Use Additional Storage Accounts with HDInsight Hive If you are using an on-demand cluster that is created by the Data Factory service, specify additional storage accounts for the HDInsight linked service so that the Data Factory service can register them on your behalf. In the JSON definition for the on-demand linked service, use additionalLinkedServiceNames property to specify alternate storage accounts as shown in the following JSON snippet: { "name": "MyHDInsightOnDemandLinkedService", "properties": { "type": "HDInsightOnDemandLinkedService", "typeProperties": { "clusterSize": 1, "timeToLive": "00:01:00", "linkedServiceName": "LinkedService-SampleData", "additionalLinkedServiceNames": [ "otherLinkedServiceName1", "otherLinkedServiceName2" ] } } } In the example above, otherLinkedServiceName1 and otherLinkedServiceName2 represent linked services whose definitions contain credentials that the HDInsight cluster needs to access alternate storage accounts. Slices - FAQ Why are my input slices not in Ready state? A common mistake is not setting external property to true on the input dataset when the input data is external to the data factory (not produced by the data factory). In the following example, you only need to set external to true on dataset1. DataFactory1 Pipeline 1: dataset1 -> activity1 -> dataset2 -> activity2 -> dataset3 Pipeline 2: dataset3-> activity3 -> dataset4 If you have another data factory with a pipeline that takes dataset4 (produced by pipeline 2 in data factory 1), mark dataset4 as an external dataset because the dataset is produced by a different data factory (DataFactory1, not DataFactory2). DataFactory2 Pipeline 1: dataset4->activity4->dataset5 If the external property is properly set, verify whether the input data exists in the location specified in the input dataset definition. How to run a slice at another time than midnight when the slice is being produced daily? Use the offset property to specify the time at which you want the slice to be produced. See Dataset availability section for details about this property. Here is a quick example: "availability": { "frequency": "Day", "interval": 1, "offset": "06:00:00" } Daily slices start at 6 AM instead of the default midnight. How can I rerun a slice? You can rerun a slice in one of the following ways: Use Monitor and Manage App to rerun an activity window or slice. See Rerun selected activity windows for instructions. Click Run in the command bar on the DATA SLICE blade for the slice in the Azure portal. Run Set-AzureRmDataFactorySliceStatus cmdlet with Status set to Waiting for the slice. Set-AzureRmDataFactorySliceStatus -Status Waiting -ResourceGroupName $ResourceGroup -DataFactoryName $df -TableName $table -StartDateTime "02/26/2015 19:00:00" -EndDateTime "02/26/2015 20:00:00" See Set-AzureRmDataFactorySliceStatus for details about the cmdlet. How long did it take to process a slice? Use Activity Window Explorer in Monitor & Manage App to know how long it took to process a data slice. See Activity Window Explorer for details. You can also do the following in the Azure portal: 1. 2. 3. 4. 5. 6. Click Datasets tile on the DATA FACTORY blade for your data factory. Click the specific dataset on the Datasets blade. Select the slice that you are interested in from the Recent slices list on the TABLE blade. Click the activity run from the Activity Runs list on the DATA SLICE blade. Click Properties tile on the ACTIVITY RUN DETAILS blade. You should see the DURATION field with a value. This value is the time taken to process the slice. How to stop a running slice? If you need to stop the pipeline from executing, you can use Suspend-AzureRmDataFactoryPipeline cmdlet. Currently, suspending the pipeline does not stop the slice executions that are in progress. Once the in-progress executions finish, no extra slice is picked up. If you really want to stop all the executions immediately, the only way would be to delete the pipeline and create it again. If you choose to delete the pipeline, you do NOT need to delete tables and linked services used by the pipeline. Move data by using Copy Activity 5/11/2017 • 9 min to read • Edit Online Overview In Azure Data Factory, you can use Copy Activity to copy data between on-premises and cloud data stores. After the data is copied, it can be further transformed and analyzed. You can also use Copy Activity to publish transformation and analysis results for business intelligence (BI) and application consumption. Copy Activity is powered by a secure, reliable, scalable, and globally available service. This article provides details on data movement in Data Factory and Copy Activity. First, let's see how data migration occurs between two cloud data stores, and between an onpremises data store and a cloud data store. NOTE To learn about activities in general, see Understanding pipelines and activities. Copy data between two cloud data stores When both source and sink data stores are in the cloud, Copy Activity goes through the following stages to copy data from the source to the sink. The service that powers Copy Activity: 1. Reads data from the source data store. 2. Performs serialization/deserialization, compression/decompression, column mapping, and type conversion. It does these operations based on the configurations of the input dataset, output dataset, and Copy Activity. 3. Writes data to the destination data store. The service automatically chooses the optimal region to perform the data movement. This region is usually the one closest to the sink data store. Copy data between an on-premises data store and a cloud data store To securely move data between an on-premises data store and a cloud data store, install Data Management Gateway on your on-premises machine. Data Management Gateway is an agent that enables hybrid data movement and processing. You can install it on the same machine as the data store itself, or on a separate machine that has access to the data store. In this scenario, Data Management Gateway performs the serialization/deserialization, compression/decompression, column mapping, and type conversion. Data does not flow through the Azure Data Factory service. Instead, Data Management Gateway directly writes the data to the destination store. See Move data between on-premises and cloud data stores for an introduction and walkthrough. See Data Management Gateway for detailed information about this agent. You can also move data from/to supported data stores that are hosted on Azure IaaS virtual machines (VMs) by using Data Management Gateway. In this case, you can install Data Management Gateway on the same VM as the data store itself, or on a separate VM that has access to the data store. Supported data stores and formats Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store. NOTE If you need to move data to/from a data store that Copy Activity doesn't support, use a custom activity in Data Factory with your own logic for copying/moving data. For details on creating and using a custom activity, see Use custom activities in an Azure Data Factory pipeline. CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK Azure Azure Blob storage ✓ ✓ Azure Cosmos DB (DocumentDB API) ✓ ✓ Azure Data Lake Store ✓ ✓ Azure SQL Database ✓ ✓ Azure SQL Data Warehouse ✓ ✓ ✓ Azure Search Index Databases Azure Table storage ✓ Amazon Redshift ✓ DB2* ✓ MySQL* ✓ Oracle* ✓ ✓ ✓ CATEGORY NoSQL File Others DATA STORE SUPPORTED AS A SOURCE PostgreSQL* ✓ SAP Business Warehouse* ✓ SAP HANA* ✓ SQL Server* ✓ Sybase* ✓ Teradata* ✓ Cassandra* ✓ MongoDB* ✓ Amazon S3 ✓ File System* ✓ FTP ✓ HDFS* ✓ SFTP ✓ Generic HTTP ✓ Generic OData ✓ Generic ODBC* ✓ Salesforce ✓ Web Table (table from HTML) ✓ GE Historian* ✓ SUPPORTED AS A SINK ✓ ✓ NOTE Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-premises/Azure IaaS machine. Supported file formats You can use Copy Activity to copy files as-is between two file-based data stores, you can skip the format section in both the input and output dataset definitions. The data is copied efficiently without any serialization/deserialization. Copy Activity also reads from and writes to files in specified formats: Text, JSON, Avro, ORC, and Parquet, and compression codec GZip, Deflate, BZip2, and ZipDeflate are supported. See Supported file and compression formats with details. For example, you can do the following copy activities: Copy data in on-premises SQL Server and write to Azure Data Lake Store in ORC format. Copy files in text (CSV) format from on-premises File System and write to Azure Blob in Avro format. Copy zipped files from on-premises File System and decompress then land to Azure Data Lake Store. Copy data in GZip compressed text (CSV) format from Azure Blob and write to Azure SQL Database. Globally available data movement Azure Data Factory is available only in the West US, East US, and North Europe regions. However, the service that powers Copy Activity is available globally in the following regions and geographies. The globally available topology ensures efficient data movement that usually avoids cross-region hops. See Services by region for availability of Data Factory and Data Movement in a region. Copy data between cloud data stores When both source and sink data stores are in the cloud, Data Factory uses a service deployment in the region that is closest to the sink in the same geography to move the data. Refer to the following table for mapping: GEOGRAPHY OF THE DESTINATION DATA STORES REGION OF THE DESTINATION DATA STORE REGION USED FOR DATA MOVEMENT United States East US East US East US 2 East US 2 Central US Central US North Central US North Central US South Central US South Central US West Central US West Central US West US West US West US 2 West US Canada East Canada Central Canada Central Canada Central Brazil Brazil South Brazil South Europe North Europe North Europe Canada GEOGRAPHY OF THE DESTINATION DATA STORES United Kingdom Asia Pacific Australia Japan India REGION OF THE DESTINATION DATA STORE REGION USED FOR DATA MOVEMENT West Europe West Europe UK West UK South UK South UK South Southeast Asia Southeast Asia East Asia Southeast Asia Australia East Australia East Australia Southeast Australia Southeast Japan East Japan East Japan West Japan East Central India Central India West India Central India South India Central India Alternatively, you can explicitly indicate the region of Data Factory service to be used to perform the copy by specifying executionLocation property under Copy Activity typeProperties . Supported values for this property are listed in above Region used for data movement column. Note your data goes through that region over the wire during copy. For example, to copy between Azure stores in Korea, you can specify "executionLocation": "Japan East" to route through Japan region (see sample JSON as reference). NOTE If the region of the destination data store is not in preceding list or undetectable, by default Copy Activity fails instead of going through an alternative region, unless executionLocation is specified. The supported region list will be expanded over time. Copy data between an on-premises data store and a cloud data store When data is being copied between on-premises (or Azure virtual machines/IaaS) and cloud stores, Data Management Gateway performs data movement on an on-premises machine or virtual machine. The data does not flow through the service in the cloud, unless you use the staged copy capability. In this case, data flows through the staging Azure Blob storage before it is written into the sink data store. Create a pipeline with Copy Activity You can create a pipeline with Copy Activity in a couple of ways: By using the Copy Wizard The Data Factory Copy Wizard helps you to create a pipeline with Copy Activity. This pipeline allows you to copy data from supported sources to destinations without writing JSON definitions for linked services, datasets, and pipelines. See Data Factory Copy Wizard for details about the wizard. By using JSON scripts You can use Data Factory Editor in the Azure portal, Visual Studio, or Azure PowerShell to create a JSON definition for a pipeline (by using Copy Activity). Then, you can deploy it to create the pipeline in Data Factory. See Tutorial: Use Copy Activity in an Azure Data Factory pipeline for a tutorial with step-by-step instructions. JSON properties (such as name, description, input and output tables, and policies) are available for all types of activities. Properties that are available in the typeProperties section of the activity vary with each activity type. For Copy Activity, the typeProperties section varies depending on the types of sources and sinks. Click a source/sink in the Supported sources and sinks section to learn about type properties that Copy Activity supports for that data store. Here's a sample JSON definition: { "name": "ADFTutorialPipeline", "properties": { "description": "Copy data from Azure blob to Azure SQL table", "activities": [ { "name": "CopyFromBlobToSQL", "type": "Copy", "inputs": [ { "name": "InputBlobTable" } ], "outputs": [ { "name": "OutputSQLTable" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink" }, "executionLocation": "Japan East" }, "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 0, "timeout": "01:00:00" } } ], "start": "2016-07-12T00:00:00Z", "end": "2016-07-13T00:00:00Z" } } The schedule that is defined in the output dataset determines when the activity runs (for example: daily, frequency as day, and interval as 1). The activity copies data from an input dataset (source) to an output dataset (sink). You can specify more than one input dataset to Copy Activity. They are used to verify the dependencies before the activity is run. However, only the data from the first dataset is copied to the destination dataset. For more information, see Scheduling and execution. Performance and tuning See the Copy Activity performance and tuning guide, which describes key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory. It also lists the observed performance during internal testing and discusses various ways to optimize the performance of Copy Activity. Scheduling and sequential copy See Scheduling and execution for detailed information about how scheduling and execution works in Data Factory. It is possible to run multiple copy operations one after another in a sequential/ordered manner. See the Copy sequentially section. Type conversions Different data stores have different native type systems. Copy Activity performs automatic type conversions from source types to sink types with the following two-step approach: 1. Convert from native source types to a .NET type. 2. Convert from a .NET type to a native sink type. The mapping from a native type system to a .NET type for a data store is in the respective data store article. (Click the specific link in the Supported data stores table). You can use these mappings to determine appropriate types while creating your tables, so that Copy Activity performs the right conversions. Next steps To learn about the Copy Activity more, see Copy data from Azure Blob storage to Azure SQL Database. To learn about moving data from an on-premises data store to a cloud data store, see Move data from on-premises to cloud data stores. Azure Data Factory Copy Wizard 5/2/2017 • 4 min to read • Edit Online The Azure Data Factory Copy Wizard eases the process of ingesting data, which is usually a first step in an end-toend data integration scenario. When going through the Azure Data Factory Copy Wizard, you do not need to understand any JSON definitions for linked services, data sets, and pipelines. The wizard automatically creates a pipeline to copy data from the selected data source to the selected destination. In addition, the Copy Wizard helps you to validate the data being ingested at the time of authoring. This saves time, especially when you are ingesting data for the first time from the data source. To start the Copy Wizard, click the Copy data tile on the home page of your data factory. Designed for big data This wizard allows you to easily move data from a wide variety of sources to destinations in minutes. After you go through the wizard, a pipeline with a copy activity is automatically created for you, along with dependent Data Factory entities (linked services and data sets). No additional steps are required to create the pipeline. NOTE For step-by-step instructions to create a sample pipeline to copy data from an Azure blob to an Azure SQL Database table, see the Copy Wizard tutorial. The wizard is designed with big data in mind from the start, with support for diverse data and object types. You can author Data Factory pipelines that move hundreds of folders, files, or tables. The wizard supports automatic data preview, schema capture and mapping, and data filtering. Automatic data preview You can preview part of the data from the selected data source in order to validate whether the data is what you want to copy. In addition, if the source data is in a text file, the Copy Wizard parses the text file to learn the row and column delimiters and schema automatically. Schema capture and mapping The schema of input data may not match the schema of output data in some cases. In this scenario, you need to map columns from the source schema to columns from the destination schema. TIP When copying data from SQL Server or Azure SQL Database into Azure SQL Data Warehouse, if the table does not exist in the destination store, Data Factory support auto table creation using source's schema. Learn more from Move data to and from Azure SQL Data Warehouse using Azure Data Factory. Use a drop-down list to select a column from the source schema to map to a column in the destination schema. The Copy Wizard tries to understand your pattern for column mapping. It applies the same pattern to the rest of the columns, so that you do not need to select each of the columns individually to complete the schema mapping. If you prefer, you can override these mappings by using the drop-down lists to map the columns one by one. The pattern becomes more accurate as you map more columns. The Copy Wizard constantly updates the pattern, and ultimately reaches the right pattern for the column mapping you want to achieve. Filtering data You can filter source data to select only the data that needs to be copied to the sink data store. Filtering reduces the volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy operation. It provides a flexible way to filter data in a relational database by using the SQL query language, or files in an Azure blob folder by using Data Factory functions and variables. Filtering of data in a database The following screenshot shows a SQL query using the Filtering of data in an Azure blob folder Text.Format function and WindowStart variable. You can use variables in the folder path to copy data from a folder that is determined at runtime based on system variables. The supported variables are: {year}, {month}, {day}, {hour}, {minute}, and {custom}. For example: inputfolder/{year}/{month}/{day}. Suppose that you have input folders in the following format: 2016/03/01/01 2016/03/01/02 2016/03/01/03 ... Click the Browse button for File or folder, browse to one of these folders (for example, 2016->03->01->02), and click Choose. You should see 2016/03/01/02 in the text box. Now, replace 2016 with {year}, 03 with {month}, 01 with {day}, and 02 with {hour}, and press the Tab key. You should see drop-down lists to select the format for these four variables: As shown in the following screenshot, you can also use a custom variable and any supported format strings. To select a folder with that structure, use the Browse button first. Then replace a value with {custom}, and press the Tab key to see the text box where you can type the format string. Scheduling options You can run the copy operation once or on a schedule (hourly, daily, and so on). Both of these options can be used for the breadth of the connectors across environments, including on-premises, cloud, and local desktop copy. A one-time copy operation enables data movement from a source to a destination only once. It applies to data of any size and any supported format. The scheduled copy allows you to copy data on a prescribed recurrence. You can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy. Next steps For a quick walkthrough of using the Data Factory Copy Wizard to create a pipeline with Copy Activity, see Tutorial: Create a pipeline using the Copy Wizard. Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Data Factory 6/6/2017 • 7 min to read • Edit Online Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational. Built on massively parallel processing (MPP) architecture, SQL Data Warehouse is optimized for enterprise data warehouse workloads. It offers cloud elasticity with the flexibility to scale storage and compute independently. Getting started with Azure SQL Data Warehouse is now easier than ever using Azure Data Factory. Azure Data Factory is a fully managed cloud-based data integration service, which can be used to populate a SQL Data Warehouse with the data from your existing system, and saving you valuable time while evaluating SQL Data Warehouse and building your analytics solutions. Here are the key benefits of loading data into Azure SQL Data Warehouse using Azure Data Factory: Easy to set up: 5-step intuitive wizard with no scripting required. Rich data store support: built-in support for a rich set of on-premises and cloud-based data stores. Secure and compliant: data is transferred over HTTPS or ExpressRoute, and global service presence ensures your data never leaves the geographical boundary Unparalleled performance by using PolyBase – Using Polybase is the most efficient way to move data into Azure SQL Data Warehouse. Using the staging blob feature, you can achieve high load speeds from all types of data stores besides Azure Blob storage, which the Polybase supports by default. This article shows you how to use Data Factory Copy Wizard to load 1-TB data from Azure Blob Storage into Azure SQL Data Warehouse in under 15 minutes, at over 1.2 GBps throughput. This article provides step-by-step instructions for moving data into Azure SQL Data Warehouse by using the Copy Wizard. NOTE For general information about capabilities of Data Factory in moving data to/from Azure SQL Data Warehouse, see Move data to and from Azure SQL Data Warehouse using Azure Data Factory article. You can also build pipelines using Azure portal, Visual Studio, PowerShell, etc. See Tutorial: Copy data from Azure Blob to Azure SQL Database for a quick walkthrough with step-by-step instructions for using the Copy Activity in Azure Data Factory. Prerequisites Azure Blob Storage: this experiment uses Azure Blob Storage (GRS) for storing TPC-H testing dataset. If you do not have an Azure storage account, learn how to create a storage account. TPC-H data: we are going to use TPC-H as the testing dataset. To do that, you need to use dbgen from TPCH toolkit, which helps you generate the dataset. You can either download source code for dbgen from TPC Tools and compile it yourself, or download the compiled binary from GitHub. Run dbgen.exe with the following commands to generate 1 TB flat file for lineitem table spread across 10 files: Dbgen -s 1000 -S **1** -C 10 -T L -v Dbgen -s 1000 -S **2** -C 10 -T L -v … Dbgen -s 1000 -S **10** -C 10 -T L -v Now copy the generated files to Azure Blob. Refer to Move data to and from an on-premises file system by using Azure Data Factory for how to do that using ADF Copy. Azure SQL Data Warehouse: this experiment loads data into Azure SQL Data Warehouse created with 6,000 DWUs Refer to Create an Azure SQL Data Warehouse for detailed instructions on how to create a SQL Data Warehouse database. To get the best possible load performance into SQL Data Warehouse using Polybase, we choose maximum number of Data Warehouse Units (DWUs) allowed in the Performance setting, which is 6,000 DWUs. NOTE When loading from Azure Blob, the data loading performance is directly proportional to the number of DWUs you configure on the SQL Data Warehouse: Loading 1 TB into 1,000 DWU SQL Data Warehouse takes 87 minutes (~200 MBps throughput) Loading 1 TB into 2,000 DWU SQL Data Warehouse takes 46 minutes (~380 MBps throughput) Loading 1 TB into 6,000 DWU SQL Data Warehouse takes 14 minutes (~1.2 GBps throughput) To create a SQL Data Warehouse with 6,000 DWUs, move the Performance slider all the way to the right: For an existing database that is not configured with 6,000 DWUs, you can scale it up using Azure portal. Navigate to the database in Azure portal, and there is a Scale button in the Overview panel shown in the following image: Click the Scale button to open the following panel, move the slider to the maximum value, and click Save button. This experiment loads data into Azure SQL Data Warehouse using xlargerc resource class. To achieve best possible throughput, copy needs to be performed using a SQL Data Warehouse user belonging to xlargerc resource class. Learn how to do that by following Change a user resource class example. Create destination table schema in Azure SQL Data Warehouse database, by running the following DDL statement: CREATE TABLE [dbo].[lineitem] ( [L_ORDERKEY] [bigint] NOT NULL, [L_PARTKEY] [bigint] NOT NULL, [L_SUPPKEY] [bigint] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NULL, [L_DISCOUNT] [decimal](15, 2) NULL, [L_TAX] [decimal](15, 2) NULL, [L_RETURNFLAG] [char](1) NULL, [L_LINESTATUS] [char](1) NULL, [L_SHIPDATE] [date] NULL, [L_COMMITDATE] [date] NULL, [L_RECEIPTDATE] [date] NULL, [L_SHIPINSTRUCT] [char](25) NULL, [L_SHIPMODE] [char](10) NULL, [L_COMMENT] [varchar](44) NULL ) WITH ( DISTRIBUTION = ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX ) With the prerequisite steps completed, we are now ready to configure the copy activity using the Copy Wizard. Launch Copy Wizard 1. Log in to the Azure portal. 2. Click + NEW from the top-left corner, click Intelligence + analytics, and click Data Factory. 3. In the New data factory blade: a. Enter LoadIntoSQLDWDataFactory for the name. The name of the Azure data factory must be globally unique. If you receive the error: Data factory name “LoadIntoSQLDWDataFactory” is not available, change the name of the data factory (for example, yournameLoadIntoSQLDWDataFactory) and try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. b. Select your Azure subscription. c. For Resource Group, do one of the following steps: a. Select Use existing to select an existing resource group. b. Select Create new to enter a name for a resource group. d. Select a location for the data factory. e. Select Pin to dashboard check box at the bottom of the blade. f. Click Create. 4. After the creation is complete, you see the Data Factory blade as shown in the following image: 5. On the Data Factory home page, click the Copy data tile to launch Copy Wizard. NOTE If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third party cookies and site data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try launching the wizard again. Step 1: Configure data loading schedule The first step is to configure the data loading schedule. In the Properties page: 1. Enter CopyFromBlobToAzureSqlDataWarehouse for Task name 2. Select Run once now option. 3. Click Next. Step 2: Configure source This section shows you the steps to configure the source: Azure Blob containing the 1-TB TPC-H line item files. 1. Select the Azure Blob Storage as the data store and click Next. 2. Fill in the connection information for the Azure Blob storage account, and click Next. 3. Choose the folder containing the TPC-H line item files and click Next. 4. Upon clicking Next, the file format settings are detected automatically. Check to make sure that column delimiter is ‘|’ instead of the default comma ‘,’. Click Next after you have previewed the data. Step 3: Configure destination This section shows you how to configure the destination: database. lineitem table in the Azure SQL Data Warehouse 1. Choose Azure SQL Data Warehouse as the destination store and click Next. 2. Fill in the connection information for Azure SQL Data Warehouse. Make sure you specify the user that is a member of the role xlargerc (see the prerequisites section for detailed instructions), and click Next. 3. Choose the destination table and click Next. 4. In Schema mapping page, leave "Apply column mapping" option unchecked and click Next. Step 4: Performance settings Allow polybase is checked by default. Click Next. Step 5: Deploy and monitor load results 1. Click Finish button to deploy. 2. After the deployment is complete, click Click here to monitor copy pipeline to monitor the copy run progress. Select the copy pipeline you created in the Activity Windows list. You can view the copy run details in the Activity Window Explorer in the right panel, including the data volume read from source and written into destination, duration, and the average throughput for the run. As you can see from the following screen shot, copying 1 TB from Azure Blob Storage into SQL Data Warehouse took 14 minutes, effectively achieving 1.22 GBps throughput! Best practices Here are a few best practices for running your Azure SQL Data Warehouse database: Use a larger resource class when loading into a CLUSTERED COLUMNSTORE INDEX. For more efficient joins, consider using hash distribution by a select column instead of default round robin distribution. For faster load speeds, consider using heap for transient data. Create statistics after you finish loading Azure SQL Data Warehouse. See Best practices for Azure SQL Data Warehouse for details. Next steps Data Factory Copy Wizard - This article provides details about the Copy Wizard. Copy Activity performance and tuning guide - This article contains the reference performance measurements and tuning guide. Copy Activity performance and tuning guide 5/16/2017 • 28 min to read • Edit Online Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and onpremises data stores. Blazing-fast data loading performance is key to ensure you can focus on the core “big data” problem: building advanced analytics solutions and getting deep insights from all that data. Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a highly optimized data loading experience that is easy to configure and set up. With just a single copy activity, you can achieve: Loading data into Azure SQL Data Warehouse at 1.2 GBps. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory. Loading data into Azure Blob storage at 1.0 GBps Loading data into Azure Data Lake Store at 1.0 GBps This article describes: Performance reference numbers for supported source and sink data stores to help you plan your project; Features that can boost the copy throughput in different scenarios, including cloud data movement units, parallel copy, and staged Copy; Performance tuning guidance on how to tune the performance and the key factors that can impact copy performance. NOTE If you are not familiar with Copy Activity in general, see Move data by using Copy Activity before reading this article. Performance reference NOTE You can achieve higher throughput by leveraging more data movement units (DMUs) than the default maximum DMUs, which is 32 for a cloud-to-cloud copy activity run. For example, with 100 DMUs, you can achieve copying data from Azure Blob into Azure Data Lake Store at 1.0GBps. See the Cloud data movement units section for details about this feature and the supported scenario. Contact Azure support to request more DMUs. Points to note: Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run duration]. The performance reference numbers in the table were measured using TPC-H data set in a single copy activity run. To copy between cloud data stores, set cloudDataMovementUnits to 1 and 4 (or 8) for comparison. parallelCopies is not specified. See the Parallel copy section for details about these features. In Azure data stores, the source and sink are in the same Azure region. For hybrid (on-premises to cloud, or cloud to on-premises) data movement, a single instance of gateway was running on a machine that was separate from the on-premises data store. The configuration is listed in the next table. When a single activity was running on gateway, the copy operation consumed only a small portion of the test machine's CPU, memory, or network bandwidth. CPU 32 cores 2.20 GHz Intel Xeon E5-2660 v2 Memory 128 GB Network Internet interface: 10 Gbps; intranet interface: 40 Gbps Parallel copy You can read data from the source or write data to the destination in parallel within a Copy Activity run. This feature enhances the throughput of a copy operation and reduces the time it takes to move data. This setting is different from the concurrency property in the activity definition. The concurrency property determines the number of concurrent Copy Activity runs to process data from different activity windows (1 AM to 2 AM, 2 AM to 3 AM, 3 AM to 4 AM, and so on). This capability is helpful when you perform a historical load. The parallel copy capability applies to a single activity run. Let's look at a sample scenario. In the following example, multiple slices from the past need to be processed. Data Factory runs an instance of Copy Activity (an activity run) for each slice: The data slice from the first activity window (1 AM to 2 AM) ==> Activity run 1 The data slice from the second activity window (2 AM to 3 AM) ==> Activity run 2 The data slice from the second activity window (3 AM to 4 AM) ==> Activity run 3 And so on. In this example, when the concurrency value is set to 2, Activity run 1 and Activity run 2 copy data from two activity windows concurrently to improve data movement performance. However, if multiple files are associated with Activity run 1, the data movement service copies files from the source to the destination one file at a time. Cloud data movement units A cloud data movement unit (DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. A DMU might be used in a cloud-tocloud copy operation, but not in a hybrid copy. By default, Data Factory uses a single cloud DMU to perform a single Copy Activity run. To override this default, specify a value for the cloudDataMovementUnits property as follows. For information about the level of performance gain you might get when you configure more units for a specific copy source and sink, see the performance reference. "activities":[ { "name": "Sample copy activity", "description": "", "type": "Copy", "inputs": [{ "name": "InputDataset" }], "outputs": [{ "name": "OutputDataset" }], "typeProperties": { "source": { "type": "BlobSource", }, "sink": { "type": "AzureDataLakeStoreSink" }, "cloudDataMovementUnits": 32 } } ] The allowed values for the cloudDataMovementUnits property are 1 (default), 2, 4, 8, 16, 32. The actual number of cloud DMUs that the copy operation uses at run time is equal to or less than the configured value, depending on your data pattern. NOTE If you need more cloud DMUs for a higher throughput, contact Azure support. Setting of 8 and above currently works only when you copy multiple files from Blob storage/Data Lake Store/Amazon S3/cloud FTP/cloud SFTP to Blob storage/Data Lake Store/Azure SQL Database. parallelCopies You can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use. You can think of this property as the maximum number of threads within Copy Activity that can read from your source or write to your sink data stores in parallel. For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store. The default number of parallel copies that it uses depends on the type of source and sink that you are using. SOURCE AND SINK DEFAULT PARALLEL COPY COUNT DETERMINED BY SERVICE Copy data between file-based stores (Blob storage; Data Lake Store; Amazon S3; an on-premises file system; an onpremises HDFS) Between 1 and 32. Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Gateway machine used for a hybrid copy (to copy data to or from an on-premises data store). Copy data from any source data store to Azure Table storage 4 All other source and sink pairs 1 Usually, the default behavior should give you the best throughput. However, to control the load on machines that host your data stores, or to tune copy performance, you may choose to override the default value and specify a value for the parallelCopies property. The value must be between 1 and 32 (both inclusive). At run time, for the best performance, Copy Activity uses a value that is less than or equal to the value that you set. "activities":[ { "name": "Sample copy activity", "description": "", "type": "Copy", "inputs": [{ "name": "InputDataset" }], "outputs": [{ "name": "OutputDataset" }], "typeProperties": { "source": { "type": "BlobSource", }, "sink": { "type": "AzureDataLakeStoreSink" }, "parallelCopies": 8 } } ] Points to note: When you copy data between file-based stores, the parallelCopies determine the parallelism at the file level. The chunking within a single file would happen underneath automatically and transparently, and it's designed to use the best suitable chunk size for a given source data store type to load data in parallel and orthogonal to parallelCopies. The actual number of parallel copies the data movement service uses for the copy operation at run time is no more than the number of files you have. If the copy behavior is mergeFile, Copy Activity cannot take advantage of file-level parallelism. When you specify a value for the parallelCopies property, consider the load increase on your source and sink data stores, and to gateway if it is a hybrid copy. This happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store. If you notice that either the data store or Gateway is overwhelmed with the load, decrease the parallelCopies value to relieve the load. When you copy data from stores that are not file-based to stores that are file-based, the data movement service ignores the parallelCopies property. Even if parallelism is specified, it's not applied in this case. NOTE You must use Data Management Gateway version 1.11 or later to use the parallelCopies feature when you do a hybrid copy. To better use these two properties, and to enhance your data movement throughput, see the sample use cases. You don't need to configure parallelCopies to take advantage of the default behavior. If you do configure and parallelCopies is too small, multiple cloud DMUs might not be fully utilized. Billing impact It's important to remember that you are charged based on the total time of the copy operation. If a copy job used to take one hour with one cloud unit and now it takes 15 minutes with four cloud units, the overall bill remains almost the same. For example, you use four cloud units. The first cloud unit spends 10 minutes, the second one, 10 minutes, the third one, 5 minutes, and the fourth one, 5 minutes, all in one Copy Activity run. You are charged for the total copy (data movement) time, which is 10 + 10 + 5 + 5 = 30 minutes. Using parallelCopies does not affect billing. Staged copy When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. Staging is especially useful in the following cases: 1. You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL Data Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data Warehouse. However, the source data must be in Blob storage, and it must meet additional criteria. When you load data from a data store other than Blob storage, you can activate data copying via interim staging Blob storage. In that case, Data Factory performs the required data transformations to ensure that it meets the requirements of PolyBase. Then it uses PolyBase to load data into SQL Data Warehouse. For more details, see Use PolyBase to load data into Azure SQL Data Warehouse. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory. 2. Sometimes it takes a while to perform a hybrid data movement (that is, to copy between an onpremises data store and a cloud data store) over a slow network connection. To improve performance, you can compress the data on-premises so that it takes less time to move data to the staging data store in the cloud. Then you can decompress the data in the staging store before you load it into the destination data store. 3. You don't want to open ports other than port 80 and port 443 in your firewall, because of corporate IT policies. For example, when you copy data from an on-premises data store to an Azure SQL Database sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication on port 1433 for both the Windows firewall and your corporate firewall. In this scenario, take advantage of the gateway to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. Then, load the data into SQL Database or SQL Data Warehouse from Blob storage staging. In this flow, you don't need to enable port 1433. How staged copy works When you activate the staging feature, first the data is copied from the source data store to the staging data store (bring your own). Next, the data is copied from the staging data store to the sink data store. Data Factory automatically manages the two-stage flow for you. Data Factory also cleans up temporary data from the staging storage after the data movement is complete. In the cloud copy scenario (both source and sink data stores are in the cloud), gateway is not used. The Data Factory service performs the copy operations. In the hybrid copy scenario (source is on-premises and sink is in the cloud), the gateway moves data from the source data store to a staging data store. Data Factory service moves data from the staging data store to the sink data store. Copying data from a cloud data store to an on-premises data store via staging also is supported with the reversed flow. When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before moving data from the source data store to an interim or staging data store, and then decompressed before moving data from an interim or staging data store to the sink data store. Currently, you can't copy data between two on-premises data stores by using a staging store. We expect this option to be available soon. Configuration Configure the enableStaging setting in Copy Activity to specify whether you want the data to be staged in Blob storage before you load it into a destination data store. When you set enableStaging to TRUE, specify the additional properties listed in the next table. If you don’t have one, you also need to create an Azure Storage or Storage shared access signature-linked service for staging. PROPERTY DESCRIPTION DEFAULT VALUE REQUIRED enableStaging Specify whether you want to copy data via an interim staging store. False No PROPERTY DESCRIPTION DEFAULT VALUE REQUIRED linkedServiceName Specify the name of an AzureStorage or AzureStorageSas linked service, which refers to the instance of Storage that you use as an interim staging store. N/A Yes, when enableStaging is set to TRUE N/A No False No You cannot use Storage with a shared access signature to load data into SQL Data Warehouse via PolyBase. You can use it in all other scenarios. path Specify the Blob storage path that you want to contain the staged data. If you do not provide a path, the service creates a container to store temporary data. Specify a path only if you use Storage with a shared access signature, or you require temporary data to be in a specific location. enableCompression Specifies whether data should be compressed before it is copied to the destination. This setting reduces the volume of data being transferred. Here's a sample definition of Copy Activity with the properties that are described in the preceding table: "activities":[ { "name": "Sample copy activity", "type": "Copy", "inputs": [{ "name": "OnpremisesSQLServerInput" }], "outputs": [{ "name": "AzureSQLDBOutput" }], "typeProperties": { "source": { "type": "SqlSource", }, "sink": { "type": "SqlSink" }, "enableStaging": true, "stagingSettings": { "linkedServiceName": "MyStagingBlob", "path": "stagingcontainer/path", "enableCompression": true } } } ] Billing impact You are charged based on two steps: copy duration and copy type. When you use staging during a cloud copy (copying data from a cloud data store to another cloud data store), you are charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price]. When you use staging during a hybrid copy (copying data from an on-premises data store to a cloud data store), you are charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price]. Performance tuning steps We suggest that you take these steps to tune the performance of your Data Factory service with Copy Activity: 1. Establish a baseline. During the development phase, test your pipeline by using Copy Activity against a representative data sample. You can use the Data Factory slicing model to limit the amount of data you work with. Collect execution time and performance characteristics by using the Monitoring and Management App. Choose Monitor & Manage on your Data Factory home page. In the tree view, choose the output dataset. In the Activity Windows list, choose the Copy Activity run. Activity Windows lists the Copy Activity duration and the size of the data that's copied. The throughput is listed in Activity Window Explorer. To learn more about the app, see Monitor and manage Azure Data Factory pipelines by using the Monitoring and Management App. Later in the article, you can compare the performance and configuration of your scenario to Copy Activity’s performance reference from our tests. 2. Diagnose and optimize performance. If the performance you observe doesn't meet your expectations, you need to identify performance bottlenecks. Then, optimize performance to remove or reduce the effect of bottlenecks. A full description of performance diagnosis is beyond the scope of this article, but here are some common considerations: Performance features: Parallel copy Cloud data movement units Staged copy Source Sink Serialization and deserialization Compression Column mapping Data Management Gateway Other considerations 3. Expand the configuration to your entire data set. When you're satisfied with the execution results and performance, you can expand the definition and pipeline active period to cover your entire data set. Considerations for the source General Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it. For Microsoft data stores, see monitoring and tuning topics that are specific to data stores, and help you understand data store performance characteristics, minimize response times, and maximize throughput. If you copy data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost performance. See Use PolyBase to load data into Azure SQL Data Warehouse for details. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory. File -based data stores (Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS) Average file size and file count: Copy Activity transfers data one file at a time. With the same amount of data to be moved, the overall throughput is lower if the data consists of many small files rather than a few large files due to the bootstrap phase for each file. Therefore, if possible, combine small files into larger files to gain higher throughput. File format and compression: For more ways to improve performance, see the Considerations for serialization and deserialization and Considerations for compression sections. For the on-premises file system scenario, in which Data Management Gateway is required, see the Considerations for Data Management Gateway section. Relational data stores (Includes SQL Database; SQL Data Warehouse; Amazon Redshift; SQL Server databases; and Oracle, MySQL, DB2, Teradata, Sybase, and PostgreSQL databases, etc.) Data pattern: Your table schema affects copy throughput. A large row size gives you a better performance than small row size, to copy the same amount of data. The reason is that the database can more efficiently retrieve fewer batches of data that contain fewer rows. Query or stored procedure: Optimize the logic of the query or stored procedure you specify in the Copy Activity source to fetch data more efficiently. For on-premises relational databases, such as SQL Server and Oracle, which require the use of Data Management Gateway, see the Considerations for Data Management Gateway section. Considerations for the sink General Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it. For Microsoft data stores, refer to monitoring and tuning topics that are specific to data stores. These topics can help you understand data store performance characteristics and how to minimize response times and maximize throughput. If you are copying data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost performance. See Use PolyBase to load data into Azure SQL Data Warehouse for details. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory. File -based data stores (Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS) Copy behavior: If you copy data from a different file-based data store, Copy Activity has three options via the copyBehavior property. It preserves hierarchy, flattens hierarchy, or merges files. Either preserving or flattening hierarchy has little or no performance overhead, but merging files causes performance overhead to increase. File format and compression: See the Considerations for serialization and deserialization and Considerations for compression sections for more ways to improve performance. Blob storage: Currently, Blob storage supports only block blobs for optimized data transfer and throughput. For on-premises file systems scenarios that require the use of Data Management Gateway, see the Considerations for Data Management Gateway section. Relational data stores (Includes SQL Database, SQL Data Warehouse, SQL Server databases, and Oracle databases) Copy behavior: Depending on the properties you've set for sqlSink, Copy Activity writes data to the destination database in different ways. By default, the data movement service uses the Bulk Copy API to insert data in append mode, which provides the best performance. If you configure a stored procedure in the sink, the database applies the data one row at a time instead of as a bulk load. Performance drops significantly. If your data set is large, when applicable, consider switching to using the sqlWriterCleanupScript property. If you configure the sqlWriterCleanupScript property for each Copy Activity run, the service triggers the script, and then you use the Bulk Copy API to insert the data. For example, to overwrite the entire table with the latest data, you can specify a script to first delete all records before bulkloading the new data from the source. Data pattern and batch size: Your table schema affects copy throughput. To copy the same amount of data, a large row size gives you better performance than a small row size because the database can more efficiently commit fewer batches of data. Copy Activity inserts data in a series of batches. You can set the number of rows in a batch by using the writeBatchSize property. If your data has small rows, you can set the writeBatchSize property with a higher value to benefit from lower batch overhead and higher throughput. If the row size of your data is large, be careful when you increase writeBatchSize. A high value might lead to a copy failure caused by overloading the database. For on-premises relational databases like SQL Server and Oracle, which require the use of Data Management Gateway, see the Considerations for Data Management Gateway section. NoSQL stores (Includes Table storage and Azure Cosmos DB ) For Table storage: Partition: Writing data to interleaved partitions dramatically degrades performance. Sort your source data by partition key so that the data is inserted efficiently into one partition after another, or adjust the logic to write the data to a single partition. For Azure Cosmos DB: Batch size: The writeBatchSize property sets the number of parallel requests to the Azure Cosmos DB service to create documents. You can expect better performance when you increase writeBatchSize because more parallel requests are sent to Azure Cosmos DB. However, watch for throttling when you write to Azure Cosmos DB (the error message is "Request rate is large"). Various factors can cause throttling, including document size, the number of terms in the documents, and the target collection's indexing policy. To achieve higher copy throughput, consider using a better collection, for example, S3. Considerations for serialization and deserialization Serialization and deserialization can occur when your input data set or output data set is a file. See Supported file and compression formats with details on supported file formats by Copy Activity. Copy behavior: Copying files between file-based data stores: When input and output data sets both have the same or no file format settings, the data movement service executes a binary copy without any serialization or deserialization. You see a higher throughput compared to the scenario, in which the source and sink file format settings are different from each other. When input and output data sets both are in text format and only the encoding type is different, the data movement service only does encoding conversion. It doesn't do any serialization and deserialization, which causes some performance overhead compared to a binary copy. When input and output data sets both have different file formats or different configurations, like delimiters, the data movement service deserializes source data to stream, transform, and then serialize it into the output format you indicated. This operation results in a much more significant performance overhead compared to other scenarios. When you copy files to/from a data store that is not file-based (for example, from a file-based store to a relational store), the serialization or deserialization step is required. This step results in significant performance overhead. File format: The file format you choose might affect copy performance. For example, Avro is a compact binary format that stores metadata with data. It has broad support in the Hadoop ecosystem for processing and querying. However, Avro is more expensive for serialization and deserialization, which results in lower copy throughput compared to text format. Make your choice of file format throughout the processing flow holistically. Start with what form the data is stored in, source data stores or to be extracted from external systems; the best format for storage, analytical processing, and querying; and in what format the data should be exported into data marts for reporting and visualization tools. Sometimes a file format that is suboptimal for read and write performance might be a good choice when you consider the overall analytical process. Considerations for compression When your input or output data set is a file, you can set Copy Activity to perform compression or decompression as it writes data to the destination. When you choose compression, you make a tradeoff between input/output (I/O) and CPU. Compressing the data costs extra in compute resources. But in return, it reduces network I/O and storage. Depending on your data, you may see a boost in overall copy throughput. Codec: Copy Activity supports gzip, bzip2, and Deflate compression types. Azure HDInsight can consume all three types for processing. Each compression codec has advantages. For example, bzip2 has the lowest copy throughput, but you get the best Hive query performance with bzip2 because you can split it for processing. Gzip is the most balanced option, and it is used the most often. Choose the codec that best suits your end-toend scenario. Level: You can choose from two options for each compression codec: fastest compressed and optimally compressed. The fastest compressed option compresses the data as quickly as possible, even if the resulting file is not optimally compressed. The optimally compressed option spends more time on compression and yields a minimal amount of data. You can test both options to see which provides better overall performance in your case. A consideration: To copy a large amount of data between an on-premises store and the cloud, consider using interim blob storage with compression. Using interim storage is helpful when the bandwidth of your corporate network and your Azure services is the limiting factor, and you want the input data set and output data set both to be in uncompressed form. More specifically, you can break a single copy activity into two copy activities. The first copy activity copies from the source to an interim or staging blob in compressed form. The second copy activity copies the compressed data from staging, and then decompresses while it writes to the sink. Considerations for column mapping You can set the columnMappings property in Copy Activity to map all or a subset of the input columns to the output columns. After the data movement service reads the data from the source, it needs to perform column mapping on the data before it writes the data to the sink. This extra processing reduces copy throughput. If your source data store is queryable, for example, if it's a relational store like SQL Database or SQL Server, or if it's a NoSQL store like Table storage or Azure Cosmos DB, consider pushing the column filtering and reordering logic to the query property instead of using column mapping. This way, the projection occurs while the data movement service reads data from the source data store, where it is much more efficient. Considerations for Data Management Gateway For Gateway setup recommendations, see Considerations for using Data Management Gateway. Gateway machine environment: We recommend that you use a dedicated machine to host Data Management Gateway. Use tools like PerfMon to examine CPU, memory, and bandwidth use during a copy operation on your Gateway machine. Switch to a more powerful machine if CPU, memory, or network bandwidth becomes a bottleneck. Concurrent Copy Activity runs: A single instance of Data Management Gateway can serve multiple Copy Activity runs at the same time, or concurrently. The maximum number of concurrent jobs is calculated based on the Gateway machine’s hardware configuration. Additional copy jobs are queued until they are picked up by Gateway or until another job times out. To avoid resource contention on the Gateway machine, you can stage your Copy Activity schedule to reduce the number of copy jobs in the queue at a time, or consider splitting the load onto multiple Gateway machines. Other considerations If the size of data you want to copy is large, you can adjust your business logic to further partition the data using the slicing mechanism in Data Factory. Then, schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run. Be cautious about the number of data sets and copy activities requiring Data Factory to connector to the same data store at the same time. Many concurrent copy jobs might throttle a data store and lead to degraded performance, copy job internal retries, and in some cases, execution failures. Sample scenario: Copy from an on-premises SQL Server to Blob storage Scenario: A pipeline is built to copy data from an on-premises SQL Server to Blob storage in CSV format. To make the copy job faster, the CSV files should be compressed into bzip2 format. Test and analysis: The throughput of Copy Activity is less than 2 MBps, which is much slower than the performance benchmark. Performance analysis and tuning: To troubleshoot the performance issue, let’s look at how the data is processed and moved. 1. Read data: Gateway opens a connection to SQL Server and sends the query. SQL Server responds by sending the data stream to Gateway via the intranet. 2. Serialize and compress data: Gateway serializes the data stream to CSV format, and compresses the data to a bzip2 stream. 3. Write data: Gateway uploads the bzip2 stream to Blob storage via the Internet. As you can see, the data is being processed and moved in a streaming sequential manner: SQL Server > LAN > Gateway > WAN > Blob storage. The overall performance is gated by the minimum throughput across the pipeline. One or more of the following factors might cause the performance bottleneck: Source: SQL Server itself has low throughput because of heavy loads. Data Management Gateway: LAN: Gateway is located far from the SQL Server machine and has a low-bandwidth connection. Gateway: Gateway has reached its load limitations to perform the following operations: Serialization: Serializing the data stream to CSV format has slow throughput. Compression: You chose a slow compression codec (for example, bzip2, which is 2.8 MBps with Core i7). WAN: The bandwidth between the corporate network and your Azure services is low (for example, T1 = 1,544 kbps; T2 = 6,312 kbps). Sink: Blob storage has low throughput. (This scenario is unlikely because its SLA guarantees a minimum of 60 MBps.) In this case, bzip2 data compression might be slowing down the entire pipeline. Switching to a gzip compression codec might ease this bottleneck. Sample scenarios: Use parallel copy Scenario I: Copy 1,000 1-MB files from the on-premises file system to Blob storage. Analysis and performance tuning: For an example, if you have installed gateway on a quad core machine, Data Factory uses 16 parallel copies to move files from the file system to Blob storage concurrently. This parallel execution should result in high throughput. You also can explicitly specify the parallel copies count. When you copy many small files, parallel copies dramatically help throughput by using resources more effectively. Scenario II: Copy 20 blobs of 500 MB each from Blob storage to Data Lake Store Analytics, and then tune performance. Analysis and performance tuning: In this scenario, Data Factory copies the data from Blob storage to Data Lake Store by using single-copy (parallelCopies set to 1) and single-cloud data movement units. The throughput you observe will be close to that described in the performance reference section. Scenario III: Individual file size is greater than dozens of MBs and total volume is large. Analysis and performance turning: Increasing parallelCopies doesn't result in better copy performance because of the resource limitations of a single-cloud DMU. Instead, you should specify more cloud DMUs to get more resources to perform the data movement. Do not specify a value for the parallelCopies property. Data Factory handles the parallelism for you. In this case, if you set cloudDataMovementUnits to 4, a throughput of about four times occurs. Reference Here are performance monitoring and tuning references for some of the supported data stores: Azure Storage (including Blob storage and Table storage): Azure Storage scalability targets and Azure Storage performance and scalability checklist Azure SQL Database: You can monitor the performance and check the database transaction unit (DTU) percentage Azure SQL Data Warehouse: Its capability is measured in data warehouse units (DWUs); see Manage compute power in Azure SQL Data Warehouse (Overview) Azure Cosmos DB: Performance levels in Azure Cosmos DB On-premises SQL Server: Monitor and tune for performance On-premises file server: Performance tuning for file servers Azure Data Factory - Security considerations for data movement 5/10/2017 • 10 min to read • Edit Online Introduction This article describes basic security infrastructure that data movement services in Azure Data Factory use to secure your data. Azure Data Factory management resources are built on Azure security infrastructure and use all possible security measures offered by Azure. In a Data Factory solution, you create one or more data pipelines. A pipeline is a logical grouping of activities that together perform a task. These pipelines reside in the region where the data factory was created. Even though Data Factory is available in only West US, East US, and North Europe regions, the data movement service is available globally in several regions. Data Factory service ensures that data does not leave a geographical area/ region unless you explicitly instruct the service to use an alternate region if the data movement service is not yet deployed to that region. Azure Data Factory itself does not store any data except for linked service credentials for cloud data stores, which are encrypted using certificates. It lets you create data-driven workflows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms. Data movement using Azure Data Factory has been certified for: HIPAA/HITECH ISO/IEC 27001 ISO/IEC 27018 CSA STAR If you are interested in Azure compliance and how Azure secures its own infrastructure, visit the Microsoft Trust Center. In this article, we review security considerations in the following two data movement scenarios: Cloud scenario- In this scenario, both your source and destination are publicly accessible through internet. These include managed cloud storage services like Azure Storage, Azure SQL Data Warehouse, Azure SQL Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce, and web protocols such as FTP and OData. You can find a complete list of supported data sources here. Hybrid scenario- In this scenario, either your source or destination is behind a firewall or inside an onpremises corporate network or the data store is in a private network/ virtual network (most often the source) and is not publicly accessible. Database servers hosted on virtual machines also fall under this scenario. Cloud scenarios Securing data store credentials Azure Data Factory protects your data store credentials by encrypting them by using certificates managed by Microsoft. These certificates are rotated every two years (which includes renewal of certificate and migration of credentials). These encrypted credentials are securely stored in an Azure Storage managed by Azure Data Factory management services. For more information about Azure Storage security, refer Azure Storage Security Overview. Data encryption in transit If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data Factory and a cloud data store are via secure channel HTTPS or TLS. NOTE All connections to Azure SQL Database and Azure SQL Data Warehouse always require encryption (SSL/TLS) while data is in transit to and from the database. While authoring a pipeline using a JSON editor, add the encryption property and set it to true in the connection string. When you use the Copy Wizard, the wizard sets this property by default. For Azure Storage, you can use HTTPS in the connection string. Data encryption at rest Some data stores support encryption of data at rest. We suggest that you enable data encryption mechanism for those data stores. Azure SQL Data Warehouse Transparent Data Encryption (TDE) in Azure SQL Data Warehouse helps with protecting against the threat of malicious activity by performing real-time encryption and decryption of your data at rest. This behavior is transparent to the client. For more information, see Secure a database in SQL Data Warehouse. Azure SQL Database Azure SQL Database also supports transparent data encryption (TDE), which helps with protecting against the threat of malicious activity by performing real-time encryption and decryption of the data without requiring changes to the application. This behavior is transparent to the client. For more information, see Transparent Data Encryption with Azure SQL Database. Azure Data Lake Store Azure Data Lake store also provides encryption for data stored in the account. When enabled, Data Lake store automatically encrypts data before persisting and decrypts before retrieval, making it transparent to the client accessing the data. For more information, see Security in Azure Data Lake Store. Azure Blob Storage and Azure Table Storage Azure Blob Storage and Azure Table storage supports Storage Service Encryption (SSE), which automatically encrypts your data before persisting to storage and decrypts before retrieval. For more information, see Azure Storage Service Encryption for Data at Rest. Amazon S3 Amazon S3 supports both client and server encryption of data at Rest. For more information, see Protecting Data Using Encryption. Currently, Data Factory does not support Amazon S3 inside a virtual private cloud (VPC). Amazon Redshift Amazon Redshift supports cluster encryption for data at rest. For more information, see Amazon Redshift Database Encryption. Currently, Data Factory does not support Amazon Redshift inside a VPC. Salesforce Salesforce supports Shield Platform Encryption that allows encryption of all files, attachments, custom fields. For more information, see Understanding the Web Server OAuth Authentication Flow. Hybrid Scenarios (using Data Management Gateway) Hybrid scenarios require Data Management Gateway to be installed in an on-premises network or inside a virtual network (Azure) or a virtual private cloud (Amazon). The gateway must be able to access the local data stores. For more information about the gateway, see Data Management Gateway. The command channel allows communication between data movement services in Data Factory and Data Management Gateway. The communication contains information related to the activity. The data channel is used for transferring data between on-premises data stores and cloud data stores. On-premises data store credentials The credentials for your on-premises data stores are stored locally (not in the cloud). They can be set in three different ways. Using plain-text (less secure) via HTTPS from Azure Portal/ Copy Wizard. The credentials are passed in plaintext to the on-premises gateway. Using JavaScript Cryptography library from Copy Wizard. Using click-once based credentials manager app. The click-once application executes on the on-premises machine that has access to the gateway and sets credentials for the data store. This option and the next one are the most secure options. The credential manager app, by default, uses the port 8050 on the machine with gateway for secure communication. Use New-AzureRmDataFactoryEncryptValue PowerShell cmdlet to encrypt credentials. The cmdlet uses the certificate that gateway is configured to use to encrypt the credentials. You can use the encrypted credentials returned by this cmdlet and add it to EncryptedCredential element of the connectionString in the JSON file that you use with the New-AzureRmDataFactoryLinkedService cmdlet or in the JSON snippet in the Data Factory Editor in the portal. This option and the click-once application are the most secure options. JavaScript cryptography library-based encryption You can encrypt data store credentials using JavaScript Cryptography library from the Copy Wizard. When you select this option, the Copy Wizard retrieves the public key of gateway and uses it to encrypt the data store credentials. The credentials are decrypted by the gateway machine and protected by Windows DPAPI. Supported browsers: IE8, IE9, IE10, IE11, Microsoft Edge, and latest Firefox, Chrome, Opera, Safari browsers. Click-once credentials manager app You can launch the click-once based credential manager app from Azure portal/Copy Wizard when authoring pipelines. This application ensures that credentials are not transferred in plain text over the wire. By default, it uses the port 8050 on the machine with gateway for secure communication. If necessary, this port can be changed. Currently, Data Management Gateway uses a single certificate. This certificate is created during the gateway installation (applies to Data Management Gateway created after November 2016 and version 2.4.xxxx.x or later). You can replace this certificate with your own SSL/TLS certificate. This certificate is used by the click-once credential manager application to securely connect to the gateway machine for setting data store credentials. It stores data store credentials securely on-premises by using the Windows DPAPI on the machine with gateway. NOTE Older gateways that were installed before November 2016 or of version 2.3.xxxx.x continue to use credentials encrypted and stored on cloud. Even if you upgrade the gateway to the latest version, the credentials are not migrated to an on-premises machine GATEWAY VERSION (DURING CREATION) CREDENTIALS STORED CREDENTIAL ENCRYPTION/ SECURITY < = 2.3.xxxx.x On cloud Encrypted using certificate (different from the one used by Credential manager app) > = 2.4.xxxx.x On premises Secured via DPAPI Encryption in transit All data transfers are via secure channel HTTPS and TLS over TCP to prevent man-in-the-middle attacks during communication with Azure services. You can also use IPSec VPN or Express Route to further secure the communication channel between your onpremises network and Azure. Virtual network is a logical representation of your network in the cloud. You can connect an on-premises network to your Azure virtual network (VNet) by setting up IPSec VPN (site-to-site) or Express Route (Private Peering) The following table summarizes the network and gateway configuration recommendations based on different combinations of source and destination locations for hybrid data movement. SOURCE DESTINATION NETWORK CONFIGURATION GATEWAY SETUP On-premises Virtual machines and cloud services deployed in virtual networks IPSec VPN (point-to-site or site-to-site) Gateway can be installed either on-premises or on an Azure virtual machine (VM) in VNet On-premises Virtual machines and cloud services deployed in virtual networks ExpressRoute (Private Peering) Gateway can be installed either on-premises or on an Azure VM in VNet On-premises Azure-based services that have a public endpoint ExpressRoute (Public Peering) Gateway must be installed on-premises The following images show the usage of Data Management Gateway for moving data between an on-premises database and Azure services using Express route and IPSec VPN (with Virtual Network): Express Route: IPSec VPN: Firewall configurations and whitelisting IP address of gateway Firewall requirements for on-premise/private network In an enterprise, a corporate firewall runs on the central router of the organization. And, Windows firewall runs as a daemon on the local machine on which the gateway is installed. The following table provides outbound port and domain requirements for the corporate firewall. DOMAIN NAMES OUTBOUND PORTS DESCRIPTION *.servicebus.windows.net 443, 80 Required by the gateway to connect to data movement services in Data Factory *.core.windows.net 443 Used by the gateway to connect to Azure Storage Account when you use the staged copy feature. *.frontend.clouddatahub.net 443 Required by the gateway to connect to the Azure Data Factory service. *.database.windows.net 1433 (OPTIONAL) needed when your destination is Azure SQL Database/ Azure SQL Data Warehouse. Use the staged copy feature to copy data to Azure SQL Database/Azure SQL Data Warehouse without opening the port 1433. *.azuredatalakestore.net 443 (OPTIONAL) needed when your destination is Azure Data Lake store NOTE You may have to manage ports/ whitelisting domains at the corporate firewall level as required by respective data sources. This table only uses Azure SQL Database, Azure SQL Data Warehouse, Azure Data Lake Store as examples. The following table provides inbound port requirements for the windows firewall. INBOUND PORTS DESCRIPTION 8050 (TCP) Required by the credential manager application to securely set credentials for on-premises data stores on the gateway. IP configurations/ whitelisting in data store Some data stores in the cloud also require whitelisting of IP address of the machine accessing them. Ensure that the IP address of the gateway machine is whitelisted/ configured in firewall appropriately. The following cloud data stores require whitelisting of IP address of the gateway machine. Some of these data stores, by default, may not require whitelisting of the IP address. Azure SQL Database Azure SQL Data Warehouse Azure Data Lake Store Azure Cosmos DB Amazon Redshift Frequently asked questions Question: Can the Gateway be shared across different data factories? Answer: We do not support this feature yet. We are actively working on it. Question: What are the port requirements for the gateway to work? Answer: Gateway makes HTTP-based connections to open internet. The outbound ports 443 and 80 must be opened for gateway to make this connection. Open Inbound Port 8050 only at the machine level (not at corporate firewall level) for Credential Manager application. If Azure SQL Database or Azure SQL Data Warehouse is used as source/ destination, then you need to open 1433 port as well. For more information, see Firewall configurations and whitelisting IP addresses section. Question: What are certificate requirements for Gateway? Answer: Current gateway requires a certificate that is used by the credential manager application for securely setting data store credentials. This certificate is a selfsigned certificate created and configured by the gateway setup. You can use your own TLS/ SSL certificate instead. For more information, see click-once credential manager application section. Next steps For information about performance of copy activity, see Copy activity performance and tuning guide. Move data From Amazon Redshift using Azure Data Factory 6/6/2017 • 7 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from Amazon Redshift. The article builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from Amazon Redshift to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see supported data stores. Data factory currently supports moving data from Amazon Redshift to other data stores, but not for moving data from other data stores to Amazon Redshift. Prerequisites If you are moving data to an on-premises data store, install Data Management Gateway on an on-premises machine. Then, Grant Data Management Gateway (use IP address of the machine) the access to Amazon Redshift cluster. See Authorize access to the cluster for instructions. If you are moving data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address and SQL ranges used by the Azure data centers. Getting started You can create a pipeline with a copy activity that moves data from an Amazon Redshift source by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an Amazon Redshift data store, see JSON example: Copy data from Amazon Redshift to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Amazon Redshift: Linked service properties The following table provides description for JSON elements specific to Amazon Redshift linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: AmazonRedshift. Yes server IP address or host name of the Amazon Redshift server. Yes port The number of the TCP port that the Amazon Redshift server uses to listen for client connections. No, default value: 5439 database Name of the Amazon Redshift database. Yes username Name of user who has access to the database. Yes password Password for the user account. Yes Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset. It provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes Amazon Redshift dataset) has the following properties PROPERTY DESCRIPTION REQUIRED tableName Name of the table in the Amazon Redshift database that linked service refers to. No (if query of RelationalSource is specified) Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When source of copy activity is of type RelationalSource (which includes Amazon Redshift), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL query string. For example: select * from MyTable. No (if tableName of dataset is specified) JSON example: Copy data from Amazon Redshift to Azure Blob This sample shows how to copy data from an Amazon Redshift database to an Azure Blob Storage. However, data can be copied directly to any of the sinks stated here using the Copy Activity in Azure Data Factory. The sample has the following data factory entities: A linked service of type AmazonRedshift. A linked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from a query result in Amazon Redshift to a blob every hour. The JSON properties used in these samples are described in sections following the samples. Amazon Redshift linked service: { "name": "AmazonRedshiftLinkedService", "properties": { "type": "AmazonRedshift", "typeProperties": { "server": "< The IP address or host name of the Amazon Redshift server >", "port": <The number of the TCP port that the Amazon Redshift server uses to listen for client connections.>, "database": "<The database name of the Amazon Redshift database>", "username": "<username>", "password": "<password>" } } } Azure Storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Amazon Redshift input dataset: Setting "external": true informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. Set this property to true on an input dataset that is not produced by an activity in the pipeline. { "name": "AmazonRedshiftInputDataset", "properties": { "type": "RelationalTable", "linkedServiceName": "AmazonRedshiftLinkedService", "typeProperties": { "tableName": "<Table name>" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutputDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/fromamazonredshift/yearno={Year}/monthno={Month}/dayno={Day}/hourno= {Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with Azure Redshift source (RelationalSource) and Blob sink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy. { "name": "CopyAmazonRedshiftToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MMddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "AmazonRedshiftInputDataset" } ], "outputs": [ { "name": "AzureBlobOutputDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "AmazonRedshiftToBlob" } ], "start": "2014-06-01T18:00:00Z", "end": "2014-06-01T19:00:00Z" } } Type mapping for Amazon Redshift As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to Amazon Redshift, the following mappings are used from Amazon Redshift types to .NET types. AMAZON REDSHIFT TYPE .NET BASED TYPE SMALLINT Int16 INTEGER Int32 BIGINT Int64 AMAZON REDSHIFT TYPE .NET BASED TYPE DECIMAL Decimal REAL Single DOUBLE PRECISION Double BOOLEAN String CHAR String VARCHAR String DATE DateTime TIMESTAMP DateTime TEXT String Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Next Steps See the following articles: Copy Activity tutorial for step-by-step instructions for creating a pipeline with a Copy Activity. Move data from Amazon Simple Storage Service by using Azure Data Factory 4/19/2017 • 8 min to read • Edit Online This article explains how to use the copy activity in Azure Data Factory to move data from Amazon Simple Storage Service (S3). It builds on the Data movement activities article, which presents a general overview of data movement with the copy activity. You can copy data from Amazon S3 to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data Factory currently supports only moving data from Amazon S3 to other data stores, but not moving data from other data stores to Amazon S3. Required permissions To copy data from Amazon S3, make sure you have been granted the following permissions: and s3:GetObjectVersion for Amazon S3 Object Operations. s3:ListBucket for Amazon S3 Bucket Operations. If you are using the Data Factory Copy Wizard, s3:ListAllMyBuckets is also required. s3:GetObject For details about the full list of Amazon S3 permissions, see Specifying Permissions in a Policy. Getting started You can create a pipeline with a copy activity that moves data from an Amazon S3 source by using different tools or APIs. The easiest way to create a pipeline is to use the Copy Wizard. For a quick walkthrough, see Tutorial: Create a pipeline using Copy Wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. For step-by-step instructions to create a pipeline with a copy activity, see the Copy activity tutorial. Whether you use tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools or APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an Amazon S3 data store, see the JSON example: Copy data from Amazon S3 to Azure Blob section of this article. NOTE For details about supported file and compression formats for a copy activity, see File and compression formats in Azure Data Factory. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Amazon S3. Linked service properties A linked service links a data store to a data factory. You create a linked service of type AwsAccessKey to link your Amazon S3 data store to your data factory. The following table provides description for JSON elements specific to Amazon S3 (AwsAccessKey) linked service. PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED accessKeyID ID of the secret access key. string Yes secretAccessKey The secret access key itself. Encrypted secret string Yes Here is an example: { "name": "AmazonS3LinkedService", "properties": { "type": "AwsAccessKey", "typeProperties": { "accessKeyId": "<access key id>", "secretAccessKey": "<secret access key>" } } } Dataset properties To specify a dataset to represent input data in Azure Blob storage, set the type property of the dataset to AmazonS3. Set the linkedServiceName property of the dataset to the name of the Amazon S3 linked service. For a full list of sections and properties available for defining datasets, see Creating datasets. Sections such as structure, availability, and policy are similar for all dataset types (such as SQL database, Azure blob, and Azure table). The typeProperties section is different for each type of dataset, and provides information about the location of the data in the data store. The typeProperties section for a dataset of type AmazonS3 (which includes the Amazon S3 dataset) has the following properties: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED bucketName The S3 bucket name. String Yes key The S3 object key. String No PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED prefix Prefix for the S3 object key. Objects whose keys start with this prefix are selected. Applies only when key is empty. String No version The version of the S3 object, if S3 versioning is enabled. String No format The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see the Text format, JSON format, Avro format, Orc format, and Parquet format sections. No If you want to copy files asis between file-based stores (binary copy), skip the format section in both input and output dataset definitions. compression Specify the type and level of compression for the data. The supported types are: GZip, Deflate, BZip2, and ZipDeflate. The supported levels are: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. No NOTE bucketName + key specifies the location of the S3 object, where bucket is the root container for S3 objects, and key is the full path to the S3 object. Sample dataset with prefix { "name": "dataset-s3", "properties": { "type": "AmazonS3", "linkedServiceName": "link- testS3", "typeProperties": { "prefix": "testFolder/test", "bucketName": "testbucket", "format": { "type": "OrcFormat" } }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } Sample dataset (with version) { "name": "dataset-s3", "properties": { "type": "AmazonS3", "linkedServiceName": "link- testS3", "typeProperties": { "key": "testFolder/test.orc", "bucketName": "testbucket", "version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL", "format": { "type": "OrcFormat" } }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } Dynamic paths for S3 The preceding sample uses fixed values for the key and bucketName properties in the Amazon S3 dataset. "key": "testFolder/test.orc", "bucketName": "testbucket", You can have Data Factory calculate these properties dynamically at runtime, by using system variables such as SliceStart. "key": "$$Text.Format('{0:MM}/{0:dd}/test.orc', SliceStart)" "bucketName": "$$Text.Format('{0:yyyy}', SliceStart)" You can do the same for the prefix property of an Amazon S3 dataset. For a list of supported functions and variables, see Data Factory functions and system variables. Copy activity properties For a full list of sections and properties available for defining activities, see Creating pipelines. Properties such as name, description, input and output tables, and policies are available for all types of activities. Properties available in the typeProperties section of the activity vary with each activity type. For the copy activity, properties vary depending on the types of sources and sinks. When a source in the copy activity is of type FileSystemSource (which includes Amazon S3), the following property is available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED recursive Specifies whether to recursively list S3 objects under the directory. true/false No JSON example: Copy data from Amazon S3 to Azure Blob storage This sample shows how to copy data from Amazon S3 to an Azure Blob storage. However, data can be copied directly to any of the sinks that are supported by using the copy activity in Data Factory. The sample provides JSON definitions for the following Data Factory entities. You can use these definitions to create a pipeline to copy data from Amazon S3 to Blob storage, by using the Azure portal, Visual Studio, or PowerShell. A linked service of type AwsAccessKey. A linked service of type AzureStorage. An input dataset of type AmazonS3. An output dataset of type AzureBlob. A pipeline with copy activity that uses FileSystemSource and BlobSink. The sample copies data from Amazon S3 to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. Amazon S3 linked service { "name": "AmazonS3LinkedService", "properties": { "type": "AwsAccessKey", "typeProperties": { "accessKeyId": "<access key id>", "secretAccessKey": "<secret access key>" } } } Azure Storage linked service { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Amazon S3 input dataset Setting "external": true informs the Data Factory service that the dataset is external to the data factory. Set this property to true on an input dataset that is not produced by an activity in the pipeline. { "name": "AmazonS3InputDataset", "properties": { "type": "AmazonS3", "linkedServiceName": "AmazonS3LinkedService", "typeProperties": { "key": "testFolder/test.orc", "bucketName": "testbucket", "format": { "type": "OrcFormat" } }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } Azure Blob output dataset Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hours parts of the start time. { "name": "AzureBlobOutputDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/fromamazons3/yearno={Year}/monthno={Month}/dayno={Day}/hourno= {Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with an Amazon S3 source and a blob sink The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource, and sink type is set to BlobSink. { "name": "CopyAmazonS3ToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "FileSystemSource", "recursive": true }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "AmazonS3InputDataset" } ], "outputs": [ { "name": "AzureBlobOutputDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "AmazonS3ToBlob" } ], "start": "2014-08-08T18:00:00Z", "end": "2014-08-08T19:00:00Z" } } NOTE To map columns from a source dataset to columns from a sink dataset, see Mapping dataset columns in Azure Data Factory. Next steps See the following articles: To learn about key factors that impact performance of data movement (copy activity) in Data Factory, and various ways to optimize it, see the Copy activity performance and tuning guide. For step-by-step instructions for creating a pipeline with a copy activity, see the Copy activity tutorial. Copy data to or from Azure Blob Storage using Azure Data Factory 5/11/2017 • 31 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to copy data to and from Azure Blob Storage. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. Overview You can copy data from any supported source data store to Azure Blob Storage or from Azure Blob Storage to any supported sink data store. The following table provides a list of data stores supported as sources or sinks by the copy activity. For example, you can move data from a SQL Server database or an Azure SQL database to an Azure blob storage. And, you can copy data from Azure blob storage to an Azure SQL Data Warehouse or an Azure Cosmos DB collection. Supported scenarios You can copy data from Azure Blob Storage to the following data stores: CATEGORY DATA STORE Azure Azure Blob storage Azure Data Lake Store Azure Cosmos DB (DocumentDB API) Azure SQL Database Azure SQL Data Warehouse Azure Search Index Azure Table storage Databases SQL Server Oracle File File system You can copy data from the following data stores to Azure Blob Storage: CATEGORY DATA STORE Azure Azure Blob storage Azure Cosmos DB (DocumentDB API) Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure Table storage CATEGORY DATA STORE Databases Amazon Redshift DB2 MySQL Oracle PostgreSQL SAP Business Warehouse SAP HANA SQL Server Sybase Teradata NoSQL Cassandra MongoDB File Amazon S3 File System FTP HDFS SFTP Others Generic HTTP Generic OData Generic ODBC Salesforce Web Table (table from HTML) GE Historian IMPORTANT Copy Activity supports copying data from/to both general-purpose Azure Storage accounts and Hot/Cool Blob storage. The activity supports reading from block, append, or page blobs, but supports writing to only block blobs. Azure Premium Storage is not supported as a sink because it is backed by page blobs. Copy Activity does not delete data from the source after the data is successfully copied to the destination. If you need to delete source data after a successful copy, create a custom activity to delete the data and use the activity in the pipeline. For an example, see the Delete blob or folder sample on GitHub. Get started You can create a pipeline with a copy activity that moves data to/from an Azure Blob Storage by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. This article has a walkthrough for creating a pipeline to copy data from an Azure Blob Storage location to another Azure Blob Storage location. For a tutorial on creating a pipeline to copy data from an Azure Blob Storage to Azure SQL Database, see Tutorial: Create a pipeline using Copy Wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create a data factory. A data factory may contain one or more pipelines. 2. Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an Azure SQL database, you create two linked services to link your Azure storage account and Azure SQL database to your data factory. For linked service properties that are specific to Azure Blob Storage, see linked service properties section. 3. Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the SQL table in the Azure SQL database that holds the data copied from the blob storage. For dataset properties that are specific to Azure Blob Storage, see dataset properties section. 4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity. Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure Blob Storage, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Blob Storage, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Blob Storage. Linked service properties There are two types of linked services you can use to link an Azure Storage to an Azure data factory. They are: AzureStorage linked service and AzureStorageSas linked service. The Azure Storage linked service provides the data factory with global access to the Azure Storage. Whereas, The Azure Storage SAS (Shared Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure Storage. There are no other differences between these two linked services. Choose the linked service that suits your needs. The following sections provide more details on these two linked services. Azure Storage Linked Service The Azure Storage linked service allows you to link an Azure storage account to an Azure data factory by using the account key, which provides the data factory with global access to the Azure Storage. The following table provides description for JSON elements specific to Azure Storage linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: AzureStorage Yes connectionString Specify information needed to connect to Azure storage for the connectionString property. Yes See the following article for steps to view/copy the account key for an Azure Storage: View, copy, and regenerate storage access keys. Example: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Storage Sas Linked Service A shared access signature (SAS) provides delegated access to resources in your storage account. It allows you to grant a client limited permissions to objects in your storage account for a specified period of time and with a specified set of permissions, without having to share your account access keys. The SAS is a URI that encompasses in its query parameters all the information necessary for authenticated access to a storage resource. To access storage resources with the SAS, the client only needs to pass in the SAS to the appropriate constructor or method. For detailed information about SAS, see Shared Access Signatures: Understanding the SAS Model The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory by using a Shared Access Signature (SAS). It provides the data factory with restricted/time-bound access to all/specific resources (blob/container) in the storage. The following table provides description for JSON elements specific to Azure Storage SAS linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: AzureStorageSas Yes sasUri Specify Shared Access Signature URI to the Azure Storage resources such as blob, container, or table. Yes Example: { "name": "StorageSasLinkedService", "properties": { "type": "AzureStorageSas", "typeProperties": { "sasUri": "<storageUri>?<sasToken>" } } } When creating an SAS URI, considering the following: Azure Data Factory supports only Service SAS, not Account SAS. See Types of Shared Access Signatures for details about these two types. Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is used in your data factory. Set Expiry time appropriately. Make sure that the access to Azure Storage objects does not expire within the active period of the pipeline. Uri should be created at the right container/blob or Table level based on the need. A SAS Uri to an Azure blob allows the Data Factory service to access that particular blob. A SAS Uri to an Azure blob container allows the Data Factory service to iterate through blobs in that container. If you need to provide access more/fewer objects later, or update the SAS URI, remember to update the linked service with the new URI. Dataset properties To specify a dataset to represent input or output data in an Azure Blob Storage, you set the type property of the dataset to: AzureBlob. Set the linkedServiceName property of the dataset to the name of the Azure Storage or Azure Storage SAS linked service. The type properties of the dataset specify the blob container and the folder in the blob storage. For a full list of JSON sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). Data factory supports the following CLS-compliant .NET based type values for providing type information in “structure” for schema-on-read data sources like Azure blob: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Bool, String, Guid, Datetime, Datetimeoffset, Timespan. Data Factory automatically performs type conversions when moving data from a source data store to a sink data store. The typeProperties section is different for each type of dataset and provides information about the location, format etc., of the data in the data store. The typeProperties section for dataset of type AzureBlob dataset has the following properties: PROPERTY DESCRIPTION REQUIRED folderPath Path to the container and folder in the blob storage. Example: myblobcontainer\myblobfolder\ Yes fileName Name of the blob. fileName is optional and case-sensitive. No If you specify a filename, the activity (including Copy) works on the specific Blob. When fileName is not specified, Copy includes all Blobs in the folderPath for input dataset. When fileName is not specified for an output dataset and preserveHierarchy is not specified in activity sink, the name of the generated file would be in the following this format: Data..txt (for example: : Data.0a405f8a-93ff-4c6fb3be-f69616f1df7a.txt partitionedBy partitionedBy is an optional property. You can use it to specify a dynamic folderPath and filename for time series data. For example, folderPath can be parameterized for every hour of data. See the Using partitionedBy property section for details and examples. No PROPERTY DESCRIPTION REQUIRED format The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections. No If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions. compression Specify the type and level of compression for the data. Supported types are: GZip, Deflate, BZip2, and ZipDeflate. Supported levels are: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. No Using partitionedBy property As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the partitionedBy property, Data Factory functions, and the system variables. For more information on time series datasets, scheduling, and slices, see Creating Datasets and Scheduling & Execution articles. Sample 1 "folderPath": "wikidatagateway/wikisampledataout/{Slice}", "partitionedBy": [ { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, ], In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104 Sample 2 "folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } ], In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties. Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output datasets, and policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. If you are moving data from an Azure Blob Storage, you set the source type in the copy activity to BlobSource. Similarly, if you are moving data to an Azure Blob Storage, you set the sink type in the copy activity to BlobSink. This section provides a list of properties supported by BlobSource and BlobSink. BlobSource supports the following properties in the typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED recursive Indicates whether the data is read recursively from the sub folders or only from the specified folder. True (default value), False No BlobSink supports the following properties typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED copyBehavior Defines the copy behavior when the source is BlobSource or FileSystem. PreserveHierarchy: preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder. No FlattenHierarchy: all files from the source folder are in the first level of target folder. The target files have auto generated name. MergeFiles: merges all files from the source folder to one file. If the File/Blob Name is specified, the merged file name would be the specified name; otherwise, would be autogenerated file name. BlobSource also supports these two properties for backward compatibility. treatEmptyAsNull: Specifies whether to treat null or empty string as null value. skipHeaderLineCount - Specifies how many lines need be skipped. It is applicable only when input dataset is using TextFormat. Similarly, BlobSink supports the following property for backward compatibility. blobWriterAddHeader: Specifies whether to add a header of column definitions while writing to an output dataset. Datasets now support the following properties that implement the same functionality: treatEmptyAsNull, skipLineCount, firstRowAsHeader. The following table provides guidance on using the new dataset properties in place of these blob source/sink properties. COPY ACTIVITY PROPERTY DATASET PROPERTY skipHeaderLineCount on BlobSource skipLineCount and firstRowAsHeader. Lines are skipped first and then the first row is read as a header. treatEmptyAsNull on BlobSource treatEmptyAsNull on input dataset blobWriterAddHeader on BlobSink firstRowAsHeader on output dataset See Specifying TextFormat section for detailed information on these properties. recursive and copyBehavior examples This section describes the resulting behavior of the Copy operation for different combinations of recursive and copyBehavior values. RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR true preserveHierarchy For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the same structure as the source Folder1 File1 File2 Subfolder1 File3 File4 File5. RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR true flattenHierarchy For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target Folder1 is created with the following structure: Folder1 auto-generated name for File1 auto-generated name for File2 auto-generated name for File3 auto-generated name for File4 auto-generated name for File5 true mergeFiles For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target Folder1 is created with the following structure: Folder1 File1 + File2 + File3 + File4 + File 5 contents are merged into one file with auto-generated file name RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR false preserveHierarchy For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 File1 File2 Subfolder1 with File3, File4, and File5 are not picked up. false flattenHierarchy For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 auto-generated name for File1 auto-generated name for File2 Subfolder1 with File3, File4, and File5 are not picked up. RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR false mergeFiles For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 File1 + File2 contents are merged into one file with auto-generated file name. auto-generated name for File1 Subfolder1 with File3, File4, and File5 are not picked up. Walkthrough: Use Copy Wizard to copy data to/from Blob Storage Let's look at how to quickly copy data to/from an Azure blob storage. In this walkthrough, both source and destination data stores of type: Azure Blob Storage. The pipeline in this walkthrough copies data from a folder to another folder in the same blob container. This walkthrough is intentionally simple to show you settings or properties when using Blob Storage as a source or sink. Prerequisites 1. Create a general-purpose Azure Storage Account if you don't have one already. You use the blob storage as both source and destination data store in this walkthrough. if you don't have an Azure storage account, see the Create a storage account article for steps to create one. 2. Create a blob container named adfblobconnector in the storage account. 3. Create a folder named input in the adfblobconnector container. 4. Create a file named emp.txt with the following content and upload it to the input folder by using tools such as Azure Storage Explorer json John, Doe Jane, Doe ### Create the data factory 5. Sign in to the Azure portal. 6. Click + NEW from the top-left corner, click Intelligence + analytics, and click Data Factory. 7. In the New data factory blade: a. Enter ADFBlobConnectorDF for the name. The name of the Azure data factory must be globally unique. If you receive the error: *Data factory name “ADFBlobConnectorDF” is not available , change the name of the data factory (for example, yournameADFBlobConnectorDF) and try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. b. Select your Azure subscription. c. For Resource Group, select Use existing to select an existing resource group (or) select Create new to enter a name for a resource group. d. Select a location for the data factory. e. Select Pin to dashboard check box at the bottom of the blade. f. Click Create. 8. After the creation is complete, you see the Data Factory blade as shown in the following image: Copy Wizard 1. On the Data Factory home page, click the Copy data [PREVIEW] tile to launch Copy Data Wizard in a separate tab. NOTE If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third party cookies and site data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try launching the wizard again. 2. In the Properties page: a. Enter CopyPipeline for Task name. The task name is the name of the pipeline in your data factory. b. Enter a description for the task (optional). c. For Task cadence or Task schedule, keep the Run regularly on schedule option. If you want to run this task only once instead of run repeatedly on a schedule, select Run once now. If you select, Run once now option, a one-time pipeline is created. d. Keep the settings for Recurring pattern. This task runs daily between the start and end times you specify in the next step. e. Change the Start date time to 04/21/2017. f. Change the End date time to 04/25/2017. You may want to type the date instead of browsing through the calendar. g. Click Next. 3. On the Source data store page, click Azure Blob Storage tile. You use this page to specify the source data store for the copy task. You can use an existing data store linked service (or) specify a new data store. To use an existing linked service, you would select FROM EXISTING LINKED SERVICES and select the right linked service. 4. On the Specify the Azure Blob storage account page: a. Keep the auto-generated name for Connection name. The connection name is the name of the linked service of type: Azure Storage. b. Confirm that From Azure subscriptions option is selected for Account selection method. c. Select your Azure subscription or keep Select all for Azure subscription. d. Select an Azure storage account from the list of Azure storage accounts available in the selected subscription. You can also choose to enter storage account settings manually by selecting Enter manually option for the Account selection method. e. Click Next. 5. On Choose the input file or folder page: a. Double-click adfblobcontainer. b. Select input, and click Choose. In this walkthrough, you select the input folder. You could also select the emp.txt file in the folder instead. 6. On the Choose the input file or folder page: a. Confirm that the file or folder is set to adfblobconnector/input. If the files are in sub folders, for example, 2017/04/01, 2017/04/02, and so on, enter adfblobconnector/input/{year}/{month}/{day} for file or folder. When you press TAB out of the text box, you see three drop-down lists to select formats for year (yyyy), month (MM), and day (dd). b. Do not set Copy file recursively. Select this option to recursively traverse through folders for files to be copied to the destination. c. Do not the binary copy option. Select this option to perform a binary copy of source file to the destination. Do not select for this walkthrough so that you can see more options in the next pages. d. Confirm that the Compression type is set to None. Select a value for this option if your source files are compressed in one of the supported formats. e. Click Next. 7. On the File format settings page, you see the delimiters and the schema that is auto-detected by the wizard by parsing the file. a. Confirm that the file format is set to Text format. You can see all the supported formats in the drop-down list. For example: JSON, Avro, ORC, Parquet. b. Confirm that the column delimiter is set to Comma (,) . You can see the other column delimiters supported by Data Factory in the drop-down list. You can also specify a custom delimiter. c. Confirm that the row delimiter is set to Carriage Return + Line feed (\r\n) . You can see the other row delimiters supported by Data Factory in the drop-down list. You can also specify a custom delimiter. d. Confirm that the skip line count is set to 0. If you want a few lines to be skipped at the top of the file, enter the number here. e. Confirm that the first data row contains column names is not set. If the source files contain column names in the first row, select this option. f. Confirm that the treat empty column value as null option is set. g. Expand Advanced settings to see advanced option available. h. At the bottom of the page, see the preview of data from the emp.txt file. i. Click SCHEMA tab at the bottom to see the schema that the copy wizard inferred by looking at the data in the source file. j. Click Next after you review the delimiters and preview data. 8. On the Destination data store page, select Azure Blob Storage, and click Next. You are using the Azure Blob Storage as both the source and destination data stores in this walkthrough. 9. On Specify the Azure Blob storage account page: a. Enter AzureStorageLinkedService for the Connection name field. b. Confirm that From Azure subscriptions option is selected for Account selection method. c. Select your Azure subscription. d. Select your Azure storage account. e. Click Next. 10. On the Choose the output file or folder page: a. specify Folder path as adfblobconnector/output/{year}/{month}/{day}. Enter TAB. b. For the year, select yyyy. c. For the month, confirm that it is set to MM. d. For the day, confirm that it is set to dd. e. Confirm that the compression type is set to None. f. Confirm that the copy behavior is set to Merge files. If the output file with the same name already exists, the new content is added to the same file at the end. g. Click Next. 11. On the File format settings page, review the settings, and click Next. One of the additional options here is to add a header to the output file. If you select that option, a header row is added with names of the columns from the schema of the source. You can rename the default column names when viewing the schema for the source. For example, you could change the first column to First Name and second column to Last Name. Then, the output file is generated with a header with these names as column names. 12. On the Performance settings page, confirm that cloud units and parallel copies are set to Auto, and click Next. For details about these settings, see Copy activity performance and tuning guide. 13. On the Summary page, review all settings (task properties, settings for source and destination, and copy settings), and click Next. 14. Review information in the Summary page, and click Finish. The wizard creates two linked services, two datasets (input and output), and one pipeline in the data factory (from where you launched the Copy Wizard). Monitor the pipeline (copy task) 1. Click the link Click here to monitor copy pipeline on the Deployment page. 2. You should see the Monitor and Manage application in a separate tab. 3. Change the start time at the top to 04/19/2017 and end time to 04/27/2017 , and then click Apply. 4. You should see five activity windows in the ACTIVITY WINDOWS list. The WindowStart times should cover all days from pipeline start to pipeline end times. 5. Click Refresh button for the ACTIVITY WINDOWS list a few times until you see the status of all the activity windows is set to Ready. 6. Now, verify that the output files are generated in the output folder of adfblobconnector container. You should see the following folder structure in the output folder: 2017/04/21 2017/04/22 2017/04/23 2017/04/24 2017/04/25 For detailed information about monitoring and managing data factories, see Monitor and manage Data Factory pipeline article. Data Factory entities Now, switch back to the tab with the Data Factory home page. Notice that there are two linked services, two datasets, and one pipeline in your data factory now. Click Author and deploy to launch Data Factory Editor. You should see the following Data Factory entities in your data factory: Two linked services. One for the source and the other one for the destination. Both the linked services refer to the same Azure Storage account in this walkthrough. Two datasets. An input dataset and an output dataset. In this walkthrough, both use the same blob container but refer to different folders (input and output). A pipeline. The pipeline contains a copy activity that uses a blob source and a blob sink to copy data from an Azure blob location to another Azure blob location. The following sections provide more information about these entities. Linked services You should see two linked services. One for the source and the other one for the destination. In this walkthrough, both definitions look the same except for the names. The type of the linked service is set to AzureStorage. Most important property of the linked service definition is the connectionString, which is used by Data Factory to connect to your Azure Storage account at runtime. Ignore the hubName property in the definition. So u r c e b l o b st o r a g e l i n k e d se r v i c e { "name": "Source-BlobStorage-z4y", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********" } } } D e st i n a t i o n b l o b st o r a g e l i n k e d se r v i c e { "name": "Destination-BlobStorage-z4y", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********" } } } For more information about Azure Storage linked service, see Linked service properties section. Datasets There are two datasets: an input dataset and an output dataset. The type of the dataset is set to AzureBlob for both. The input dataset points to the input folder of the adfblobconnector blob container. The external property is set to true for this dataset as the data is not produced by the pipeline with the copy activity that takes this dataset as an input. The output dataset points to the output folder of the same blob container. The output dataset also uses the year, month, and day of the SliceStart system variable to dynamically evaluate the path for the output file. For a list of functions and system variables supported by Data Factory, see Data Factory functions and system variables. The external property is set to false (default value) because this dataset is produced by the pipeline. For more information about properties supported by Azure Blob dataset, see Dataset properties section. I n p u t d a t a se t { "name": "InputDataset-z4y", "properties": { "structure": [ { "name": "Prop_0", "type": "String" }, { "name": "Prop_1", "type": "String" } ], "type": "AzureBlob", "linkedServiceName": "Source-BlobStorage-z4y", "typeProperties": { "folderPath": "adfblobconnector/input/", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Day", "interval": 1 }, "external": true, "policy": {} } } O u t p u t d a t a se t { "name": "OutputDataset-z4y", "properties": { "structure": [ { "name": "Prop_0", "type": "String" }, { "name": "Prop_1", "type": "String" } ], "type": "AzureBlob", "linkedServiceName": "Destination-BlobStorage-z4y", "typeProperties": { "folderPath": "adfblobconnector/output/{year}/{month}/{day}", "format": { "type": "TextFormat", "columnDelimiter": "," }, "partitionedBy": [ { "name": "year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } } ] }, "availability": { "frequency": "Day", "interval": 1 }, "external": false, "policy": {} } } Pipeline The pipeline has just one activity. The type of the activity is set to Copy. In the type properties for the activity, there are two sections, one for source and the other one for sink. The source type is set to BlobSource as the activity is copying data from a blob storage. The sink type is set to BlobSink as the activity copying data to a blob storage. The copy activity takes InputDataset-z4y as the input and OutputDataset-z4y as the output. For more information about properties supported by BlobSource and BlobSink, see Copy activity properties section. { "name": "CopyPipeline", "properties": { "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource", "recursive": false }, "sink": { "type": "BlobSink", "copyBehavior": "MergeFiles", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "InputDataset-z4y" } ], "outputs": [ { "name": "OutputDataset-z4y" } ], "policy": { "timeout": "1.00:00:00", "concurrency": 1, "executionPriorityOrder": "NewestFirst", "style": "StartOfInterval", "retry": 3, "longRetry": 0, "longRetryInterval": "00:00:00" }, "scheduler": { "frequency": "Day", "interval": 1 }, "name": "Activity-0-Blob path_ adfblobconnector_input_->OutputDataset-z4y" } ], "start": "2017-04-21T22:34:00Z", "end": "2017-04-25T05:00:00Z", "isPaused": false, "pipelineMode": "Scheduled" } } JSON examples for copying data to and from Blob Storage The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Blob Storage and Azure SQL Database. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory. JSON Example: Copy data from Blob Storage to SQL Database The following sample shows: 1. A linked service of type AzureSqlDatabase. 2. A linked service of type AzureStorage. 3. An input dataset of type AzureBlob. 4. An output dataset of type AzureSqlTable. 5. A pipeline with a Copy activity that uses BlobSource and SqlSink. The sample copies time-series data from an Azure blob to an Azure SQL table hourly. The JSON properties used in these samples are described in sections following the samples. Azure SQL linked service: { "name": "AzureSqlLinkedService", "properties": { "type": "AzureSqlDatabase", "typeProperties": { "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database= <databasename>;User ID=<username>@<servername>;Password= <password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } } } Azure Storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details. Azure Blob input dataset: Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs Data Factory that the table is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": ",", "rowDelimiter": "\n" } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure SQL output dataset: The sample copies data to a table named “MyTable” in an Azure SQL database. Create the table in your Azure SQL database with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. { "name": "AzureSqlOutput", "properties": { "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "MyOutputTable" }, "availability": { "frequency": "Hour", "interval": 1 } } } A copy activity in a pipeline with Blob source and SQL sink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlSink. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "AzureBlobtoSQL", "description": "Copy Activity", "type": "Copy", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureSqlOutput" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } JSON Example: Copy data from Azure SQL to Azure Blob The following sample shows: 1. 2. 3. 4. 5. A linked service of type AzureSqlDatabase. A linked service of type AzureStorage. An input dataset of type AzureSqlTable. An output dataset of type AzureBlob. A pipeline with Copy activity that uses SqlSource and BlobSink. The sample copies time-series data from an Azure SQL table to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples. Azure SQL linked service: { "name": "AzureSqlLinkedService", "properties": { "type": "AzureSqlDatabase", "typeProperties": { "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database= <databasename>;User ID=<username>@<servername>;Password= <password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } } } Azure Storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details. Azure SQL input dataset: The sample assumes you have created a table “MyTable” in Azure SQL and it contains a column called “timestampcolumn” for time series data. Setting “external”: ”true” informs Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureSqlInput", "properties": { "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "MyTable" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": "\t", "rowDelimiter": "\n" } }, "availability": { "frequency": "Hour", "interval": 1 } } } A copy activity in a pipeline with SQL source and Blob sink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "AzureSQLtoBlob", "description": "copy activity", "type": "Copy", "inputs": [ { "name": "AzureSQLInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "SqlSource", "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } NOTE To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data to and from Azure Cosmos DB using Azure Data Factory 5/11/2017 • 11 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Cosmos DB (DocumentDB API). It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from any supported source data store to Azure Cosmos DB or from Azure Cosmos DB to any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity, see the Supported data stores table. IMPORTANT Azure Cosmos DB connector only support DocumentDB API. To copy data as-is to/from JSON files or another Cosmos DB collection, see Import/Export JSON documents. Getting started You can create a pipeline with a copy activity that moves data to/from Azure Cosmos DB by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from Cosmos DB, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Cosmos DB: Linked service properties The following table provides description for JSON elements specific to Azure Cosmos DB linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: DocumentDb Yes connectionString Specify information needed to connect to Azure Cosmos DB database. Yes Dataset properties For a full list of sections & properties available for defining datasets please refer to the Creating datasets article. Sections like structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type DocumentDbCollection has the following properties. PROPERTY DESCRIPTION REQUIRED collectionName Name of the Cosmos DB document collection. Yes Example: { "name": "PersonCosmosDbTable", "properties": { "type": "DocumentDbCollection", "linkedServiceName": "CosmosDbLinkedService", "typeProperties": { "collectionName": "Person" }, "external": true, "availability": { "frequency": "Day", "interval": 1 } } } Schema by Data Factory For schema-free data stores such as Azure Cosmos DB, the Data Factory service infers the schema in one of the following ways: 1. If you specify the structure of data by using the structure property in the dataset definition, the Data Factory service honors this structure as the schema. In this case, if a row does not contain a value for a column, a null value will be provided for it. 2. If you do not specify the structure of data by using the structure property in the dataset definition, the Data Factory service infers the schema by using the first row in the data. In this case, if the first row does not contain the full schema, some columns will be missing in the result of copy operation. Therefore, for schema-free data sources, the best practice is to specify the structure of data using the structure property. Copy activity properties For a full list of sections & properties available for defining activities please refer to the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. NOTE The Copy Activity takes only one input and produces only one output. Properties available in the typeProperties section of the activity on the other hand vary with each activity type and in case of Copy activity they vary depending on the types of sources and sinks. In case of Copy activity when source is of type DocumentDbCollectionSource the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Specify the query to read data. Query string supported by Azure Cosmos DB. No Example: SELECT c.BusinessEntityID, c.PersonType, c.NameStyle, c.Title, c.Name.First AS FirstName, c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-0101T00:00:00\" nestingSeparator Special character to indicate that the document is nested Any character. If not specified, the SQL statement that is executed: select <columns defined in structure> from mycollection No Azure Cosmos DB is a NoSQL store for JSON documents, where nested structures are allowed. Azure Data Factory enables user to denote hierarchy via nestingSeparator, which is “.” in the above examples. With the separator, the copy activity will generate the “Name” object with three children elements First, Middle and Last, according to “Name.First”, “Name.Middle” and “Name.Last” in the table definition. DocumentDbCollectionSink supports the following properties: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED nestingSeparator A special character in the source column name to indicate that nested document is needed. Character that is used to separate nesting levels. Character that is used to separate nesting levels. Default value is Default value is . (dot). . (dot). For example above: Name.First in the output table produces the following JSON structure in the Cosmos DB document: "Name": { "First": "John" }, writeBatchSize Number of parallel requests to Azure Cosmos DB service to create documents. Integer No (default: 5) timespan No You can fine-tune the performance when copying data to/from Cosmos DB by using this property. You can expect a better performance when you increase writeBatchSize because more parallel requests to Cosmos DB are sent. However you’ll need to avoid throttling that can throw the error message: "Request rate is large". Throttling is decided by a number of factors, including size of documents, number of terms in documents, indexing policy of target collection, etc. For copy operations, you can use a better collection (e.g. S3) to have the most throughput available (2,500 request units/second). writeBatchTimeout Wait time for the operation to complete before it times out. Example: “00:30:00” (30 minutes). Import/Export JSON documents Using this Cosmos DB connector, you can easily Import JSON documents from various sources into Cosmos DB, including Azure Blob, Azure Data Lake, onpremises File System or other file-based stores supported by Azure Data Factory. Export JSON documents from Cosmos DB collecton into various file-based stores. Migrate data between two Cosmos DB collections as-is. To achieve such schema-agnostic copy, When using copy wizard, check the "Export as-is to JSON files or Cosmos DB collection" option. When using JSON editing, do not specify the "structure" section in Cosmos DB dataset(s) nor "nestingSeparator" property on Cosmos DB source/sink in copy activity. To import from/export to JSON files, in the file store dataset specify format type as "JsonFormat", config "filePattern" and skip the rest format settings, see JSON format section on details. JSON examples The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Cosmos DB and Azure Blob Storage. However, data can be copied directly from any of the sources to any of the sinks stated here using the Copy Activity in Azure Data Factory. Example: Copy data from Azure Cosmos DB to Azure Blob The sample below shows: 1. 2. 3. 4. 5. A linked service of type DocumentDb. A linked service of type AzureStorage. An input dataset of type DocumentDbCollection. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses DocumentDbCollectionSource and BlobSink. The sample copies data in Azure Cosmos DB to Azure Blob. The JSON properties used in these samples are described in sections following the samples. Azure Cosmos DB linked service: { "name": "CosmosDbLinkedService", "properties": { "type": "DocumentDb", "typeProperties": { "connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>" } } } Azure Blob storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Document DB input dataset: The sample assumes you have a collection named Person in an Azure Cosmos DB database. Setting “external”: ”true” and specifying externalData policy information the Azure Data Factory service that the table is external to the data factory and not produced by an activity in the data factory. { "name": "PersonCosmosDbTable", "properties": { "type": "DocumentDbCollection", "linkedServiceName": "CosmosDbLinkedService", "typeProperties": { "collectionName": "Person" }, "external": true, "availability": { "frequency": "Day", "interval": 1 } } } Azure Blob output dataset: Data is copied to a new blob every hour with the path for the blob reflecting the specific datetime with hour granularity. { "name": "PersonBlobTableOut", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "docdb", "format": { "type": "TextFormat", "columnDelimiter": ",", "nullValue": "NULL" } }, "availability": { "frequency": "Day", "interval": 1 } } } Sample JSON document in the Person collection in a Cosmos DB database: { "PersonId": 2, "Name": { "First": "Jane", "Middle": "", "Last": "Doe" } } Cosmos DB supports querying documents using a SQL like syntax over hierarchical JSON documents. Example: SELECT Person.PersonId, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person The following pipeline copies data from the Person collection in the Azure Cosmos DB database to an Azure blob. As part of the copy activity the input and output datasets have been specified. { "name": "DocDbToBlobPipeline", "properties": { "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "DocumentDbCollectionSource", "query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person", "nestingSeparator": "." }, "sink": { "type": "BlobSink", "blobWriterAddHeader": true, "writeBatchSize": 1000, "writeBatchTimeout": "00:00:59" } }, "inputs": [ { "name": "PersonCosmosDbTable" } ], "outputs": [ { "name": "PersonBlobTableOut" } ], "policy": { "concurrency": 1 }, "name": "CopyFromDocDbToBlob" } ], "start": "2015-04-01T00:00:00Z", "end": "2015-04-02T00:00:00Z" } } Example: Copy data from Azure Blob to Azure Cosmos DB The sample below shows: 1. 2. 3. 4. 5. A linked service of type DocumentDb. A linked service of type AzureStorage. An input dataset of type AzureBlob. An output dataset of type DocumentDbCollection. A pipeline with Copy Activity that uses BlobSource and DocumentDbCollectionSink. The sample copies data from Azure blob to Azure Cosmos DB. The JSON properties used in these samples are described in sections following the samples. Azure Blob storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Cosmos DB linked service: { "name": "CosmosDbLinkedService", "properties": { "type": "DocumentDb", "typeProperties": { "connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>" } } } Azure Blob input dataset: { "name": "PersonBlobTableIn", "properties": { "structure": [ { "name": "Id", "type": "Int" }, { "name": "FirstName", "type": "String" }, { "name": "MiddleName", "type": "String" }, { "name": "LastName", "type": "String" } ], "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "fileName": "input.csv", "folderPath": "docdb", "format": { "type": "TextFormat", "columnDelimiter": ",", "nullValue": "NULL" } }, "external": true, "availability": { "frequency": "Day", "interval": 1 } } } Azure Cosmos DB output dataset: The sample copies data to a collection named “Person”. { "name": "PersonCosmosDbTableOut", "properties": { "structure": [ { "name": "Id", "type": "Int" }, { "name": "Name.First", "type": "String" }, { "name": "Name.Middle", "type": "String" }, { "name": "Name.Last", "type": "String" } ], "type": "DocumentDbCollection", "linkedServiceName": "CosmosDbLinkedService", "typeProperties": { "collectionName": "Person" }, "availability": { "frequency": "Day", "interval": 1 } } } The following pipeline copies data from Azure Blob to the Person collection in the Cosmos DB. As part of the copy activity the input and output datasets have been specified. { "name": "BlobToDocDbPipeline", "properties": { "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "DocumentDbCollectionSink", "nestingSeparator": ".", "writeBatchSize": 2, "writeBatchTimeout": "00:00:00" } "translator": { "type": "TabularTranslator", "ColumnMappings": "FirstName: Name.First, MiddleName: Name.Middle, LastName: Name.Last, BusinessEntityID: BusinessEntityID, PersonType: PersonType, NameStyle: NameStyle, Title: Title, Suffix: Suffix, EmailPromotion: EmailPromotion, rowguid: rowguid, ModifiedDate: ModifiedDate" } }, "inputs": [ { "name": "PersonBlobTableIn" } ], "outputs": [ { "name": "PersonCosmosDbTableOut" } ], "policy": { "concurrency": 1 }, "name": "CopyFromBlobToDocDb" } ], "start": "2015-04-14T00:00:00Z", "end": "2015-04-15T00:00:00Z" } } If the sample blob input is as 1,John,,Doe Then the output JSON in Cosmos DB will be as: { "Id": 1, "Name": { "First": "John", "Middle": null, "Last": "Doe" }, "id": "a5e8595c-62ec-4554-a118-3940f4ff70b6" } Azure Cosmos DB is a NoSQL store for JSON documents, where nested structures are allowed. Azure Data Factory enables user to denote hierarchy via nestingSeparator, which is “.” in this example. With the separator, the copy activity will generate the “Name” object with three children elements First, Middle and Last, according to “Name.First”, “Name.Middle” and “Name.Last” in the table definition. Appendix 1. Question: Does the Copy Activity support update of existing records? Answer: No. 2. Question: How does a retry of a copy to Azure Cosmos DB deal with already copied records? Answer: If records have an "ID" field and the copy operation tries to insert a record with the same ID, the copy operation throws an error. 3. Question: Does Data Factory support range or hash-based data partitioning? Answer: No. 4. Question: Can I specify more than one Azure Cosmos DB collection for a table? Answer: No. Only one collection can be specified at this time. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Copy data to and from Data Lake Store by using Data Factory 5/11/2017 • 18 min to read • Edit Online This article explains how to use Copy Activity in Azure Data Factory to move data to and from Azure Data Lake Store. It builds on the Data movement activities article, an overview of data movement with Copy Activity. Supported scenarios You can copy data from Azure Data Lake Store to the following data stores: CATEGORY DATA STORE Azure Azure Blob storage Azure Data Lake Store Azure Cosmos DB (DocumentDB API) Azure SQL Database Azure SQL Data Warehouse Azure Search Index Azure Table storage Databases SQL Server Oracle File File system You can copy data from the following data stores to Azure Data Lake Store: CATEGORY DATA STORE Azure Azure Blob storage Azure Cosmos DB (DocumentDB API) Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure Table storage Databases Amazon Redshift DB2 MySQL Oracle PostgreSQL SAP Business Warehouse SAP HANA SQL Server Sybase Teradata NoSQL Cassandra MongoDB CATEGORY DATA STORE File Amazon S3 File System FTP HDFS SFTP Others Generic HTTP Generic OData Generic ODBC Salesforce Web Table (table from HTML) GE Historian NOTE Create a Data Lake Store account before creating a pipeline with Copy Activity. For more information, see Get started with Azure Data Lake Store. Supported authentication types The Data Lake Store connector supports these authentication types: Service principal authentication User credential (OAuth) authentication We recommend that you use service principal authentication, especially for a scheduled data copy. Token expiration behavior can occur with user credential authentication. For configuration details, see the Linked service properties section. Get started You can create a pipeline with a copy activity that moves data to/from an Azure Data Lake Store by using different tools/APIs. The easiest way to create a pipeline to copy data is to use the Copy Wizard. For a tutorial on creating a pipeline by using the Copy Wizard, see Tutorial: Create a pipeline using Copy Wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create a data factory. A data factory may contain one or more pipelines. 2. Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an Azure Data Lake Store, you create two linked services to link your Azure storage account and Azure Data Lake store to your data factory. For linked service properties that are specific to Azure Data Lake Store, see linked service properties section. 3. Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the folder and file path in the Data Lake store that holds the data copied from the blob storage. For dataset properties that are specific to Azure Data Lake Store, see dataset properties section. 4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and AzureDataLakeStoreSink as a sink for the copy activity. Similarly, if you are copying from Azure Data Lake Store to Azure Blob Storage, you use AzureDataLakeStoreSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure Data Lake Store, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Data Lake Store, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Data Lake Store. Linked service properties A linked service links a data store to a data factory. You create a linked service of type AzureDataLakeStore to link your Data Lake Store data to your data factory. The following table describes JSON elements specific to Data Lake Store linked services. You can choose between service principal and user credential authentication. PROPERTY DESCRIPTION REQUIRED type The type property must be set to AzureDataLakeStore. Yes dataLakeStoreUri Information about the Azure Data Lake Store account. This information takes one of the following formats: Yes https://[accountname].azuredatalakestore.net/webhdfs/v1 or adl://[accountname].azuredatalakestore.net/ . subscriptionId Azure subscription ID to which the Data Lake Store account belongs. Required for sink resourceGroupName Azure resource group name to which the Data Lake Store account belongs. Required for sink Service principal authentication (recommended) To use service principal authentication, register an application entity in Azure Active Directory (Azure AD) and grant it the access to Data Lake Store. For detailed steps, see Service-to-service authentication. Make note of the following values, which you use to define the linked service: Application ID Application key Tenant ID IMPORTANT If you are using the Copy Wizard to author data pipelines, make sure that you grant the service principal at least a Reader role in access control (identity and access management) for the Data Lake Store account. Also, grant the service principal at least Read + Execute permission to your Data Lake Store root ("/") and its children. Otherwise you might see the message "The credentials provided are invalid." After you create or update a service principal in Azure AD, it can take a few minutes for the changes to take effect. Check the service principal and Data Lake Store access control list (ACL) configurations. If you still see the message "The credentials provided are invalid," wait a while and try again. Use service principal authentication by specifying the following properties: PROPERTY DESCRIPTION REQUIRED servicePrincipalId Specify the application's client ID. Yes servicePrincipalKey Specify the application's key. Yes tenant Specify the tenant information (domain name or tenant ID) under which your application resides. You can retrieve it by hovering the mouse in the upper-right corner of the Azure portal. Yes Example: Service principal authentication { "name": "AzureDataLakeStoreLinkedService", "properties": { "type": "AzureDataLakeStore", "typeProperties": { "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", "servicePrincipalId": "<service principal id>", "servicePrincipalKey": "<service principal key>", "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>", "subscriptionId": "<subscription of ADLS>", "resourceGroupName": "<resource group of ADLS>" } } } User credential authentication Alternatively, you can use user credential authentication to copy from or to Data Lake Store by specifying the following properties: PROPERTY DESCRIPTION REQUIRED authorization Click the Authorize button in the Data Factory Editor and enter your credential that assigns the autogenerated authorization URL to this property. Yes PROPERTY DESCRIPTION REQUIRED sessionId OAuth session ID from the OAuth authorization session. Each session ID is unique and can be used only once. This setting is automatically generated when you use the Data Factory Editor. Yes Example: User credential authentication { "name": "AzureDataLakeStoreLinkedService", "properties": { "type": "AzureDataLakeStore", "typeProperties": { "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", "sessionId": "<session ID>", "authorization": "<authorization URL>", "subscriptionId": "<subscription of ADLS>", "resourceGroupName": "<resource group of ADLS>" } } } Token expiration The authorization code that you generate by using the Authorize button expires after a certain amount of time. The following message means that the authentication token has expired: Credential operation error: invalid_grant - AADSTS70002: Error validating credentials. AADSTS70008: The provided access grant is expired or revoked. Trace ID: d18629e8-af88-43c5-88e3-d8419eb1fca1 Correlation ID: fac30a0c-6be6-4e02-8d69-a776d2ffefd7 Timestamp: 2015-12-15 21-09-31Z. The following table shows the expiration times of different types of user accounts: USER TYPE EXPIRES AFTER User accounts not managed by Azure Active Directory (for example, @hotmail.com or @live.com) 12 hours Users accounts managed by Azure Active Directory 14 days after the last slice run 90 days, if a slice based on an OAuth-based linked service runs at least once every 14 days If you change your password before the token expiration time, the token expires immediately. You will see the message mentioned earlier in this section. You can reauthorize the account by using the Authorize button when the token expires to redeploy the linked service. You can also generate values for the sessionId and authorization properties programmatically by using the following code: if (linkedService.Properties.TypeProperties is AzureDataLakeStoreLinkedService || linkedService.Properties.TypeProperties is AzureDataLakeAnalyticsLinkedService) { AuthorizationSessionGetResponse authorizationSession = this.Client.OAuth.Get(this.ResourceGroupName, this.DataFactoryName, linkedService.Properties.Type); WindowsFormsWebAuthenticationDialog authenticationDialog = new WindowsFormsWebAuthenticationDialog(null); string authorization = authenticationDialog.AuthenticateAAD(authorizationSession.AuthorizationSession.Endpoint, new Uri("urn:ietf:wg:oauth:2.0:oob")); AzureDataLakeStoreLinkedService azureDataLakeStoreProperties = linkedService.Properties.TypeProperties as AzureDataLakeStoreLinkedService; if (azureDataLakeStoreProperties != null) { azureDataLakeStoreProperties.SessionId = authorizationSession.AuthorizationSession.SessionId; azureDataLakeStoreProperties.Authorization = authorization; } AzureDataLakeAnalyticsLinkedService azureDataLakeAnalyticsProperties = linkedService.Properties.TypeProperties as AzureDataLakeAnalyticsLinkedService; if (azureDataLakeAnalyticsProperties != null) { azureDataLakeAnalyticsProperties.SessionId = authorizationSession.AuthorizationSession.SessionId; azureDataLakeAnalyticsProperties.Authorization = authorization; } } For details about the Data Factory classes used in the code, see the AzureDataLakeStoreLinkedService Class, AzureDataLakeAnalyticsLinkedService Class, and AuthorizationSessionGetResponse Class topics. Add a reference to version 2.9.10826.1824 of Microsoft.IdentityModel.Clients.ActiveDirectory.WindowsForms.dll for the WindowsFormsWebAuthenticationDialog class used in the code. Dataset properties To specify a dataset to represent input data in a Data Lake Store, you set the type property of the dataset to AzureDataLakeStore. Set the linkedServiceName property of the dataset to the name of the Data Lake Store linked service. For a full list of JSON sections and properties available for defining datasets, see the Creating datasets article. Sections of a dataset in JSON, such as structure, availability, and policy, are similar for all dataset types (Azure SQL database, Azure blob, and Azure table, for example). The typeProperties section is different for each type of dataset and provides information such as location and format of the data in the data store. The typeProperties section for a dataset of type AzureDataLakeStore contains the following properties: PROPERTY DESCRIPTION REQUIRED folderPath Path to the container and folder in Data Lake Store. Yes PROPERTY DESCRIPTION REQUIRED fileName Name of the file in Azure Data Lake Store. The fileName property is optional and case-sensitive. No If you specify fileName, the activity (including Copy) works on the specific file. When fileName is not specified, Copy includes all files in folderPath in the input dataset. When fileName is not specified for an output dataset and preserveHierarchy is not specified in activity sink, the name of the generated file is in the format Data.Guid.txt`. For example: Data.0a405f8a-93ff-4c6f-b3bef69616f1df7a.txt. partitionedBy The partitionedBy property is optional. You can use it to specify a dynamic path and file name for timeseries data. For example, folderPath can be parameterized for every hour of data. For details and examples, see The partitionedBy property. No format The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, and ParquetFormat. Set the type property under format to one of these values. For more information, see the Text format, JSON format, Avro format, ORC format, and Parquet Format sections in the File and compression formats supported by Azure Data Factory article. No If you want to copy files "as-is" between file-based stores (binary copy), skip the format section in both input and output dataset definitions. compression Specify the type and level of compression for the data. Supported types are GZip, Deflate, BZip2, and ZipDeflate. Supported levels are Optimal and Fastest. For more information, see File and compression formats supported by Azure Data Factory. No The partitionedBy property You can specify dynamic folderPath and fileName properties for time-series data with the partitionedBy property, Data Factory functions, and system variables. For details, see the Azure Data Factory - functions and system variables article. In the following example, {Slice} is replaced with the value of the Data Factory system variable SliceStart in the format specified ( yyyyMMddHH ). The name SliceStart refers to the start time of the slice. The folderPath property is different for each slice, as in wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104 . "folderPath": "wikidatagateway/wikisampledataout/{Slice}", "partitionedBy": [ { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, ], In the following example, the year, month, day, and time of that are used by the folderPath and fileName properties: SliceStart are extracted into separate variables "folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } ], For more details on time-series datasets, scheduling, and slices, see the Datasets in Azure Data Factory and Data Factory scheduling and execution articles. Copy activity properties For a full list of sections and properties available for defining activities, see the Creating pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. The properties available in the typeProperties section of an activity vary with each activity type. For a copy activity, they vary depending on the types of sources and sinks. AzureDataLakeStoreSource supports the following property in the typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED recursive Indicates whether the data is read recursively from the subfolders or only from the specified folder. True (default value), False No AzureDataLakeStoreSink supports the following properties in the typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED copyBehavior Specifies the copy behavior. PreserveHierarchy: Preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder. No FlattenHierarchy: All files from the source folder are created in the first level of the target folder. The target files are created with autogenerated names. MergeFiles: Merges all files from the source folder to one file. If the file or blob name is specified, the merged file name is the specified name. Otherwise, the file name is autogenerated. recursive and copyBehavior examples This section describes the resulting behavior of the Copy operation for different combinations of recursive and copyBehavior values. RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR true preserveHierarchy For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the same structure as the source Folder1 File1 File2 Subfolder1 File3 File4 File5. RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR true flattenHierarchy For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target Folder1 is created with the following structure: Folder1 auto-generated name for File1 auto-generated name for File2 auto-generated name for File3 auto-generated name for File4 auto-generated name for File5 true mergeFiles For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target Folder1 is created with the following structure: Folder1 File1 + File2 + File3 + File4 + File 5 contents are merged into one file with auto-generated file name RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR false preserveHierarchy For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 File1 File2 Subfolder1 with File3, File4, and File5 are not picked up. false flattenHierarchy For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 auto-generated name for File1 auto-generated name for File2 Subfolder1 with File3, File4, and File5 are not picked up. RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR false mergeFiles For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 File1 + File2 contents are merged into one file with auto-generated file name. auto-generated name for File1 Subfolder1 with File3, File4, and File5 are not picked up. Supported file and compression formats For details, see the File and compression formats in Azure Data Factory article. JSON examples for copying data to and from Data Lake Store The following examples provide sample JSON definitions. You can use these sample definitions to create a pipeline by using the Azure portal, Visual Studio, or Azure PowerShell. The examples show how to copy data to and from Data Lake Store and Azure Blob storage. However, data can be copied directly from any of the sources to any of the supported sinks. For more information, see the section "Supported data stores and formats" in the Move data by using Copy Activity article. Example: Copy data from Azure Blob Storage to Azure Data Lake Store The example code in this section shows: A linked service of type AzureStorage. A linked service of type AzureDataLakeStore. An input dataset of type AzureBlob. An output dataset of type AzureDataLakeStore. A pipeline with a copy activity that uses BlobSource and AzureDataLakeStoreSink. The examples show how time-series data from Azure Blob Storage is copied to Data Lake Store every hour. Azure Storage linked service { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Data Lake Store linked service { "name": "AzureDataLakeStoreLinkedService", "properties": { "type": "AzureDataLakeStore", "typeProperties": { "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", "servicePrincipalId": "<service principal id>", "servicePrincipalKey": "<service principal key>", "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>", "subscriptionId": "<subscription of ADLS>", "resourceGroupName": "<resource group of ADLS>" } } } NOTE For configuration details, see the Linked service properties section. Azure blob input dataset In the following example, data is picked up from a new blob every hour ( "frequency": "Hour", "interval": 1 ). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, and day portion of the start time. The file name uses the hour portion of the start time. The "external": true setting informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Data Lake Store output dataset The following example copies data to Data Lake Store. New data is copied to Data Lake Store every hour. { "name": "AzureDataLakeStoreOutput", "properties": { "type": "AzureDataLakeStore", "linkedServiceName": "AzureDataLakeStoreLinkedService", "typeProperties": { "folderPath": "datalake/output/" }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with a blob source and a Data Lake Store sink In the following example, the pipeline contains a copy activity that is configured to use the input and output datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource , and the sink type is set to AzureDataLakeStoreSink . { "name":"SamplePipeline", "properties": { "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline with copy activity", "activities": [ { "name": "AzureBlobtoDataLake", "description": "Copy Activity", "type": "Copy", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureDataLakeStoreOutput" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "AzureDataLakeStoreSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Example: Copy data from Azure Data Lake Store to an Azure blob The example code in this section shows: A linked service of type AzureDataLakeStore. A linked service of type AzureStorage. An input dataset of type AzureDataLakeStore. An output dataset of type AzureBlob. A pipeline with a copy activity that uses AzureDataLakeStoreSource and BlobSink. The code copies time-series data from Data Lake Store to an Azure blob every hour. Azure Data Lake Store linked service { "name": "AzureDataLakeStoreLinkedService", "properties": { "type": "AzureDataLakeStore", "typeProperties": { "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", "servicePrincipalId": "<service principal id>", "servicePrincipalKey": "<service principal key>", "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>" } } } NOTE For configuration details, see the Linked service properties section. Azure Storage linked service { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Data Lake input dataset In this example, setting "external" to true informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureDataLakeStoreInput", "properties": { "type": "AzureDataLakeStore", "linkedServiceName": "AzureDataLakeStoreLinkedService", "typeProperties": { "folderPath": "datalake/input/", "fileName": "SearchLog.tsv", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure blob output dataset In the following example, data is written to a new blob every hour ( "frequency": "Hour", "interval": 1 ). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hours portion of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": "\t", "rowDelimiter": "\n" } }, "availability": { "frequency": "Hour", "interval": 1 } } } A copy activity in a pipeline with an Azure Data Lake Store source and a blob sink In the following example, the pipeline contains a copy activity that is configured to use the input and output datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the source type is set to AzureDataLakeStoreSource , and the sink type is set to BlobSink . { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "AzureDakeLaketoBlob", "description": "copy activity", "type": "Copy", "inputs": [ { "name": "AzureDataLakeStoreInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "AzureDataLakeStoreSource", }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } In the copy activity definition, you can also map columns from the source dataset to columns in the sink dataset. For details, see Mapping dataset columns in Azure Data Factory. Performance and tuning To learn about the factors that affect Copy Activity performance and how to optimize it, see the Copy Activity performance and tuning guide article. Push data to an Azure Search index by using Azure Data Factory 6/5/2017 • 8 min to read • Edit Online This article describes how to use the Copy Activity to push data from a supported source data store to Azure Search index. Supported source data stores are listed in the Source column of the supported sources and sinks table. This article builds on the data movement activities article, which presents a general overview of data movement with Copy Activity and supported data store combinations. Enabling connectivity To allow Data Factory service connect to an on-premises data store, you install Data Management Gateway in your on-premises environment. You can install gateway on the same machine that hosts the source data store or on a separate machine to avoid competing for resources with the data store. Data Management Gateway connects on-premises data sources to cloud services in a secure and managed way. See Move data between on-premises and cloud article for details about Data Management Gateway. Getting started You can create a pipeline with a copy activity that pushes data from a source data store to Azure Search index by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data to Azure Search index, see JSON example: Copy data from on-premises SQL Server to Azure Search index section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Search Index: Linked service properties The following table provides descriptions for JSON elements that are specific to the Azure Search linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: AzureSearch. Yes url URL for the Azure Search service. Yes key Admin key for the Azure Search service. Yes Dataset properties For a full list of sections and properties that are available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. The typeProperties section is different for each type of dataset. The typeProperties section for a dataset of the type AzureSearchIndex has the following properties: PROPERTY DESCRIPTION REQUIRED type The type property must be set to AzureSearchIndex. Yes indexName Name of the Azure Search index. Data Factory does not create the index. The index must exist in Azure Search. Yes Copy activity properties For a full list of sections and properties that are available for defining activities, see the Creating pipelines article. Properties such as name, description, input and output tables, and various policies are available for all types of activities. Whereas, properties available in the typeProperties section vary with each activity type. For Copy Activity, they vary depending on the types of sources and sinks. For Copy Activity, when the sink is of the type AzureSearchIndexSink, the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED WriteBehavior Specifies whether to merge or replace when a document already exists in the index. See the WriteBehavior property. Merge (default) Upload No WriteBatchSize Uploads data into the Azure Search index when the buffer size reaches writeBatchSize. See the WriteBatchSize property for details. 1 to 1,000. Default value is 1000. No WriteBehavior property AzureSearchSink upserts when writing data. In other words, when writing a document, if the document key already exists in the Azure Search index, Azure Search updates the existing document rather than throwing a conflict exception. The AzureSearchSink provides the following two upsert behaviors (by using AzureSearch SDK): Merge: combine all the columns in the new document with the existing one. For columns with null value in the new document, the value in the existing one is preserved. Upload: The new document replaces the existing one. For columns not specified in the new document, the value is set to null whether there is a non-null value in the existing document or not. The default behavior is Merge. WriteBatchSize Property Azure Search service supports writing documents as a batch. A batch can contain 1 to 1,000 Actions. An action handles one document to perform the upload/merge operation. Data type support The following table specifies whether an Azure Search data type is supported or not. AZURE SEARCH DATA TYPE SUPPORTED IN AZURE SEARCH SINK String Y Int32 Y Int64 Y Double Y Boolean Y DataTimeOffset Y String Array N GeographyPoint N JSON example: Copy data from on-premises SQL Server to Azure Search index The following sample shows: 1. 2. 3. 4. 5. A linked service of type AzureSearch. A linked service of type OnPremisesSqlServer. An input dataset of type SqlServerTable. An output dataset of type AzureSearchIndex. A pipeline with a Copy activity that uses SqlSource and AzureSearchIndexSink. The sample copies time-series data from an on-premises SQL Server database to an Azure Search index hourly. The JSON properties used in this sample are described in sections following the samples. As a first step, setup the data management gateway on your on-premises machine. The instructions are in the moving data between on-premises locations and cloud article. Azure Search linked service: { "name": "AzureSearchLinkedService", "properties": { "type": "AzureSearch", "typeProperties": { "url": "https://<service>.search.windows.net", "key": "<AdminKey>" } } } SQL Server linked service { "Name": "SqlServerLinkedService", "properties": { "type": "OnPremisesSqlServer", "typeProperties": { "connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;", "gatewayName": "<gatewayname>" } } } SQL Server input dataset The sample assumes you have created a table “MyTable” in SQL Server and it contains a column called “timestampcolumn” for time series data. You can query over multiple tables within the same database using a single dataset, but a single table must be used for the dataset's tableName typeProperty. Setting “external”: ”true” informs Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "SqlServerDataset", "properties": { "type": "SqlServerTable", "linkedServiceName": "SqlServerLinkedService", "typeProperties": { "tableName": "MyTable" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Search output dataset: The sample copies data to an Azure Search index named products. Data Factory does not create the index. To test the sample, create an index with this name. Create the Azure Search index with the same number of columns as in the input dataset. New entries are added to the Azure Search index every hour. { "name": "AzureSearchIndexDataset", "properties": { "type": "AzureSearchIndex", "linkedServiceName": "AzureSearchLinkedService", "typeProperties" : { "indexName": "products", }, "availability": { "frequency": "Minute", "interval": 15 } } } Copy activity in a pipeline with SQL source and Azure Search Index sink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to AzureSearchIndexSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "SqlServertoAzureSearchIndex", "description": "copy activity", "type": "Copy", "inputs": [ { "name": " SqlServerInput" } ], "outputs": [ { "name": "AzureSearchIndexDataset" } ], "typeProperties": { "source": { "type": "SqlSource", "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MMdd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "AzureSearchIndexSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } If you are copying data from a cloud data store into Azure Search, executionLocation property is required. The following JSON snippet shows the change needed under Copy Activity typeProperties as an example. Check Copy data between cloud data stores section for supported values and more details. "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "AzureSearchIndexSink" }, "executionLocation": "West US" } Copy from a cloud source If you are copying data from a cloud data store into Azure Search, executionLocation property is required. The following JSON snippet shows the change needed under Copy Activity typeProperties as an example. Check Copy data between cloud data stores section for supported values and more details. "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "AzureSearchIndexSink" }, "executionLocation": "West US" } You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For details, see Mapping dataset columns in Azure Data Factory. Performance and tuning See the Copy Activity performance and tuning guide to learn about key factors that impact performance of data movement (Copy Activity) and various ways to optimize it. Next steps See the following articles: Copy Activity tutorial for step-by-step instructions for creating a pipeline with a Copy Activity. Copy data to and from Azure SQL Database using Azure Data Factory 6/9/2017 • 17 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data to and from Azure SQL Database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. Supported scenarios You can copy data from Azure SQL Database to the following data stores: CATEGORY DATA STORE Azure Azure Blob storage Azure Data Lake Store Azure Cosmos DB (DocumentDB API) Azure SQL Database Azure SQL Data Warehouse Azure Search Index Azure Table storage Databases SQL Server Oracle File File system You can copy data from the following data stores to Azure SQL Database: CATEGORY DATA STORE Azure Azure Blob storage Azure Cosmos DB (DocumentDB API) Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure Table storage Databases Amazon Redshift DB2 MySQL Oracle PostgreSQL SAP Business Warehouse SAP HANA SQL Server Sybase Teradata NoSQL Cassandra MongoDB CATEGORY DATA STORE File Amazon S3 File System FTP HDFS SFTP Others Generic HTTP Generic OData Generic ODBC Salesforce Web Table (table from HTML) GE Historian Supported authentication type Azure SQL Database connector supports basic authentication. Getting started You can create a pipeline with a copy activity that moves data to/from an Azure SQL Database by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create a data factory. A data factory may contain one or more pipelines. 2. Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an Azure SQL database, you create two linked services to link your Azure storage account and Azure SQL database to your data factory. For linked service properties that are specific to Azure SQL Database, see linked service properties section. 3. Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the SQL table in the Azure SQL database that holds the data copied from the blob storage. For dataset properties that are specific to Azure Data Lake Store, see dataset properties section. 4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity. Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure SQL Database, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure SQL Database, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure SQL Database: Linked service properties An Azure SQL linked service links an Azure SQL database to your data factory. The following table provides description for JSON elements specific to Azure SQL linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: AzureSqlDatabase Yes connectionString Specify information needed to connect to the Azure SQL Database instance for the connectionString property. Only basic authentication is supported. Yes IMPORTANT Configure Azure SQL Database Firewall the database server to allow Azure Services to access the server. Additionally, if you are copying data to Azure SQL Database from outside Azure including from on-premises data sources with data factory gateway, configure appropriate IP address range for the machine that is sending data to Azure SQL Database. Dataset properties To specify a dataset to represent input or output data in an Azure SQL database, you set the type property of the dataset to: AzureSqlTable. Set the linkedServiceName property of the dataset to the name of the Azure SQL linked service. For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type AzureSqlTable has the following properties: PROPERTY DESCRIPTION REQUIRED tableName Name of the table or view in the Azure SQL Database instance that linked service refers to. Yes Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. NOTE The Copy Activity takes only one input and produces only one output. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. If you are moving data from an Azure SQL database, you set the source type in the copy activity to SqlSource. Similarly, if you are moving data to an Azure SQL database, you set the sink type in the copy activity to SqlSink. This section provides a list of properties supported by SqlSource and SqlSink. SqlSource In copy activity, when the source is of type SqlSource, the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED sqlReaderQuery Use the custom query to read data. SQL query string. Example: select * from MyTable . No sqlReaderStoredProcedure Name Name of the stored procedure that reads data from the source table. Name of the stored procedure. The last SQL statement must be a SELECT statement in the stored procedure. No storedProcedureParameter s Parameters for the stored procedure. Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. No If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the Azure SQL Database source to get the data. Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters). If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query ( select column1, column2 from mytable ) to run against the Azure SQL Database. If the dataset definition does not have the structure, all columns are selected from the table. NOTE When you use sqlReaderStoredProcedureName, you still need to specify a value for the tableName property in the dataset JSON. There are no validations performed against this table though. SqlSource example "source": { "type": "SqlSource", "sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters", "storedProcedureParameters": { "stringData": { "value": "str3" }, "identifier": { "value": "$$Text.Format('{0:yyyy}', SliceStart)", "type": "Int"} } } The stored procedure definition: CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters ( @stringData varchar(20), @identifier int ) AS SET NOCOUNT ON; BEGIN select * from dbo.UnitTestSrcTable where dbo.UnitTestSrcTable.stringData != stringData and dbo.UnitTestSrcTable.identifier != identifier END GO SqlSink SqlSink supports the following properties: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED writeBatchTimeout Wait time for the batch insert operation to complete before it times out. timespan No writeBatchSize Inserts data into the SQL table when the buffer size reaches writeBatchSize. Integer (number of rows) No (default: 10000) sqlWriterCleanupScript Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. For more information, see repeatable copy. A query statement. No sliceIdentifierColumnName Specify a column name for Copy Activity to fill with auto generated slice identifier, which is used to clean up data of a specific slice when rerun. For more information, see repeatable copy. Column name of a column with data type of binary(32). No sqlWriterStoredProcedureN ame Name of the stored procedure that upserts (updates/inserts) data into the target table. Name of the stored procedure. No storedProcedureParameter s Parameters for the stored procedure. Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. No Example: “00:30:00” (30 minutes). PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED sqlWriterTableType Specify a table type name to be used in the stored procedure. Copy activity makes the data being moved available in a temp table with this table type. Stored procedure code can then merge the data being copied with existing data. A table type name. No SqlSink example "sink": { "type": "SqlSink", "writeBatchSize": 1000000, "writeBatchTimeout": "00:05:00", "sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters", "sqlWriterTableType": "CopyTestTableType", "storedProcedureParameters": { "identifier": { "value": "1", "type": "Int" }, "stringData": { "value": "str1" }, "decimalData": { "value": "1", "type": "Decimal" } } } JSON examples for copying data to and from SQL Database The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure SQL Database and Azure Blob Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory. Example: Copy data from Azure SQL Database to Azure Blob The same defines the following Data Factory entities: 1. 2. 3. 4. 5. A linked service of type AzureSqlDatabase. A linked service of type AzureStorage. An input dataset of type AzureSqlTable. An output dataset of type Azure Blob. A pipeline with a Copy activity that uses SqlSource and BlobSink. The sample copies time-series data (hourly, daily, etc.) from a table in Azure SQL database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. Azure SQL Database linked service: { "name": "AzureSqlLinkedService", "properties": { "type": "AzureSqlDatabase", "typeProperties": { "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } } } See the Azure SQL Linked Service section for the list of properties supported by this linked service. Azure Blob storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } See the Azure Blob article for the list of properties supported by this linked service. Azure SQL input dataset: The sample assumes you have created a table “MyTable” in Azure SQL and it contains a column called “timestampcolumn” for time series data. Setting “external”: ”true” informs the Azure Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureSqlInput", "properties": { "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "MyTable" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } See the Azure SQL dataset type properties section for the list of properties supported by this dataset type. Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": "\t", "rowDelimiter": "\n" } }, "availability": { "frequency": "Hour", "interval": 1 } } } See the Azure Blob dataset type properties section for the list of properties supported by this dataset type. A copy activity in a pipeline with SQL source and Blob sink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "AzureSQLtoBlob", "description": "copy activity", "type": "Copy", "inputs": [ { "name": "AzureSQLInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "SqlSource", "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyyMM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } In the example, sqlReaderQuery is specified for the SqlSource. The Copy Activity runs this query against the Azure SQL Database source to get the data. Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters). If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query to run against the Azure SQL Database. For example: select column1, column2 from mytable . If the dataset definition does not have the structure, all columns are selected from the table. See the Sql Source section and BlobSink for the list of properties supported by SqlSource and BlobSink. Example: Copy data from Azure Blob to Azure SQL Database The sample defines the following Data Factory entities: 1. 2. 3. 4. 5. A linked service of type AzureSqlDatabase. A linked service of type AzureStorage. An input dataset of type AzureBlob. An output dataset of type AzureSqlTable. A pipeline with Copy activity that uses BlobSource and SqlSink. The sample copies time-series data (hourly, daily, etc.) from Azure blob to a table in Azure SQL database every hour. The JSON properties used in these samples are described in sections following the samples. Azure SQL linked service: { "name": "AzureSqlLinkedService", "properties": { "type": "AzureSqlDatabase", "typeProperties": { "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } } } See the Azure SQL Linked Service section for the list of properties supported by this linked service. Azure Blob storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } See the Azure Blob article for the list of properties supported by this linked service. Azure Blob input dataset: Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs the Data Factory service that this table is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": ",", "rowDelimiter": "\n" } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } See the Azure Blob dataset type properties section for the list of properties supported by this dataset type. Azure SQL Database output dataset: The sample copies data to a table named “MyTable” in Azure SQL. Create the table in Azure SQL with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. { "name": "AzureSqlOutput", "properties": { "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "MyOutputTable" }, "availability": { "frequency": "Hour", "interval": 1 } } } See the Azure SQL dataset type properties section for the list of properties supported by this dataset type. A copy activity in a pipeline with Blob source and SQL sink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlSink. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "AzureBlobtoSQL", "description": "Copy Activity", "type": "Copy", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureSqlOutput" } ], "typeProperties": { "source": { "type": "BlobSource", "blobColumnSeparators": "," }, "sink": { "type": "SqlSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } See the Sql Sink section and BlobSource for the list of properties supported by SqlSink and BlobSource. Identity columns in the target database This section provides an example for copying data from a source table without an identity column to a destination table with an identity column. Source table: create table dbo.SourceTbl ( name varchar(100), age int ) Destination table: create table dbo.TargetTbl ( identifier int identity(1,1), name varchar(100), age int ) Notice that the target table has an identity column. Source dataset JSON definition { "name": "SampleSource", "properties": { "type": " SqlServerTable", "linkedServiceName": "TestIdentitySQL", "typeProperties": { "tableName": "SourceTbl" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": {} } } Destination dataset JSON definition { "name": "SampleTarget", "properties": { "structure": [ { "name": "name" }, { "name": "age" } ], "type": "AzureSqlTable", "linkedServiceName": "TestIdentitySQLSource", "typeProperties": { "tableName": "TargetTbl" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": false, "policy": {} } } Notice that as your source and target table have different schema (target has an additional column with identity). In this scenario, you need to specify structure property in the target dataset definition, which doesn’t include the identity column. Invoke stored procedure from SQL sink For an example of invoking a stored procedure from SQL sink in a copy activity of a pipeline, see Invoke stored procedure for SQL sink in copy activity article. Type mapping for Azure SQL Database As mentioned in the data movement activities article Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to and from Azure SQL Database, the following mappings are used from SQL type to .NET type and vice versa. The mapping is same as the SQL Server Data Type Mapping for ADO.NET. SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE bigint Int64 binary Byte[] bit Boolean char String, Char[] date DateTime Datetime DateTime datetime2 DateTime Datetimeoffset DateTimeOffset Decimal Decimal FILESTREAM attribute (varbinary(max)) Byte[] Float Double image Byte[] int Int32 money Decimal nchar String, Char[] ntext String, Char[] numeric Decimal nvarchar String, Char[] real Single rowversion Byte[] SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE smalldatetime DateTime smallint Int16 smallmoney Decimal sql_variant Object * text String, Char[] time TimeSpan timestamp Byte[] tinyint Byte uniqueidentifier Guid varbinary Byte[] varchar String, Char[] xml Xml Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable copy When copying data to SQL Server Database, the copy activity appends data to the sink table by default. To perform an UPSERT instead, See Repeatable write to SqlSink article. When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Copy data to and from Azure SQL Data Warehouse using Azure Data Factory 6/9/2017 • 25 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure SQL Data Warehouse. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. TIP To achieve best performance, use PolyBase to load data into Azure SQL Data Warehouse. The Use PolyBase to load data into Azure SQL Data Warehouse section has details. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory. Supported scenarios You can copy data from Azure SQL Data Warehouse to the following data stores: CATEGORY DATA STORE Azure Azure Blob storage Azure Data Lake Store Azure Cosmos DB (DocumentDB API) Azure SQL Database Azure SQL Data Warehouse Azure Search Index Azure Table storage Databases SQL Server Oracle File File system You can copy data from the following data stores to Azure SQL Data Warehouse: CATEGORY DATA STORE Azure Azure Blob storage Azure Cosmos DB (DocumentDB API) Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure Table storage CATEGORY DATA STORE Databases Amazon Redshift DB2 MySQL Oracle PostgreSQL SAP Business Warehouse SAP HANA SQL Server Sybase Teradata NoSQL Cassandra MongoDB File Amazon S3 File System FTP HDFS SFTP Others Generic HTTP Generic OData Generic ODBC Salesforce Web Table (table from HTML) GE Historian TIP When copying data from SQL Server or Azure SQL Database to Azure SQL Data Warehouse, if the table does not exist in the destination store, Data Factory can automatically create the table in SQL Data Warehouse by using the schema of the table in the source data store. See Auto table creation for details. Supported authentication type Azure SQL Data Warehouse connector support basic authentication. Getting started You can create a pipeline with a copy activity that moves data to/from an Azure SQL Data Warehouse by using different tools/APIs. The easiest way to create a pipeline that copies data to/from Azure SQL Data Warehouse is to use the Copy data wizard. See Tutorial: Load data into SQL Data Warehouse with Data Factory for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create a data factory. A data factory may contain one or more pipelines. 2. Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an Azure SQL data warehouse, you create two linked services to link your Azure storage account and Azure SQL data warehouse to your data factory. For linked service properties that are specific to Azure SQL Data Warehouse, see linked service properties section. 3. Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the table in the Azure SQL data warehouse that holds the data copied from the blob storage. For dataset properties that are specific to Azure SQL Data Warehouse, see dataset properties section. 4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlDWSink as a sink for the copy activity. Similarly, if you are copying from Azure SQL Data Warehouse to Azure Blob Storage, you use SqlDWSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure SQL Data Warehouse, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure SQL Data Warehouse, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure SQL Data Warehouse: Linked service properties The following table provides description for JSON elements specific to Azure SQL Data Warehouse linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: AzureSqlDW Yes connectionString Specify information needed to connect to the Azure SQL Data Warehouse instance for the connectionString property. Only basic authentication is supported. Yes IMPORTANT Configure Azure SQL Database Firewall and the database server to allow Azure Services to access the server. Additionally, if you are copying data to Azure SQL Data Warehouse from outside Azure including from on-premises data sources with data factory gateway, configure appropriate IP address range for the machine that is sending data to Azure SQL Data Warehouse. Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type AzureSqlDWTable has the following properties: PROPERTY DESCRIPTION REQUIRED tableName Name of the table or view in the Azure SQL Data Warehouse database that the linked service refers to. Yes Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. NOTE The Copy Activity takes only one input and produces only one output. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. SqlDWSource When source is of type SqlDWSource, the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED sqlReaderQuery Use the custom query to read data. SQL query string. For example: select * from MyTable. No sqlReaderStoredProcedure Name Name of the stored procedure that reads data from the source table. Name of the stored procedure. The last SQL statement must be a SELECT statement in the stored procedure. No storedProcedureParameter s Parameters for the stored procedure. Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. No If the sqlReaderQuery is specified for the SqlDWSource, the Copy Activity runs this query against the Azure SQL Data Warehouse source to get the data. Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters). If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query to run against the Azure SQL Data Warehouse. Example: select column1, column2 from mytable . If the dataset definition does not have the structure, all columns are selected from the table. SqlDWSource example "source": { "type": "SqlDWSource", "sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters", "storedProcedureParameters": { "stringData": { "value": "str3" }, "identifier": { "value": "$$Text.Format('{0:yyyy}', SliceStart)", "type": "Int"} } } The stored procedure definition: CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters ( @stringData varchar(20), @identifier int ) AS SET NOCOUNT ON; BEGIN select * from dbo.UnitTestSrcTable where dbo.UnitTestSrcTable.stringData != stringData and dbo.UnitTestSrcTable.identifier != identifier END GO SqlDWSink SqlDWSink supports the following properties: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED sqlWriterCleanupScript Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. For details, see repeatability section. A query statement. No allowPolyBase Indicates whether to use PolyBase (when applicable) instead of BULKINSERT mechanism. True False (default) No Using PolyBase is the recommended way to load data into SQL Data Warehouse. See Use PolyBase to load data into Azure SQL Data Warehouse section for constraints and details. polyBaseSettings A group of properties that can be specified when the allowPolybase property is set to true. No PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED rejectValue Specifies the number or percentage of rows that can be rejected before the query fails. 0 (default), 1, 2, … No Learn more about the PolyBase’s reject options in the Arguments section of CREATE EXTERNAL TABLE (Transact-SQL) topic. rejectType Specifies whether the rejectValue option is specified as a literal value or a percentage. Value (default), Percentage No rejectSampleValue Determines the number of rows to retrieve before the PolyBase recalculates the percentage of rejected rows. 1, 2, … Yes, if rejectType is percentage useTypeDefault Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file. True, False (default) No Learn more about this property from the Arguments section in CREATE EXTERNAL FILE FORMAT (Transact-SQL). writeBatchSize Inserts data into the SQL table when the buffer size reaches writeBatchSize Integer (number of rows) No (default: 10000) writeBatchTimeout Wait time for the batch insert operation to complete before it times out. timespan No Example: “00:30:00” (30 minutes). SqlDWSink example "sink": { "type": "SqlDWSink", "allowPolyBase": true } Use PolyBase to load data into Azure SQL Data Warehouse Using PolyBase is an efficient way of loading large amount of data into Azure SQL Data Warehouse with high throughput. You can see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. See copy performance reference number with detailed comparison. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory. If your source data is in Azure Blob or Azure Data Lake Store, and the format is compatible with PolyBase, you can directly copy to Azure SQL Data Warehouse using PolyBase. See Direct copy using PolyBase with details. If your source data store and format is not originally supported by PolyBase, you can use the Staged Copy using PolyBase feature instead. It also provides you better throughput by automatically converting the data into PolyBase-compatible format and storing the data in Azure Blob storage. It then loads data into SQL Data Warehouse. Set the allowPolyBase property to true as shown in the following example for Azure Data Factory to use PolyBase to copy data into Azure SQL Data Warehouse. When you set allowPolyBase to true, you can specify PolyBase specific properties using the polyBaseSettings property group. see the SqlDWSink section for details about properties that you can use with polyBaseSettings. "sink": { "type": "SqlDWSink", "allowPolyBase": true, "polyBaseSettings": { "rejectType": "percentage", "rejectValue": 10.0, "rejectSampleValue": 100, "useTypeDefault": true } } Direct copy using PolyBase SQL Data Warehouse PolyBase directly support Azure Blob and Azure Data Lake Store (using service principal) as source and with specific file format requirements. If your source data meets the criteria described in this section, you can directly copy from source data store to Azure SQL Data Warehouse using PolyBase. Otherwise, you can use Staged Copy using PolyBase. TIP To copy data from Data Lake Store to SQL Data Warehouse efficiently, learn more from Azure Data Factory makes it even easier and convenient to uncover insights from data when using Data Lake Store with SQL Data Warehouse. If the requirements are not met, Azure Data Factory checks the settings and automatically falls back to the BULKINSERT mechanism for the data movement. 1. Source linked service is of type: AzureStorage or AzureDataLakeStore with service principal authentication. 2. The input dataset is of type: AzureBlob or AzureDataLakeStore, and the format type under type properties is OrcFormat, or TextFormat with the following configurations: a. b. c. d. e. must be \n. nullValue is set to empty string (""), or treatEmptyAsNull is set to true. encodingName is set to utf-8, which is default value. escapeChar , quoteChar , firstRowAsHeader , and skipLineCount are not specified. compression can be no compression, GZip, or Deflate. rowDelimiter "typeProperties": { "folderPath": "<blobpath>", "format": { "type": "TextFormat", "columnDelimiter": "<any delimiter>", "rowDelimiter": "\n", "nullValue": "", "encodingName": "utf-8" }, "compression": { "type": "GZip", "level": "Optimal" } }, 3. There is no skipHeaderLineCount setting under BlobSource or AzureDataLakeStore for the Copy activity in the pipeline. 4. There is no sliceIdentifierColumnName setting under SqlDWSink for the Copy activity in the pipeline. (PolyBase guarantees that all data is updated or nothing is updated in a single run. To achieve repeatability, you could use sqlWriterCleanupScript ). 5. There is no columnMapping being used in the associated in Copy activity. Staged Copy using PolyBase When your source data doesn’t meet the criteria introduced in the previous section, you can enable copying data via an interim staging Azure Blob Storage (cannot be Premium Storage). In this case, Azure Data Factory automatically performs transformations on the data to meet data format requirements of PolyBase, then use PolyBase to load data into SQL Data Warehouse, and at last clean-up your temp data from the Blob storage. See Staged Copy for details on how copying data via a staging Azure Blob works in general. NOTE When copying data from an on-prem data store into Azure SQL Data Warehouse using PolyBase and staging, if your Data Management Gateway version is below 2.4, JRE (Java Runtime Environment) is required on your gateway machine that is used to transform your source data into proper format. Suggest you upgrade your gateway to the latest to avoid such dependency. To use this feature, create an Azure Storage linked service that refers to the Azure Storage Account that has the interim blob storage, then specify the enableStaging and stagingSettings properties for the Copy Activity as shown in the following code: "activities":[ { "name": "Sample copy activity from SQL Server to SQL Data Warehouse via PolyBase", "type": "Copy", "inputs": [{ "name": "OnpremisesSQLServerInput" }], "outputs": [{ "name": "AzureSQLDWOutput" }], "typeProperties": { "source": { "type": "SqlSource", }, "sink": { "type": "SqlDwSink", "allowPolyBase": true }, "enableStaging": true, "stagingSettings": { "linkedServiceName": "MyStagingBlob" } } } ] Best practices when using PolyBase The following sections provide additional best practices to the ones that are mentioned in Best practices for Azure SQL Data Warehouse. Required database permission To use PolyBase, it requires the user being used to load data into SQL Data Warehouse has the "CONTROL" permission on the target database. One way to achieve that is to add that user as a member of "db_owner" role. Learn how to do that by following this section. Row size and data type limitation Polybase loads are limited to loading rows both smaller than 1 MB and cannot load to VARCHR(MAX), NVARCHAR(MAX) or VARBINARY(MAX). Refer to here. If you have source data with rows of size greater than 1 MB, you may want to split the source tables vertically into several small ones where the largest row size of each of them does not exceed the limit. The smaller tables can then be loaded using PolyBase and merged together in Azure SQL Data Warehouse. SQL Data Warehouse resource class To achieve best possible throughput, consider to assign larger resource class to the user being used to load data into SQL Data Warehouse via PolyBase. Learn how to do that by following Change a user resource class example. tableName in Azure SQL Data Warehouse The following table provides examples on how to specify the tableName property in dataset JSON for various combinations of schema and table name. DB SCHEMA TABLE NAME TABLENAME JSON PROPERTY dbo MyTable MyTable or dbo.MyTable or [dbo]. [MyTable] dbo1 MyTable dbo1.MyTable or [dbo1].[MyTable] dbo My.Table [My.Table] or [dbo].[My.Table] DB SCHEMA TABLE NAME TABLENAME JSON PROPERTY dbo1 My.Table [dbo1].[My.Table] If you see the following error, it could be an issue with the value you specified for the tableName property. See the table for the correct way to specify values for the tableName JSON property. Type=System.Data.SqlClient.SqlException,Message=Invalid object name 'stg.Account_test'.,Source=.Net SqlClient Data Provider Columns with default values Currently, PolyBase feature in Data Factory only accepts the same number of columns as in the target table. Say, you have a table with four columns and one of them is defined with a default value. The input data should still contain four columns. Providing a 3-column input dataset would yield an error similar to the following message: All columns of the table must be specified in the INSERT BULK statement. NULL value is a special form of default value. If the column is nullable, the input data (in blob) for that column could be empty (cannot be missing from the input dataset). PolyBase inserts NULL for them in the Azure SQL Data Warehouse. Auto table creation If you are using Copy Wizard to copy data from SQL Server or Azure SQL Database to Azure SQL Data Warehouse and the table that corresponds to the source table does not exist in the destination store, Data Factory can automatically create the table in the data warehouse by using the source table schema. Data Factory creates the table in the destination store with the same table name in the source data store. The data types for columns are chosen based on the following type mapping. If needed, it performs type conversions to fix any incompatibilities between source and destination stores. It also uses Round Robin table distribution. SOURCE SQL DATABASE COLUMN TYPE DESTINATION SQL DW COLUMN TYPE (SIZE LIMITATION) Int Int BigInt BigInt SmallInt SmallInt TinyInt TinyInt Bit Bit Decimal Decimal Numeric Decimal Float Float SOURCE SQL DATABASE COLUMN TYPE DESTINATION SQL DW COLUMN TYPE (SIZE LIMITATION) Money Money Real Real SmallMoney SmallMoney Binary Binary Varbinary Varbinary (up to 8000) Date Date DateTime DateTime DateTime2 DateTime2 Time Time DateTimeOffset DateTimeOffset SmallDateTime SmallDateTime Text Varchar (up to 8000) NText NVarChar (up to 4000) Image VarBinary (up to 8000) UniqueIdentifier UniqueIdentifier Char Char NChar NChar VarChar VarChar (up to 8000) NVarChar NVarChar (up to 4000) Xml Varchar (up to 8000) Repeatability during Copy When copying data to Azure SQL/SQL Server from other data stores one needs to keep repeatability in mind to avoid unintended outcomes. When copying data to Azure SQL/SQL Server Database, copy activity will by default APPEND the data set to the sink table by default. For example, when copying data from a CSV (comma separated values data) file source containing two records to Azure SQL/SQL Server Database, this is what the table looks like: ID Product ... ... 6 Flat Washer 7 Down Tube Quantity ... 3 2 ModifiedDate ... 2015-05-01 00:00:00 2015-05-01 00:00:00 Suppose you found errors in source file and updated the quantity of Down Tube from 2 to 4 in the source file. If you re-run the data slice for that period, you’ll find two new records appended to Azure SQL/SQL Server Database. The below assumes none of the columns in the table have the primary key constraint. ID Product ... ... 6 Flat Washer 7 Down Tube 6 Flat Washer 7 Down Tube Quantity ... 3 2 3 4 ModifiedDate ... 2015-05-01 00:00:00 2015-05-01 00:00:00 2015-05-01 00:00:00 2015-05-01 00:00:00 To avoid this, you will need to specify UPSERT semantics by leveraging one of the below 2 mechanisms stated below. NOTE A slice can be re-run automatically in Azure Data Factory as per the retry policy specified. Mechanism 1 You can leverage sqlWriterCleanupScript property to first perform cleanup action when a slice is run. "sink": { "type": "SqlSink", "sqlWriterCleanupScript": "$$Text.Format('DELETE FROM table WHERE ModifiedDate >= \\'{0:yyyy-MM-dd HH:mm}\\' AND ModifiedDate < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" } The cleanup script would be executed first during copy for a given slice which would delete the data from the SQL Table corresponding to that slice. The activity will subsequently insert the data into the SQL Table. If the slice is now re-run, then you will find the quantity is updated as desired. ID Product ... ... 6 Flat Washer 7 Down Tube Quantity ... 3 4 ModifiedDate ... 2015-05-01 00:00:00 2015-05-01 00:00:00 Suppose the Flat Washer record is removed from the original csv. Then re-running the slice would produce the following result: ID ... 7 Product ... Down Tube Quantity ... 4 ModifiedDate ... 2015-05-01 00:00:00 Nothing new had to be done. The copy activity ran the cleanup script to delete the corresponding data for that slice. Then it read the input from the csv (which then contained only 1 record) and inserted it into the Table. Mechanism 2 IMPORTANT sliceIdentifierColumnName is not supported for Azure SQL Data Warehouse at this time. Another mechanism to achieve repeatability is by having a dedicated column (sliceIdentifierColumnName) in the target Table. This column would be used by Azure Data Factory to ensure the source and destination stay synchronized. This approach works when there is flexibility in changing or defining the destination SQL Table schema. This column would be used by Azure Data Factory for repeatability purposes and in the process Azure Data Factory will not make any schema changes to the Table. Way to use this approach: 1. Define a column of type binary (32) in the destination SQL Table. There should be no constraints on this column. Let's name this column as ‘ColumnForADFuseOnly’ for this example. 2. Use it in the copy activity as follows: "sink": { "type": "SqlSink", "sliceIdentifierColumnName": "ColumnForADFuseOnly" } Azure Data Factory will populate this column as per its need to ensure the source and destination stay synchronized. The values of this column should not be used outside of this context by the user. Similar to mechanism 1, Copy Activity will automatically first clean up the data for the given slice from the destination SQL Table and then run the copy activity normally to insert the data from source to destination for that slice. Type mapping for Azure SQL Data Warehouse As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to & from Azure SQL Data Warehouse, the following mappings are used from SQL type to .NET type and vice versa. The mapping is same as the SQL Server Data Type Mapping for ADO.NET. SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE bigint Int64 binary Byte[] bit Boolean char String, Char[] date DateTime SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE Datetime DateTime datetime2 DateTime Datetimeoffset DateTimeOffset Decimal Decimal FILESTREAM attribute (varbinary(max)) Byte[] Float Double image Byte[] int Int32 money Decimal nchar String, Char[] ntext String, Char[] numeric Decimal nvarchar String, Char[] real Single rowversion Byte[] smalldatetime DateTime smallint Int16 smallmoney Decimal sql_variant Object * text String, Char[] time TimeSpan timestamp Byte[] tinyint Byte uniqueidentifier Guid varbinary Byte[] SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE varchar String, Char[] xml Xml You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For details, see Mapping dataset columns in Azure Data Factory. JSON examples for copying data to and from SQL Data Warehouse The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure SQL Data Warehouse and Azure Blob Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory. Example: Copy data from Azure SQL Data Warehouse to Azure Blob The sample defines the following Data Factory entities: 1. 2. 3. 4. 5. A linked service of type AzureSqlDW. A linked service of type AzureStorage. An input dataset of type AzureSqlDWTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses SqlDWSource and BlobSink. The sample copies time-series (hourly, daily, etc.) data from a table in Azure SQL Data Warehouse database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. Azure SQL Data Warehouse linked service: { "name": "AzureSqlDWLinkedService", "properties": { "type": "AzureSqlDW", "typeProperties": { "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } } } Azure Blob storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure SQL Data Warehouse input dataset: The sample assumes you have created a table “MyTable” in Azure SQL Data Warehouse and it contains a column called “timestampcolumn” for time series data. Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureSqlDWInput", "properties": { "type": "AzureSqlDWTable", "linkedServiceName": "AzureSqlDWLinkedService", "typeProperties": { "tableName": "MyTable" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": "\t", "rowDelimiter": "\n" } }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with SqlDWSource and BlobSink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlDWSource and sink type is set to BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "AzureSQLDWtoBlob", "description": "copy activity", "type": "Copy", "inputs": [ { "name": "AzureSqlDWInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "SqlDWSource", "sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyyMM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } NOTE In the example, sqlReaderQuery is specified for the SqlDWSource. The Copy Activity runs this query against the Azure SQL Data Warehouse source to get the data. Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters). If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query (select column1, column2 from mytable) to run against the Azure SQL Data Warehouse. If the dataset definition does not have the structure, all columns are selected from the table. Example: Copy data from Azure Blob to Azure SQL Data Warehouse The sample defines the following Data Factory entities: 1. A linked service of type AzureSqlDW. 2. 3. 4. 5. A linked service of type AzureStorage. An input dataset of type AzureBlob. An output dataset of type AzureSqlDWTable. A pipeline with Copy activity that uses BlobSource and SqlDWSink. The sample copies time-series data (hourly, daily, etc.) from Azure blob to a table in Azure SQL Data Warehouse database every hour. The JSON properties used in these samples are described in sections following the samples. Azure SQL Data Warehouse linked service: { "name": "AzureSqlDWLinkedService", "properties": { "type": "AzureSqlDW", "typeProperties": { "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } } } Azure Blob storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Blob input dataset: Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs the Data Factory service that this table is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": ",", "rowDelimiter": "\n" } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure SQL Data Warehouse output dataset: The sample copies data to a table named “MyTable” in Azure SQL Data Warehouse. Create the table in Azure SQL Data Warehouse with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. { "name": "AzureSqlDWOutput", "properties": { "type": "AzureSqlDWTable", "linkedServiceName": "AzureSqlDWLinkedService", "typeProperties": { "tableName": "MyOutputTable" }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with BlobSource and SqlDWSink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlDWSink. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "AzureBlobtoSQLDW", "description": "Copy Activity", "type": "Copy", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureSqlDWOutput" } ], "typeProperties": { "source": { "type": "BlobSource", "blobColumnSeparators": "," }, "sink": { "type": "SqlDWSink", "allowPolyBase": true } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } For a walkthrough, see the see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory and Load data with Azure Data Factory article in the Azure SQL Data Warehouse documentation. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data to and from Azure Table using Azure Data Factory 6/5/2017 • 16 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Table Storage. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from any supported source data store to Azure Table Storage or from Azure Table Storage to any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity, see the Supported data stores table. Getting started You can create a pipeline with a copy activity that moves data to/from an Azure Table Storage by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Table Storage, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Table Storage: Linked service properties There are two types of linked services you can use to link an Azure blob storage to an Azure data factory. They are: AzureStorage linked service and AzureStorageSas linked service. The Azure Storage linked service provides the data factory with global access to the Azure Storage. Whereas, The Azure Storage SAS (Shared Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure Storage. There are no other differences between these two linked services. Choose the linked service that suits your needs. The following sections provide more details on these two linked services. Azure Storage Linked Service The Azure Storage linked service allows you to link an Azure storage account to an Azure data factory by using the account key, which provides the data factory with global access to the Azure Storage. The following table provides description for JSON elements specific to Azure Storage linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: AzureStorage Yes connectionString Specify information needed to connect to Azure storage for the connectionString property. Yes See the following article for steps to view/copy the account key for an Azure Storage: View, copy, and regenerate storage access keys. Example: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Storage Sas Linked Service A shared access signature (SAS) provides delegated access to resources in your storage account. It allows you to grant a client limited permissions to objects in your storage account for a specified period of time and with a specified set of permissions, without having to share your account access keys. The SAS is a URI that encompasses in its query parameters all the information necessary for authenticated access to a storage resource. To access storage resources with the SAS, the client only needs to pass in the SAS to the appropriate constructor or method. For detailed information about SAS, see Shared Access Signatures: Understanding the SAS Model The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory by using a Shared Access Signature (SAS). It provides the data factory with restricted/time-bound access to all/specific resources (blob/container) in the storage. The following table provides description for JSON elements specific to Azure Storage SAS linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: AzureStorageSas Yes sasUri Specify Shared Access Signature URI to the Azure Storage resources such as blob, container, or table. Yes Example: { "name": "StorageSasLinkedService", "properties": { "type": "AzureStorageSas", "typeProperties": { "sasUri": "<storageUri>?<sasToken>" } } } When creating an SAS URI, considering the following: Azure Data Factory supports only Service SAS, not Account SAS. See Types of Shared Access Signatures for details about these two types. Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is used in your data factory. Set Expiry time appropriately. Make sure that the access to Azure Storage objects does not expire within the active period of the pipeline. Uri should be created at the right container/blob or Table level based on the need. A SAS Uri to an Azure blob allows the Data Factory service to access that particular blob. A SAS Uri to an Azure blob container allows the Data Factory service to iterate through blobs in that container. If you need to provide access more/fewer objects later, or update the SAS URI, remember to update the linked service with the new URI. Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type AzureTable has the following properties. PROPERTY DESCRIPTION REQUIRED tableName Name of the table in the Azure Table Database instance that linked service refers to. Yes. When a tableName is specified without an azureTableSourceQuery, all records from the table are copied to the destination. If an azureTableSourceQuery is also specified, records from the table that satisfies the query are copied to the destination. Schema by Data Factory For schema-free data stores such as Azure Table, the Data Factory service infers the schema in one of the following ways: 1. If you specify the structure of data by using the structure property in the dataset definition, the Data Factory service honors this structure as the schema. In this case, if a row does not contain a value for a column, a null value is provided for it. 2. If you don't specify the structure of data by using the structure property in the dataset definition, Data Factory infers the schema by using the first row in the data. In this case, if the first row does not contain the full schema, some columns are missed in the result of copy operation. Therefore, for schema-free data sources, the best practice is to specify the structure of data using the structure property. Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output datasets, and policies are available for all types of activities. Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. AzureTableSource supports the following properties in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED azureTableSourceQuery Use the custom query to read data. Azure table query string. See examples in the next section. No. When a tableName is specified without an azureTableSourceQuery, all records from the table are copied to the destination. If an azureTableSourceQuery is also specified, records from the table that satisfies the query are copied to the destination. azureTableSourceIgnoreTab leNotFound Indicate whether swallow the exception of table not exist. TRUE FALSE No azureTableSourceQuery examples If Azure Table column is of string type: azureTableSourceQuery": "$$Text.Format('PartitionKey ge \\'{0:yyyyMMddHH00_0000}\\' and PartitionKey le \\'{0:yyyyMMddHH00_9999}\\'', SliceStart)" If Azure Table column is of datetime type: "azureTableSourceQuery": "$$Text.Format('DeploymentEndTime gt datetime\\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' and DeploymentEndTime le datetime\\'{1:yyyy-MM-ddTHH:mm:ssZ}\\'', SliceStart, SliceEnd)" AzureTableSink supports the following properties in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED azureTableDefaultPartition KeyValue Default partition key value that can be used by the sink. A string value. No azureTablePartitionKeyNam e Specify name of the column whose values are used as partition keys. If not specified, AzureTableDefaultPartition KeyValue is used as the partition key. A column name. No PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED azureTableRowKeyName Specify name of the column whose column values are used as row key. If not specified, use a GUID for each row. A column name. No azureTableInsertType The mode to insert data into Azure table. merge (default) replace No This property controls whether existing rows in the output table with matching partition and row keys have their values replaced or merged. To learn about how these settings (merge and replace) work, see Insert or Merge Entity and Insert or Replace Entity topics. This setting applies at the row level, not the table level, and neither option deletes rows in the output table that do not exist in the input. writeBatchSize Inserts data into the Azure table when the writeBatchSize or writeBatchTimeout is hit. Integer (number of rows) No (default: 10000) writeBatchTimeout Inserts data into the Azure table when the writeBatchSize or writeBatchTimeout is hit timespan No (Default to storage client default timeout value 90 sec) Example: “00:20:00” (20 minutes) azureTablePartitionKeyName Map a source column to a destination column using the translator JSON property before you can use the destination column as the azureTablePartitionKeyName. In the following example, source column DivisionID is mapped to the destination column: DivisionID. "translator": { "type": "TabularTranslator", "columnMappings": "DivisionID: DivisionID, FirstName: FirstName, LastName: LastName" } The DivisionID is specified as the partition key. "sink": { "type": "AzureTableSink", "azureTablePartitionKeyName": "DivisionID", "writeBatchSize": 100, "writeBatchTimeout": "01:00:00" } JSON examples The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Table Storage and Azure Blob Database. However, data can be copied directly from any of the sources to any of the supported sinks. For more information, see the section "Supported data stores and formats" in Move data by using Copy Activity. Example: Copy data from Azure Table to Azure Blob The following sample shows: 1. 2. 3. 4. A linked service of type AzureStorage (used for both table & blob). An input dataset of type AzureTable. An output dataset of type AzureBlob. The pipeline with Copy activity that uses AzureTableSource and BlobSink. The sample copies data belonging to the default partition in an Azure Table to a blob every hour. The JSON properties used in these samples are described in sections following the samples. Azure storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details. Azure Table input dataset: The sample assumes you have created a table “MyTable” in Azure Table. Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureTableInput", "properties": { "type": "AzureTable", "linkedServiceName": "StorageLinkedService", "typeProperties": { "tableName": "MyTable" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": "\t", "rowDelimiter": "\n" } }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with AzureTableSource and BlobSink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to AzureTableSource and sink type is set to BlobSink. The SQL query specified with AzureTableSourceQuery property selects the data from the default partition every hour to copy. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "AzureTabletoBlob", "description": "copy activity", "type": "Copy", "inputs": [ { "name": "AzureTableInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "AzureTableSource", "AzureTableSourceQuery": "PartitionKey eq 'DefaultPartitionKey'" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Example: Copy data from Azure Blob to Azure Table The following sample shows: 1. 2. 3. 4. A linked service of type AzureStorage (used for both table & blob) An input dataset of type AzureBlob. An output dataset of type AzureTable. The pipeline with Copy activity that uses BlobSource and AzureTableSink. The sample copies time-series data from an Azure blob to an Azure table hourly. The JSON properties used in these samples are described in sections following the samples. Azure storage (for both Azure Table & Blob) linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details. Azure Blob input dataset: Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": ",", "rowDelimiter": "\n" } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Table output dataset: The sample copies data to a table named “MyTable” in Azure Table. Create an Azure table with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. { "name": "AzureTableOutput", "properties": { "type": "AzureTable", "linkedServiceName": "StorageLinkedService", "typeProperties": { "tableName": "MyOutputTable" }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with BlobSource and AzureTableSink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to AzureTableSink. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "AzureBlobtoTable", "description": "Copy Activity", "type": "Copy", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureTableOutput" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "AzureTableSink", "writeBatchSize": 100, "writeBatchTimeout": "01:00:00" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Type Mapping for Azure Table As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach. 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to & from Azure Table, the following mappings defined by Azure Table service are used from Azure Table OData types to .NET type and vice versa. ODATA DATA TYPE .NET TYPE DETAILS Edm.Binary byte[] An array of bytes up to 64 KB. Edm.Boolean bool A Boolean value. ODATA DATA TYPE .NET TYPE DETAILS Edm.DateTime DateTime A 64-bit value expressed as Coordinated Universal Time (UTC). The supported DateTime range begins from 12:00 midnight, January 1, 1601 A.D. (C.E.), UTC. The range ends at December 31, 9999. Edm.Double double A 64-bit floating point value. Edm.Guid Guid A 128-bit globally unique identifier. Edm.Int32 Int32 A 32-bit integer. Edm.Int64 Int64 A 64-bit integer. Edm.String String A UTF-16-encoded value. String values may be up to 64 KB. Type Conversion Sample The following sample is for copying data from an Azure Blob to Azure Table with type conversions. Suppose the Blob dataset is in CSV format and contains three columns. One of them is a datetime column with a custom datetime format using abbreviated French names for day of the week. Define the Blob Source dataset as follows along with type definitions for the columns. { "name": " AzureBlobInput", "properties": { "structure": [ { "name": "userid", "type": "Int64"}, { "name": "name", "type": "String"}, { "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MMYYYY"} ], "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder", "fileName":"myfile.csv", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "external": true, "availability": { "frequency": "Hour", "interval": 1, }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Given the type mapping from Azure Table OData type to .NET type, you would define the table in Azure Table with the following schema. Azure Table schema: COLUMN NAME TYPE userid Edm.Int64 name Edm.String lastlogindate Edm.DateTime Next, define the Azure Table dataset as follows. You do not need to specify “structure” section with the type information since the type information is already specified in the underlying data store. { "name": "AzureTableOutput", "properties": { "type": "AzureTable", "linkedServiceName": "StorageLinkedService", "typeProperties": { "tableName": "MyOutputTable" }, "availability": { "frequency": "Hour", "interval": 1 } } } In this case, Data Factory automatically does type conversions including the Datetime field with the custom datetime format using the "fr-fr" culture when moving data from Blob to Azure Table. NOTE To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Performance and Tuning To learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it, see Copy Activity Performance & Tuning Guide. Move data from an on-premises Cassandra database using Azure Data Factory 5/4/2017 • 10 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises Cassandra database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an on-premises Cassandra data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from a Cassandra data store to other data stores, but not for moving data from other data stores to a Cassandra data store. Supported versions The Cassandra connector supports the following versions of Cassandra: 2.X. Prerequisites For the Azure Data Factory service to be able to connect to your on-premises Cassandra database, you must install a Data Management Gateway on the same machine that hosts the database or on a separate machine to avoid competing for resources with the database. Data Management Gateway is a component that connects on-premises data sources to cloud services in a secure and managed way. See Data Management Gateway article for details about Data Management Gateway. See Move data from on-premises to cloud article for stepby-step instructions on setting up the gateway a data pipeline to move data. You must use the gateway to connect to a Cassandra database even if the database is hosted in the cloud, for example, on an Azure IaaS VM. Y You can have the gateway on the same VM that hosts the database or on a separate VM as long as the gateway can connect to the database. When you install the gateway, it automatically installs a Microsoft Cassandra ODBC driver used to connect to Cassandra database. Therefore, you don't need to manually install any driver on the gateway machine when copying data from the Cassandra database. NOTE See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Getting started You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises Cassandra data store, see JSON example: Copy data from Cassandra to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Cassandra data store: Linked service properties The following table provides description for JSON elements specific to Cassandra linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesCassandra Yes host One or more IP addresses or host names of Cassandra servers. Yes Specify a comma-separated list of IP addresses or host names to connect to all servers concurrently. port The TCP port that the Cassandra server uses to listen for client connections. No, default value: 9042 authenticationType Basic, or Anonymous Yes username Specify user name for the user account. Yes, if authenticationType is set to Basic. password Specify password for the user account. Yes, if authenticationType is set to Basic. gatewayName The name of the gateway that is used to connect to the on-premises Cassandra database. Yes encryptedCredential Credential encrypted by the gateway. No Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type CassandraTable has the following properties PROPERTY DESCRIPTION REQUIRED keyspace Name of the keyspace or schema in Cassandra database. Yes (If query for CassandraSource is not defined). tableName Name of the table in Cassandra database. Yes (If query for CassandraSource is not defined). Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When source is of type CassandraSource, the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL-92 query or CQL query. See CQL reference. No (if tableName and keyspace on dataset are defined). When using SQL query, specify keyspace name.table name to represent the table you want to query. consistencyLevel The consistency level specifies how many replicas must respond to a read request before returning data to the client application. Cassandra checks the specified number of replicas for data to satisfy the read request. ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, LOCAL_ONE. See Configuring data consistency for details. No. Default value is ONE. JSON example: Copy data from Cassandra to Azure Blob This example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. It shows how to copy data from an on-premises Cassandra database to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. IMPORTANT This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See moving data between on-premises locations and cloud article for step-by-step instructions. The sample has the following data factory entities: A linked service of type OnPremisesCassandra. A linked service of type AzureStorage. An input dataset of type CassandraTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses CassandraSource and BlobSink. Cassandra linked service: This example uses the Cassandra linked service. See Cassandra linked service section for the properties supported by this linked service. { "name": "CassandraLinkedService", "properties": { "type": "OnPremisesCassandra", "typeProperties": { "authenticationType": "Basic", "host": "mycassandraserver", "port": 9042, "username": "user", "password": "password", "gatewayName": "mygateway" } } } Azure Storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Cassandra input dataset: { "name": "CassandraInput", "properties": { "linkedServiceName": "CassandraLinkedService", "type": "CassandraTable", "typeProperties": { "tableName": "mytable", "keySpace": "mykeyspace" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Setting external to true informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adfgetstarted/fromcassandra" }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with Cassandra source and Blob sink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to CassandraSource and sink type is set to BlobSink. See RelationalSource type properties for the list of properties supported by the RelationalSource. { "name":"SamplePipeline", "properties":{ "start":"2016-06-01T18:00:00", "end":"2016-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "CassandraToAzureBlob", "description": "Copy from Cassandra to an Azure blob", "type": "Copy", "inputs": [ { "name": "CassandraInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "CassandraSource", "query": "select id, firstname, lastname from mykeyspace.mytable" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Type mapping for Cassandra CASSANDRA TYPE .NET BASED TYPE ASCII String BIGINT Int64 BLOB Byte[] BOOLEAN Boolean DECIMAL Decimal DOUBLE Double CASSANDRA TYPE .NET BASED TYPE FLOAT Single INET String INT Int32 TEXT String TIMESTAMP DateTime TIMEUUID Guid UUID Guid VARCHAR String VARINT Decimal NOTE For collection types (map, set, list, etc.), refer to Work with Cassandra collection types using virtual table section. User-defined types are not supported. The length of Binary Column and String Column lengths cannot be greater than 4000. Work with collections using virtual table Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your Cassandra database. For collection types including map, set and list, the driver renormalizes the data into corresponding virtual tables. Specifically, if a table contains any collection columns, the driver generates the following virtual tables: A base table, which contains the same data as the real table except for the collection columns. The base table uses the same name as the real table that it represents. A virtual table for each collection column, which expands the nested data. The virtual tables that represent collections are named using the name of the real table, a separator “vt” and the name of the column. Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. See Example section for details. You can access the content of Cassandra collections by querying and joining the virtual tables. You can use the Copy Wizard to intuitively view the list of tables in Cassandra database including the virtual tables, and preview the data inside. You can also construct a query in the Copy Wizard and validate to see the result. Example For example, the following “ExampleTable” is a Cassandra database table that contains an integer primary key column named “pk_int”, a text column named value, a list column, a map column, and a set column (named “StringSet”). PK_INT VALUE LIST MAP STRINGSET 1 "sample value 1" ["1", "2", "3"] {"S1": "a", "S2": "b"} {"A", "B", "C"} 3 "sample value 3" ["100", "101", "102", "105"] {"S1": "t"} {"A", "E"} The driver would generate multiple virtual tables to represent this single table. The foreign key columns in the virtual tables reference the primary key columns in the real table, and indicate which real table row the virtual table row corresponds to. The first virtual table is the base table named “ExampleTable” is shown in the following table. The base table contains the same data as the original database table except for the collections, which are omitted from this table and expanded in other virtual tables. PK_INT VALUE 1 "sample value 1" 3 "sample value 3" The following tables show the virtual tables that renormalize the data from the List, Map, and StringSet columns. The columns with names that end with “_index” or “_key” indicate the position of the data within the original list or map. The columns with names that end with “_value” contain the expanded data from the collection. Table “ExampleTable_vt_List”: PK_INT LIST_INDEX LIST_VALUE 1 0 1 1 1 2 1 2 3 3 0 100 3 1 101 3 2 102 3 3 103 PK_INT MAP_KEY MAP_VALUE 1 S1 A 1 S2 b 3 S1 t Table “ExampleTable_vt_Map”: Table “ExampleTable_vt_StringSet”: PK_INT STRINGSET_VALUE 1 A 1 B 1 C 3 A 3 E Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data from DB2 using Azure Data Factory 4/12/2017 • 9 min to read • Edit Online This article outlines how you can use the Copy Activity in an Azure data factory to copy data from an onpremises DB2 database to any data store listed under Sink column in the Supported Sources and Sinks section. This article builds on the data movement activities article, which presents a general overview of data movement with copy activity and supported data store combinations. Data factory currently supports only moving data from a DB2 database to supported sink data stores, but not moving data from other data stores to a DB2 database. Prerequisites Data Factory supports connecting to on-premises DB2 database using the Data Management Gateway. See Data Management Gateway article to learn about Data Management Gateway and Move data from onpremises to cloud article for step-by-step instructions on setting up the gateway a data pipeline to move data. Gateway is required even if the DB2 is hosted in an Azure IaaS VM. You can install the gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database. The Data Management Gateway provides a built-in DB2 driver, therefore you don't need to manually install any driver when copying data from DB2. NOTE See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Supported versions This DB2 connector supports the following IBM DB2 platforms and versions with Distributed Relational Database Architecture (DRDA) SQL Access Manager (SQLAM) version 9, 10 and 11: IBM DB2 for z/OS 11.1 IBM DB2 for z/OS 10.1 IBM DB2 for i 7.2 IBM DB2 for i 7.1 IBM DB2 for LUW 11 IBM DB2 for LUW 10.5 IBM DB2 for LUW 10.1 TIP If you hit error stating "The package corresponding to an SQL statement execution request was not found. SQLSTATE=51002 SQLCODE=-805", user a high privilege account (power user or admin) to run the copy activity once, then the needed package will be auto created during copy. Afterwards, you can switch back to normal user for your subsequent copy runs. Getting started You can create a pipeline with a copy activity that moves data from an on-premises DB2 data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises DB2 data store, see JSON example: Copy data from DB2 to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to a DB2 data store: Linked service properties The following table provides description for JSON elements specific to DB2 linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesDB2 Yes server Name of the DB2 server. Yes database Name of the DB2 database. Yes schema Name of the schema in the database. The schema name is case-sensitive. No authenticationType Type of authentication used to connect to the DB2 database. Possible values are: Anonymous, Basic, and Windows. Yes username Specify user name if you are using Basic or Windows authentication. No password Specify password for the user account you specified for the username. No gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises DB2 database. Yes Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes DB2 dataset) has the following properties. PROPERTY DESCRIPTION REQUIRED tableName Name of the table in the DB2 Database instance that linked service refers to. The tableName is casesensitive. No (if query of RelationalSource is specified) Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. For Copy Activity, when source is of type RelationalSource (which includes DB2) the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL query string. For example: No (if tableName of dataset is specified) "query": "select * from "MySchema"."MyTable"" . NOTE Schema and table names are case-sensitive. Enclose the names in "" (double quotes) in the query. Example: "query": "select * from "DB2ADMIN"."Customers"" JSON example: Copy data from DB2 to Azure Blob This example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. It shows how to copy data from DB2 database and Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. The sample has the following data factory entities: 1. A linked service of type OnPremisesDb2. 2. 3. 4. 5. A linked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from a query result in a DB2 database to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples. As a first step, install and configure a data management gateway. Instructions are in the moving data between on-premises locations and cloud article. DB2 linked service: { "name": "OnPremDb2LinkedService", "properties": { "type": "OnPremisesDb2", "typeProperties": { "server": "<server>", "database": "<database>", "schema": "<schema>", "authenticationType": "<authentication type>", "username": "<username>", "password": "<password>", "gatewayName": "<gatewayName>" } } } Azure Blob storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorageLinkedService", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey= <AccountKey>" } } } DB2 input dataset: The sample assumes you have created a table “MyTable” in DB2 and it contains a column called “timestamp” for time series data. Setting “external”: true informs the Data Factory service that this dataset is external to the data factory and is not produced by an activity in the data factory. Notice that the type is set to RelationalTable. { "name": "Db2DataSet", "properties": { "type": "RelationalTable", "linkedServiceName": "OnPremDb2LinkedService", "typeProperties": {}, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobDb2DataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/db2/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data from the Orders table. { "name": "CopyDb2ToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "select * from \"Orders\"" }, "sink": { "type": "BlobSink" } }, "inputs": [ { "name": "Db2DataSet" } ], "outputs": [ { "name": "AzureBlobDb2DataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "Db2ToBlob" } ], "start": "2014-06-01T18:00:00Z", "end": "2014-06-01T19:00:00Z" } } Type mapping for DB2 As mentioned in the data movement activities article, the Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to DB2, the following mappings are used from DB2 type to .NET type. DB2 DATABASE TYPE .NET FRAMEWORK TYPE SmallInt Int16 Integer Int32 BigInt Int64 Real Single DB2 DATABASE TYPE .NET FRAMEWORK TYPE Double Double Float Double Decimal Decimal DecimalFloat Decimal Numeric Decimal Date Datetime Time TimeSpan Timestamp DateTime Xml Byte[] Char String VarChar String LongVarChar String DB2DynArray String Binary Byte[] VarBinary Byte[] LongVarBinary Byte[] Graphic String VarGraphic String LongVarGraphic String Clob String Blob Byte[] DbClob String SmallInt Int16 Integer Int32 BigInt Int64 DB2 DATABASE TYPE .NET FRAMEWORK TYPE Real Single Double Double Float Double Decimal Decimal DecimalFloat Decimal Numeric Decimal Date Datetime Time TimeSpan Timestamp DateTime Xml Byte[] Char String Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Copy data to and from an on-premises file system by using Azure Data Factory 6/9/2017 • 17 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to copy data to/from an on-premises file system. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. Supported scenarios You can copy data from an on-premises file system to the following data stores: CATEGORY DATA STORE Azure Azure Blob storage Azure Data Lake Store Azure Cosmos DB (DocumentDB API) Azure SQL Database Azure SQL Data Warehouse Azure Search Index Azure Table storage Databases SQL Server Oracle File File system You can copy data from the following data stores to an on-premises file system: CATEGORY DATA STORE Azure Azure Blob storage Azure Cosmos DB (DocumentDB API) Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure Table storage Databases Amazon Redshift DB2 MySQL Oracle PostgreSQL SAP Business Warehouse SAP HANA SQL Server Sybase Teradata NoSQL Cassandra MongoDB CATEGORY DATA STORE File Amazon S3 File System FTP HDFS SFTP Others Generic HTTP Generic OData Generic ODBC Salesforce Web Table (table from HTML) GE Historian NOTE Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline. Enabling connectivity Data Factory supports connecting to and from an on-premises file system via Data Management Gateway. You must install the Data Management Gateway in your on-premises environment for the Data Factory service to connect to any supported on-premises data store including file system. To learn about Data Management Gateway and for step-by-step instructions on setting up the gateway, see Move data between on-premises sources and the cloud with Data Management Gateway. Apart from Data Management Gateway, no other binary files need to be installed to communicate to and from an on-premises file system. You must install and use the Data Management Gateway even if the file system is in Azure IaaS VM. For detailed information about the gateway, see Data Management Gateway. To use a Linux file share, install Samba on your Linux server, and install Data Management Gateway on a Windows server. Installing Data Management Gateway on a Linux server is not supported. Getting started You can create a pipeline with a copy activity that moves data to/from a file system by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create a data factory. A data factory may contain one or more pipelines. 2. Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an on-premises file system, you create two linked services to link your on-premises file system and Azure storage account to your data factory. For linked service properties that are specific to an on-premises file system, see linked service properties section. 3. Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the folder and file name (optional) in your file system. For dataset properties that are specific to on-premises file system, see dataset properties section. 4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and FileSystemSink as a sink for the copy activity. Similarly, if you are copying from on-premises file system to Azure Blob Storage, you use FileSystemSource and BlobSink in the copy activity. For copy activity properties that are specific to onpremises file system, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from a file system, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to file system: Linked service properties You can link an on-premises file system to an Azure data factory with the On-Premises File Server linked service. The following table provides descriptions for JSON elements that are specific to the On-Premises File Server linked service. PROPERTY DESCRIPTION REQUIRED type Ensure that the type property is set to OnPremisesFileServer. Yes host Specifies the root path of the folder that you want to copy. Use the escape character ‘ \ ’ for special characters in the string. See Sample linked service and dataset definitions for examples. Yes userid Specify the ID of the user who has access to the server. No (if you choose encryptedCredential) password Specify the password for the user (userid). No (if you choose encryptedCredential encryptedCredential Specify the encrypted credentials that you can get by running the NewAzureRmDataFactoryEncryptValue cmdlet. No (if you choose to specify userid and password in plain text) gatewayName Specifies the name of the gateway that Data Factory should use to connect to the on-premises file server. Yes Sample linked service and dataset definitions SCENARIO HOST IN LINKED SERVICE DEFINITION FOLDERPATH IN DATASET DEFINITION Local folder on Data Management Gateway machine: D:\\ (for Data Management Gateway 2.0 and later versions) .\\ or folder\\subfolder (for Data Management Gateway 2.0 and later versions) Examples: D:\* or D:\folder\subfolder\* localhost (for earlier versions than Data Management Gateway 2.0) Remote shared folder: \\\\myserver\\share D:\\ or D:\\folder\\subfolder (for gateway version below 2.0) .\\ or folder\\subfolder Examples: \\myserver\share\* or \\myserver\share\folder\subfolder\* Example: Using username and password in plain text { "Name": "OnPremisesFileServerLinkedService", "properties": { "type": "OnPremisesFileServer", "typeProperties": { "host": "\\\\Contosogame-Asia", "userid": "Admin", "password": "123456", "gatewayName": "mygateway" } } } Example: Using encryptedcredential { "Name": " OnPremisesFileServerLinkedService ", "properties": { "type": "OnPremisesFileServer", "typeProperties": { "host": "D:\\", "encryptedCredential": "WFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5xxxxxxxxxxxxxxxxx", "gatewayName": "mygateway" } } } Dataset properties For a full list of sections and properties that are available for defining datasets, see Creating datasets. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. The typeProperties section is different for each type of dataset. It provides information such as the location and format of the data in the data store. The typeProperties section for the dataset of type FileShare has the following properties: PROPERTY DESCRIPTION REQUIRED PROPERTY DESCRIPTION REQUIRED folderPath Specifies the subpath to the folder. Use the escape character ‘\’ for special characters in the string. See Sample linked service and dataset definitions for examples. Yes You can combine this property with partitionBy to have folder paths based on slice start/end date-times. fileName Specify the name of the file in the folderPath if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder. No When fileName is not specified for an output dataset and preserveHierarchy is not specified in activity sink, the name of the generated file is in the following format: Data.<Guid>.txt (Example: Data.0a405f8a-93ff-4c6f-b3bef69616f1df7a.txt) fileFilter Specify a filter to be used to select a subset of files in the folderPath rather than all files. No Allowed values are: * (multiple characters) and ? (single character). Example 1: "fileFilter": "*.log" Example 2: "fileFilter": 2014-1-?.txt" Note that fileFilter is applicable for an input FileShare dataset. partitionedBy You can use partitionedBy to specify a dynamic folderPath/fileName for time-series data. An example is folderPath parameterized for every hour of data. No PROPERTY DESCRIPTION REQUIRED format The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections. No If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions. compression Specify the type and level of compression for the data. Supported types are: GZip, Deflate, BZip2, and ZipDeflate. Supported levels are: Optimal and Fastest. see File and compression formats in Azure Data Factory. No NOTE You cannot use fileName and fileFilter simultaneously. Using partitionedBy property As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the partitionedBy property, Data Factory functions, and the system variables. To understand more details on time-series datasets, scheduling, and slices, see Creating datasets, Scheduling and execution, and Creating pipelines. Sample 1: "folderPath": "wikidatagateway/wikisampledataout/{Slice}", "partitionedBy": [ { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, ], In this example, {Slice} is replaced with the value of the Data Factory system variable SliceStart in the format (YYYYMMDDHH). SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104. Sample 2: "folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } ], In this example, year, month, day, and time of SliceStart are extracted into separate variables that the folderPath and fileName properties use. Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output datasets, and policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. If you are moving data from an onpremises file system, you set the source type in the copy activity to FileSystemSource. Similarly, if you are moving data to an on-premises file system, you set the sink type in the copy activity to FileSystemSink. This section provides a list of properties supported by FileSystemSource and FileSystemSink. FileSystemSource supports the following properties: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED recursive Indicates whether the data is read recursively from the subfolders or only from the specified folder. True, False (default) No ALLOWED VALUES REQUIRED FileSystemSink supports the following properties: PROPERTY DESCRIPTION PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED copyBehavior Defines the copy behavior when the source is BlobSource or FileSystem. PreserveHierarchy: Preserves the file hierarchy in the target folder. That is, the relative path of the source file to the source folder is the same as the relative path of the target file to the target folder. No FlattenHierarchy: All files from the source folder are created in the first level of target folder. The target files are created with an autogenerated name. MergeFiles: Merges all files from the source folder to one file. If the file name/blob name is specified, the merged file name is the specified name. Otherwise, it is an autogenerated file name. recursive and copyBehavior examples This section describes the resulting behavior of the Copy operation for different combinations of values for the recursive and copyBehavior properties. RECURSIVE VALUE COPYBEHAVIOR VALUE RESULTING BEHAVIOR true preserveHierarchy For a source folder Folder1 with the following structure, Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the same structure as the source: Folder1 File1 File2 Subfolder1 File3 File4 File5 RECURSIVE VALUE COPYBEHAVIOR VALUE RESULTING BEHAVIOR true flattenHierarchy For a source folder Folder1 with the following structure, Folder1 File1 File2 Subfolder1 File3 File4 File5 the target Folder1 is created with the following structure: Folder1 auto-generated name for File1 auto-generated name for File2 auto-generated name for File3 auto-generated name for File4 auto-generated name for File5 true mergeFiles For a source folder Folder1 with the following structure, Folder1 File1 File2 Subfolder1 File3 File4 File5 the target Folder1 is created with the following structure: Folder1 File1 + File2 + File3 + File4 + File 5 contents are merged into one file with an auto-generated file name. false preserveHierarchy For a source folder Folder1 with the following structure, Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure: Folder1 File1 File2 Subfolder1 with File3, File4, and File5 is not picked up. RECURSIVE VALUE COPYBEHAVIOR VALUE RESULTING BEHAVIOR false flattenHierarchy For a source folder Folder1 with the following structure, Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure: Folder1 auto-generated name for File1 auto-generated name for File2 Subfolder1 with File3, File4, and File5 is not picked up. false mergeFiles For a source folder Folder1 with the following structure, Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure: Folder1 File1 + File2 contents are merged into one file with an auto-generated file name. Auto-generated name for File1 Subfolder1 with File3, File4, and File5 is not picked up. Supported file and compression formats See File and compression formats in Azure Data Factory article on details. JSON examples for copying data to and from file system The following examples provide sample JSON definitions that you can use to create a pipeline by using the Azure portal, Visual Studio, or Azure PowerShell. They show how to copy data to and from an on-premises file system and Azure Blob storage. However, you can copy data directly from any of the sources to any of the sinks listed in Supported sources and sinks by using Copy Activity in Azure Data Factory. Example: Copy data from an on-premises file system to Azure Blob storage This sample shows how to copy data from an on-premises file system to Azure Blob storage. The sample has the following Data Factory entities: A linked service of type OnPremisesFileServer. A linked service of type AzureStorage. An input dataset of type FileShare. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses FileSystemSource and BlobSink. The following sample copies time-series data from an on-premises file system to Azure Blob storage every hour. The JSON properties that are used in these samples are described in the sections after the samples. As a first step, set up Data Management Gateway as per the instructions in Move data between on-premises sources and the cloud with Data Management Gateway. On-Premises File Server linked service: { "Name": "OnPremisesFileServerLinkedService", "properties": { "type": "OnPremisesFileServer", "typeProperties": { "host": "\\\\Contosogame-Asia.<region>.corp.<company>.com", "userid": "Admin", "password": "123456", "gatewayName": "mygateway" } } } We recommend using the encryptedCredential property instead the userid and password properties. See File Server linked service for details about this linked service. Azure Storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } On-premises file system input dataset: Data is picked up from a new file every hour. The folderPath and fileName properties are determined based on the start time of the slice. Setting "external": "true" informs Data Factory that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "OnpremisesFileSystemInput", "properties": { "type": " FileShare", "linkedServiceName": " OnPremisesFileServerLinkedService ", "typeProperties": { "folderPath": "mysharedfolder/yearno={Year}/monthno={Month}/dayno={Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob storage output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hour parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": "\t", "rowDelimiter": "\n" } }, "availability": { "frequency": "Hour", "interval": 1 } } } A copy activity in a pipeline with File System source and Blob sink: The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource, and sink type is set to BlobSink. { "name":"SamplePipeline", "properties":{ "start":"2015-06-01T18:00:00", "end":"2015-06-01T19:00:00", "description":"Pipeline for copy activity", "activities":[ { "name": "OnpremisesFileSystemtoBlob", "description": "copy activity", "type": "Copy", "inputs": [ { "name": "OnpremisesFileSystemInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "FileSystemSource" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Example: Copy data from Azure SQL Database to an on-premises file system The following sample shows: A linked service of type AzureSqlDatabase. A linked service of type OnPremisesFileServer. An input dataset of type AzureSqlTable. An output dataset of type FileShare. A pipeline with a copy activity that uses SqlSource and FileSystemSink. The sample copies time-series data from an Azure SQL table to an on-premises file system every hour. The JSON properties that are used in these samples are described in sections after the samples. Azure SQL Database linked service: { "name": "AzureSqlLinkedService", "properties": { "type": "AzureSqlDatabase", "typeProperties": { "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" } } } On-Premises File Server linked service: { "Name": "OnPremisesFileServerLinkedService", "properties": { "type": "OnPremisesFileServer", "typeProperties": { "host": "\\\\Contosogame-Asia.<region>.corp.<company>.com", "userid": "Admin", "password": "123456", "gatewayName": "mygateway" } } } We recommend using the encryptedCredential property instead of using the userid and password properties. See File System linked service for details about this linked service. Azure SQL input dataset: The sample assumes that you've created a table “MyTable” in Azure SQL, and it contains a column called “timestampcolumn” for time-series data. Setting “external”: ”true” informs Data Factory that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureSqlInput", "properties": { "type": "AzureSqlTable", "linkedServiceName": "AzureSqlLinkedService", "typeProperties": { "tableName": "MyTable" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } On-premises file system output dataset: Data is copied to a new file every hour. The folderPath and fileName for the blob are determined based on the start time of the slice. { "name": "OnpremisesFileSystemOutput", "properties": { "type": "FileShare", "linkedServiceName": " OnPremisesFileServerLinkedService ", "typeProperties": { "folderPath": "mysharedfolder/yearno={Year}/monthno={Month}/dayno={Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } A copy activity in a pipeline with SQL source and File System sink: The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlSource, and the sink type is set to FileSystemSink. The SQL query that is specified for the SqlReaderQuery property selects the data in the past hour to copy. { "name":"SamplePipeline", "properties":{ "start":"2015-06-01T18:00:00", "end":"2015-06-01T20:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "AzureSQLtoOnPremisesFile", "description": "copy activity", "type": "Copy", "inputs": [ { "name": "AzureSQLInput" } ], "outputs": [ { "name": "OnpremisesFileSystemOutput" } ], "typeProperties": { "source": { "type": "SqlSource", "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyyMM-dd}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "FileSystemSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 3, "timeout": "01:00:00" } } ] } } You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For details, see Mapping dataset columns in Azure Data Factory. Performance and tuning To learn about key factors that impact the performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it, see the Copy Activity performance and tuning guide. Move data from an FTP server by using Azure Data Factory 4/18/2017 • 11 min to read • Edit Online This article explains how to use the copy activity in Azure Data Factory to move data from an FTP server. It builds on the Data movement activities article, which presents a general overview of data movement with the copy activity. You can copy data from an FTP server to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the supported data stores table. Data Factory currently supports only moving data from an FTP server to other data stores, but not moving data from other data stores to an FTP server. It supports both on-premises and cloud FTP servers. NOTE The copy activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file, and use the activity in the pipeline. Enable connectivity If you are moving data from an on-premises FTP server to a cloud data store (for example, to Azure Blob storage), install and use Data Management Gateway. The Data Management Gateway is a client agent that is installed on your on-premises machine, and it allows cloud services to connect to an on-premises resource. For details, see Data Management Gateway. For step-by-step instructions on setting up the gateway and using it, see Moving data between on-premises locations and cloud. You use the gateway to connect to an FTP server, even if the server is on an Azure infrastructure as a service (IaaS) virtual machine (VM). It is possible to install the gateway on the same on-premises machine or IaaS VM as the FTP server. However, we recommend that you install the gateway on a separate machine or IaaS VM to avoid resource contention, and for better performance. When you install the gateway on a separate machine, the machine should be able to access the FTP server. Get started You can create a pipeline with a copy activity that moves data from an FTP source by using different tools or APIs. The easiest way to create a pipeline is to use the Data Factory Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools or APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an FTP data store, see the JSON example: Copy data from FTP server to Azure blob section of this article. NOTE For details about supported file and compression formats to use, see File and compression formats in Azure Data Factory. The following sections provide details about JSON properties that are used to define Data Factory entities specific to FTP. Linked service properties The following table describes JSON elements specific to an FTP linked service. PROPERTY DESCRIPTION REQUIRED DEFAULT type Set this to FtpServer. Yes host Specify the name or IP address of the FTP server. Yes authenticationType Specify the authentication type. Yes username Specify the user who has access to the FTP server. No password Specify the password for the user (username). No encryptedCredential Specify the encrypted credential to access the FTP server. No gatewayName Specify the name of the gateway in Data Management Gateway to connect to an on-premises FTP server. No port Specify the port on which the FTP server is listening. No 21 enableSsl Specify whether to use FTP over an SSL/TLS channel. No true enableServerCertificateValid ation Specify whether to enable server SSL certificate validation when you are using FTP over SSL/TLS channel. No true Basic, Anonymous Use Anonymous authentication { "name": "FTPLinkedService", "properties": { "type": "FtpServer", "typeProperties": { "authenticationType": "Anonymous", "host": "myftpserver.com" } } } Use username and password in plain text for basic authentication { "name": "FTPLinkedService", "properties": { "type": "FtpServer", "typeProperties": { "host": "myftpserver.com", "authenticationType": "Basic", "username": "Admin", "password": "123456" } } } Use port, enableSsl, enableServerCertificateValidation { "name": "FTPLinkedService", "properties": { "type": "FtpServer", "typeProperties": { "host": "myftpserver.com", "authenticationType": "Basic", "username": "Admin", "password": "123456", "port": "21", "enableSsl": true, "enableServerCertificateValidation": true } } } Use encryptedCredential for authentication and gateway { "name": "FTPLinkedService", "properties": { "type": "FtpServer", "typeProperties": { "host": "myftpserver.com", "authenticationType": "Basic", "encryptedCredential": "xxxxxxxxxxxxxxxxx", "gatewayName": "mygateway" } } } Dataset properties Dataset properties For a full list of sections and properties available for defining datasets, see Creating datasets. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. The typeProperties section is different for each type of dataset. It provides information that is specific to the dataset type. The typeProperties section for a dataset of type FileShare has the following properties: PROPERTY DESCRIPTION REQUIRED folderPath Subpath to the folder. Use escape character ‘ \ ’ for special characters in the string. See Sample linked service and dataset definitions for examples. Yes You can combine this property with partitionBy to have folder paths based on slice start and end datetimes. fileName Specify the name of the file in the folderPath if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder. No When fileName is not specified for an output dataset, the name of the generated file is in the following format: Data..txt (Example: Data.0a405f8a93ff-4c6f-b3be-f69616f1df7a.txt) fileFilter Specify a filter to be used to select a subset of files in the folderPath, rather than all files. No Allowed values are: * (multiple characters) and ? (single character). Example 1: Example 2: "fileFilter": "*.log" "fileFilter": 2014-1-?.txt" fileFilter is applicable for an input FileShare dataset. This property is not supported with Hadoop Distributed File System (HDFS). partitionedBy Used to specify a dynamic folderPath and fileName for time series data. For example, you can specify a folderPath that is parameterized for every hour of data. No PROPERTY DESCRIPTION REQUIRED format The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see the Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections. No If you want to copy files as they are between file-based stores (binary copy), skip the format section in both input and output dataset definitions. compression Specify the type and level of compression for the data. Supported types are GZip, Deflate, BZip2, and ZipDeflate, and supported levels are Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. No useBinaryTransfer Specify whether to use the binary transfer mode. The values are true for binary mode (this is the default value), and false for ASCII. This property can only be used when the associated linked service type is of type: FtpServer. No NOTE fileName and fileFilter cannot be used simultaneously. Use the partionedBy property As mentioned in the previous section, you can specify a dynamic folderPath and fileName for time series data with the partitionedBy property. To learn about time series datasets, scheduling, and slices, see Creating datasets, Scheduling and execution, and Creating pipelines. Sample 1 "folderPath": "wikidatagateway/wikisampledataout/{Slice}", "partitionedBy": [ { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, ], In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart, in the format specified (YYYYMMDDHH). The SliceStart refers to start time of the slice. The folder path is different for each slice. (For example, wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104.) Sample 2 "folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } ], In this example, the year, month, day, and time of SliceStart are extracted into separate variables that are used by the folderPath and fileName properties. Copy activity properties For a full list of sections and properties available for defining activities, see Creating pipelines. Properties such as name, description, input and output tables, and policies are available for all types of activities. Properties available in the typeProperties section of the activity, on the other hand, vary with each activity type. For the copy activity, the type properties vary depending on the types of sources and sinks. In copy activity, when the source is of type FileSystemSource, the following property is available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED recursive Indicates whether the data is read recursively from the subfolders, or only from the specified folder. True, False (default) No JSON example: Copy data from FTP server to Azure Blob storage This sample shows how to copy data from an FTP server to Azure Blob storage. However, data can be copied directly to any of the sinks stated in the supported data stores and formats, by using the copy activity in Data Factory. The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal, Visual Studio, or PowerShell: A linked service of type FtpServer A linked service of type AzureStorage An input dataset of type FileShare An output dataset of type AzureBlob A pipeline with copy activity that uses FileSystemSource and BlobSink The sample copies data from an FTP server to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. FTP linked service This example uses basic authentication, with the user name and password in plain text. You can also use one of the following ways: Anonymous authentication Basic authentication with encrypted credentials FTP over SSL/TLS (FTPS) See the FTP linked service section for different types of authentication you can use. { "name": "FTPLinkedService", "properties": { "type": "FtpServer", "typeProperties": { "host": "myftpserver.com", "authenticationType": "Basic", "username": "Admin", "password": "123456" } } } Azure Storage linked service { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } FTP input dataset This dataset refers to the FTP folder destination. mysharedfolder and file test.csv . The pipeline copies the file to the Setting external to true informs the Data Factory service that the dataset is external to the data factory, and is not produced by an activity in the data factory. { "name": "FTPFileInput", "properties": { "type": "FileShare", "linkedServiceName": "FTPLinkedService", "typeProperties": { "folderPath": "mysharedfolder", "fileName": "test.csv", "useBinaryTransfer": true }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Azure Blob output dataset Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated, based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hours parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/ftp/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } A copy activity in a pipeline with file system source and blob sink The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource, and the sink type is set to BlobSink. { "name": "pipeline", "properties": { "activities": [{ "name": "FTPToBlobCopy", "inputs": [{ "name": "FtpFileInput" }], "outputs": [{ "name": "AzureBlobOutput" }], "type": "Copy", "typeProperties": { "source": { "type": "FileSystemSource" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 1, "timeout": "00:05:00" } }], "start": "2016-08-24T18:00:00Z", "end": "2016-08-24T19:00:00Z" } } NOTE To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Next steps See the following articles: To learn about key factors that impact performance of data movement (copy activity) in Data Factory, and various ways to optimize it, see the Copy activity performance and tuning guide. For step-by-step instructions for creating a pipeline with a copy activity, see the Copy activity tutorial. Move data from on-premises HDFS using Azure Data Factory 5/22/2017 • 13 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises HDFS. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from HDFS to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from an on-premises HDFS to other data stores, but not for moving data from other data stores to an on-premises HDFS. NOTE Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline. Enabling connectivity Data Factory service supports connecting to on-premises HDFS using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Use the gateway to connect to HDFS even if it is hosted in an Azure IaaS VM. NOTE Make sure the Data Management Gateway can access to ALL the [name node server]:[name node port] and [data node servers]:[data node port] of the Hadoop cluster. Default [name node port] is 50070, and default [data node port] is 50075. While you can install gateway on the same on-premises machine or the Azure VM as the HDFS, we recommend that you install the gateway on a separate machine/Azure IaaS VM. Having gateway on a separate machine reduces resource contention and improves performance. When you install the gateway on a separate machine, the machine should be able to access the machine with the HDFS. Getting started You can create a pipeline with a copy activity that moves data from a HDFS source by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from a HDFS data store, see JSON example: Copy data from on-premises HDFS to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to HDFS: Linked service properties A linked service links a data store to a data factory. You create a linked service of type Hdfs to link an onpremises HDFS to your data factory. The following table provides description for JSON elements specific to HDFS linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: Hdfs Yes Url URL to the HDFS Yes authenticationType Anonymous, or Windows. Yes To use Kerberos authentication for HDFS connector, refer to this section to set up your on-premises environment accordingly. userName Username for Windows authentication. Yes (for Windows Authentication) password Password for Windows authentication. Yes (for Windows Authentication) gatewayName Name of the gateway that the Data Factory service should use to connect to the HDFS. Yes encryptedCredential NewAzureRMDataFactoryEncryptValue output of the access credential. No Using Anonymous authentication { "name": "hdfs", "properties": { "type": "Hdfs", "typeProperties": { "authenticationType": "Anonymous", "userName": "hadoop", "url" : "http://<machine>:50070/webhdfs/v1/", "gatewayName": "mygateway" } } } Using Windows authentication { "name": "hdfs", "properties": { "type": "Hdfs", "typeProperties": { "authenticationType": "Windows", "userName": "Administrator", "password": "password", "url" : "http://<machine>:50070/webhdfs/v1/", "gatewayName": "mygateway" } } } Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type FileShare (which includes HDFS dataset) has the following properties PROPERTY DESCRIPTION REQUIRED folderPath Path to the folder. Example: Yes myfolder Use escape character ‘ \ ’ for special characters in the string. For example: for folder\subfolder, specify folder\\subfolder and for d:\samplefolder, specify d:\\samplefolder. You can combine this property with partitionBy to have folder paths based on slice start/end date-times. PROPERTY DESCRIPTION REQUIRED fileName Specify the name of the file in the folderPath if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder. No When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: Data..txt (for example: : Data.0a405f8a-93ff-4c6f-b3bef69616f1df7a.txt partitionedBy partitionedBy can be used to specify a dynamic folderPath, filename for time series data. Example: folderPath parameterized for every hour of data. No format The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections. No If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions. compression Specify the type and level of compression for the data. Supported types are: GZip, Deflate, BZip2, and ZipDeflate. Supported levels are: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. No NOTE filename and fileFilter cannot be used simultaneously. Using partionedBy property As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the partitionedBy property, Data Factory functions, and the system variables. To learn more about time series datasets, scheduling, and slices, see Creating Datasets, Scheduling & Execution, and Creating Pipelines articles. Sample 1: "folderPath": "wikidatagateway/wikisampledataout/{Slice}", "partitionedBy": [ { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, ], In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104. Sample 2: "folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } ], In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties. Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. For Copy Activity, when source is of type FileSystemSource the following properties are available in typeProperties section: FileSystemSource supports the following properties: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED recursive Indicates whether the data is read recursively from the sub folders or only from the specified folder. True, False (default) No Supported file and compression formats See File and compression formats in Azure Data Factory article on details. JSON example: Copy data from on-premises HDFS to Azure Blob This sample shows how to copy data from an on-premises HDFS to Azure Blob Storage. However, data can be copied directly to any of the sinks stated here using the Copy Activity in Azure Data Factory. The sample provides JSON definitions for the following Data Factory entities. You can use these definitions to create a pipeline to copy data from HDFS to Azure Blob Storage by using Azure portal or Visual Studio or Azure PowerShell. 1. 2. 3. 4. 5. A linked service of type OnPremisesHdfs. A linked service of type AzureStorage. An input dataset of type FileShare. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses FileSystemSource and BlobSink. The sample copies data from an on-premises HDFS to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. As a first step, set up the data management gateway. The instructions in the moving data between on-premises locations and cloud article. HDFS linked service: This example uses the Windows authentication. See HDFS linked service section for different types of authentication you can use. { "name": "HDFSLinkedService", "properties": { "type": "Hdfs", "typeProperties": { "authenticationType": "Windows", "userName": "Administrator", "password": "password", "url" : "http://<machine>:50070/webhdfs/v1/", "gatewayName": "mygateway" } } } Azure Storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } HDFS input dataset: This dataset refers to the HDFS folder DataTransfer/UnitTest/. The pipeline copies all the files in this folder to the destination. Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "InputDataset", "properties": { "type": "FileShare", "linkedServiceName": "HDFSLinkedService", "typeProperties": { "folderPath": "DataTransfer/UnitTest/" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "OutputDataset", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/hdfs/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } A copy activity in a pipeline with File System source and Blob sink: The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy. { "name": "pipeline", "properties": { "activities": [ { "name": "HdfsToBlobCopy", "inputs": [ {"name": "InputDataset"} ], "outputs": [ {"name": "OutputDataset"} ], "type": "Copy", "typeProperties": { "source": { "type": "FileSystemSource" }, "sink": { "type": "BlobSink" } }, "policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 1, "timeout": "00:05:00" } } ], "start": "2014-06-01T18:00:00Z", "end": "2014-06-01T19:00:00Z" } } Use Kerberos authentication for HDFS connector There are two options to set up the on-premises environment so as to use Kerberos Authentication in HDFS connector. You can choose the one better fits your case. Option 1: Join gateway machine in Kerberos realm Option 2: Enable mutual trust between Windows domain and Kerberos realm Option 1: Join gateway machine in Kerberos realm Requirement: The gateway machine needs to join the Kerberos realm and can’t join any Windows domain. How to configure: On gateway machine: 1. Run the Ksetup utility to configure the Kerberos KDC server and realm. The machine must be configured as a member of a workgroup since a Kerberos realm is different from a Windows domain. This can be achieved by setting the Kerberos realm and adding a KDC server as follows. Replace REALM.COM with your own respective realm as needed. C:> Ksetup /setdomain REALM.COM C:> Ksetup /addkdc REALM.COM <your_kdc_server_address> Restart the machine after executing these 2 commands. 2. Verify the configuration with Ksetup command. The output should be like: C:> Ksetup default realm = REALM.COM (external) REALM.com: kdc = <your_kdc_server_address> In Azure Data Factory: Configure the HDFS connector using Windows authentication together with your Kerberos principal name and password to connect to the HDFS data source. Check HDFS Linked Service properties section on configuration details. Option 2: Enable mutual trust between Windows domain and Kerberos realm Requirement: The gateway machine must join a Windows domain. You need permission to update the domain controller's settings. How to configure: NOTE Replace REALM.COM and AD.COM in the following tutorial with your own respective realm and domain controller as needed. On KDC server: 1. Edit the KDC configuration in krb5.conf file to let KDC trust Windows Domain referring to the following configuration template. By default, the configuration is located at /etc/krb5.conf. [logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log [libdefaults] default_realm = REALM.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true [realms] REALM.COM = { kdc = node.REALM.COM admin_server = node.REALM.COM } AD.COM = { kdc = windc.ad.com admin_server = windc.ad.com } [domain_realm] .REALM.COM = REALM.COM REALM.COM = REALM.COM .ad.com = AD.COM ad.com = AD.COM [capaths] AD.COM = { REALM.COM = . } Restart the KDC service after configuration. 2. Prepare a principal named krbtgt/[email protected] in KDC server with the following command: Kadmin> addprinc krbtgt/[email protected] 3. In hadoop.security.auth_to_local HDFS service configuration file, add RULE:[1:$1@$0](.*@AD.COM)s/@.*// . On domain controller: 1. Run the following Ksetup commands to add a realm entry: C:> Ksetup /addkdc REALM.COM <your_kdc_server_address> C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM 2. Establish trust from Windows Domain to Kerberos Realm. [password] is the password for the principal krbtgt/[email protected]. C:> netdom trust REALM.COM /Domain: AD.COM /add /realm /passwordt:[password] 3. Select encryption algorithm used in Kerberos. a. Go to Server Manager > Group Policy Management > Domain > Group Policy Objects > Default or Active Domain Policy, and Edit. b. In the Group Policy Management Editor popup window, go to Computer Configuration > Policies > Windows Settings > Security Settings > Local Policies > Security Options, and configure Network security: Configure Encryption types allowed for Kerberos. c. Select the encryption algorithm you want to use when connect to KDC. Commonly, you can simply select all the options. d. Use Ksetup command to specify the encryption algorithm to be used on the specific REALM. C:> ksetup /SetEncTypeAttr REALM.COM DES-CBC-CRC DES-CBC-MD5 RC4-HMAC-MD5 AES128-CTSHMAC-SHA1-96 AES256-CTS-HMAC-SHA1-96 4. Create the mapping between the domain account and Kerberos principal, in order to use Kerberos principal in Windows Domain. a. Start the Administrative tools > Active Directory Users and Computers. b. Configure advanced features by clicking View > Advanced Features. c. Locate the account to which you want to create mappings, and right-click to view Name Mappings > click Kerberos Names tab. d. Add a principal from the realm. On gateway machine: Run the following Ksetup commands to add a realm entry. C:> Ksetup /addkdc REALM.COM <your_kdc_server_address> C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM In Azure Data Factory: Configure the HDFS connector using Windows authentication together with either your Domain Account or Kerberos Principal to connect to the HDFS data source. Check HDFS Linked Service properties section on configuration details. NOTE To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data from an HTTP source using Azure Data Factory 4/10/2017 • 8 min to read • Edit Online This article outlines how to use the Copy Activity in Azure Data Factory to move data from an onpremises/cloud HTTP endpoint to a supported sink data store. This article builds on the data movement activities article that presents a general overview of data movement with copy activity and the list of data stores supported as sources/sinks. Data factory currently supports only moving data from an HTTP source to other data stores, but not moving data from other data stores to an HTTP destination. Supported scenarios and authentication types You can use this HTTP connector to retrieve data from both cloud and on-premises HTTP/s endpoint by using HTTP GET or POST method. The following authentication types are supported: Anonymous, Basic, Digest, Windows, and ClientCertificate. Note the difference between this connector and the Web table connector is: the latter is used to extract table content from web HTML page. When copying data from an on-premises HTTP endpoint, you need install a Data Management Gateway in the on-premises environment/Azure VM. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Getting started You can create a pipeline with a copy activity that moves data from an HTTP source by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. For JSON samples to copy data from HTTP source to Azure Blob Storage, see JSON examples section of this articles. Linked service properties The following table provides description for JSON elements specific to HTTP linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: Http . Yes url Base URL to the Web Server Yes PROPERTY DESCRIPTION REQUIRED authenticationType Specifies the authentication type. Allowed values are: Anonymous, Basic, Digest, Windows, ClientCertificate. Yes Refer to sections below this table on more properties and JSON samples for those authentication types respectively. enableServerCertificateValidation Specify whether to enable server SSL certificate validation if source is HTTPS Web Server No, default is true gatewayName Name of the Data Management Gateway to connect to an onpremises HTTP source. Yes if copying data from an onpremises HTTP source. encryptedCredential Encrypted credential to access the HTTP endpoint. Auto-generated when you configure the authentication information in copy wizard or the ClickOnce popup dialog. No. Apply only when copying data from an on-premises HTTP server. See Move data between on-premises sources and the cloud with Data Management Gateway for details about setting credentials for on-premises HTTP connector data source. Using Basic, Digest, or Windows authentication Set authenticationType as Basic , Digest , or Windows , and specify the following properties besides the HTTP connector generic ones introduced above: PROPERTY DESCRIPTION REQUIRED username Username to access the HTTP endpoint. Yes password Password for the user (username). Yes Example: using Basic, Digest, or Windows authentication { "name": "HttpLinkedService", "properties": { "type": "Http", "typeProperties": { "authenticationType": "basic", "url" : "https://en.wikipedia.org/wiki/", "userName": "user name", "password": "password" } } } Using ClientCertificate authentication To use basic authentication, set authenticationType as ClientCertificate , and specify the following properties besides the HTTP connector generic ones introduced above: PROPERTY DESCRIPTION REQUIRED embeddedCertData The Base64-encoded contents of binary data of the Personal Information Exchange (PFX) file. Specify either the embeddedCertData or certThumbprint . certThumbprint The thumbprint of the certificate that was installed on your gateway machine’s cert store. Apply only when copying data from an on-premises HTTP source. Specify either the embeddedCertData or certThumbprint . password Password associated with the certificate. No If you use certThumbprint for authentication and the certificate is installed in the personal store of the local computer, you need to grant the read permission to the gateway service: 1. Launch Microsoft Management Console (MMC). Add the Certificates snap-in that targets the Local Computer. 2. Expand Certificates, Personal, and click Certificates. 3. Right-click the certificate from the personal store, and select All Tasks->Manage Private Keys... 4. On the Security tab, add the user account under which Data Management Gateway Host Service is running with the read access to the certificate. Example: using client certificate This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate that is installed on the machine with Data Management Gateway installed. { "name": "HttpLinkedService", "properties": { "type": "Http", "typeProperties": { "authenticationType": "ClientCertificate", "url": "https://en.wikipedia.org/wiki/", "certThumbprint": "thumbprint of certificate", "gatewayName": "gateway name" } } } Example: using client certificate in a file This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate file on the machine with Data Management Gateway installed. { "name": "HttpLinkedService", "properties": { "type": "Http", "typeProperties": { "authenticationType": "ClientCertificate", "url": "https://en.wikipedia.org/wiki/", "embeddedCertData": "base64 encoded cert data", "password": "password of cert" } } } Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type Http has the following properties PROPERTY DESCRIPTION REQUIRED type Specified the type of the dataset. must be set to Http . Yes relativeUrl A relative URL to the resource that contains the data. When path is not specified, only the URL specified in the linked service definition is used. No To construct dynamic URL, you can use Data Factory functions and system variables, e.g. "relativeUrl": "$$Text.Format('/my/report?month= {0:yyyy}-{0:MM}&fmt=csv', SliceStart)". requestMethod Http method. Allowed values are GET or POST. No. Default is additionalHeaders Additional HTTP request headers. No requestBody Body for HTTP request. No GET . PROPERTY DESCRIPTION REQUIRED format If you want to simply retrieve the data from HTTP endpoint as-is without parsing it, skip this format settings. No If you want to parse the HTTP response content during copy, the following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections. compression Specify the type and level of compression for the data. Supported types are: GZip, Deflate, BZip2, and ZipDeflate. Supported levels are: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. No Example: using the GET (default) method { "name": "HttpSourceDataInput", "properties": { "type": "Http", "linkedServiceName": "HttpLinkedService", "typeProperties": { "relativeUrl": "XXX/test.xml", "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Example: using the POST method { "name": "HttpSourceDataInput", "properties": { "type": "Http", "linkedServiceName": "HttpLinkedService", "typeProperties": { "relativeUrl": "/XXX/test.xml", "requestMethod": "Post", "requestBody": "body for POST HTTP request" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. Currently, when the source in copy activity is of type HttpSource, the following properties are supported. PROPERTY DESCRIPTION REQUIRED httpRequestTimeout The timeout (TimeSpan) for the HTTP request to get a response. It is the timeout to get a response, not the timeout to read response data. No. Default value: 00:01:40 Supported file and compression formats See File and compression formats in Azure Data Factory article on details. JSON examples The following example provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from HTTP source to Azure Blob Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory. Example: Copy data from HTTP source to Azure Blob Storage The Data Factory solution for this sample contains the following Data Factory entities: 1. 2. 3. 4. 5. A linked service of type HTTP. A linked service of type AzureStorage. An input dataset of type Http. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses HttpSource and BlobSink. The sample copies data from an HTTP source to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. HTTP linked service This example uses the HTTP linked service with anonymous authentication. See HTTP linked service section for different types of authentication you can use. { "name": "HttpLinkedService", "properties": { "type": "Http", "typeProperties": { "authenticationType": "Anonymous", "url" : "https://en.wikipedia.org/wiki/" } } } Azure Storage linked service { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } HTTP input dataset Setting external to true informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "HttpSourceDataInput", "properties": { "type": "Http", "linkedServiceName": "HttpLinkedService", "typeProperties": { "relativeUrl": "$$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart)", "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Azure Blob output dataset Data is written to a new blob every hour (frequency: hour, interval: 1). { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adfgetstarted/Movies" }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to HttpSource and sink type is set to BlobSink. See HttpSource for the list of properties supported by the HttpSource. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "HttpSourceToAzureBlob", "description": "Copy from an HTTP source to an Azure blob", "type": "Copy", "inputs": [ { "name": "HttpSourceDataInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "HttpSource" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } NOTE To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data From MongoDB using Azure Data Factory 5/22/2017 • 11 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises MongoDB database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an on-premises MongoDB data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from a MongoDB data store to other data stores, but not for moving data from other data stores to an MongoDB datastore. Prerequisites For the Azure Data Factory service to be able to connect to your on-premises MongoDB database, you must install the following components: Supported MongoDB versions are: 2.4, 2.6, 3.0, and 3.2. Data Management Gateway on the same machine that hosts the database or on a separate machine to avoid competing for resources with the database. Data Management Gateway is a software that connects on-premises data sources to cloud services in a secure and managed way. See Data Management Gateway article for details about Data Management Gateway. See Move data from onpremises to cloud article for step-by-step instructions on setting up the gateway a data pipeline to move data. When you install the gateway, it automatically installs a Microsoft MongoDB ODBC driver used to connect to MongoDB. NOTE You need to use the gateway to connect to MongoDB even if it is hosted in Azure IaaS VMs. If you are trying to connect to an instance of MongoDB hosted in cloud, you can also install the gateway instance in the IaaS VM. Getting started You can create a pipeline with a copy activity that moves data from an on-premises MongoDB data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises MongoDB data store, see JSON example: Copy data from MongoDB to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to MongoDB source: Linked service properties The following table provides description for JSON elements specific to OnPremisesMongoDB linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesMongoDb Yes server IP address or host name of the MongoDB server. Yes port TCP port that the MongoDB server uses to listen for client connections. Optional, default value: 27017 authenticationType Basic, or Anonymous. Yes username User account to access MongoDB. Yes (if basic authentication is used). password Password for the user. Yes (if basic authentication is used). authSource Name of the MongoDB database that you want to use to check your credentials for authentication. Optional (if basic authentication is used). default: uses the admin account and the database specified using databaseName property. databaseName Name of the MongoDB database that you want to access. Yes gatewayName Name of the gateway that accesses the data store. Yes encryptedCredential Credential encrypted by gateway. Optional Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type MongoDbCollection has the following properties: PROPERTY DESCRIPTION REQUIRED collectionName Name of the collection in MongoDB database. Yes Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When the source is of type MongoDbSource the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL-92 query string. For example: select * from MyTable. No (if collectionName of dataset is specified) JSON example: Copy data from MongoDB to Azure Blob This example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. It shows how to copy data from an on-premises MongoDB to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type OnPremisesMongoDb. A linked service of type AzureStorage. An input dataset of type MongoDbCollection. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses MongoDbSource and BlobSink. The sample copies data from a query result in MongoDB database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. As a first step, setup the data management gateway as per the instructions in the Data Management Gateway article. MongoDB linked service: { "name": "OnPremisesMongoDbLinkedService", "properties": { "type": "OnPremisesMongoDb", "typeProperties": { "authenticationType": "<Basic or Anonymous>", "server": "< The IP address or host name of the MongoDB server >", "port": "<The number of the TCP port that the MongoDB server uses to listen for client connections.>", "username": "<username>", "password": "<password>", "authSource": "< The database that you want to use to check your credentials for authentication. >", "databaseName": "<database name>", "gatewayName": "<mygateway>" } } } Azure Storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } MongoDB input dataset: Setting “external”: ”true” informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. { "name": "MongoDbInputDataset", "properties": { "type": "MongoDbCollection", "linkedServiceName": "OnPremisesMongoDbLinkedService", "typeProperties": { "collectionName": "<Collection name>" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutputDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/frommongodb/yearno={Year}/monthno={Month}/dayno={Day}/hourno= {Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with MongoDB source and Blob sink: The pipeline contains a Copy Activity that is configured to use the above input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to MongoDbSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy. { "name": "CopyMongoDBToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "MongoDbSource", "query": "$$Text.Format('select * from MyTable where LastModifiedDate >= {{ts\'{0:yyyy-MM-dd HH:mm:ss}\'}} AND LastModifiedDate < {{ts\'{1:yyyy-MM-dd HH:mm:ss}\'}}', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "MongoDbInputDataset" } ], "outputs": [ { "name": "AzureBlobOutputDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "MongoDBToAzureBlob" } ], "start": "2016-06-01T18:00:00Z", "end": "2016-06-01T19:00:00Z" } } Schema by Data Factory Azure Data Factory service infers schema from a MongoDB collection by using the latest 100 documents in the collection. If these 100 documents do not contain full schema, some columns may be ignored during the copy operation. Type mapping for MongoDB As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to MongoDB the following mappings are used from MongoDB types to .NET types. MONGODB TYPE .NET FRAMEWORK TYPE Binary Byte[] Boolean Boolean Date DateTime NumberDouble Double NumberInt Int32 NumberLong Int64 ObjectID String String String UUID Guid Object Renormalized into flatten columns with “_” as nested separator NOTE To learn about support for arrays using virtual tables, refer to Support for complex types using virtual tables section below. Currently, the following MongoDB data types are not supported: DBPointer, JavaScript, Max/Min key, Regular Expression, Symbol, Timestamp, Undefined Support for complex types using virtual tables Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your MongoDB database. For complex types such as arrays or objects with different types across the documents, the driver re-normalizes data into corresponding virtual tables. Specifically, if a table contains such columns, the driver generates the following virtual tables: A base table, which contains the same data as the real table except for the complex type columns. The base table uses the same name as the real table that it represents. A virtual table for each complex type column, which expands the nested data. The virtual tables are named using the name of the real table, a separator “_” and the name of the array or object. Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. See Example section below details. You can access the content of MongoDB arrays by querying and joining the virtual tables. You can use the Copy Wizard to intuitively view the list of tables in MongoDB database including the virtual tables, and preview the data inside. You can also construct a query in the Copy Wizard and validate to see the result. Example For example, “ExampleTable” below is a MongoDB table that has one column with an array of Objects in each cell – Invoices, and one column with an array of Scalar types – Ratings. _ID CUSTOMER NAME INVOICES SERVICE LEVEL RATINGS 1111 ABC [{invoice_id:”123”, item:”toaster”, price:”456”, discount:”0.2”}, {invoice_id:”124”, item:”oven”, price: ”1235”, discount: ”0.2”}] Silver [5,6] 2222 XYZ [{invoice_id:”135”, item:”fridge”, price: ”12543”, discount: ”0.0”}] Gold [1,2] The driver would generate multiple virtual tables to represent this single table. The first virtual table is the base table named “ExampleTable”, shown below. The base table contains all the data of the original table, but the data from the arrays has been omitted and is expanded in the virtual tables. _ID CUSTOMER NAME SERVICE LEVEL 1111 ABC Silver 2222 XYZ Gold The following tables show the virtual tables that represent the original arrays in the example. These tables contain the following: A reference back to the original primary key column corresponding to the row of the original array (via the _id column) An indication of the position of the data within the original array The expanded data for each element within the array Table “ExampleTable_Invoices”: _ID EXAMPLETABLE_I NVOICES_DIM1_ID X INVOICE_ID ITEM PRICE DISCOUNT 1111 0 123 toaster 456 0.2 1111 1 124 oven 1235 0.2 2222 0 135 fridge 12543 0.0 Table “ExampleTable_Ratings”: _ID EXAMPLETABLE_RATINGS_DIM1_IDX EXAMPLETABLE_RATINGS 1111 0 5 1111 1 6 _ID EXAMPLETABLE_RATINGS_DIM1_IDX EXAMPLETABLE_RATINGS 2222 0 1 2222 1 2 Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Next Steps See Move data between on-premises and cloud article for step-by-step instructions for creating a data pipeline that moves data from an on-premises data store to an Azure data store. Move data From MySQL using Azure Data Factory 5/15/2017 • 8 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises MySQL database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an on-premises MySQL data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from a MySQL data store to other data stores, but not for moving data from other data stores to an MySQL data store. Prerequisites Data Factory service supports connecting to on-premises MySQL sources using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Gateway is required even if the MySQL database is hosted in an Azure IaaS virtual machine (VM). You can install the gateway on the same VM as the data store or on a different VM as long as the gateway can connect to the database. NOTE See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Supported versions and installation For Data Management Gateway to connect to the MySQL Database, you need to install the MySQL Connector/Net for Microsoft Windows (version 6.6.5 or above) on the same system as the Data Management Gateway. MySQL version 5.1 and above is supported. TIP If you hit error on "Authentication failed because the remote party has closed the transport stream.", consider to upgrade the MySQL Connector/Net to higher version. Getting started You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises MySQL data store, see JSON example: Copy data from MySQL to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to a MySQL data store: Linked service properties The following table provides description for JSON elements specific to MySQL linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesMySql Yes server Name of the MySQL server. Yes database Name of the MySQL database. Yes schema Name of the schema in the database. No authenticationType Type of authentication used to connect to the MySQL database. Possible values are: Basic . Yes username Specify user name to connect to the MySQL database. Yes password Specify password for the user account you specified. Yes gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises MySQL database. Yes Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes MySQL dataset) has the following properties PROPERTY DESCRIPTION REQUIRED tableName Name of the table in the MySQL Database instance that linked service refers to. No (if query of RelationalSource is specified) Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, are policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When source in copy activity is of type RelationalSource (which includes MySQL), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL query string. For example: select * from MyTable. No (if tableName of dataset is specified) JSON example: Copy data from MySQL to Azure Blob This example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. It shows how to copy data from an on-premises MySQL database to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. IMPORTANT This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See moving data between on-premises locations and cloud article for step-by-step instructions. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type OnPremisesMySql. A linked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from a query result in MySQL database to a blob hourly. The JSON properties used in these samples are described in sections following the samples. As a first step, setup the data management gateway. The instructions are in the moving data between onpremises locations and cloud article. MySQL linked service: { "name": "OnPremMySqlLinkedService", "properties": { "type": "OnPremisesMySql", "typeProperties": { "server": "<server name>", "database": "<database name>", "schema": "<schema name>", "authenticationType": "<authentication type>", "userName": "<user name>", "password": "<password>", "gatewayName": "<gateway>" } } } Azure Storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } MySQL input dataset: The sample assumes you have created a table “MyTable” in MySQL and it contains a column called “timestampcolumn” for time series data. Setting “external”: ”true” informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. { "name": "MySqlDataSet", "properties": { "published": false, "type": "RelationalTable", "linkedServiceName": "OnPremMySqlLinkedService", "typeProperties": {}, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobMySqlDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/mysql/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy. { "name": "CopyMySqlToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyyMM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "MySqlDataSet" } ], "outputs": [ { "name": "AzureBlobMySqlDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "MySqlToBlob" } ], "start": "2014-06-01T18:00:00Z", "end": "2014-06-01T19:00:00Z" } } Type mapping for MySQL As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to MySQL, the following mappings are used from MySQL types to .NET types. MYSQL DATABASE TYPE .NET FRAMEWORK TYPE bigint unsigned Decimal bigint Int64 bit Decimal MYSQL DATABASE TYPE .NET FRAMEWORK TYPE blob Byte[] bool Boolean char String date Datetime datetime Datetime decimal Decimal double precision Double double Double enum String float Single int unsigned Int64 int Int32 integer unsigned Int64 integer Int32 long varbinary Byte[] long varchar String longblob Byte[] longtext String mediumblob Byte[] mediumint unsigned Int64 mediumint Int32 mediumtext String numeric Decimal real Double set String MYSQL DATABASE TYPE .NET FRAMEWORK TYPE smallint unsigned Int32 smallint Int16 text String time TimeSpan timestamp Datetime tinyblob Byte[] tinyint unsigned Int16 tinyint Int16 tinytext String varchar String year Int Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data From a OData source using Azure Data Factory 6/5/2017 • 9 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an OData source. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an OData source to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from an OData source to other data stores, but not for moving data from other data stores to an OData source. Supported versions and authentication types This OData connector support OData version 3.0 and 4.0, and you can copy data from both cloud OData and on-premises OData sources. For the latter, you need to install the Data Management Gateway. See Move data between on-premises and cloud article for details about Data Management Gateway. Below authentication types are supported: To access cloud OData feed, you can use anonymous, basic (user name and password), or Azure Active Directory based OAuth authentication. To access on-premises OData feed, you can use anonymous, basic (user name and password), or Windows authentication. Getting started You can create a pipeline with a copy activity that moves data from an OData source by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an OData source, see JSON example: Copy data from OData source to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to OData source: Linked Service properties The following table provides description for JSON elements specific to OData linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OData Yes url Url of the OData service. Yes authenticationType Type of authentication used to connect to the OData source. Yes For cloud OData, possible values are Anonymous, Basic, and OAuth (note Azure Data Factory currently only support Azure Active Directory based OAuth). For on-premises OData, possible values are Anonymous, Basic, and Windows. username Specify user name if you are using Basic authentication. Yes (only if you are using Basic authentication) password Specify password for the user account you specified for the username. Yes (only if you are using Basic authentication) authorizedCredential If you are using OAuth, click Authorize button in the Data Factory Copy Wizard or Editor and enter your credential, then the value of this property will be auto-generated. Yes (only if you are using OAuth authentication) gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises OData service. Specify only if you are copying data from on-prem OData source. No Using Basic authentication { "name": "inputLinkedService", "properties": { "type": "OData", "typeProperties": { "url": "http://services.odata.org/OData/OData.svc", "authenticationType": "Basic", "username": "username", "password": "password" } } } Using Anonymous authentication { "name": "ODataLinkedService", "properties": { "type": "OData", "typeProperties": { "url": "http://services.odata.org/OData/OData.svc", "authenticationType": "Anonymous" } } } Using Windows authentication accessing on-premises OData source { "name": "inputLinkedService", "properties": { "type": "OData", "typeProperties": { "url": "<endpoint of on-premises OData source e.g. Dynamics CRM>", "authenticationType": "Windows", "username": "domain\\user", "password": "password", "gatewayName": "mygateway" } } } Using OAuth authentication accessing cloud OData source { "name": "inputLinkedService", "properties": { "type": "OData", "typeProperties": { "url": "<endpoint of cloud OData source e.g. https://<tenant>.crm.dynamics.com/XRMServices/2011/OrganizationData.svc>", "authenticationType": "OAuth", "authorizedCredential": "<auto generated by clicking the Authorize button on UI>" } } } Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type ODataResource (which includes OData dataset) has the following properties PROPERTY DESCRIPTION REQUIRED path Path to the OData resource No Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When source is of type RelationalSource (which includes OData) the following properties are available in typeProperties section: PROPERTY DESCRIPTION EXAMPLE REQUIRED query Use the custom query to read data. "?$select=Name, Description&$top=5" No Type Mapping for OData As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach. 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data from OData, the following mappings are used from OData types to .NET type. ODATA DATA TYPE .NET TYPE Edm.Binary Byte[] Edm.Boolean Bool Edm.Byte Byte[] Edm.DateTime DateTime Edm.Decimal Decimal Edm.Double Double Edm.Single Single Edm.Guid Guid Edm.Int16 Int16 Edm.Int32 Int32 Edm.Int64 Int64 Edm.SByte Int16 Edm.String String Edm.Time TimeSpan Edm.DateTimeOffset DateTimeOffset NOTE OData complex data types e.g. Object are not supported. JSON example: Copy data from OData source to Azure Blob This example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from an OData source to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. The sample has the following Data Factory entities: 1. 2. 3. 4. 5. A linked service of type OData. A linked service of type AzureStorage. An input dataset of type ODataResource. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from querying against an OData source to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. OData linked service: This example uses the Anonymous authentication. See OData linked service section for different types of authentication you can use. { "name": "ODataLinkedService", "properties": { "type": "OData", "typeProperties": { "url": "http://services.odata.org/OData/OData.svc", "authenticationType": "Anonymous" } } } Azure Storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } OData input dataset: Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "ODataDataset", "properties": { "type": "ODataResource", "typeProperties": { "path": "Products" }, "linkedServiceName": "ODataLinkedService", "structure": [], "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } Specifying path in the dataset definition is optional. Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobODataDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/odata/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with OData source and Blob sink: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the latest (newest) data from the OData source. { "name": "CopyODataToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "?$select=Name, Description&$top=5", }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "ODataDataSet" } ], "outputs": [ { "name": "AzureBlobODataDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "ODataToBlob" } ], "start": "2017-02-01T18:00:00Z", "end": "2017-02-03T19:00:00Z" } } Specifying query in the pipeline definition is optional. The URL that the Data Factory service uses to retrieve data is: URL specified in the linked service (required) + path specified in the dataset (optional) + query in the pipeline (optional). Type mapping for OData As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data from OData data stores, OData data types are mapped to .NET types. Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data From ODBC data stores using Azure Data Factory 5/16/2017 • 10 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises ODBC data store. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an ODBC data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from an ODBC data store to other data stores, but not for moving data from other data stores to an ODBC data store. Enabling connectivity Data Factory service supports connecting to on-premises ODBC sources using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Use the gateway to connect to an ODBC data store even if it is hosted in an Azure IaaS VM. You can install the gateway on the same on-premises machine or the Azure VM as the ODBC data store. However, we recommend that you install the gateway on a separate machine/Azure IaaS VM to avoid resource contention and for better performance. When you install the gateway on a separate machine, the machine should be able to access the machine with the ODBC data store. Apart from the Data Management Gateway, you also need to install the ODBC driver for the data store on the gateway machine. NOTE See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Getting started You can create a pipeline with a copy activity that moves data from an ODBC data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an ODBC data store, see JSON example: Copy data from ODBC data store to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to ODBC data store: Linked service properties The following table provides description for JSON elements specific to ODBC linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesOdbc Yes connectionString The non-access credential portion of the connection string and an optional encrypted credential. See examples in the following sections. Yes credential The access credential portion of the connection string specified in driverspecific property-value format. Example: “Uid=;Pwd=;RefreshToken=;”. No authenticationType Type of authentication used to connect to the ODBC data store. Possible values are: Anonymous and Basic. Yes username Specify user name if you are using Basic authentication. No password Specify password for the user account you specified for the username. No gatewayName Name of the gateway that the Data Factory service should use to connect to the ODBC data store. Yes Using Basic authentication { "name": "odbc", "properties": { "type": "OnPremisesOdbc", "typeProperties": { "authenticationType": "Basic", "connectionString": "Driver={SQL Server};Server=Server.database.windows.net; Database=TestDatabase;", "userName": "username", "password": "password", "gatewayName": "mygateway" } } } Using Basic authentication with encrypted credentials You can encrypt the credentials using the New-AzureRMDataFactoryEncryptValue (1.0 version of Azure PowerShell) cmdlet or New-AzureDataFactoryEncryptValue (0.9 or earlier version of the Azure PowerShell). { "name": "odbc", "properties": { "type": "OnPremisesOdbc", "typeProperties": { "authenticationType": "Basic", "connectionString": "Driver={SQL Server};Server=myserver.database.windows.net; Database=TestDatabase;;EncryptedCredential=eyJDb25uZWN0...........................", "gatewayName": "mygateway" } } } Using Anonymous authentication { "name": "odbc", "properties": { "type": "OnPremisesOdbc", "typeProperties": { "authenticationType": "Anonymous", "connectionString": "Driver={SQL Server};Server={servername}.database.windows.net; Database=TestDatabase;", "credential": "UID={uid};PWD={pwd}", "gatewayName": "mygateway" } } } Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes ODBC dataset) has the following properties PROPERTY DESCRIPTION REQUIRED tableName Name of the table in the ODBC data store. Yes Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities. Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. In copy activity, when source is of type RelationalSource (which includes ODBC), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL query string. For example: select * from MyTable. Yes JSON example: Copy data from ODBC data store to Azure Blob This example provides JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. It shows how to copy data from an ODBC source to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type OnPremisesOdbc. A linked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from a query result in an ODBC data store to a blob every hour. The JSON properties used in these samples are described in sections following the samples. As a first step, set up the data management gateway. The instructions are in the moving data between onpremises locations and cloud article. ODBC linked service This example uses the Basic authentication. See ODBC linked service section for different types of authentication you can use. { "name": "OnPremOdbcLinkedService", "properties": { "type": "OnPremisesOdbc", "typeProperties": { "authenticationType": "Basic", "connectionString": "Driver={SQL Server};Server=Server.database.windows.net; Database=TestDatabase;", "userName": "username", "password": "password", "gatewayName": "mygateway" } } } Azure Storage linked service { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } ODBC input dataset The sample assumes you have created a table “MyTable” in an ODBC database and it contains a column called “timestampcolumn” for time series data. Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "ODBCDataSet", "properties": { "published": false, "type": "RelationalTable", "linkedServiceName": "OnPremOdbcLinkedService", "typeProperties": {}, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOdbcDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/odbc/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity in a pipeline with ODBC source (RelationalSource) and Blob sink (BlobSink) The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy. { "name": "CopyODBCToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MMddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "OdbcDataSet" } ], "outputs": [ { "name": "AzureBlobOdbcDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "OdbcToBlob" } ], "start": "2016-06-01T18:00:00Z", "end": "2016-06-01T19:00:00Z" } } Type mapping for ODBC As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data from ODBC data stores, ODBC data types are mapped to .NET types as mentioned in the ODBC Data Type Mappings topic. Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. GE Historian store You create an ODBC linked service to link a GE Proficy Historian (now GE Historian) data store to an Azure data factory as shown in the following example: { "name": "HistorianLinkedService", "properties": { "type": "OnPremisesOdbc", "typeProperties": { "connectionString": "DSN=<name of the GE Historian store>", "gatewayName": "<gateway name>", "authenticationType": "Basic", "userName": "<user name>", "password": "<password>" } } } Install Data Management Gateway on an on-premises machine and register the gateway with the portal. The gateway installed on your on-premises computer uses the ODBC driver for GE Historian to connect to the GE Historian data store. Therefore, install the driver if it is not already installed on the gateway machine. See Enabling connectivity section for details. Before you use the GE Historian store in a Data Factory solution, verify whether the gateway can connect to the data store using instructions in the next section. Read the article from the beginning for a detailed overview of using ODBC data stores as source data stores in a copy operation. Troubleshoot connectivity issues To troubleshoot connection issues, use the Diagnostics tab of Data Management Gateway Configuration Manager. 1. Launch Data Management Gateway Configuration Manager. You can either run "C:\Program Files\Microsoft Data Management Gateway\1.0\Shared\ConfigManager.exe" directly (or) search for Gateway to find a link to Microsoft Data Management Gateway application as shown in the following image. 2. Switch to the Diagnostics tab. 3. Select the type of data store (linked service). 4. Specify authentication and enter credentials (or) enter connection string that is used to connect to the data store. 5. Click Test connection to test the connection to the data store. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Copy data to/from on-premises Oracle using Azure Data Factory 6/9/2017 • 15 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data to/from an on-premises Oracle database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. Supported scenarios You can copy data from an Oracle database to the following data stores: CATEGORY DATA STORE Azure Azure Blob storage Azure Data Lake Store Azure Cosmos DB (DocumentDB API) Azure SQL Database Azure SQL Data Warehouse Azure Search Index Azure Table storage Databases SQL Server Oracle File File system You can copy data from the following data stores to an Oracle database: CATEGORY DATA STORE Azure Azure Blob storage Azure Cosmos DB (DocumentDB API) Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure Table storage Databases Amazon Redshift DB2 MySQL Oracle PostgreSQL SAP Business Warehouse SAP HANA SQL Server Sybase Teradata NoSQL Cassandra MongoDB CATEGORY DATA STORE File Amazon S3 File System FTP HDFS SFTP Others Generic HTTP Generic OData Generic ODBC Salesforce Web Table (table from HTML) GE Historian Prerequisites Data Factory supports connecting to on-premises Oracle sources using the Data Management Gateway. See Data Management Gateway article to learn about Data Management Gateway and Move data from onpremises to cloud article for step-by-step instructions on setting up the gateway a data pipeline to move data. Gateway is required even if the Oracle is hosted in an Azure IaaS VM. You can install the gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database. NOTE See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Supported versions and installation This Oracle connector support two versions of drivers: Microsoft driver for Oracle (recommended): starting from Data Management Gateway version 2.7, a Microsoft driver for Oracle is automatically installed along with the gateway, so you don't need to additionally handle the driver in order to establish connectivity to Oracle, and you can also experience better copy performance using this driver. Below versions of Oracle databases are supported: Oracle 12c R1 (12.1) Oracle 11g R1, R2 (11.1, 11.2) Oracle 10g R1, R2 (10.1, 10.2) Oracle 9i R1, R2 (9.0.1, 9.2) Oracle 8i R3 (8.1.7) IMPORTANT Currently Microsoft driver for Oracle only supports copying data from Oracle but not writing to Oracle. And note the test connection capability in Data Management Gateway Diagnostics tab does not support this driver. Alternatively, you can use the copy wizard to validate the connectivity. Oracle Data Provider for .NET: you can also choose to use Oracle Data Provider to copy data from/to Oracle. This component is included in Oracle Data Access Components for Windows. Install the appropriate version (32/64 bit) on the machine where the gateway is installed. Oracle Data Provider .NET 12.1 can access to Oracle Database 10g Release 2 or later. If you choose “XCopy Installation”, follow steps in the readme.htm. We recommend you choose the installer with UI (non-XCopy one). After installing the provider, restart the Data Management Gateway host service on your machine using Services applet (or) Data Management Gateway Configuration Manager. If you use copy wizard to author the copy pipeline, the driver type will be auto-determined. Microsoft driver will be used by default, unless your gateway version is lower than 2.7 or you choose Oracle as sink. Getting started You can create a pipeline with a copy activity that moves data to/from an on-premises Oracle database by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create a data factory. A data factory may contain one or more pipelines. 2. Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Oralce database to an Azure blob storage, you create two linked services to link your Oracle database and Azure storage account to your data factory. For linked service properties that are specific to Oracle, see linked service properties section. 3. Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the table in your Oracle database that contains the input data. And, you create another dataset to specify the blob container and the folder that holds the data copied from the Oracle database. For dataset properties that are specific to Oracle, see dataset properties section. 4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use OracleSource as a source and BlobSink as a sink for the copy activity. Similarly, if you are copying from Azure Blob Storage to Oracle Database, you use BlobSource and OracleSink in the copy activity. For copy activity properties that are specific to Oracle database, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an on-premises Oracle database, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities: Linked service properties The following table provides description for JSON elements specific to Oracle linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesOracle Yes PROPERTY DESCRIPTION REQUIRED driverType Specify which driver to use to copy data from/to Oracle Database. Allowed values are Microsoft or ODP (default). See Supported version and installation section on driver details. No connectionString Specify information needed to connect to the Oracle Database instance for the connectionString property. Yes gatewayName Name of the gateway that that is used to connect to the on-premises Oracle server Yes Example: using Microsoft driver: { "name": "OnPremisesOracleLinkedService", "properties": { "type": "OnPremisesOracle", "typeProperties": { "driverType": "Microsoft", "connectionString":"Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password= <password>;", "gatewayName": "<gateway name>" } } } Example: using ODP driver Refer to this site for the allowed formats. { "name": "OnPremisesOracleLinkedService", "properties": { "type": "OnPremisesOracle", "typeProperties": { "connectionString": "Data Source=(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<hostname>)(PORT= <port number>))(CONNECT_DATA=(SERVICE_NAME=<SID>))); User Id=<username>;Password=<password>;", "gatewayName": "<gateway name>" } } } Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Oracle, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type OracleTable has the following properties: PROPERTY DESCRIPTION REQUIRED tableName Name of the table in the Oracle Database that the linked service refers to. No (if oracleReaderQuery of OracleSource is specified) Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. NOTE The Copy Activity takes only one input and produces only one output. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. OracleSource In Copy activity, when the source is of type OracleSource the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED oracleReaderQuery Use the custom query to read data. SQL query string. For example: select * from MyTable No (if tableName of dataset is specified) If not specified, the SQL statement that is executed: select * from MyTable OracleSink OracleSink supports the following properties: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED writeBatchTimeout Wait time for the batch insert operation to complete before it times out. timespan No writeBatchSize Inserts data into the SQL table when the buffer size reaches writeBatchSize. Integer (number of rows) No (default: 100) sqlWriterCleanupScript Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. A query statement. No Example: 00:30:00 (30 minutes). PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED sliceIdentifierColumnName Specify column name for Copy Activity to fill with auto generated slice identifier, which is used to clean up data of a specific slice when rerun. Column name of a column with data type of binary(32). No JSON examples for copying data to and from Oracle database The following example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from/to an Oracle database to/from Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. Example: Copy data from Oracle to Azure Blob The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type OnPremisesOracle. A linked service of type AzureStorage. An input dataset of type OracleTable. An output dataset of type AzureBlob. A pipeline with Copy activity that uses OracleSource as source and BlobSink as sink. The sample copies data from a table in an on-premises Oracle database to a blob hourly. For more information on various properties used in the sample, see documentation in sections following the samples. Oracle linked service: { "name": "OnPremisesOracleLinkedService", "properties": { "type": "OnPremisesOracle", "typeProperties": { "driverType": "Microsoft", "connectionString":"Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password= <password>;", "gatewayName": "<gateway name>" } } } Azure Blob storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey= <Account key>" } } } Oracle input dataset: The sample assumes you have created a table “MyTable” in Oracle and it contains a column called “timestampcolumn” for time series data. Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "OracleInput", "properties": { "type": "OracleTable", "linkedServiceName": "OnPremisesOracleLinkedService", "typeProperties": { "tableName": "MyTable" }, "external": true, "availability": { "offset": "01:00:00", "interval": "1", "anchorDateTime": "2014-02-27T12:00:00", "frequency": "Hour" }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": "\t", "rowDelimiter": "\n" } }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run hourly. In the pipeline JSON definition, the source type is set to OracleSource and sink type is set to BlobSink. The SQL query specified with oracleReaderQuery property selects the data in the past hour to copy. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "OracletoBlob", "description": "copy activity", "type": "Copy", "inputs": [ { "name": " OracleInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "OracleSource", "oracleReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Example: Copy data from Azure Blob to Oracle This sample shows how to copy data from an Azure Blob Storage to an on-premises Oracle database. However, data can be copied directly from any of the sources stated here using the Copy Activity in Azure Data Factory. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type OnPremisesOracle. A linked service of type AzureStorage. An input dataset of type AzureBlob. An output dataset of type OracleTable. A pipeline with Copy activity that uses BlobSource as source OracleSink as sink. The sample copies data from a blob to a table in an on-premises Oracle database every hour. For more information on various properties used in the sample, see documentation in sections following the samples. Oracle linked service: { "name": "OnPremisesOracleLinkedService", "properties": { "type": "OnPremisesOracle", "typeProperties": { "connectionString": "Data Source=(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<hostname>)(PORT= <port number>))(CONNECT_DATA=(SERVICE_NAME=<SID>))); User Id=<username>;Password=<password>;", "gatewayName": "<gateway name>" } } } Azure Blob storage linked service: { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey= <Account key>" } } } Azure Blob input dataset Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs the Data Factory service that this table is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } } ], "format": { "type": "TextFormat", "columnDelimiter": ",", "rowDelimiter": "\n" } }, "external": true, "availability": { "frequency": "Day", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Oracle output dataset: The sample assumes you have created a table “MyTable” in Oracle. Create the table in Oracle with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. { "name": "OracleOutput", "properties": { "type": "OracleTable", "linkedServiceName": "OnPremisesOracleLinkedService", "typeProperties": { "tableName": "MyTable" }, "availability": { "frequency": "Day", "interval": "1" } } } Pipeline with Copy activity: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and the sink type is set to OracleSink. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-05T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "AzureBlobtoOracle", "description": "Copy Activity", "type": "Copy", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "OracleOutput" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "OracleSink" } }, "scheduler": { "frequency": "Day", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Troubleshooting tips Problem 1: .NET Framework Data Provider You see the following error message: Copy activity met invalid parameters: 'UnknownParameterName', Detailed message: Unable to find the requested .Net Framework Data Provider. It may not be installed”. Possible causes: 1. The .NET Framework Data Provider for Oracle was not installed. 2. The .NET Framework Data Provider for Oracle was installed to .NET Framework 2.0 and is not found in the .NET Framework 4.0 folders. Resolution/Workaround: 1. If you haven't installed the .NET Provider for Oracle, install it and retry the scenario. 2. If you get the error message even after installing the provider, do the following steps: a. Open machine config of .NET 2.0 from the folder: :\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config. b. Search for Oracle Data Provider for .NET, and you should be able to find an entry as shown in the following sample under system.data -> DbProviderFactories: “” 3. Copy this entry to the machine.config file in the following v4.0 folder: :\Windows\Microsoft.NET\Framework64\v4.0.30319\Config\machine.config, and change the version to 4.xxx.x.x. 4. Install “\11.2.0\client_1\odp.net\bin\4\Oracle.DataAccess.dll” into the global assembly cache (GAC) by running gacutil /i [provider path] .## Troubleshooting tips Problem 2: datetime formatting You see the following error message: Message=Operation failed in Oracle Database with the following error: 'ORA-01861: literal does not match format string'.,Source=,''Type=Oracle.DataAccess.Client.OracleException,Message=ORA-01861: literal does not match format string,Source=Oracle Data Provider for .NET,'. Resolution/Workaround: You may need to adjust the query string in your copy activity based on how dates are configured in your Oracle database, as shown in the following sample (using the to_date function): "oracleReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= to_date(\\'{0:MM-ddyyyy HH:mm}\\',\\'MM/DD/YYYY HH24:MI\\') AND timestampcolumn < to_date(\\'{1:MM-dd-yyyy HH:mm}\\',\\'MM/DD/YYYY HH24:MI\\') ', WindowStart, WindowEnd)" Type mapping for Oracle As mentioned in the data movement activities article Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data from Oracle, the following mappings are used from Oracle data type to .NET type and vice versa. ORACLE DATA TYPE .NET FRAMEWORK DATA TYPE BFILE Byte[] BLOB Byte[] CHAR String CLOB String DATE DateTime FLOAT Decimal, String (if precision > 28) INTEGER Decimal, String (if precision > 28) INTERVAL YEAR TO MONTH Int32 INTERVAL DAY TO SECOND TimeSpan LONG String LONG RAW Byte[] NCHAR String NCLOB String NUMBER Decimal, String (if precision > 28) NVARCHAR2 String RAW Byte[] ROWID String TIMESTAMP DateTime TIMESTAMP WITH LOCAL TIME ZONE DateTime TIMESTAMP WITH TIME ZONE DateTime UNSIGNED INTEGER Number VARCHAR2 String XML String NOTE Data type INTERVAL YEAR TO MONTH and INTERVAL DAY TO SECOND are not supported when using Microsoft driver. Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data from PostgreSQL using Azure Data Factory 6/6/2017 • 8 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises PostgreSQL database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an on-premises PostgreSQL data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see supported data stores. Data factory currently supports moving data from a PostgreSQL database to other data stores, but not for moving data from other data stores to an PostgreSQL database. prerequisites Data Factory service supports connecting to on-premises PostgreSQL sources using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Gateway is required even if the PostgreSQL database is hosted in an Azure IaaS VM. You can install gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database. NOTE See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Supported versions and installation For Data Management Gateway to connect to the PostgreSQL Database, install the Ngpsql data provider for PostgreSQL 2.0.12 or above on the same system as the Data Management Gateway. PostgreSQL version 7.4 and above is supported. Getting started You can create a pipeline with a copy activity that moves data from an on-premises PostgreSQL data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal Visual Studio Azure PowerShell Azure Resource Manager template .NET API REST API See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises PostgreSQL data store, see JSON example: Copy data from PostgreSQL to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to a PostgreSQL data store: Linked service properties The following table provides description for JSON elements specific to PostgreSQL linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesPostgreSql Yes server Name of the PostgreSQL server. Yes database Name of the PostgreSQL database. Yes schema Name of the schema in the database. The schema name is case-sensitive. No authenticationType Type of authentication used to connect to the PostgreSQL database. Possible values are: Anonymous, Basic, and Windows. Yes username Specify user name if you are using Basic or Windows authentication. No password Specify password for the user account you specified for the username. No gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises PostgreSQL database. Yes Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes PostgreSQL dataset) has the following properties: PROPERTY DESCRIPTION REQUIRED tableName Name of the table in the PostgreSQL Database instance that linked service refers to. The tableName is casesensitive. No (if query of RelationalSource is specified) Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When source is of type RelationalSource (which includes PostgreSQL), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL query string. For example: "query": "select * from \"MySchema\".\"MyTable\"". No (if tableName of dataset is specified) NOTE Schema and table names are case-sensitive. Enclose them in "" (double quotes) in the query. Example: "query": "select * from \"MySchema\".\"MyTable\"" JSON example: Copy data from PostgreSQL to Azure Blob This example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from PostgreSQL database to Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. IMPORTANT This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See moving data between on-premises locations and cloud article for step-by-step instructions. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type OnPremisesPostgreSql. A linked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. The pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from a query result in PostgreSQL database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. As a first step, setup the data management gateway. The instructions are in the moving data between onpremises locations and cloud article. PostgreSQL linked service: { "name": "OnPremPostgreSqlLinkedService", "properties": { "type": "OnPremisesPostgreSql", "typeProperties": { "server": "<server>", "database": "<database>", "schema": "<schema>", "authenticationType": "<authentication type>", "username": "<username>", "password": "<password>", "gatewayName": "<gatewayName>" } } } Azure Blob storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey= <AccountKey>" } } } PostgreSQL input dataset: The sample assumes you have created a table “MyTable” in PostgreSQL and it contains a column called “timestamp” for time series data. Setting "external": true informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "PostgreSqlDataSet", "properties": { "type": "RelationalTable", "linkedServiceName": "OnPremPostgreSqlLinkedService", "typeProperties": {}, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobPostgreSqlDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/postgresql/yearno={Year}/monthno={Month}/dayno={Day}/hourno= {Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run hourly. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data from the public.usstates table in the PostgreSQL database. { "name": "CopyPostgreSqlToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "select * from \"public\".\"usstates\"" }, "sink": { "type": "BlobSink" } }, "inputs": [ { "name": "PostgreSqlDataSet" } ], "outputs": [ { "name": "AzureBlobPostgreSqlDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "PostgreSqlToBlob" } ], "start": "2014-06-01T18:00:00Z", "end": "2014-06-01T19:00:00Z" } } Type mapping for PostgreSQL As mentioned in the data movement activities article Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to PostgreSQL, the following mappings are used from PostgreSQL type to .NET type. POSTGRESQL DATABASE TYPE POSTGRESSQL ALIASES abstime .NET FRAMEWORK TYPE Datetime bigint int8 Int64 bigserial serial8 Int64 bit [ (n) ] Byte[], String POSTGRESQL DATABASE TYPE POSTGRESSQL ALIASES .NET FRAMEWORK TYPE bit varying [ (n) ] varbit Byte[], String boolean bool Boolean box Byte[], String bytea Byte[], String character [ (n) ] char [ (n) ] String character varying [ (n) ] varchar [ (n) ] String cid String cidr String circle Byte[], String date Datetime daterange String double precision float8 Double inet Byte[], String intarry String int4range String int8range String integer int, int4 Int32 interval [ fields ] [ (p) ] Timespan json String jsonb Byte[] line Byte[], String lseg Byte[], String macaddr Byte[], String money Decimal numeric [ (p, s) ] decimal [ (p, s) ] Decimal POSTGRESQL DATABASE TYPE POSTGRESSQL ALIASES .NET FRAMEWORK TYPE numrange String oid Int32 path Byte[], String pg_lsn Int64 point Byte[], String polygon Byte[], String real float4 Single smallint int2 Int16 smallserial serial2 Int16 serial serial4 Int32 text String Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data from Salesforce by using Azure Data Factory 6/5/2017 • 10 min to read • Edit Online This article outlines how you can use Copy Activity in an Azure data factory to copy data from Salesforce to any data store that is listed under the Sink column in the supported sources and sinks table. This article builds on the data movement activities article, which presents a general overview of data movement with Copy Activity and supported data store combinations. Azure Data Factory currently supports only moving data from Salesforce to supported sink data stores, but does not support moving data from other data stores to Salesforce. Supported versions This connector supports the following editions of Salesforce: Developer Edition, Professional Edition, Enterprise Edition, or Unlimited Edition. And it supports copying from Salesforce production, sandbox and custom domain. Prerequisites API permission must be enabled. See How do I enable API access in Salesforce by permission set? To copy data from Salesforce to on-premises data stores, you must have at least Data Management Gateway 2.0 installed in your on-premises environment. Salesforce request limits Salesforce has limits for both total API requests and concurrent API requests. Note the following points: If the number of concurrent requests exceeds the limit, throttling occurs and you will see random failures. If the total number of requests exceeds the limit, the Salesforce account will be blocked for 24 hours. You might also receive the “REQUEST_LIMIT_EXCEEDED“ error in both scenarios. See the "API Request Limits" section in the Salesforce Developer Limits article for details. Getting started You can create a pipeline with a copy activity that moves data from Salesforce by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from Salesforce, see JSON example: Copy data from Salesforce to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to Salesforce: Linked service properties The following table provides descriptions for JSON elements that are specific to the Salesforce linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: Salesforce. Yes environmentUrl Specify the URL of Salesforce instance. No - Default is "https://login.salesforce.com". - To copy data from sandbox, specify "https://test.salesforce.com". - To copy data from custom domain, specify, for example, "https://[domain].my.salesforce.com". username Specify a user name for the user account. Yes password Specify a password for the user account. Yes securityToken Specify a security token for the user account. See Get security token for instructions on how to reset/get a security token. To learn about security tokens in general, see Security and the API. Yes Dataset properties For a full list of sections and properties that are available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, and so on). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for a dataset of the type RelationalTable has the following properties: PROPERTY DESCRIPTION REQUIRED tableName Name of the table in Salesforce. No (if a query of RelationalSource is specified) IMPORTANT The "__c" part of the API Name is needed for any custom object. Copy activity properties For a full list of sections and properties that are available for defining activities, see the Creating pipelines article. Properties like name, description, input and output tables, and various policies are available for all types of activities. The properties that are available in the typeProperties section of the activity, on the other hand, vary with each activity type. For Copy Activity, they vary depending on the types of sources and sinks. In copy activity, when the source is of the type RelationalSource (which includes Salesforce), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. A SQL-92 query or Salesforce Object Query Language (SOQL) query. For example: No (if the tableName of the dataset is specified) select * from MyTable__c . IMPORTANT The "__c" part of the API Name is needed for any custom object. Query tips Retrieving data using where clause on DateTime column When specify the SOQL or SQL query, pay attention to the DateTime format difference. For example: SOQL sample: $$Text.Format('SELECT Id, Name, BillingCity FROM Account WHERE LastModifiedDate >= {0:yyyy-MMddTHH:mm:ssZ} AND LastModifiedDate < {1:yyyy-MM-ddTHH:mm:ssZ}', WindowStart, WindowEnd) SQL sample: Using copy wizard to specify the query: $$Text.Format('SELECT * FROM Account WHERE LastModifiedDate >= {{ts\'{0:yyyy-MM-dd HH:mm:ss}\'}} AND LastModifiedDate < {{ts\'{1:yyyy-MM-dd HH:mm:ss}\'}}', WindowStart, WindowEnd) Using JSON editing to specify the query (escape char properly): $$Text.Format('SELECT * FROM Account WHERE LastModifiedDate >= {{ts\\'{0:yyyy-MM-dd HH:mm:ss}\\'}} AND LastModifiedDate < {{ts\\'{1:yyyy-MM-dd HH:mm:ss}\\'}}', WindowStart, WindowEnd) Retrieving data from Salesforce Report You can retrieve data from Salesforce reports by specifying query as "query": "{call \"TestReport\"}" . {call "<report name>"} ,for example,. Retrieving deleted records from Salesforce Recycle Bin To query the soft deleted records from Salesforce Recycle Bin, you can specify "IsDeleted = 1" in your query. For example, To query only the deleted records, specify "select * from MyTable__c where IsDeleted= 1" To query all the records including the existing and the deleted, specify "select * from MyTable__c where IsDeleted = 0 or IsDeleted = 1" JSON example: Copy data from Salesforce to Azure Blob The following example provides sample JSON definitions that you can use to create a pipeline by using the Azure portal, Visual Studio, or Azure PowerShell. They show how to copy data from Salesforce to Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. Here are the Data Factory artifacts that you'll need to create to implement the scenario. The sections that follow the list provide details about these steps. A linked service of the type Salesforce A linked service of the type AzureStorage An input dataset of the type RelationalTable An output dataset of the type AzureBlob A pipeline with Copy Activity that uses RelationalSource and BlobSink Salesforce linked service This example uses the Salesforce linked service. See the Salesforce linked service section for the properties that are supported by this linked service. See Get security token for instructions on how to reset/get the security token. { "name": "SalesforceLinkedService", "properties": { "type": "Salesforce", "typeProperties": { "username": "<user name>", "password": "<password>", "securityToken": "<security token>" } } } Azure Storage linked service { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Salesforce input dataset { "name": "SalesforceInput", "properties": { "linkedServiceName": "SalesforceLinkedService", "type": "RelationalTable", "typeProperties": { "tableName": "AllDataType__c" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Setting external to true informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. IMPORTANT The "__c" part of the API Name is needed for any custom object. Azure blob output dataset Data is written to a new blob every hour (frequency: hour, interval: 1). { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adfgetstarted/alltypes_c" }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy Activity The pipeline contains Copy Activity, which is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource, and the sink type is set to BlobSink. See RelationalSource type properties for the list of properties that are supported by the RelationalSource. { "name":"SamplePipeline", "properties":{ "start":"2016-06-01T18:00:00", "end":"2016-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "SalesforceToAzureBlob", "description": "Copy from Salesforce to an Azure blob", "type": "Copy", "inputs": [ { "name": "SalesforceInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "RelationalSource", "query": "SELECT Id, Col_AutoNumber__c, Col_Checkbox__c, Col_Currency__c, Col_Date__c, Col_DateTime__c, Col_Email__c, Col_Number__c, Col_Percent__c, Col_Phone__c, Col_Picklist__c, Col_Picklist_MultiSelect__c, Col_Text__c, Col_Text_Area__c, Col_Text_AreaLong__c, Col_Text_AreaRich__c, Col_URL__c, Col_Text_Encrypt__c, Col_Lookup__c FROM AllDataType__c" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } IMPORTANT The "__c" part of the API Name is needed for any custom object. Type mapping for Salesforce SALESFORCE TYPE .NET-BASED TYPE Auto Number String Checkbox Boolean Currency Double Date DateTime Date/Time DateTime Email String Id String Lookup Relationship String Multi-Select Picklist String Number Double Percent Double Phone String Picklist String Text String Text Area String Text Area (Long) String Text Area (Rich) String Text (Encrypted) String URL String NOTE To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Specifying structure definition for rectangular datasets The structure section in the datasets JSON is an optional section for rectangular tables (with rows & columns) and contains a collection of columns for the table. You will use the structure section for either providing type information for type conversions or doing column mappings. The following sections describe these features in detail. Each column contains the following properties: PROPERTY DESCRIPTION REQUIRED name Name of the column. Yes type Data type of the column. See type conversions section below for more details regarding when should you specify type information No culture .NET based culture to be used when type is specified and is .NET type Datetime or Datetimeoffset. Default is “en-us”. No format Format string to be used when type is specified and is .NET type Datetime or Datetimeoffset. No The following sample shows the structure section JSON for a table that has three columns userid, name, and lastlogindate. "structure": [ { "name": "userid"}, { "name": "name"}, { "name": "lastlogindate"} ], Please use the following guidelines for when to include “structure” information and what to include in the structure section. For structured data sources that store data schema and type information along with the data itself (sources like SQL Server, Oracle, Azure table etc.), you should specify the “structure” section only if you want do column mapping of specific source columns to specific columns in sink and their names are not the same (see details in column mapping section below). As mentioned above, the type information is optional in “structure” section. For structured sources, type information is already available as part of dataset definition in the data store, so you should not include type information when you do include the “structure” section. For schema on read data sources (specifically Azure blob) you can choose to store data without storing any schema or type information with the data. For these types of data sources you should include “structure” in the following 2 cases: You want to do column mapping. When the dataset is a source in a Copy activity, you can provide type information in “structure” and data factory will use this type information for conversion to native types for the sink. See Move data to and from Azure Blob article for more information. Supported .NET -based types Data factory supports the following CLS compliant .NET based type values for providing type information in “structure” for schema on read data sources like Azure blob. Int16 Int32 Int64 Single Double Decimal Byte[] Bool String Guid Datetime Datetimeoffset Timespan For Datetime & Datetimeoffset you can also optionally specify “culture” & “format” string to facilitate parsing of your custom Datetime string. See sample for type conversion below. Performance and tuning See the Copy Activity performance and tuning guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data From SAP Business Warehouse using Azure Data Factory 5/16/2017 • 9 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises SAP Business Warehouse (BW). It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an on-premises SAP Business Warehouse data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from an SAP Business Warehouse to other data stores, but not for moving data from other data stores to an SAP Business Warehouse. Supported versions and installation This connector supports SAP Business Warehouse version 7.x. It supports copying data from InfoCubes and QueryCubes (including BEx queries) using MDX queries. To enable the connectivity to the SAP BW instance, install the following components: Data Management Gateway: Data Factory service supports connecting to on-premises data stores (including SAP Business Warehouse) using a component called Data Management Gateway. To learn about Data Management Gateway and step-by-step instructions for setting up the gateway, see Moving data between on-premises data store to cloud data store article. Gateway is required even if the SAP Business Warehouse is hosted in an Azure IaaS virtual machine (VM). You can install the gateway on the same VM as the data store or on a different VM as long as the gateway can connect to the database. SAP NetWeaver library on the gateway machine. You can get the SAP Netweaver library from your SAP administrator, or directly from the SAP Software Download Center. Search for the SAP Note #1025361 to get the download location for the most recent version. Make sure that the architecture for the SAP NetWeaver library (32-bit or 64-bit) matches your gateway installation. Then install all files included in the SAP NetWeaver RFC SDK according to the SAP Note. The SAP NetWeaver library is also included in the SAP Client Tools installation. TIP Put the dlls extracted from the NetWeaver RFC SDK into system32 folder. Getting started You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises SAP Business Warehouse, see JSON example: Copy data from SAP Business Warehouse to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to an SAP BW data store: Linked service properties The following table provides description for JSON elements specific to SAP Business Warehouse (BW) linked service. PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED server Name of the server on which the SAP BW instance resides. string Yes systemNumber System number of the SAP BW system. Two-digit decimal number represented as a string. Yes clientId Client ID of the client in the SAP W system. Three-digit decimal number represented as a string. Yes username Name of the user who has access to the SAP server string Yes password Password for the user. string Yes gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises SAP BW instance. string Yes encryptedCredential The encrypted credential string. string No Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. There are no type-specific properties supported for the SAP BW dataset of type RelationalTable. Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, are policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When source in copy activity is of type RelationalSource (which includes SAP BW), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Specifies the MDX query to read data from the SAP BW instance. MDX query. Yes JSON example: Copy data from SAP Business Warehouse to Azure Blob The following example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. This sample shows how to copy data from an on-premises SAP Business Warehouse to an Azure Blob Storage. However, data can be copied directly to any of the sinks stated here using the Copy Activity in Azure Data Factory. IMPORTANT This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See moving data between on-premises locations and cloud article for step-by-step instructions. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type SapBw. A linked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from an SAP Business Warehouse instance to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples. As a first step, setup the data management gateway. The instructions are in the moving data between onpremises locations and cloud article. SAP Business Warehouse linked service This linked service links your SAP BW instance to the data factory. The type property is set to SapBw. The typeProperties section provides connection information for the SAP BW instance. { "name": "SapBwLinkedService", "properties": { "type": "SapBw", "typeProperties": { "server": "<server name>", "systemNumber": "<system number>", "clientId": "<client id>", "username": "<SAP user>", "password": "<Password for SAP user>", "gatewayName": "<gateway name>" } } } Azure Storage linked service This linked service links your Azure Storage account to the data factory. The type property is set to AzureStorage. The typeProperties section provides connection information for the Azure Storage account. { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } SAP BW input dataset This dataset defines the SAP Business Warehouse dataset. You set the type of the Data Factory dataset to RelationalTable. Currently, you do not specify any type-specific properties for an SAP BW dataset. The query in the Copy Activity definition specifies what data to read from the SAP BW instance. Setting external property to true informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. Frequency and interval properties defines the schedule. In this case, the data is read from the SAP BW instance hourly. { "name": "SapBwDataset", "properties": { "type": "RelationalTable", "linkedServiceName": "SapBwLinkedService", "typeProperties": {}, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } Azure Blob output dataset This dataset defines the output Azure Blob dataset. The type property is set to AzureBlob. The typeProperties section provides where the data copied from the SAP BW instance is stored. The data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/sapbw/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource (for SAP BW source) and sink type is set to BlobSink. The query specified for the query property selects the data in the past hour to copy. { "name": "CopySapBwToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "<MDX query for SAP BW>" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "SapBwDataset" } ], "outputs": [ { "name": "AzureBlobDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "SapBwToBlob" } ], "start": "2017-03-01T18:00:00Z", "end": "2017-03-01T19:00:00Z" } } Type mapping for SAP BW As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data from SAP BW, the following mappings are used from SAP BW types to .NET types. DATA TYPE IN THE ABAP DICTIONARY .NET DATA TYPE ACCP Int CHAR String CLNT String DATA TYPE IN THE ABAP DICTIONARY .NET DATA TYPE CURR Decimal CUKY String DEC Decimal FLTP Double INT1 Byte INT2 Int16 INT4 Int LANG String LCHR String LRAW Byte[] PREC Int16 QUAN Decimal RAW Byte[] RAWSTRING Byte[] STRING String UNIT String DATS String NUMC String TIMS String NOTE To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data From SAP HANA using Azure Data Factory 6/5/2017 • 9 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises SAP HANA. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an on-premises SAP HANA data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from an SAP HANA to other data stores, but not for moving data from other data stores to an SAP HANA. Supported versions and installation This connector supports any version of SAP HANA database. It supports copying data from HANA information models (such as Analytic and Calculation views) and Row/Column tables using SQL queries. To enable the connectivity to the SAP HANA instance, install the following components: Data Management Gateway: Data Factory service supports connecting to on-premises data stores (including SAP HANA) using a component called Data Management Gateway. To learn about Data Management Gateway and step-by-step instructions for setting up the gateway, see Moving data between on-premises data store to cloud data store article. Gateway is required even if the SAP HANA is hosted in an Azure IaaS virtual machine (VM). You can install the gateway on the same VM as the data store or on a different VM as long as the gateway can connect to the database. SAP HANA ODBC driver on the gateway machine. You can download the SAP HANA ODBC driver from the SAP Software Download Center. Search with the keyword SAP HANA CLIENT for Windows. Getting started You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises SAP HANA, see JSON example: Copy data from SAP HANA to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to an SAP HANA data store: Linked service properties The following table provides description for JSON elements specific to SAP HANA linked service. PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED server Name of the server on which the SAP HANA instance resides. If your server is using a customized port, specify server:port . string Yes authenticationType Type of authentication. string. "Basic" or "Windows" Yes username Name of the user who has access to the SAP server string Yes password Password for the user. string Yes gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises SAP HANA instance. string Yes encryptedCredential The encrypted credential string. string No Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. There are no type-specific properties supported for the SAP HANA dataset of type RelationalTable. Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, are policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When source in copy activity is of type RelationalSource (which includes SAP HANA), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Specifies the SQL query to read data from the SAP HANA instance. SQL query. Yes JSON example: Copy data from SAP HANA to Azure Blob The following sample provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. This sample shows how to copy data from an on-premises SAP HANA to an Azure Blob Storage. However, data can be copied directly to any of the sinks listed here using the Copy Activity in Azure Data Factory. IMPORTANT This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See moving data between on-premises locations and cloud article for step-by-step instructions. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type SapHana. A linked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from an SAP HANA instance to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples. As a first step, setup the data management gateway. The instructions are in the moving data between onpremises locations and cloud article. SAP HANA linked service This linked service links your SAP HANA instance to the data factory. The type property is set to SapHana. The typeProperties section provides connection information for the SAP HANA instance. { "name": "SapHanaLinkedService", "properties": { "type": "SapHana", "typeProperties": { "server": "<server name>", "authenticationType": "<Basic, or Windows>", "username": "<SAP user>", "password": "<Password for SAP user>", "gatewayName": "<gateway name>" } } } Azure Storage linked service This linked service links your Azure Storage account to the data factory. The type property is set to AzureStorage. The typeProperties section provides connection information for the Azure Storage account. { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } SAP HANA input dataset This dataset defines the SAP HANA dataset. You set the type of the Data Factory dataset to RelationalTable. Currently, you do not specify any type-specific properties for an SAP HANA dataset. The query in the Copy Activity definition specifies what data to read from the SAP HANA instance. Setting external property to true informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. Frequency and interval properties defines the schedule. In this case, the data is read from the SAP HANA instance hourly. { "name": "SapHanaDataset", "properties": { "type": "RelationalTable", "linkedServiceName": "SapHanaLinkedService", "typeProperties": {}, "availability": { "frequency": "Hour", "interval": 1 }, "external": true } } Azure Blob output dataset This dataset defines the output Azure Blob dataset. The type property is set to AzureBlob. The typeProperties section provides where the data copied from the SAP HANA instance is stored. The data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/saphana/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource (for SAP HANA source) and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy. { "name": "CopySapHanaToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "<SQL Query for HANA>" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "SapHanaDataset" } ], "outputs": [ { "name": "AzureBlobDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "SapHanaToBlob" } ], "start": "2017-03-01T18:00:00Z", "end": "2017-03-01T19:00:00Z" } } Type mapping for SAP HANA As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data from SAP HANA, the following mappings are used from SAP HANA types to .NET types. SAP HANA TYPE .NET BASED TYPE TINYINT Byte SMALLINT Int16 INT Int32 SAP HANA TYPE .NET BASED TYPE BIGINT Int64 REAL Single DOUBLE Single DECIMAL Decimal BOOLEAN Byte VARCHAR String NVARCHAR String CLOB Byte[] ALPHANUM String BLOB Byte[] DATE DateTime TIME TimeSpan TIMESTAMP DateTime SECONDDATE DateTime Known limitations There are a few known limitations when copying data from SAP HANA: NVARCHAR strings are truncated to maximum length of 4000 Unicode characters SMALLDECIMAL is not supported VARBINARY is not supported Valid Dates are between 1899/12/30 and 9999/12/31 Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data from an SFTP server using Azure Data Factory 6/5/2017 • 11 min to read • Edit Online This article outlines how to use the Copy Activity in Azure Data Factory to move data from an onpremises/cloud SFTP server to a supported sink data store. This article builds on the data movement activities article that presents a general overview of data movement with copy activity and the list of data stores supported as sources/sinks. Data factory currently supports only moving data from an SFTP server to other data stores, but not for moving data from other data stores to an SFTP server. It supports both on-premises and cloud SFTP servers. NOTE Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline. Supported scenarios and authentication types You can use this SFTP connector to copy data from both cloud SFTP servers and on-premises SFTP servers. Basic and SshPublicKey authentication types are supported when connecting to the SFTP server. When copying data from an on-premises SFTP server, you need install a Data Management Gateway in the onpremises environment/Azure VM. See Data Management Gateway for details on the gateway. See moving data between on-premises locations and cloud article for step-by-step instructions on setting up the gateway and using it. Getting started You can create a pipeline with a copy activity that moves data from an SFTP source by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. For JSON samples to copy data from SFTP server to Azure Blob Storage, see JSON Example: Copy data from SFTP server to Azure blob section of this article. Linked service properties The following table provides description for JSON elements specific to FTP linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to Sftp . Yes PROPERTY DESCRIPTION REQUIRED host Name or IP address of the SFTP server. Yes port Port on which the SFTP server is listening. The default value is: 21 No authenticationType Specify authentication type. Allowed values: Basic, SshPublicKey. Yes Refer to Using basic authentication and Using SSH public key authentication sections on more properties and JSON samples respectively. skipHostKeyValidation Specify whether to skip host key validation. No. The default value: false hostKeyFingerprint Specify the finger print of the host key. Yes if the skipHostKeyValidation is set to false. gatewayName Name of the Data Management Gateway to connect to an onpremises SFTP server. Yes if copying data from an onpremises SFTP server. encryptedCredential Encrypted credential to access the SFTP server. Auto-generated when you specify basic authentication (username + password) or SshPublicKey authentication (username + private key path or content) in copy wizard or the ClickOnce popup dialog. No. Apply only when copying data from an on-premises SFTP server. Using basic authentication To use basic authentication, set authenticationType as Basic , and specify the following properties besides the SFTP connector generic ones introduced in the last section: PROPERTY DESCRIPTION REQUIRED username User who has access to the SFTP server. Yes password Password for the user (username). Yes Example: Basic authentication { "name": "SftpLinkedService", "properties": { "type": "Sftp", "typeProperties": { "host": "mysftpserver", "port": 22, "authenticationType": "Basic", "username": "xxx", "password": "xxx", "skipHostKeyValidation": false, "hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00", "gatewayName": "mygateway" } } } Example: Basic authentication with encrypted credential { "name": "SftpLinkedService", "properties": { "type": "Sftp", "typeProperties": { "host": "mysftpserver", "port": 22, "authenticationType": "Basic", "username": "xxx", "encryptedCredential": "xxxxxxxxxxxxxxxxx", "skipHostKeyValidation": false, "hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00", "gatewayName": "mygateway" } } } Using SSH public key authentication To use SSH public key authentication, set authenticationType as SshPublicKey , and specify the following properties besides the SFTP connector generic ones introduced in the last section: PROPERTY DESCRIPTION REQUIRED username User who has access to the SFTP server Yes privateKeyPath Specify absolute path to the private key file that gateway can access. Specify either the privateKeyPath privateKeyContent or . Apply only when copying data from an on-premises SFTP server. privateKeyContent A serialized string of the private key content. The Copy Wizard can read the private key file and extract the private key content automatically. If you are using any other tool/SDK, use the privateKeyPath property instead. Specify either the privateKeyPath privateKeyContent . or PROPERTY DESCRIPTION REQUIRED passPhrase Specify the pass phrase/password to decrypt the private key if the key file is protected by a pass phrase. Yes if the private key file is protected by a pass phrase. NOTE SFTP connector only support OpenSSH key. Make sure your key file is in the proper format. You can use Putty tool to convert from .ppk to OpenSSH format. Example: SshPublicKey authentication using private key filePath { "name": "SftpLinkedServiceWithPrivateKeyPath", "properties": { "type": "Sftp", "typeProperties": { "host": "mysftpserver", "port": 22, "authenticationType": "SshPublicKey", "username": "xxx", "privateKeyPath": "D:\\privatekey_openssh", "passPhrase": "xxx", "skipHostKeyValidation": true, "gatewayName": "mygateway" } } } Example: SshPublicKey authentication using private key content { "name": "SftpLinkedServiceWithPrivateKeyContent", "properties": { "type": "Sftp", "typeProperties": { "host": "mysftpserver.westus.cloudapp.azure.com", "port": 22, "authenticationType": "SshPublicKey", "username": "xxx", "privateKeyContent": "<base64 string of the private key content>", "passPhrase": "xxx", "skipHostKeyValidation": true } } } Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. The typeProperties section is different for each type of dataset. It provides information that is specific to the dataset type. The typeProperties section for a dataset of type FileShare dataset has the following properties: PROPERTY DESCRIPTION REQUIRED folderPath Sub path to the folder. Use escape character ‘ \ ’ for special characters in the string. See Sample linked service and dataset definitions for examples. Yes You can combine this property with partitionBy to have folder paths based on slice start/end date-times. fileName Specify the name of the file in the folderPath if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder. No When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: Data..txt (Example: Data.0a405f8a93ff-4c6f-b3be-f69616f1df7a.txt fileFilter Specify a filter to be used to select a subset of files in the folderPath rather than all files. No Allowed values are: * (multiple characters) and ? (single character). Examples 1: Example 2: "fileFilter": "*.log" "fileFilter": 2014-1-?.txt" fileFilter is applicable for an input FileShare dataset. This property is not supported with HDFS. partitionedBy partitionedBy can be used to specify a dynamic folderPath, filename for time series data. For example, folderPath parameterized for every hour of data. No format The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections. No If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions. PROPERTY DESCRIPTION REQUIRED compression Specify the type and level of compression for the data. Supported types are: GZip, Deflate, BZip2, and ZipDeflate. Supported levels are: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. No useBinaryTransfer Specify whether use Binary transfer mode. True for binary mode and false ASCII. Default value: True. This property can only be used when associated linked service type is of type: FtpServer. No NOTE filename and fileFilter cannot be used simultaneously. Using partionedBy property As mentioned in the previous section, you can specify a dynamic folderPath, filename for time series data with partitionedBy. You can do so with the Data Factory macros and the system variable SliceStart, SliceEnd that indicate the logical time period for a given data slice. To learn about time series datasets, scheduling, and slices, See Creating Datasets, Scheduling & Execution, and Creating Pipelines articles. Sample 1: "folderPath": "wikidatagateway/wikisampledataout/{Slice}", "partitionedBy": [ { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, ], In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. Example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104. Sample 2: "folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } ], In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties. Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities. Whereas, the properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, the type properties vary depending on the types of sources and sinks. In Copy Activity, when source is of type FileSystemSource, the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED recursive Indicates whether the data is read recursively from the sub folders or only from the specified folder. True, False (default) No Supported file and compression formats See File and compression formats in Azure Data Factory article on details. JSON Example: Copy data from SFTP server to Azure blob The following example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from SFTP source to Azure Blob Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory. IMPORTANT This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See moving data between on-premises locations and cloud article for step-by-step instructions. The sample has the following data factory entities: A linked service of type sftp. A linked service of type AzureStorage. An input dataset of type FileShare. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses FileSystemSource and BlobSink. The sample copies data from an SFTP server to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. SFTP linked service This example uses the basic authentication with user name and password in plain text. You can also use one of the following ways: Basic authentication with encrypted credentials SSH public key authentication See FTP linked service section for different types of authentication you can use. { "name": "SftpLinkedService", "properties": { "type": "Sftp", "typeProperties": { "host": "mysftpserver", "port": 22, "authenticationType": "Basic", "username": "myuser", "password": "mypassword", "skipHostKeyValidation": false, "hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00", "gatewayName": "mygateway" } } } Azure Storage linked service { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } SFTP input dataset This dataset refers to the SFTP folder destination. mysharedfolder and file test.csv . The pipeline copies the file to the Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "SFTPFileInput", "properties": { "type": "FileShare", "linkedServiceName": "SftpLinkedService", "typeProperties": { "folderPath": "mysharedfolder", "fileName": "test.csv" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Azure Blob output dataset Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/sftp/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource and sink type is set to BlobSink. { "name": "pipeline", "properties": { "activities": [{ "name": "SFTPToBlobCopy", "inputs": [{ "name": "SFTPFileInput" }], "outputs": [{ "name": "AzureBlobOutput" }], "type": "Copy", "typeProperties": { "source": { "type": "FileSystemSource" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 1, "timeout": "00:05:00" } }], "start": "2017-02-20T18:00:00Z", "end": "2017-02-20T19:00:00Z" } } Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Next Steps See the following articles: Copy Activity tutorial for step-by-step instructions for creating a pipeline with a Copy Activity. Move data to and from SQL Server on-premises or on IaaS (Azure VM) using Azure Data Factory 6/9/2017 • 18 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data to/from an on-premises SQL Server database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. Supported scenarios You can copy data from a SQL Server database to the following data stores: CATEGORY DATA STORE Azure Azure Blob storage Azure Data Lake Store Azure Cosmos DB (DocumentDB API) Azure SQL Database Azure SQL Data Warehouse Azure Search Index Azure Table storage Databases SQL Server Oracle File File system You can copy data from the following data stores to a SQL Server database: CATEGORY DATA STORE Azure Azure Blob storage Azure Cosmos DB (DocumentDB API) Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure Table storage Databases Amazon Redshift DB2 MySQL Oracle PostgreSQL SAP Business Warehouse SAP HANA SQL Server Sybase Teradata NoSQL Cassandra MongoDB CATEGORY DATA STORE File Amazon S3 File System FTP HDFS SFTP Others Generic HTTP Generic OData Generic ODBC Salesforce Web Table (table from HTML) GE Historian Supported SQL Server versions This SQL Server connector support copying data from/to the following versions of instance hosted onpremises or in Azure IaaS using both SQL authentication and Windows authentication: SQL Server 2016, SQL Server 2014, SQL Server 2012, SQL Server 2008 R2, SQL Server 2008, SQL Server 2005 Enabling connectivity The concepts and steps needed for connecting with SQL Server hosted on-premises or in Azure IaaS (Infrastructure-as-a-Service) VMs are the same. In both cases, you need to use Data Management Gateway for connectivity. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Setting up a gateway instance is a pre-requisite for connecting with SQL Server. While you can install gateway on the same on-premises machine or cloud VM instance as the SQL Server for better performance, we recommended that you install them on separate machines. Having the gateway and SQL Server on separate machines reduces resource contention. Getting started You can create a pipeline with a copy activity that moves data to/from an on-premises SQL Server database by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create a data factory. A data factory may contain one or more pipelines. 2. Create linked services to link input and output data stores to your data factory. For example, if you are copying data from a SQL Server database to an Azure blob storage, you create two linked services to link your SQL Server database and Azure storage account to your data factory. For linked service properties that are specific to SQL Server database, see linked service properties section. 3. Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the SQL table in your SQL Server database that contains the input data. And, you create another dataset to specify the blob container and the folder that holds the data copied from the SQL Server database. For dataset properties that are specific to SQL Server database, see dataset properties section. 4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use SqlSource as a source and BlobSink as a sink for the copy activity. Similarly, if you are copying from Azure Blob Storage to SQL Server Database, you use BlobSource and SqlSink in the copy activity. For copy activity properties that are specific to SQL Server Database, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an on-premises SQL Server database, see JSON examples section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to SQL Server: Linked service properties You create a linked service of type OnPremisesSqlServer to link an on-premises SQL Server database to a data factory. The following table provides description for JSON elements specific to on-premises SQL Server linked service. The following table provides description for JSON elements specific to SQL Server linked service. PROPERTY DESCRIPTION REQUIRED type The type property should be set to: OnPremisesSqlServer. Yes connectionString Specify connectionString information needed to connect to the onpremises SQL Server database using either SQL authentication or Windows authentication. Yes gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises SQL Server database. Yes username Specify user name if you are using Windows Authentication. Example: domainname\username. No password Specify password for the user account you specified for the username. No You can encrypt credentials using the New-AzureRmDataFactoryEncryptValue cmdlet and use them in the connection string as shown in the following example (EncryptedCredential property): "connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=True;EncryptedCredential=<encrypted credential>", Samples JSON for using SQL Authentication { "name": "MyOnPremisesSQLDB", "properties": { "type": "OnPremisesSqlServer", "typeProperties": { "connectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated Security=False;User ID=<username>;Password=<password>;", "gatewayName": "<gateway name>" } } } JSON for using Windows Authentication Data Management Gateway will impersonate the specified user account to connect to the on-premises SQL Server database. { "Name": " MyOnPremisesSQLDB", "Properties": { "type": "OnPremisesSqlServer", "typeProperties": { "ConnectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated Security=True;", "username": "<domain\\username>", "password": "<password>", "gatewayName": "<gateway name>" } } } Dataset properties In the samples, you have used a dataset of type SqlServerTable to represent a table in a SQL Server database. For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (SQL Server, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type SqlServerTable has the following properties: PROPERTY DESCRIPTION REQUIRED tableName Name of the table or view in the SQL Server Database instance that linked service refers to. Yes Copy activity properties If you are moving data from a SQL Server database, you set the source type in the copy activity to SqlSource. Similarly, if you are moving data to a SQL Server database, you set the sink type in the copy activity to SqlSink. This section provides a list of properties supported by SqlSource and SqlSink. For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities. NOTE The Copy Activity takes only one input and produces only one output. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. SqlSource When source in a copy activity is of type SqlSource, the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED sqlReaderQuery Use the custom query to read data. SQL query string. For example: select * from MyTable. May reference multiple tables from the database referenced by the input dataset. If not specified, the SQL statement that is executed: select from MyTable. No sqlReaderStoredProcedure Name Name of the stored procedure that reads data from the source table. Name of the stored procedure. The last SQL statement must be a SELECT statement in the stored procedure. No storedProcedureParameter s Parameters for the stored procedure. Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. No If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the SQL Server Database source to get the data. Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters). If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section are used to build a select query to run against the SQL Server Database. If the dataset definition does not have the structure, all columns are selected from the table. NOTE When you use sqlReaderStoredProcedureName, you still need to specify a value for the tableName property in the dataset JSON. There are no validations performed against this table though. SqlSink SqlSink supports the following properties: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED writeBatchTimeout Wait time for the batch insert operation to complete before it times out. timespan No writeBatchSize Inserts data into the SQL table when the buffer size reaches writeBatchSize. Integer (number of rows) No (default: 10000) sqlWriterCleanupScript Specify query for Copy Activity to execute such that data of a specific slice is cleaned up. For more information, see repeatable copy section. A query statement. No sliceIdentifierColumnName Specify column name for Copy Activity to fill with auto generated slice identifier, which is used to clean up data of a specific slice when rerun. For more information, see repeatable copy section. Column name of a column with data type of binary(32). No sqlWriterStoredProcedureN ame Name of the stored procedure that upserts (updates/inserts) data into the target table. Name of the stored procedure. No storedProcedureParameter s Parameters for the stored procedure. Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. No sqlWriterTableType Specify table type name to be used in the stored procedure. Copy activity makes the data being moved available in a temp table with this table type. Stored procedure code can then merge the data being copied with existing data. A table type name. No Example: “00:30:00” (30 minutes). JSON examples for copying data from and to SQL Server The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. The following samples show how to copy data to and from SQL Server and Azure Blob Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory. Example: Copy data from SQL Server to Azure Blob The following sample shows: 1. 2. 3. 4. 5. A linked service of type OnPremisesSqlServer. A linked service of type AzureStorage. An input dataset of type SqlServerTable. An output dataset of type AzureBlob. The pipeline with Copy activity that uses SqlSource and BlobSink. The sample copies time-series data from a SQL Server table to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. As a first step, setup the data management gateway. The instructions are in the moving data between onpremises locations and cloud article. SQL Server linked service { "Name": "SqlServerLinkedService", "properties": { "type": "OnPremisesSqlServer", "typeProperties": { "connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;", "gatewayName": "<gatewayname>" } } } Azure Blob storage linked service { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } SQL Server input dataset The sample assumes you have created a table “MyTable” in SQL Server and it contains a column called “timestampcolumn” for time series data. You can query over multiple tables within the same database using a single dataset, but a single table must be used for the dataset's tableName typeProperty. Setting “external”: ”true” informs Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "SqlServerInput", "properties": { "type": "SqlServerTable", "linkedServiceName": "SqlServerLinkedService", "typeProperties": { "tableName": "MyTable" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": "\t", "rowDelimiter": "\n" } }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy. { "name":"SamplePipeline", "properties":{ "start":"2016-06-01T18:00:00", "end":"2016-06-01T19:00:00", "description":"pipeline for copy activity", "activities":[ { "name": "SqlServertoBlob", "description": "copy activity", "type": "Copy", "inputs": [ { "name": " SqlServerInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "SqlSource", "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyyMM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } In this example, sqlReaderQuery is specified for the SqlSource. The Copy Activity runs this query against the SQL Server Database source to get the data. Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters). The sqlReaderQuery can reference multiple tables within the database referenced by the input dataset. It is not limited to only the table set as the dataset's tableName typeProperty. If you do not specify sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section are used to build a select query to run against the SQL Server Database. If the dataset definition does not have the structure, all columns are selected from the table. See the Sql Source section and BlobSink for the list of properties supported by SqlSource and BlobSink. Example: Copy data from Azure Blob to SQL Server The following sample shows: 1. The linked service of type OnPremisesSqlServer. 2. 3. 4. 5. The linked service of type AzureStorage. An input dataset of type AzureBlob. An output dataset of type SqlServerTable. The pipeline with Copy activity that uses BlobSource and SqlSink. The sample copies time-series data from an Azure blob to a SQL Server table every hour. The JSON properties used in these samples are described in sections following the samples. SQL Server linked service { "Name": "SqlServerLinkedService", "properties": { "type": "OnPremisesSqlServer", "typeProperties": { "connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;", "gatewayName": "<gatewayname>" } } } Azure Blob storage linked service { "name": "StorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } Azure Blob input dataset Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. { "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "StorageLinkedService", "typeProperties": { "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", "fileName": "{Hour}.csv", "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "format": { "type": "TextFormat", "columnDelimiter": ",", "rowDelimiter": "\n" } }, "external": true, "availability": { "frequency": "Hour", "interval": 1 }, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } SQL Server output dataset The sample copies data to a table named “MyTable” in SQL Server. Create the table in SQL Server with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. { "name": "SqlServerOutput", "properties": { "type": "SqlServerTable", "linkedServiceName": "SqlServerLinkedService", "typeProperties": { "tableName": "MyOutputTable" }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlSink. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "AzureBlobtoSQL", "description": "Copy Activity", "type": "Copy", "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": " SqlServerOutput " } ], "typeProperties": { "source": { "type": "BlobSource", "blobColumnSeparators": "," }, "sink": { "type": "SqlSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Troubleshooting connection issues 1. Configure your SQL Server to accept remote connections. Launch SQL Server Management Studio, right-click server, and click Properties. Select Connections from the list and check Allow remote connections to the server. See Configure the remote access Server Configuration Option for detailed steps. 2. Launch SQL Server Configuration Manager. Expand SQL Server Network Configuration for the instance you want, and select Protocols for MSSQLSERVER. You should see protocols in the rightpane. Enable TCP/IP by right-clicking TCP/IP and clicking Enable. See Enable or Disable a Server Network Protocol for details and alternate ways of enabling TCP/IP protocol. 3. In the same window, double-click TCP/IP to launch TCP/IP Properties window. 4. Switch to the IP Addresses tab. Scroll down to see IPAll section. Note down the TCP Port **(default is **1433). 5. Create a rule for the Windows Firewall on the machine to allow incoming traffic through this port. 6. Verify connection: To connect to the SQL Server using fully qualified name, use SQL Server Management Studio from a different machine. For example: "..corp..com,1433." IMPORTANT See Move data between on-premises sources and the cloud with Data Management Gateway for detailed information. See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Identity columns in the target database This section provides an example that copies data from a source table with no identity column to a destination table with an identity column. Source table: create table dbo.SourceTbl ( name varchar(100), age int ) Destination table: create table dbo.TargetTbl ( identifier int identity(1,1), name varchar(100), age int ) Notice that the target table has an identity column. Source dataset JSON definition { "name": "SampleSource", "properties": { "published": false, "type": " SqlServerTable", "linkedServiceName": "TestIdentitySQL", "typeProperties": { "tableName": "SourceTbl" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": {} } } Destination dataset JSON definition { "name": "SampleTarget", "properties": { "structure": [ { "name": "name" }, { "name": "age" } ], "published": false, "type": "AzureSqlTable", "linkedServiceName": "TestIdentitySQLSource", "typeProperties": { "tableName": "TargetTbl" }, "availability": { "frequency": "Hour", "interval": 1 }, "external": false, "policy": {} } } Notice that as your source and target table have different schema (target has an additional column with identity). In this scenario, you need to specify structure property in the target dataset definition, which doesn’t include the identity column. Invoke stored procedure from SQL sink See Invoke stored procedure for SQL sink in copy activity article for an example of invoking a stored procedure from SQL sink in a copy activity of a pipeline. Type mapping for SQL server As mentioned in the data movement activities article, the Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to & from SQL server, the following mappings are used from SQL type to .NET type and vice versa. The mapping is same as the SQL Server Data Type Mapping for ADO.NET. SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE bigint Int64 binary Byte[] bit Boolean char String, Char[] date DateTime Datetime DateTime SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE datetime2 DateTime Datetimeoffset DateTimeOffset Decimal Decimal FILESTREAM attribute (varbinary(max)) Byte[] Float Double image Byte[] int Int32 money Decimal nchar String, Char[] ntext String, Char[] numeric Decimal nvarchar String, Char[] real Single rowversion Byte[] smalldatetime DateTime smallint Int16 smallmoney Decimal sql_variant Object * text String, Char[] time TimeSpan timestamp Byte[] tinyint Byte uniqueidentifier Guid varbinary Byte[] varchar String, Char[] SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE xml Xml Mapping source to sink columns To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable copy When copying data to SQL Server Database, the copy activity appends data to the sink table by default. To perform an UPSERT instead, See Repeatable write to SqlSink article. When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data from Sybase using Azure Data Factory 4/12/2017 • 8 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises Sybase database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an on-premises Sybase data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from a Sybase data store to other data stores, but not for moving data from other data stores to a Sybase data store. Prerequisites Data Factory service supports connecting to on-premises Sybase sources using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Gateway is required even if the Sybase database is hosted in an Azure IaaS VM. You can install the gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database. NOTE See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Supported versions and installation For Data Management Gateway to connect to the Sybase Database, you need to install the data provider for Sybase iAnywhere.Data.SQLAnywhere 16 or above on the same system as the Data Management Gateway. Sybase version 16 and above is supported. Getting started You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises Sybase data store, see JSON example: Copy data from Sybase to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Sybase data store: Linked service properties The following table provides description for JSON elements specific to Sybase linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesSybase Yes server Name of the Sybase server. Yes database Name of the Sybase database. Yes schema Name of the schema in the database. No authenticationType Type of authentication used to connect to the Sybase database. Possible values are: Anonymous, Basic, and Windows. Yes username Specify user name if you are using Basic or Windows authentication. No password Specify password for the user account you specified for the username. No gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises Sybase database. Yes Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes Sybase dataset) has the following properties: PROPERTY DESCRIPTION REQUIRED tableName Name of the table in the Sybase Database instance that linked service refers to. No (if query of RelationalSource is specified) Copy activity properties For a full list of sections & properties available for defining activities, see Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When the source is of type RelationalSource (which includes Sybase), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL query string. For example: select * from MyTable. No (if tableName of dataset is specified) JSON example: Copy data from Sybase to Azure Blob The following example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from Sybase database to Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type OnPremisesSybase. A liked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. The pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from a query result in Sybase database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. As a first step, setup the data management gateway. The instructions are in the moving data between onpremises locations and cloud article. Sybase linked service: { "name": "OnPremSybaseLinkedService", "properties": { "type": "OnPremisesSybase", "typeProperties": { "server": "<server>", "database": "<database>", "schema": "<schema>", "authenticationType": "<authentication type>", "username": "<username>", "password": "<password>", "gatewayName": "<gatewayName>" } } } Azure Blob storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorageLinkedService", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey= <AccountKey>" } } } Sybase input dataset: The sample assumes you have created a table “MyTable” in Sybase and it contains a column called “timestamp” for time series data. Setting “external”: true informs the Data Factory service that this dataset is external to the data factory and is not produced by an activity in the data factory. Notice that the type of the linked service is set to: RelationalTable. { "name": "SybaseDataSet", "properties": { "type": "RelationalTable", "linkedServiceName": "OnPremSybaseLinkedService", "typeProperties": {}, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobSybaseDataSet", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "mycontainer/sybase/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ] }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run hourly. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data from the DBA.Orders table in the database. { "name": "CopySybaseToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "select * from DBA.Orders" }, "sink": { "type": "BlobSink" } }, "inputs": [ { "name": "SybaseDataSet" } ], "outputs": [ { "name": "AzureBlobSybaseDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "SybaseToBlob" } ], "start": "2014-06-01T18:00:00Z", "end": "2014-06-01T19:00:00Z" } } Type mapping for Sybase As mentioned in the Data Movement Activities article, the Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type Sybase supports T-SQL and T-SQL types. For a mapping table from sql types to .NET type, see Azure SQL Connector article. Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data from Teradata using Azure Data Factory 4/12/2017 • 8 min to read • Edit Online This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises Teradata database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity. You can copy data from an on-premises Teradata data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from a Teradata data store to other data stores, but not for moving data from other data stores to a Teradata data store. Prerequisites Data factory supports connecting to on-premises Teradata sources via the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Gateway is required even if the Teradata is hosted in an Azure IaaS VM. You can install the gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database. NOTE See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues. Supported versions and installation For Data Management Gateway to connect to the Teradata Database, you need to install the .NET Data Provider for Teradata version 14 or above on the same system as the Data Management Gateway. Teradata version 12 and above is supported. Getting started You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises Teradata data store, see JSON example: Copy data from Teradata to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Teradata data store: Linked service properties The following table provides description for JSON elements specific to Teradata linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: OnPremisesTeradata Yes server Name of the Teradata server. Yes authenticationType Type of authentication used to connect to the Teradata database. Possible values are: Anonymous, Basic, and Windows. Yes username Specify user name if you are using Basic or Windows authentication. No password Specify password for the user account you specified for the username. No gatewayName Name of the gateway that the Data Factory service should use to connect to the on-premises Teradata database. Yes Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. Currently, there are no type properties supported for the Teradata dataset. Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. When the source is of type RelationalSource (which includes Teradata), the following properties are available in typeProperties section: PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED query Use the custom query to read data. SQL query string. For example: select * from MyTable. Yes JSON example: Copy data from Teradata to Azure Blob The following example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from Teradata to Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory. The sample has the following data factory entities: 1. 2. 3. 4. 5. A linked service of type OnPremisesTeradata. A linked service of type AzureStorage. An input dataset of type RelationalTable. An output dataset of type AzureBlob. The pipeline with Copy Activity that uses RelationalSource and BlobSink. The sample copies data from a query result in Teradata database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. As a first step, setup the data management gateway. The instructions are in the moving data between onpremises locations and cloud article. Teradata linked service: { "name": "OnPremTeradataLinkedService", "properties": { "type": "OnPremisesTeradata", "typeProperties": { "server": "<server>", "authenticationType": "<authentication type>", "username": "<username>", "password": "<password>", "gatewayName": "<gatewayName>" } } } Azure Blob storage linked service: { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorageLinkedService", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey= <AccountKey>" } } } Teradata input dataset: The sample assumes you have created a table “MyTable” in Teradata and it contains a column called “timestamp” for time series data. Setting “external”: true informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. { "name": "TeradataDataSet", "properties": { "published": false, "type": "RelationalTable", "linkedServiceName": "OnPremTeradataLinkedService", "typeProperties": { }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": { "externalData": { "retryInterval": "00:01:00", "retryTimeout": "00:10:00", "maximumRetry": 3 } } } } Azure Blob output dataset: Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. { "name": "AzureBlobTeradataDataSet", "properties": { "published": false, "location": { "type": "AzureBlobLocation", "folderPath": "mycontainer/teradata/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": "\t" }, "partitionedBy": [ { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } ], "linkedServiceName": "AzureStorageLinkedService" }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity: The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run hourly. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy. { "name": "CopyTeradataToBlob", "properties": { "description": "pipeline for copy activity", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "RelationalSource", "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MMddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', SliceStart, SliceEnd)" }, "sink": { "type": "BlobSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "inputs": [ { "name": "TeradataDataSet" } ], "outputs": [ { "name": "AzureBlobTeradataDataSet" } ], "policy": { "timeout": "01:00:00", "concurrency": 1 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "TeradataToBlob" } ], "start": "2014-06-01T18:00:00Z", "end": "2014-06-01T19:00:00Z", "isPaused": false } } Type mapping for Teradata As mentioned in the data movement activities article, the Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: 1. Convert from native source types to .NET type 2. Convert from .NET type to native sink type When moving data to Teradata, the following mappings are used from Teradata type to .NET type. TERADATA DATABASE TYPE .NET FRAMEWORK TYPE Char String Clob String TERADATA DATABASE TYPE .NET FRAMEWORK TYPE Graphic String VarChar String VarGraphic String Blob Byte[] Byte Byte[] VarByte Byte[] BigInt Int64 ByteInt Int16 Decimal Decimal Double Double Integer Int32 Number Double SmallInt Int16 Date DateTime Time TimeSpan Time With Time Zone String Timestamp DateTime Timestamp With Time Zone DateTimeOffset Interval Day TimeSpan Interval Day To Hour TimeSpan Interval Day To Minute TimeSpan Interval Day To Second TimeSpan Interval Hour TimeSpan Interval Hour To Minute TimeSpan Interval Hour To Second TimeSpan TERADATA DATABASE TYPE .NET FRAMEWORK TYPE Interval Minute TimeSpan Interval Minute To Second TimeSpan Interval Second TimeSpan Interval Year String Interval Year To Month String Interval Month String Period(Date) String Period(Time) String Period(Time With Time Zone) String Period(Timestamp) String Period(Timestamp With Time Zone) String Xml String Map source to sink columns To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory. Repeatable read from relational sources When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Move data from a Web table source using Azure Data Factory 4/12/2017 • 6 min to read • Edit Online This article outlines how to use the Copy Activity in Azure Data Factory to move data from a table in a Web page to a supported sink data store. This article builds on the data movement activities article that presents a general overview of data movement with copy activity and the list of data stores supported as sources/sinks. Data factory currently supports only moving data from a Web table to other data stores, but not moving data from other data stores to a Web table destination. IMPORTANT This Web connector currently supports only extracting table content from an HTML page. To retrieve data from a HTTP/s endpoint, use HTTP connector instead. Getting started You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard. You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: 1. Create linked services to link input and output data stores to your data factory. 2. Create datasets to represent input and output data for the copy operation. 3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from a web table, see JSON example: Copy data from Web table to Azure Blob section of this article. The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Web table: Linked service properties The following table provides description for JSON elements specific to Web linked service. PROPERTY DESCRIPTION REQUIRED type The type property must be set to: Web Yes Url URL to the Web source Yes authenticationType Anonymous. Yes Using Anonymous authentication { "name": "web", "properties": { "type": "Web", "typeProperties": { "authenticationType": "Anonymous", "url" : "https://en.wikipedia.org/wiki/" } } } Dataset properties For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type WebTable has the following properties PROPERTY DESCRIPTION REQUIRED type type of the dataset. must be set to WebTable Yes path A relative URL to the resource that contains the table. No. When path is not specified, only the URL specified in the linked service definition is used. index The index of the table in the resource. See Get index of a table in an HTML page section for steps to getting index of a table in an HTML page. Yes Example: { "name": "WebTableInput", "properties": { "type": "WebTable", "linkedServiceName": "WebLinkedService", "typeProperties": { "index": 1, "path": "AFI's_100_Years...100_Movies" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Copy activity properties For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. Currently, when the source in copy activity is of type WebSource, no additional properties are supported. JSON example: Copy data from Web table to Azure Blob The following sample shows: 1. 2. 3. 4. 5. A linked service of type Web. A linked service of type AzureStorage. An input dataset of type WebTable. An output dataset of type AzureBlob. A pipeline with Copy Activity that uses WebSource and BlobSink. The sample copies data from a Web table to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. The following sample shows how to copy data from a Web table to an Azure blob. However, data can be copied directly to any of the sinks stated in the Data Movement Activities article by using the Copy Activity in Azure Data Factory. Web linked service This example uses the Web linked service with anonymous authentication. See Web linked service section for different types of authentication you can use. { "name": "WebLinkedService", "properties": { "type": "Web", "typeProperties": { "authenticationType": "Anonymous", "url" : "https://en.wikipedia.org/wiki/" } } } Azure Storage linked service { "name": "AzureStorageLinkedService", "properties": { "type": "AzureStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey= <accountkey>" } } } WebTable input dataset Setting external to true informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. NOTE See Get index of a table in an HTML page section for steps to getting index of a table in an HTML page. { "name": "WebTableInput", "properties": { "type": "WebTable", "linkedServiceName": "WebLinkedService", "typeProperties": { "index": 1, "path": "AFI's_100_Years...100_Movies" }, "external": true, "availability": { "frequency": "Hour", "interval": 1 } } } Azure Blob output dataset Data is written to a new blob every hour (frequency: hour, interval: 1). { "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adfgetstarted/Movies" }, "availability": { "frequency": "Hour", "interval": 1 } } } Pipeline with Copy activity The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to WebSource and sink type is set to BlobSink. See WebSource type properties for the list of properties supported by the WebSource. { "name":"SamplePipeline", "properties":{ "start":"2014-06-01T18:00:00", "end":"2014-06-01T19:00:00", "description":"pipeline with copy activity", "activities":[ { "name": "WebTableToAzureBlob", "description": "Copy from a Web table to an Azure blob", "type": "Copy", "inputs": [ { "name": "WebTableInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "typeProperties": { "source": { "type": "WebSource" }, "sink": { "type": "BlobSink" } }, "scheduler": { "frequency": "Hour", "interval": 1 }, "policy": { "concurrency": 1, "executionPriorityOrder": "OldestFirst", "retry": 0, "timeout": "01:00:00" } } ] } } Get index of a table in an HTML page 1. Launch Excel 2016 and switch to the Data tab. 2. Click New Query on the toolbar, point to From Other Sources and click From Web. 3. In the From Web dialog box, enter URL that you would use in linked service JSON (for example: https://en.wikipedia.org/wiki/) along with path you would specify for the dataset (for example: AFI%27s_100_Years...100_Movies), and click OK. URL used in this example: https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies 4. If you see Access Web content dialog box, select the right URL, authentication, and click Connect. 5. Click a table item in the tree view to see content from the table and then click Edit button at the bottom. 6. In the Query Editor window, click Advanced Editor button on the toolbar. 7. In the Advanced Editor dialog box, the number next to "Source" is the index. If you are using Excel 2013, use Microsoft Power Query for Excel to get the index. See Connect to a web page article for details. The steps are similar if you are using Microsoft Power BI for Desktop. NOTE To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory. Performance and Tuning See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. Data Management Gateway 5/8/2017 • 21 min to read • Edit Online The Data Management Gateway is a client agent that you must install in your on-premises environment to copy data between cloud and on-premises data stores. The on-premises data stores supported by Data Factory are listed in the Supported data sources section. NOTE Currently, gateway supports only the copy activity and stored procedure activity in Data Factory. It is not possible to use the gateway from a custom activity to access on-premises data sources. This article complements the walkthrough in the Move data between on-premises and cloud data stores article. In the walkthrough, you create a pipeline that uses the gateway to move data from an on-premises SQL Server database to an Azure blob. This article provides detailed in-depth information about the Data Management Gateway. Overview Capabilities of Data Management gateway Data Management Gateway provides the following capabilities: Model on-premises data sources and cloud data sources within the same data factory and move data. Have a single pane of glass for monitoring and management with visibility into gateway status from the Data Factory blade. Manage access to on-premises data sources securely. No changes required to corporate firewall. Gateway only makes outbound HTTP-based connections to open internet. Encrypt credentials for your on-premises data stores with your certificate. Move data efficiently – data is transferred in parallel, resilient to intermittent network issues with auto retry logic. Command flow and data flow When you use a copy activity to copy data between on-premises and cloud, the activity uses a gateway to transfer data from on-premises data source to cloud and vice versa. Here high-level data flow for and summary of steps for copy with data gateway: 1. Data developer creates a gateway for an Azure Data Factory using either the Azure portal or PowerShell Cmdlet. 2. Data developer creates a linked service for an on-premises data store by specifying the gateway. As part of setting up the linked service, data developer uses the Setting Credentials application to specify authentication types and credentials. The Setting Credentials application dialog communicates with the data store to test connection and the gateway to save credentials. 3. Gateway encrypts the credentials with the certificate associated with the gateway (supplied by data developer), before saving the credentials in the cloud. 4. Data Factory service communicates with the gateway for scheduling & management of jobs via a control channel that uses a shared Azure service bus queue. When a copy activity job needs to be kicked off, Data Factory queues the request along with credential information. Gateway kicks off the job after polling the queue. 5. The gateway decrypts the credentials with the same certificate and then connects to the on-premises data store with proper authentication type and credentials. 6. The gateway copies data from an on-premises store to a cloud storage, or vice versa depending on how the Copy Activity is configured in the data pipeline. For this step, the gateway directly communicates with cloud-based storage services such as Azure Blob Storage over a secure (HTTPS) channel. Considerations for using gateway A single instance of Data Management Gateway can be used for multiple on-premises data sources. However, a single gateway instance is tied to only one Azure data factory and cannot be shared with another data factory. You can have only one instance of Data Management Gateway installed on a single machine. Suppose, you have two data factories that need to access on-premises data sources, you need to install gateways on two on-premises computers. In other words, a gateway is tied to a specific data factory The gateway does not need to be on the same machine as the data source. However, having gateway closer to the data source reduces the time for the gateway to connect to the data source. We recommend that you install the gateway on a machine that is different from the one that hosts onpremises data source. When the gateway and data source are on different machines, the gateway does not compete for resources with data source. You can have multiple gateways on different machines connecting to the same on-premises data source. For example, you may have two gateways serving two data factories but the same on-premises data source is registered with both the data factories. If you already have a gateway installed on your computer serving a Power BI scenario, install a separate gateway for Azure Data Factory on another machine. Gateway must be used even when you use ExpressRoute. Treat your data source as an on-premises data source (that is behind a firewall) even when you use ExpressRoute. Use the gateway to establish connectivity between the service and the data source. You must use the gateway even if the data store is in the cloud on an Azure IaaS VM. Installation Prerequisites The supported Operating System versions are Windows 7, Windows 8/8.1, Windows 10, Windows Server 2008 R2, Windows Server 2012, Windows Server 2012 R2. Installation of the Data Management Gateway on a domain controller is currently not supported. .NET Framework 4.5.1 or above is required. If you are installing gateway on a Windows 7 machine, install .NET Framework 4.5 or later. See .NET Framework System Requirements for details. The recommended configuration for the gateway machine is at least 2 GHz, 4 cores, 8-GB RAM, and 80GB disk. If the host machine hibernates, the gateway does not respond to data requests. Therefore, configure an appropriate power plan on the computer before installing the gateway. If the machine is configured to hibernate, the gateway installation prompts a message. You must be an administrator on the machine to install and configure the Data Management Gateway successfully. You can add additional users to the Data Management Gateway Users local Windows group. The members of this group are able to use the Data Management Gateway Configuration Manager tool to configure the gateway. As copy activity runs happen on a specific frequency, the resource usage (CPU, memory) on the machine also follows the same pattern with peak and idle times. Resource utilization also depends heavily on the amount of data being moved. When multiple copy jobs are in progress, you see resource usage go up during peak times. Installation options Data Management Gateway can be installed in the following ways: By downloading an MSI setup package from the Microsoft Download Center. The MSI can also be used to upgrade existing Data Management Gateway to the latest version, with all settings preserved. By clicking Download and install data gateway link under MANUAL SETUP or Install directly on this computer under EXPRESS SETUP. See Move data between on-premises and cloud article for step-by-step instructions on using express setup. The manual step takes you to the download center. The instructions for downloading and installing the gateway from download center are in the next section. Installation best practices: 1. Configure power plan on the host machine for the gateway so that the machine does not hibernate. If the host machine hibernates, the gateway does not respond to data requests. 2. Back up the certificate associated with the gateway. Install gateway from download center 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Navigate to Microsoft Data Management Gateway download page. Click Download, select the appropriate version (32-bit vs. 64-bit), and click Next. Run the MSI directly or save it to your hard disk and run. On the Welcome page, select a language click Next. Accept the End-User License Agreement and click Next. Select folder to install the gateway and click Next. On the Ready to install page, click Install. Click Finish to complete installation. Get the key from the Azure portal. See the next section for step-by-step instructions. On the Register gateway page of Data Management Gateway Configuration Manager running on your machine, do the following steps: a. Paste the key in the text. b. Optionally, click Show gateway key to see the key text. c. Click Register. Register gateway using key If you haven't already created a logical gateway in the portal To create a gateway in the portal and get the key from the Configure blade, Follow steps from walkthrough in the Move data between on-premises and cloud article. If you have already created the logical gateway in the portal 1. In Azure portal, navigate to the Data Factory blade, and click Linked Services tile. 2. In the Linked Services blade, select the logical gateway you created in the portal. 3. In the Data Gateway blade, click Download and install data gateway. 4. In the Configure blade, click Recreate key. Click Yes on the warning message after reading it carefully. 5. Click Copy button next to the key. The key is copied to the clipboard. System tray icons/ notifications The following image shows some of the tray icons that you see. If you move cursor over the system tray icon/notification message, you see details about the state of the gateway/update operation in a popup window. Ports and firewall There are two firewalls you need to consider: corporate firewall running on the central router of the organization, and Windows firewall configured as a daemon on the local machine where the gateway is installed. At corporate firewall level, you need configure the following domains and outbound ports: DOMAIN NAMES PORTS DESCRIPTION *.servicebus.windows.net 443, 80 Used for communication with Data Movement Service backend *.core.windows.net 443 Used for Staged copy using Azure Blob (if configured) *.frontend.clouddatahub.net 443 Used for communication with Data Movement Service backend At windows firewall level, these outbound ports are normally enabled. If not, you can configure the domains and ports accordingly on gateway machine. NOTE 1. Based on your source/ sinks, you may have to whitelist additional domains and outbound ports in your corporate/ windows firewall. 2. For some Cloud Databases (eg. SQL Azure Database, Azure Data Lake, etc.), you may need to whitelist IP address of Gateway machine on there firewall configuration. Copy data from a source data store to a sink data store Ensure that the firewall rules are enabled properly on the corporate firewall, Windows firewall on the gateway machine, and the data store itself. Enabling these rules allows the gateway to connect to both source and sink successfully. Enable rules for each data store that is involved in the copy operation. For example, to copy from an on-premises data store to an Azure SQL Database sink or an Azure SQL Data Warehouse sink, do the following steps: Allow outbound TCP communication on port 1433 for both Windows firewall and corporate firewall. Configure the firewall settings of Azure SQL server to add the IP address of the gateway machine to the list of allowed IP addresses. NOTE If your firewall does not allow outbound port 1433, Gateway will not be able to access Azure SQL directly. In this case you may use Staged Copy to SQL Azure Database/ SQL Azure DW. In this scenario you would only require HTTPS (port 443) for the data movement. Proxy server considerations If your corporate network environment uses a proxy server to access the internet, configure Data Management Gateway to use appropriate proxy settings. You can set the proxy during the initial registration phase. Gateway uses the proxy server to connect to the cloud service. Click Change link during initial setup. You see the proxy setting dialog. There are three configuration options: Do not use proxy: Gateway does not explicitly use any proxy to connect to cloud services. Use system proxy: Gateway uses the proxy setting that is configured in diahost.exe.config and diawp.exe.config. If no proxy is configured in diahost.exe.config and diawp.exe.config, gateway connects to cloud service directly without going through proxy. Use custom proxy: Configure the HTTP proxy setting to use for gateway, instead of using configurations in diahost.exe.config and diawp.exe.config. Address and Port are required. User Name and Password are optional depending on your proxy’s authentication setting. All settings are encrypted with the credential certificate of the gateway and stored locally on the gateway host machine. The Data Management Gateway Host Service restarts automatically after you save the updated proxy settings. After gateway has been successfully registered, if you want to view or update proxy settings, use Data Management Gateway Configuration Manager. 1. 2. 3. 4. Launch Data Management Gateway Configuration Manager. Switch to the Settings tab. Click Change link in HTTP Proxy section to launch the Set HTTP Proxy dialog. After you click the Next button, you see a warning dialog asking for your permission to save the proxy setting and restart the Gateway Host Service. You can view and update HTTP proxy by using Configuration Manager tool. NOTE If you set up a proxy server with NTLM authentication, Gateway Host Service runs under the domain account. If you change the password for the domain account later, remember to update configuration settings for the service and restart it accordingly. Due to this requirement, we suggest you use a dedicated domain account to access the proxy server that does not require you to update the password frequently. Configure proxy server settings If you select Use system proxy setting for the HTTP proxy, gateway uses the proxy setting in diahost.exe.config and diawp.exe.config. If no proxy is specified in diahost.exe.config and diawp.exe.config, gateway connects to cloud service directly without going through proxy. The following procedure provides instructions for updating the diahost.exe.config file. 1. In File Explorer, make a safe copy of C:\Program Files\Microsoft Data Management Gateway\2.0\Shared\diahost.exe.config to back up the original file. 2. Launch Notepad.exe running as administrator, and open text file “C:\Program Files\Microsoft Data Management Gateway\2.0\Shared\diahost.exe.config. You find the default tag for system.net as shown in the following code: <system.net> <defaultProxy useDefaultCredentials="true" /> </system.net> You can then add proxy server details as shown in the following example: <system.net> <defaultProxy enabled="true"> <proxy bypassonlocal="true" proxyaddress="http://proxy.domain.org:8888/" /> </defaultProxy> </system.net> Additional properties are allowed inside the proxy tag to specify the required settings like scriptLocation. Refer to proxy Element (Network Settings) on syntax. <proxy autoDetect="true|false|unspecified" bypassonlocal="true|false|unspecified" proxyaddress="uriString" scriptLocation="uriString" usesystemdefault="true|false|unspecified "/> 3. Save the configuration file into the original location, then restart the Data Management Gateway Host service, which picks up the changes. To restart the service: use services applet from the control panel, or from the Data Management Gateway Configuration Manager > click the Stop Service button, then click the Start Service. If the service does not start, it is likely that an incorrect XML tag syntax has been added into the application configuration file that was edited. IMPORTANT Do not forget to update both diahost.exe.config and diawp.exe.config. In addition to these points, you also need to make sure Microsoft Azure is in your company’s whitelist. The list of valid Microsoft Azure IP addresses can be downloaded from the Microsoft Download Center. Possible symptoms for firewall and proxy server-related issues If you encounter errors similar to the following ones, it is likely due to improper configuration of the firewall or proxy server, which blocks gateway from connecting to Data Factory to authenticate itself. Refer to previous section to ensure your firewall and proxy server are properly configured. 1. When you try to register the gateway, you receive the following error: "Failed to register the gateway key. Before trying to register the gateway key again, confirm that the Data Management Gateway is in a connected state and the Data Management Gateway Host Service is Started." 2. When you open Configuration Manager, you see status as “Disconnected” or “Connecting.” When viewing Windows event logs, under “Event Viewer” > “Application and Services Logs” > “Data Management Gateway”, you see error messages such as the following error: Unable to connect to the remote server A component of Data Management Gateway has become unresponsive and restarts automatically. Component name: Gateway. Open port 8050 for credential encryption The Setting Credentials application uses the inbound port 8050 to relay credentials to the gateway when you set up an on-premises linked service in the Azure portal. During gateway setup, by default, the Data Management Gateway installation opens it on the gateway machine. If you are using a third-party firewall, you can manually open the port 8050. If you run into firewall issue during gateway setup, you can try using the following command to install the gateway without configuring the firewall. msiexec /q /i DataManagementGateway.msi NOFIREWALL=1 If you choose not to open the port 8050 on the gateway machine, use mechanisms other than using the Setting Credentials application to configure data store credentials. For example, you could use NewAzureRmDataFactoryEncryptValue PowerShell cmdlet. See Setting Credentials and Security section on how data store credentials can be set. Update By default, Data Management Gateway is automatically updated when a newer version of the gateway is available. The gateway is not updated until all the scheduled tasks are done. No further tasks are processed by the gateway until the update operation is completed. If the update fails, gateway is rolled back to the old version. You see the scheduled update time in the following places: The gateway properties blade in the Azure portal. Home page of the Data Management Gateway Configuration Manager System tray notification message. The Home tab of the Data Management Gateway Configuration Manager displays the update schedule and the last time the gateway was installed/updated. You can install the update right away or wait for the gateway to be automatically updated at the scheduled time. For example, the following image shows you the notification message shown in the Gateway Configuration Manager along with the Update button that you can click to install it immediately. The notification message in the system tray would look as shown in the following image: You see the status of update operation (manual or automatic) in the system tray. When you launch Gateway Configuration Manager next time, you see a message on the notification bar that the gateway has been updated along with a link to what's new topic. To disable/enable auto -update feature You can disable/enable the auto-update feature by doing the following steps: 1. Launch Windows PowerShell on the gateway machine. 2. Switch to the C:\Program Files\Microsoft Data Management Gateway\2.0\PowerShellScript folder. 3. Run the following command to turn the auto-update feature OFF (disable). .\GatewayAutoUpdateToggle.ps1 -off 4. To turn it back on: .\GatewayAutoUpdateToggle.ps1 -on Configuration Manager Once you install the gateway, you can launch Data Management Gateway Configuration Manager in one of the following ways: In the Search window, type Data Management Gateway to access this utility. Run the executable ConfigManager.exe in the folder: C:\Program Files\Microsoft Data Management Gateway\2.0\Shared Home page The Home page allows you to do the following actions: View status of the gateway (connected to the cloud service etc.). Register using a key from the portal. Stop and start the Data Management Gateway Host service on the gateway machine. Schedule updates at a specific time of the days. View the date when the gateway was last updated. Settings page The Settings page allows you to do the following actions: View, change, and export certificate used by the gateway. This certificate is used to encrypt data source credentials. Change HTTPS port for the endpoint. The gateway opens a port for setting the data source credentials. Status of the endpoint View SSL certificate is used for SSL communication between portal and the gateway to set credentials for data sources. Diagnostics page The Diagnostics page allows you to do the following actions: Enable verbose logging, view logs in event viewer, and send logs to Microsoft if there was a failure. Test connection to a data source. Help page The Help page displays the following information: Brief description of the gateway Version number Links to online help, privacy statement, and license agreement. Troubleshooting gateway issues See Troubleshooting gateway issues article for information/tips for troubleshooting issues with using the Data Management Gateway. Move gateway from a machine to another This section provides steps for moving gateway client from one machine to another machine. 1. In the portal, navigate to the Data Factory home page, and click the Linked Services tile. 2. Select your gateway in the DATA GATEWAYS section of the Linked Services blade. 3. In the Data gateway blade, click Download and install data gateway. 4. In the Configure blade, click Download and install data gateway, and follow instructions to install the data gateway on the machine. 5. Keep the Microsoft Data Management Gateway Configuration Manager open. 6. In the Configure blade in the portal, click Recreate key on the command bar, and click Yes for the warning message. Click copy button next to key text that copies the key to the clipboard. The gateway on the old machine stops functioning as soon you recreate the key. 7. Paste the key into text box in the Register Gateway page of the Data Management Gateway Configuration Manager on your machine. (optional) Click Show gateway key check box to see the key text. 8. Click Register to register the gateway with the cloud service. 9. On the Settings tab, click Change to select the same certificate that was used with the old gateway, enter the password, and click Finish. You can export a certificate from the old gateway by doing the following steps: launch Data Management Gateway Configuration Manager on the old machine, switch to the Certificate tab, click Export button and follow the instructions. 10. After successful registration of the gateway, you should see the Registration set to Registered and Status set to Started on the Home page of the Gateway Configuration Manager. Encrypting credentials To encrypt credentials in the Data Factory Editor, do the following steps: 1. Launch web browser on the gateway machine, navigate to Azure portal. Search for your data factory if needed, open data factory in the DATA FACTORY blade and then click Author & Deploy to launch Data Factory Editor. 2. Click an existing linked service in the tree view to see its JSON definition or create a linked service that requires a Data Management Gateway (for example: SQL Server or Oracle). 3. In the JSON editor, for the gatewayName property, enter the name of the gateway. 4. Enter server name for the Data Source property in the connectionString. 5. Enter database name for the Initial Catalog property in the connectionString. 6. Click Encrypt button on the command bar that launches the click-once Credential Manager application. You should see the Setting Credentials dialog box. 7. In the Setting Credentials dialog box, do the following steps: a. Select authentication that you want the Data Factory service to use to connect to the database. b. Enter name of the user who has access to the database for the USERNAME setting. c. Enter password for the user for the PASSWORD setting. d. Click OK to encrypt credentials and close the dialog box. 8. You should see a encryptedCredential property in the connectionString now. { "name": "SqlServerLinkedService", "properties": { "type": "OnPremisesSqlServer", "description": "", "typeProperties": { "connectionString": "data source=myserver;initial catalog=mydatabase;Integrated Security=False;EncryptedCredential=eyJDb25uZWN0aW9uU3R", "gatewayName": "adftutorialgateway" } } } If you access the portal from a machine that is different from the gateway machine, you must make sure that the Credentials Manager application can connect to the gateway machine. If the application cannot reach the gateway machine, it does not allow you to set credentials for the data source and to test connection to the data source. When you use the Setting Credentials application, the portal encrypts the credentials with the certificate specified in the Certificate tab of the Gateway Configuration Manager on the gateway machine. If you are looking for an API-based approach for encrypting the credentials, you can use the NewAzureRmDataFactoryEncryptValue PowerShell cmdlet to encrypt credentials. The cmdlet uses the certificate that gateway is configured to use to encrypt the credentials. You add encrypted credentials to the EncryptedCredential element of the connectionString in the JSON. You use the JSON with the NewAzureRmDataFactoryLinkedService cmdlet or in the Data Factory Editor. "connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=True;EncryptedCredential=<encrypted credential>", There is one more approach for setting credentials using Data Factory Editor. If you create a SQL Server linked service by using the editor and you enter credentials in plain text, the credentials are encrypted using a certificate that the Data Factory service owns. It does NOT use the certificate that gateway is configured to use. While this approach might be a little faster in some cases, it is less secure. Therefore, we recommend that you follow this approach only for development/testing purposes. PowerShell cmdlets This section describes how to create and register a gateway using Azure PowerShell cmdlets. 1. Launch Azure PowerShell in administrator mode. 2. Log in to your Azure account by running the following command and entering your Azure credentials. Login-AzureRmAccount 3. Use the New-AzureRmDataFactoryGateway cmdlet to create a logical gateway as follows: $MyDMG = New-AzureRmDataFactoryGateway -Name <gatewayName> -DataFactoryName <dataFactoryName> ResourceGroupName ADF –Description <desc> Example command and output: PS C:\> $MyDMG = New-AzureRmDataFactoryGateway -Name MyGateway -DataFactoryName $df ResourceGroupName ADF –Description “gateway for walkthrough” Name : MyGateway Description : gateway for walkthrough Version : Status : NeedRegistration VersionStatus : None CreateTime : 9/28/2014 10:58:22 RegisterTime : LastConnectTime : ExpiryTime : ProvisioningState : Succeeded Key : ADF#00000000-0000-4fb8-a867-947877aef6cb@fda06d87-f446-43b1-948578af26b8bab0@4707262b-dc25-4fe5-881c-c8a7c3c569fe@wu#nfU4aBlq/heRyYFZ2Xt/CD+7i73PEO521Sj2AFOCmiI 4. In Azure PowerShell, switch to the folder: C:\Program Files\Microsoft Data Management Gateway\2.0\PowerShellScript\. Run RegisterGateway.ps1 associated with the local variable $Key as shown in the following command. This script registers the client agent installed on your machine with the logical gateway you create earlier. PS C:\> .\RegisterGateway.ps1 $MyDMG.Key Agent registration is successful! You can register the gateway on a remote machine by using the IsRegisterOnRemoteMachine parameter. Example: .\RegisterGateway.ps1 $MyDMG.Key -IsRegisterOnRemoteMachine true 5. You can use the Get-AzureRmDataFactoryGateway cmdlet to get the list of Gateways in your data factory. When the Status shows online, it means your gateway is ready to use. Get-AzureRmDataFactoryGateway -DataFactoryName <dataFactoryName> -ResourceGroupName ADF You can remove a gateway using the Remove-AzureRmDataFactoryGateway cmdlet and update description for a gateway using the Set-AzureRmDataFactoryGateway cmdlets. For syntax and other details about these cmdlets, see Data Factory Cmdlet Reference. List gateways using PowerShell Get-AzureRmDataFactoryGateway -DataFactoryName jasoncopyusingstoredprocedure -ResourceGroupName ADF_ResourceGroup Remove gateway using PowerShell Remove-AzureRmDataFactoryGateway -Name JasonHDMG_byPSRemote -ResourceGroupName ADF_ResourceGroup DataFactoryName jasoncopyusingstoredprocedure -Force Next Steps See Move data between on-premises and cloud data stores article. In the walkthrough, you create a pipeline that uses the gateway to move data from an on-premises SQL Server database to an Azure blob. Transform data in Azure Data Factory 5/16/2017 • 3 min to read • Edit Online Overview This article explains data transformation activities in Azure Data Factory that you can use to transform and processes your raw data into predictions and insights. A transformation activity executes in a computing environment such as Azure HDInsight cluster or an Azure Batch. It provides links to articles with detailed information on each transformation activity. Data Factory supports the following data transformation activities that can be added to pipelines either individually or chained with another activity. NOTE For a walkthrough with step-by-step instructions, see Create a pipeline with Hive transformation article. HDInsight Hive activity The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand Windows/Linux-based HDInsight cluster. See Hive Activity article for details about this activity. HDInsight Pig activity The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand Windows/Linux-based HDInsight cluster. See Pig Activity article for details about this activity. HDInsight MapReduce activity The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or on-demand Windows/Linux-based HDInsight cluster. See MapReduce Activity article for details about this activity. HDInsight Streaming activity The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own or on-demand Windows/Linux-based HDInsight cluster. See HDInsight Streaming activity for details about this activity. HDInsight Spark Activity The HDInsight Spark activity in a Data Factory pipeline executes Spark programs on your own HDInsight cluster. For details, see Invoke Spark programs from Azure Data Factory. Machine Learning activities Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning web service for predictive analytics. Using the Batch Execution Activity in an Azure Data Factory pipeline, you can invoke a Machine Learning web service to make predictions on the data in batch. Over time, the predictive models in the Machine Learning scoring experiments need to be retrained using new input datasets. After you are done with retraining, you want to update the scoring web service with the retrained Machine Learning model. You can use the Update Resource Activity to update the web service with the newly trained model. See Use Machine Learning activities for details about these Machine Learning activities. Stored procedure activity You can use the SQL Server Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in one of the following data stores: Azure SQL Database, Azure SQL Data Warehouse, SQL Server Database in your enterprise or an Azure VM. See Stored Procedure Activity article for details. Data Lake Analytics U-SQL activity Data Lake Analytics U-SQL Activity runs a U-SQL script on an Azure Data Lake Analytics cluster. See Data Analytics U-SQL Activity article for details. .NET custom activity If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET activity to run using either an Azure Batch service or an Azure HDInsight cluster. See Use custom activities article for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script using Azure Data Factory. Compute environments You create a linked service for the compute environment and then use the linked service when defining a transformation activity. There are two types of compute environments supported by Data Factory. 1. On-Demand: In this case, the computing environment is fully managed by Data Factory. It is automatically created by the Data Factory service before a job is submitted to process data and removed when the job is completed. You can configure and control granular settings of the on-demand compute environment for job execution, cluster management, and bootstrapping actions. 2. Bring Your Own: In this case, you can register your own computing environment (for example HDInsight cluster) as a linked service in Data Factory. The computing environment is managed by you and the Data Factory service uses it to execute the activities. See Compute Linked Services article to learn about compute services supported by Data Factory. Summary Azure Data Factory supports the following data transformation activities and the compute environments for the activities. The transformation activities can be added to pipelines either individually or chained with another activity. DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT Hive HDInsight [Hadoop] Pig HDInsight [Hadoop] DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT MapReduce HDInsight [Hadoop] Hadoop Streaming HDInsight [Hadoop] Machine Learning activities: Batch Execution and Update Resource Azure VM Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server Data Lake Analytics U-SQL Azure Data Lake Analytics DotNet HDInsight [Hadoop] or Azure Batch Transform data using Hive Activity in Azure Data Factory 5/22/2017 • 4 min to read • Edit Online The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand Windows/Linux-based HDInsight cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities. Syntax { "name": "Hive Activity", "description": "description", "type": "HDInsightHive", "inputs": [ { "name": "input tables" } ], "outputs": [ { "name": "output tables" } ], "linkedServiceName": "MyHDInsightLinkedService", "typeProperties": { "script": "Hive script", "scriptPath": "<pathtotheHivescriptfileinAzureblobstorage>", "defines": { "param1": "param1Value" } }, "scheduler": { "frequency": "Day", "interval": 1 } } Syntax details PROPERTY DESCRIPTION REQUIRED name Name of the activity Yes description Text describing what the activity is used for No type HDinsightHive Yes inputs Inputs consumed by the Hive activity No outputs Outputs produced by the Hive activity Yes PROPERTY DESCRIPTION REQUIRED linkedServiceName Reference to the HDInsight cluster registered as a linked service in Data Factory Yes script Specify the Hive script inline No script path Store the Hive script in an Azure blob storage and provide the path to the file. Use 'script' or 'scriptPath' property. Both cannot be used together. The file name is casesensitive. No defines Specify parameters as key/value pairs for referencing within the Hive script using 'hiveconf' No Example Let’s consider an example of game logs analytics where you want to identify the time spent by users playing games launched by your company. The following log is a sample game log, which is comma ( , ) separated and contains the following fields – ProfileID, SessionStart, Duration, SrcIPAddress, and GameType. 1809,2014-05-04 1703,2014-05-04 1703,2014-05-04 1809,2014-05-04 ..... 12:04:25.3470000,14,221.117.223.75,CaptureFlag 06:05:06.0090000,16,12.49.178.247,KingHill 10:21:57.3290000,10,199.118.18.179,CaptureFlag 05:24:22.2100000,23,192.84.66.141,KingHill The Hive script to process this data: DROP TABLE IF EXISTS HiveSampleIn; CREATE EXTERNAL TABLE HiveSampleIn ( ProfileID string, SessionStart string, Duration int, SrcIPAddress string, GameType string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION 'wasb://adfwalkthrough@<storageaccount>.blob.core.windows.net/samplein/'; DROP TABLE IF EXISTS HiveSampleOut; CREATE EXTERNAL TABLE HiveSampleOut ( ProfileID string, Duration int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION 'wasb://adfwalkthrough@<storageaccount>.blob.core.windows.net/sampleout/'; INSERT OVERWRITE TABLE HiveSampleOut Select ProfileID, SUM(Duration) FROM HiveSampleIn Group by ProfileID To execute this Hive script in a Data Factory pipeline, you need to do the following 1. Create a linked service to register your own HDInsight compute cluster or configure on-demand HDInsight compute cluster. Let’s call this linked service “HDInsightLinkedService”. 2. Create a linked service to configure the connection to Azure Blob storage hosting the data. Let’s call this linked service “StorageLinkedService” 3. Create datasets pointing to the input and the output data. Let’s call the input dataset “HiveSampleIn” and the output dataset “HiveSampleOut” 4. Copy the Hive query as a file to Azure Blob Storage configured in step #2. if the storage for hosting the data is different from the one hosting this query file, create a separate Azure Storage linked service and refer to it in the activity. Use scriptPath **to specify the path to hive query file and **scriptLinkedService to specify the Azure storage that contains the script file. NOTE You can also provide the Hive script inline in the activity definition by using the script property. We do not recommend this approach as all special characters in the script within the JSON document needs to be escaped and may cause debugging issues. The best practice is to follow step #4. 5. Create a pipeline with the HDInsightHive activity. The activity processes/transforms the data. { "name": "HiveActivitySamplePipeline", "properties": { "activities": [ { "name": "HiveActivitySample", "type": "HDInsightHive", "inputs": [ { "name": "HiveSampleIn" } ], "outputs": [ { "name": "HiveSampleOut" } ], "linkedServiceName": "HDInsightLinkedService", "typeproperties": { "scriptPath": "adfwalkthrough\\scripts\\samplehive.hql", "scriptLinkedService": "StorageLinkedService" }, "scheduler": { "frequency": "Hour", "interval": 1 } } ] } } 6. Deploy the pipeline. See Creating pipelines article for details. 7. Monitor the pipeline using the data factory monitoring and management views. See Monitoring and manage Data Factory pipelines article for details. Specifying parameters for a Hive script In this example, game logs are ingested daily into Azure Blob Storage and are stored in a folder partitioned with date and time. You want to parameterize the Hive script and pass the input folder location dynamically during runtime and also produce the output partitioned with date and time. To use parameterized Hive script, do the following Define the parameters in defines. { "name": "HiveActivitySamplePipeline", "properties": { "activities": [ { "name": "HiveActivitySample", "type": "HDInsightHive", "inputs": [ { "name": "HiveSampleIn" } ], "outputs": [ { "name": "HiveSampleOut" } ], "linkedServiceName": "HDInsightLinkedService", "typeproperties": { "scriptPath": "adfwalkthrough\\scripts\\samplehive.hql", "scriptLinkedService": "StorageLinkedService", "defines": { "Input": "$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/samplein/yearno= {0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)", "Output": "$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/sampleout/yearno= {0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)" }, "scheduler": { "frequency": "Hour", "interval": 1 } } } ] } } In the Hive Script, refer to the parameter using ${hiveconf:parameterName}. DROP TABLE IF EXISTS HiveSampleIn; CREATE EXTERNAL TABLE HiveSampleIn ( ProfileID string, SessionStart string, Duration int, SrcIPAddress string, GameType string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION '${hiveconf:Input}'; DROP TABLE IF EXISTS HiveSampleOut; CREATE EXTERNAL TABLE HiveSampleOut ( ProfileID string, Duration int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION '${hiveconf:Output}'; INSERT OVERWRITE TABLE HiveSampleOut Select ProfileID, SUM(Duration) FROM HiveSampleIn Group by ProfileID See Also Pig Activity MapReduce Activity Hadoop Streaming Activity Invoke Spark programs Invoke R scripts Transform data using Pig Activity in Azure Data Factory 5/22/2017 • 3 min to read • Edit Online The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand Windows/Linux-based HDInsight cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities. Syntax { "name": "HiveActivitySamplePipeline", "properties": { "activities": [ { "name": "Pig Activity", "description": "description", "type": "HDInsightPig", "inputs": [ { "name": "input tables" } ], "outputs": [ { "name": "output tables" } ], "linkedServiceName": "MyHDInsightLinkedService", "typeProperties": { "script": "Pig script", "scriptPath": "<pathtothePigscriptfileinAzureblobstorage>", "defines": { "param1": "param1Value" } }, "scheduler": { "frequency": "Day", "interval": 1 } } ] } } Syntax details PROPERTY DESCRIPTION REQUIRED name Name of the activity Yes description Text describing what the activity is used for No PROPERTY DESCRIPTION REQUIRED type HDinsightPig Yes inputs One or more inputs consumed by the Pig activity No outputs One or more outputs produced by the Pig activity Yes linkedServiceName Reference to the HDInsight cluster registered as a linked service in Data Factory Yes script Specify the Pig script inline No script path Store the Pig script in an Azure blob storage and provide the path to the file. Use 'script' or 'scriptPath' property. Both cannot be used together. The file name is casesensitive. No defines Specify parameters as key/value pairs for referencing within the Pig script No Example Let’s consider an example of game logs analytics where you want to identify the time spent by players playing games launched by your company. The following sample game log is a comma (,) separated file. It contains the following fields – ProfileID, SessionStart, Duration, SrcIPAddress, and GameType. 1809,2014-05-04 1703,2014-05-04 1703,2014-05-04 1809,2014-05-04 ..... 12:04:25.3470000,14,221.117.223.75,CaptureFlag 06:05:06.0090000,16,12.49.178.247,KingHill 10:21:57.3290000,10,199.118.18.179,CaptureFlag 05:24:22.2100000,23,192.84.66.141,KingHill The Pig script to process this data: PigSampleIn = LOAD 'wasb://[email protected]/samplein/' USING PigStorage(',') AS (ProfileID:chararray, SessionStart:chararray, Duration:int, SrcIPAddress:chararray, GameType:chararray); GroupProfile = Group PigSampleIn all; PigSampleOut = Foreach GroupProfile Generate PigSampleIn.ProfileID, SUM(PigSampleIn.Duration); Store PigSampleOut into 'wasb://[email protected]/sampleoutpig/' USING PigStorage (','); To execute this Pig script in a Data Factory pipeline, do the following: 1. Create a linked service to register your own HDInsight compute cluster or configure on-demand HDInsight compute cluster. Let’s call this linked service HDInsightLinkedService. 2. Create a linked service to configure the connection to Azure Blob storage hosting the data. Let’s call this linked service StorageLinkedService. 3. Create datasets pointing to the input and the output data. Let’s call the input dataset PigSampleIn and the output dataset PigSampleOut. 4. Copy the Pig query in a file the Azure Blob Storage configured in step #2. If the Azure storage that hosts the data is different from the one that hosts the query file, create a separate Azure Storage linked service. Refer to the linked service in the activity configuration. Use scriptPath **to specify the path to pig script file and **scriptLinkedService. NOTE You can also provide the Pig script inline in the activity definition by using the script property. However, we do not recommend this approach as all special characters in the script needs to be escaped and may cause debugging issues. The best practice is to follow step #4. 5. Create the pipeline with the HDInsightPig activity. This activity processes the input data by running Pig script on HDInsight cluster. { "name": "PigActivitySamplePipeline", "properties": { "activities": [ { "name": "PigActivitySample", "type": "HDInsightPig", "inputs": [ { "name": "PigSampleIn" } ], "outputs": [ { "name": "PigSampleOut" } ], "linkedServiceName": "HDInsightLinkedService", "typeproperties": { "scriptPath": "adfwalkthrough\\scripts\\enrichlogs.pig", "scriptLinkedService": "StorageLinkedService" }, "scheduler": { "frequency": "Day", "interval": 1 } } ] } } 6. Deploy the pipeline. See Creating pipelines article for details. 7. Monitor the pipeline using the data factory monitoring and management views. See Monitoring and manage Data Factory pipelines article for details. Specifying parameters for a Pig script Consider the following example: game logs are ingested daily into Azure Blob Storage and stored in a folder partitioned based on date and time. You want to parameterize the Pig script and pass the input folder location dynamically during runtime and also produce the output partitioned with date and time. To use parameterized Pig script, do the following: Define the parameters in defines. { "name": "PigActivitySamplePipeline", "properties": { "activities": [ { "name": "PigActivitySample", "type": "HDInsightPig", "inputs": [ { "name": "PigSampleIn" } ], "outputs": [ { "name": "PigSampleOut" } ], "linkedServiceName": "HDInsightLinkedService", "typeproperties": { "scriptPath": "adfwalkthrough\\scripts\\samplepig.hql", "scriptLinkedService": "StorageLinkedService", "defines": { "Input": "$$Text.Format('wasb: //adfwalkthrough@<storageaccountname>.blob.core.windows.net/samplein/yearno={0: yyyy}/monthno= {0:MM}/dayno={0: dd}/',SliceStart)", "Output": "$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/sampleout/yearno= {0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)" } }, "scheduler": { "frequency": "Day", "interval": 1 } } ] } } In the Pig Script, refer to the parameters using '$parameterName' as shown in the following example: PigSampleIn = LOAD '$Input' USING PigStorage(',') AS (ProfileID:chararray, SessionStart:chararray, Duration:int, SrcIPAddress:chararray, GameType:chararray); GroupProfile = Group PigSampleIn all; PigSampleOut = Foreach GroupProfile Generate PigSampleIn.ProfileID, SUM(PigSampleIn.Duration); Store PigSampleOut into '$Output' USING PigStorage (','); See Also Hive Activity MapReduce Activity Hadoop Streaming Activity Invoke Spark programs Invoke R scripts Invoke MapReduce Programs from Data Factory 3/13/2017 • 4 min to read • Edit Online The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or on-demand Windows/Linux-based HDInsight cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities. Introduction A pipeline in an Azure data factory processes data in linked storage services by using linked compute services. It contains a sequence of activities where each activity performs a specific processing operation. This article describes using the HDInsight MapReduce Activity. See Pig and Hive for details about running Pig/Hive scripts on a Windows/Linux-based HDInsight cluster from a pipeline by using HDInsight Pig and Hive activities. JSON for HDInsight MapReduce Activity In the JSON definition for the HDInsight Activity: 1. 2. 3. 4. Set the type of the activity to HDInsight. Specify the name of the class for className property. Specify the path to the JAR file including the file name for jarFilePath property. Specify the linked service that refers to the Azure Blob Storage that contains the JAR file for jarLinkedService property. 5. Specify any arguments for the MapReduce program in the arguments section. At runtime, you see a few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the following example (-s, --input, --output etc., are options immediately followed by their values). { "name": "MahoutMapReduceSamplePipeline", "properties": { "description": "Sample Pipeline to Run a Mahout Custom Map Reduce Jar. This job calcuates an Item Similarity Matrix to determine the similarity between 2 items", "activities": [ { "type": "HDInsightMapReduce", "typeProperties": { "className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob", "jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar", "jarLinkedService": "StorageLinkedService", "arguments": [ "-s", "SIMILARITY_LOGLIKELIHOOD", "--input", "wasb://[email protected]/Mahout/input", "--output", "wasb://[email protected]/Mahout/output/", "--maxSimilaritiesPerItem", "500", "--tempDir", "wasb://[email protected]/Mahout/temp/mahout" ] }, "inputs": [ { "name": "MahoutInput" } ], "outputs": [ { "name": "MahoutOutput" } ], "policy": { "timeout": "01:00:00", "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "MahoutActivity", "description": "Custom Map Reduce to generate Mahout result", "linkedServiceName": "HDInsightLinkedService" } ], "start": "2017-01-03T00:00:00Z", "end": "2017-01-04T00:00:00Z" } } You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. In the following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout JAR file. Sample on GitHub You can download a sample for using the HDInsight MapReduce Activity from: Data Factory Samples on GitHub. Running the Word Count program The pipeline in this example runs the Word Count Map/Reduce program on your Azure HDInsight cluster. Linked Services First, you create a linked service to link the Azure Storage that is used by the Azure HDInsight cluster to the Azure data factory. If you copy/paste the following code, do not forget to replace account name and account key with the name and key of your Azure Storage. Azure Storage linked service {