Schema and data type mapping in AZURE Data Factory while copying data from source to sink

Schema and data type mapping in copy activity

Default mapping

By default, copy activity maps source data to sink by column names in case-sensitive manner. If sink doesn’t exist, for example, writing to file(s), the source field names will be persisted as sink names. If the sink already exists, it must contain all columns being copied from the source. Such default mapping supports flexible schemas and schema drift from source to sink from execution to execution — all the data returned by source data store can be copied to sink.

If your source is text file without header line, explicit mapping is required as the source doesn’t contain column names.

Explicit mapping

You can also specify explicit mapping to customize the column/field mapping from source to sink based on your need. With explicit mapping, you can copy only partial source data to sink, or map source data to sink with different names, or reshape tabular/hierarchical data. Copy activity:

  1. Reads the data from source and determine the source schema.
  2. Applies your defined mapping.
  3. Writes the data to sink.

You can configure the mapping on Data Factory authoring UI -> copy activity -> mapping tab, or programmatically specify the mapping in copy activity -> translator property. The following properties are supported in translator -> mappings array -> objects -> source and sink, which points to the specific column/field to map data.

Data type mapping

Copy activity performs source types to sink types mapping with the following flow:

  1. Convert from source native data types to Azure Data Factory interim data types.
  2. Automatically convert interim data type as needed to match corresponding sink types, applicable for both default mapping and explicit mapping.
  3. Convert from Azure Data Factory interim data types to sink native data types.

Copy activity currently supports the following interim data types: Boolean, Byte, Byte array, Datetime, DatetimeOffset, Decimal, Double, GUID, Int16, Int32, Int64, SByte, Single, String, Timespan, UInt16, UInt32, and UInt64.

Alternative column-mapping (legacy model)

You can specify copy activity -> translator -> columnMappings to map between tabular-shaped data. In this case, the "structure" section is required for both input and output datasets. Column mapping supports mapping all or subset of columns in the source dataset "structure" to all columns in the sink dataset "structure". The following are error conditions that result in an exception:

  • Source data store query result does not have a column name that is specified in the input dataset “structure” section.
  • Sink data store (if with pre-defined schema) does not have a column name that is specified in the output dataset “structure” section.
  • Either fewer columns or more columns in the “structure” of sink dataset than specified in the mapping.
  • Duplicate mapping.

Tabular source to tabular sink

For example, to copy data from Salesforce to Azure SQL Database and explicitly map three columns:

  1. On copy activity -> mapping tab, click Import schemas button to import both source and sink schemas.
  2. Map the needed fields and exclude/delete the rest.

The same mapping can be configured as the following in copy activity payload (see translator):

{
“name”: “CopyActivityTabularToTabular”,
“type”: “Copy”,
“typeProperties”: {
“source”: { “type”: “SalesforceSource” },
“sink”: { “type”: “SqlSink” },
“translator”: {
“type”: “TabularTranslator”,
“mappings”: [
{
“source”: { “name”: “Id” },
“sink”: { “name”: “CustomerID” }
},
{
“source”: { “name”: “Name” },
“sink”: { “name”: “LastName” }
},
{
“source”: { “name”: “LastModifiedDate” },
“sink”: { “name”: “ModifiedDate” }
}
]
}
},

}

Parameterize mapping

If you want to create a templatized pipeline to copy large number of objects dynamically, determine whether you can leverage the default mapping or you need to define explicit mapping for respective objects.

If explicit mapping is needed, you can:

  1. Define a parameter with object type at the pipeline level, for example, mapping.
  2. Parameterize the mapping: on copy activity -> mapping tab, choose to add dynamic content and select the above parameter. The activity payload would be as the following:
  3. JSON

{
“name”: “CopyActivityHierarchicalToTabular”,
“type”: “Copy”,
“typeProperties”: {
“source”: {…},
“sink”: {…},
“translator”: {
“value”: “@pipeline().parameters.mapping”,
“type”: “Expression”
},

}
}

Construct the value to pass into the mapping parameter. It should be the entire object of translator definition, refer to the samples in explicit mapping section. For example, for tabular source to tabular sink copy, the value should be {"type":"TabularTranslator","mappings":[{"source":{"name":"Id"},"sink":{"name":"CustomerID"}},{"source":{"name":"Name"},"sink":{"name":"LastName"}},{"source":{"name":"LastModifiedDate"},"sink":{"name":"ModifiedDate"}}]}.

--

--

Anirban Das, Cloud, Data & AI Innovation Architect

Global Lead - Cloud,Data & AI Innovation,Leads AI innovation, focused on building and implementing breakthrough AI research and accelerating AI adoption in org.