Advantages of Azure Data Lake Storage Gen2 ( ADLS)
Azure Data Lake Storage (ADLS) Gen2 reached general availability on February 7, 2019, and has continued to evolve and mature since then. This post will help you understand its advantages and what you need to know to get started. If you would like to become more familiar with the concepts of a data lake,
Prior to the introduction of ADLS Gen2, when we wanted cloud storage in Azure for a data lake implementation, we needed to decide between Azure Data Lake Storage Gen1 (formerly known as Azure Data Lake Store) and Azure Storage (specifically blob storage). This involved weighing the business and technical requirements versus features available in order to make the decision on which service to use. While ADLS Gen1 offers important optimizations important for analytic workloads and more granular security (see section 3 for details), Azure Storage has built-in features like geo-redundancy, hot/cold/archive tiers, additional metadata, and broader regional availability which are very compelling. In the past, we either accepted some trade-offs or stored the data twice in certain situations.
The new ADLS Gen2 service is built upon Azure Storage as its foundation. When the hierarchical namespace (HNS) property is enabled (see section 2 for details), an otherwise standard, general purpose V2, storage account becomes ADLS Gen2. For this reason, you will not see ADLS Gen2 listed in Azure as its own service — since ADLS Gen1 is its own service, this shift has been confusing for many people. There are a couple of ways to verify if ADLS Gen2 is enabled for a storage account:
When viewing the Azure Storage account, if the file system service is displayed this indicates that ADLS Gen2 is supported:
ADLS Gen2 converges the worlds of object storage and hierarchical file storage
Fundamentally, ADLS Gen2 is seeking to take advantage of file system benefits without giving up the type of scalability and cost-effectiveness available with an object store:
The three new areas depicted above include:
(1) File System. There is a terminology difference with ADLS Gen2. The concept of a container (from blob storage) is referred to as a file system in ADLS Gen2.
(2) Hierarchical Namespace. The hierarchical namespace (HNS), coupled with the DFS endpoint, is what enables the performance and security improvements, which are discussed in Section 3.
(3) DFS Endpoint and File System Driver. ADLS Gen2 utilizes the ABFS driver, which is part of Apache Hadoop. For connectivity to ADLS Gen2, the ABFS driver utilizes the DFS endpoint to invoke performance and security optimizations.
- ABFS = Azure Blob File System
- DFS = Distributed File System
Query Performance. When sending a query that is only retrieving a subset of data, with a hierarchical file system like ADLS Gen2 it is possible to leverage partition scans for data pruning (predicate pushdown). This can improve query performance dramatically for compute engines that understand how to take advantage of partition scans.
ADLS Gen2 has significant performance and security advantages for analytical workloads
Both the object store model (such as Azure blob storage) and the hierarchical file system model (ADLS Gen1 and Gen2) are compatible with HDFS (Hadoop Distributed File System). This is achieved with drivers that implement server-side HDFS semantics to translate into remote storage APIs, allowing ADLS Gen2 to behave very similarly to native HDFS. However, there are important distinctions between object storage and hierarchical file system storage in terms of performance and security.
With object storage, folders are virtual only. Although it appears like we can create folders in object storage, they are just mimicked within the URI string (or sometimes metadata is used as an alternative). Although that might initially seem trivial, it has the following implications:
(1) Query Performance. When sending a query that is only retrieving a subset of data, with a hierarchical file system like ADLS Gen2 it is possible to leverage partition scans for data pruning (predicate pushdown). This can improve query performance dramatically for compute engines that understand how to take advantage of partition scans.
Azure Data Lake Storage Gen2 (ADLS Gen2) — the latest iteration of Azure Data Lake Storage — is designed for highly scalable big data analytics solutions. Not only does it combine the management and scalability features of Azure Blob Storage and Azure Data Lake Storage Gen1 — including a hierarchical file system with granular security and lower-cost tiered storage — it also offers highly scalable storage, processing capabilities, high availability and disaster recovery.
In this blog, I’ll cover all the latest and greatest features that ADLS Gen2 has to offer.
Multi-Protocol Access Capability
Recently, ADLS introduced new multi-protocol access capability to support solutions for both object storage and analytics storage (Note: it’s still currently in public preview for West US 2 and West Central US regions).
The multi-protocol access allows you to connect applications to your ADLS Gen2 storage account via the object store Blob API using the WASB driver, or to the ADLS Gen2 API using the new ABFS driver. With hierarchical namespace enabled, both APIs can access data in ADLS Gen2 the same way. Using the Blob API, data access is routed through the hierarchical namespace to leverage the same directory operations and access control lists (ACLs) as the ADLS Gen 2 API. This is great for existing solutions using the Blob API, as no code changes are required to take advantage of the new access control features on files and directories introduced by the hierarchical namespace. Even better? The multi-protocol access on ADLS Gen2 is interoperable with many Azure services like Azure Stream Analytics, IoT Hub, Power BI, Azure Data Factory and others.
Hierarchical Namespace
Now, with a true hierarchical namespace to Blob storage, ADLS Gen2 allows true atomic directory manipulation. Historically, traditional object stores like Blob storage resembled a pseudo-filesystem directory hierarchy, adopting naming conventions to Blob objects containing slashes (/). This was inefficient because applications would have to iterate through potentially millions of individual Blob objects to achieve directory-level tasks: For example, deleting a directory with several million objects in Blob storage would require an equal number of delete operations as objects in that directory. In contrast, with ADLS Gen2, deleting a directory is a single operation regardless of the number of files in the directory.
Furthermore, the hierarchical namespace in ADLS Gen2 does not limit its scalability potential as traditional object stores do. ADLS Gen2 scales linearly in both data capacity (exabytes) and performance (Gbps throughput).
Security
The hierarchical namespace allows you to define ACL and POSIX permissions on directories, subdirectories or individual files. You can also use role-based authentication and Azure Active Directory (Azure AD) to support resource management and data operations.
Additionally, ADLS Gen2 supports both encryption-in-transit and encryption-at-rest to move data around. Encryption-at-rest is automatically enabled for all storage accounts via Storage Service Encryption (SSE), using Microsoft-managed encryption keys or using your own encryption keys. Encryption-in-transit is enabled by Transport-Level Encryption using HTTPS and can be enforced by enabling the Secure transfer required option for the storage account under Settings > Configuration. Client-side encryption is also supported with the Azure Storage Client Library for .Net.
In addition to access and encryption, ADLS Gen2 supports firewall and virtual network configurations. Network rules can be defined to restrict access to the storage account from a specific set of networks. For more information on firewalls and virtual networks for Azure Storage, check out Microsoft’s guide to Configure Azure Storage firewalls and virtual networks.
Performance and Access Tiers
ADLS Gen2 is currently supported in Azure Storage accounts with standard performance tiers (magnetic disks). However, the premium performance tier is currently not supported for ADLS Gen2 accounts.
Both hot and cool access tiers are available for ADLS Gen2 storage accounts: While the hot access tier is optimized for storing data that is accessed frequently, the cool access tier is optimized for storing data that is infrequently accessed and stored for at least 30 days.
High-Availability and Disaster Recovery
Data in ADLS Gen2 storage accounts are always replicated to ensure durability and high availability. The replication option is selected when the storage account is created and can be later upgraded for more durable and resilient availability. You can select one of the following redundancy options:
- Locally-redundant storage (LRS)
- Zone-redundant storage (ZRS)
- Geo-redundant storage (GRS)
- Read-access geo-redundant storage (RA-GRS)
For more details on redundancy options for Azure Storage accounts, please read Microsoft’s Azure Storage redundancy guide.
Roadmap
ADLS Gen2 continues to evolve rapidly as new access and interoperability features are introduced. For upcoming product updates and announcements, check out Microsoft’s Azure announcements.