Microsoft expands Azure Data Lake to unleash big data productivity
By T. K. “Ranga” Rengarajan, corporate vice president, Data Platform
In July of this year, Satya Nadella shared our broad vision for big data and analytics when he announced Cortana Analytics. Building on this vision, today we’re announcing a new and expanded Azure Data Lake that makes big data processing and analytics simpler and more accessible. The expanded Microsoft Azure Data Lake includes the following:
- Azure Data Lake Store, previously announced as Azure Data Lake, will be available in preview later this year. The Data Lake Store provides a single repository where you can easily capture data of any size, type and speed without forcing changes to your application as data scales. In the store, data can be securely shared for collaboration and is accessible for processing and analytics from HDFS applications and tools.
- Azure Data Lake Analytics, a new service built on Apache YARN that dynamically scales so you can focus on your business goals, not on distributed infrastructure. This service will be available in preview later this year and includes U-SQL, a language that unifies the benefits of SQL with the expressive power of user code. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the store and across SQL Servers in Azure, Azure SQL Database and Azure SQL Data Warehouse.
- Azure HDInsight, our fully managed Apache Hadoop cluster service with a broad range of open source analytics engines including Hive, Spark, HBase and Storm. Today, we are announcing general availability of managed clusters on Linux with an industry-leading 99.9% uptime SLA. HDInsight will be able to take advantage of capabilities in the Store for increased throughput, scale and security.
Supporting the Azure Data Lake:
- Azure Data Lake Tools for Visual Studio, provide an integrated development environment that spans the Azure Data Lake, dramatically simplifying authoring, debugging and optimization for processing and analytics at any scale.
- Leading Hadoop ISV applications that span security, governance, data preparation and analytics can be easily deployed from the Azure Marketplace on top of Azure Data Lake.
Azure Data Lake
Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape and speed, and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics. Azure Data Lake works with existing IT investments for identity, management, and security for simplified data management and governance. It also integrates seamlessly with operational stores and data warehouses so you can extend current data applications. We’ve drawn on the experience of working with enterprise customers and running some of the largest scale processing and analytics in the world for Microsoft businesses like Office 365, Xbox Live, Azure, Windows, Bing and Skype. Azure Data Lake solves many of the productivity and scalability challenges that prevent you from maximizing the value of your data assets with a service that’s ready to meet your current and future business needs.
“Hortonworks and Microsoft have partnered closely over many years to further the Hadoop platform for big data analytics, including contributions to YARN, Hive, and other Apache projects,” said Rob Bearden, CEO at Hortonworks. “Azure Data Lake, including Azure HDInsight powered by Hortonworks Data Platform, demonstrates our shared commitment to make it easier for everyone to work with big data.”
Azure Data Lake Store – A hyper-scale repository for big data processing and analytic workloads
The value of a data lake resides in the ability to develop solutions across data of all types – unstructured, semi-structured and structured. This begins with the Azure Data Lake Store, a single repository to capture and access any type of data for high-performance processing and analytics and low latency workloads with enterprise-grade security. For example, data can be ingested in real-time from sensors and devices for IoT solutions, or from online shopping websites into the store without the restriction of fixed limits on account or file size unlike current offerings in the market. As part of Azure Data Lake, the store supports development of your big data solutions with the language or framework of your choice. The store in Azure Data Lake is HDFS compatible so Hadoop distributions like Cloudera, Hortonworks®, and MapR can readily access the data for processing and analytics.
“Cloudera is pleased to be working closely with Microsoft to integrate our enterprise data hub with the Azure Data Lake Store,” said Mike Olson, founder and chief strategy officer at Cloudera. “Cloudera on Azure benefits from the Data Lake Store which acts as a cloud-based landing zone for data in your enterprise data hub. Because the store is compatible with WebHDFS, Cloudera can leverage Data Lake and provide customers with a secure and flexible big data solution.”
Azure Data Lake Analytics – a new distributed processing and analytics service
Azure Data Lake Analytics lets you focus on the logic of your application, not the distributed infrastructure running it. Instead of deploying, configuring and tuning hardware, you write queries to transform your data and extract valuable insight. Built on Apache YARN, and designed for the cloud, the analytics service can handle jobs of any scale instantly by simply setting the dial for how much power you need. The analytics service for Azure Data Lake is cost-efficient because you only pay for your job when it is running, and support for Azure Active Directory lets you manage access and roles simply and integrates with your on-premises identity system
We know that many developers and data scientists struggle to be successful with big data using existing technologies and tools. Code-based solutions offer great power, but require significant investments to master, while SQL-based tools make it easy to get started but are difficult to extend. We’ve faced the same problems inside Microsoft and that’s why we introduced, U-SQL, a new query language that unifies the ease of use of SQL with the expressive power of C#. The U-SQL language is built on the same distributed runtime that powers the big data systems inside Microsoft. Millions of SQL and .NET developers can now process and analyze all of their data with the skills they already have. The U-SQL support in Azure Data Lake Tools for Visual Studio includes state of the art support for authoring, debugging and advanced performance analysis features for increased productivity when optimizing jobs running across thousands of nodes.
“U-SQL was especially helpful because we were able to get up and running using our existing skills with .NET and SQL,” says Sam Vanhoutte, Chief Technology Officer at Codit. “This made big data easy because we didn’t have to learn a whole new paradigm. With Azure Data Lake, we were able to process data coming in from smart meters and combine it with the energy spot market prices to give our customers the ability to optimize their energy consumption and potentially save hundreds of thousands of dollars.”
Azure HDInsight – Fully Managed Hadoop, Spark, Storm and HBase
Azure Data Lake also includes HDInsight, our Apache Hadoop-based service that allows you spin up any number of nodes in minutes. As one of the fastest growing services in Azure, HDInsight gives you the breadth of the Hadoop ecosystem in a managed service that’s monitored and supported by Microsoft. Furthering our commitment to productivity, we’ve updated our Visual Studio Tools for authoring, advanced debugging, and tuning for Hive queries and Storm topologies running in HDInsight.
Today, we are announcing the general availability of HDInsight on Linux. We work closely with Hortonworks and Canonical to provide the HDP™ distribution on the Ubuntu Operating System that powers the Linux version of HDInsight in the Data Lake. This is another strategic step by Microsoft to meet customers where they are and make it easier for you run Hadoop workloads in the cloud.
Leading Hadoop ISVs on the Azure Data Lake
There are a growing set of leading data management applications for Azure Data Lake. This includes applications that provide end-to-end big data analytics like Datameer, technologies that address big data security and governance like Dataguise and BlueTalon, unified stream and batch with DataTorrent, and tools that give business users the ability to visualize and analyze data in compelling ways like AtScale and Zoomdata. Support from our partners ensures that you have the best applications available as you get started with Azure Data Lake.
We will continue to invest in solutions for big data processing and analytics to make it easier for everyone to work with data of any type, size and speed using the tools, languages and frameworks they want to in a trusted cloud, hybrid or on premise environment. Our goal is to make big data technology simpler and more accessible to the greatest number of people possible. This includes developers, data scientists, analysts, application developers, and also businesspeople and mainstream IT managers..
You can hear more about these announcements during my keynote at our free, virtual event AzureCon tomorrow or on-demand, and at Strata + Hadoop World in NYC.