Creating a data hub for your analytics with PolyBase
Data is the new currency of the digital world, and Microsoft is uniquely positioned to help businesses and consumers get the most out of their data assets. With the leading database in the world, SQL Server, and the rapidly growing Azure cloud data platform, Microsoft is delivering a modern data platform. We are witnessing a paradigm shift in data management where data in different silos are being brought together to create high-value data sets that are used to drive critical business divisions in industry verticals like retail, banking, healthcare, and more.
SQL Server 2016 introduced a new feature called PolyBase that enables your SQL Server instance to process Transact-SQL queries that read data from Hadoop. The same query can also join with relational tables in your SQL Server. SQL Server 2019 CTP 2.0 introduces new connectors for PolyBase including SQL Server, Oracle, Teradata, MongoDB, Azure SQL DB, Azure SQL DW, Cosmos DB, and virtually any ODBC-accessible data source.
The arrival of the new connectors enables customers to leverage PolyBase for creating a virtual data hub for a wide variety of data sources within the enterprise. It’s very common to find an enterprise scenario where data from an Oracle database needs to be joined with data from another SQL Server instance for serving a business purpose. Starting with SQL Server 2019, you can leverage PolyBase for creating a source-agnostic solution using external tables. This opens multiple possibilities like building a modern data warehousing solution spanning SQL Server, Oracle, and Teradata.
Let’s consider an example where we have a set of tables in multiple heterogeneous data sources such as Oracle which stores information that needs to be joined with tables in SQL Server. You can create an external table in SQL Server that retrieves data from the Oracle database, making the Oracle data available as a virtual part of the SQL Server database. The screenshot below shows how you can create an External Table using the Create External Table Wizard in Azure Data Studio.
The ability to reference an external data source like an Oracle database table in an SQL Server database table opens multiple possibilities:
- Unified Security Model – Users now have the ability to access external tables in other data sources such as Oracle and implement a common security model for applications to access the data using SQL Server database roles and permissions.
- Data Virtualization – This allows you to virtualize an external entity as a SQL Server table to build a data hub that abstracts the need for the application to know the actual schema, data source location, and protocols required to access different data sources. You can take this concept further by creating SQL Server views for applications to access the external tables which would further abstract and protect the application from any schema and design changes in the external tables in the future date due to a migration or upgrade.
- Performance – This gives you the ability to push computation to Hadoop or other data sources which will improve your query performance. For example, querying Hadoop through an external table creates MapReduce jobs and leverages Hadoop’s distributed computational resources.
- Data Marts – You can use external tables to create data marts within SQL Server by pulling data into In-Memory OLTP cache tables with Columnstore Indexes for serving high-frequency queries. The cache tables can be refreshed on-demand or on a schedule using a SQL Agent job. Using big data clusters for SQL Server 2019 public preview provides this capability.
- Intelligent Data – SQL Server Machine Learning Services can be used for executing R and Python code on SQL Server to train and operationalize machine learning models. The ability to reference an external table in SQL Server with Machine Learning Services allows you to train predictive models over a wider variety and volume of data through the use of data marts for querying the external tables.
Leveraging SQL Server’s PolyBase connectors allows you to build a modern data warehouse using SQL Server which spans beyond just the SQL Server database and opens a number of scenarios that weren’t previously possible. Use SQL Server 2019 Public Preview to try out our new PolyBase feature for Oracle, Teradata, SQL Server, and MongoDB.