Apache Spark Connector for SQL Server and Azure SQL is now open source
Accelerating big data analytics with the Spark connector for SQL Server
We’re happy to announce that we have open–sourced the Apache Spark Connector for SQL Server and Azure SQL on GitHub. Born out of Microsoft’s SQL Server Big Data Clusters investments, the Apache Spark Connector for SQL Server and Azure SQL is a high-performance connector that enables you to use transactional data in big data analytics and persists results for ad-hoc queries or reporting. The connector allows you to use any SQL database, on-premises or in the cloud, as an input data source or output data sink for Spark jobs.
Why use the Apache Spark Connector for SQL Server and Azure SQL
The Apache Spark Connector for SQL Server and Azure SQL is based on the Spark DataSourceV1 API and SQL Server Bulk API and uses the same interface as the built-in JDBC Spark-SQL connector. This allows you to easily integrate the connector and migrate your existing Spark jobs by simply updating the format parameter!
Notable features and benefits of the connector:
- Support for all Spark bindings (Scala, Python, R).
- Basic authentication and Active Directory (AD) keytab support.
- Reordered DataFrame write support.
- Reliable connector support for single instance.
Depending on your scenario, the Apache Spark Connector for SQL Server and Azure SQL is up to 15X faster than the default connector. The connector takes advantage of Spark’s distributed architecture to move data in parallel, efficiently using all cluster resources.
Visit the GitHub page for the connector to download the project and get started!
Get involved
The release of the Apache Spark Connector for SQL Server and Azure SQL makes the interaction between SQL Server and Spark even more flawless. We are continuously evolving and improving the connector, and we look forward to your feedback and contributions!
Want to contribute or have feedback or questions? Check out the project on GitHub and follow us on Twitter at @SQLServer.