At its Data + AI Summit, Databricks today made the requisite number of announcements one would expect from a company’s flagship developer events. Among those are the launch of Delta Lake 2.0, the next version of its platform for building data lakehouses, MLflow 2.0, the next generation of its platform for managing the machine learning pipeline, which now includes MLflow Pipelines with templates for bootstrapping model development, and a couple of announcements around the Apache Spark data analytics engine, which forms part of the core of the Databricks platform.
With Spark Connect, Databricks today announced a new client and server interface for Spark that is based on the DataFrame API. In Spark, a DataFrame is a distributed collection of data that is organized into columns and made available through an API in languages like Scala, Java, Python or R. With Spark Connect, Databricks takes this concept but then decouples the client and server, which the company says will lead to better stability and enables remote connectivity as a built-in feature.
What’s maybe more exciting, though, is something Databricks calls Project Lightspeed, which the company describes as the next generation of the Spark streaming engine. Databricks argues that as more applications now require streaming data, the requirements for what streaming engines can provide have also changed.
“Spark Structured Streaming has been widely adopted since the early days of streaming because of its ease of use, performance, large ecosystem, and developer communities,” the company explains in today’s announcement. “With that in mind, Databricks will collaborate with the community and encourage participation in Project Lightspeed to improve performance, ecosystem support for connectors, enhance functionality for processing data with new operators and APIs, and simplify deployment, operations, monitoring and troubleshooting.”
A Databricks spokesperson told me that the project will be led by Karthik Ramasamy, the company’s head of streaming, with a focus on delivering higher throughput, lower latency and lower cost, as well as an expanded ecosystem of connectors and additional data processing functionality.