Battle of Data Platforms: Databricks vs Snowflake.

Anurag Singh
5 min readFeb 27, 2023

Before getting into the details of both the platforms lets try to check a very basic Big Data Architecture. Left hand side of the spectrum deals with a solid Data Ingestion pattern implementation and as we move right it is more about usability by the business which could be mostly as Analytical Reports or more intelligently as Machine Learning Aided solutions.

So a platform could be called an integrated analytics platform only if that is able to capture the Big Data realm from left hand side to the right hand side of the above diagram. It’s a win situation for Organizations onboarding an Integrated Analytics Platform as their total cost of ownership decreases over a period of time because now they have to just manage and tame just one data platform tool.

Databricks started from the left hand of the spectrum as a robust ETL/ELT tool and positioned itself as a good contender for Analytical workloads with release of SQL Analytics.

Snowflake on the other hand started from the right hand of the spectrum as a Cloud Data Warehouse with unlimited scale and slowly entered the ETL/ELT space with Snowpipes and catering to Machine Learning needs through release of Snowpark in 2022.

Below are few of my learnings gathered working on both the platforms:

Follow on below for deep dive:

Cost- This feature is very important in deciding on which data platform to be invested in

Databricks has transparent costing based on reserved VM and time for processing available on Azure Pricing Calculator.

Snowflake cost for storage flat(23$ per TB)and processing cost based on usage and size of Snowflake Cluster.

Support for Open Formats- This feature prevents Vendor Locking

Databricks unifies your data ecosystem with open standards and formats. Data is stored in Delta Lake, an open format storage layer that delivers reliability, security, and performance on your data lake — for both streaming and batch operations. Data is stored in your cloud storage account and can be read by any compatible reader, giving you the flexibility to leverage other compute platforms without moving the data

With Snowflake, data is stored in a proprietary format in Snowflake’s own account and is accessible only through Snowflake’s virtual warehouses. This results in vendor lock-in because Snowflake runs in its own tenant space, it can be more complex to integrate into a surrounding Cloud Service Providers (CSP) ecosystem. Bifurcated security and metadata policies, and the need to copy data between tenant spaces, add to the deployment complexity.

End to End Integrated Analytics Platform- This feature prevents investment in multiple Big Data tools.

Databricks a complete integrated Analytics platform in itself no need for disparate tools to complement its capabilities.

Snowflake primarily supports Data Analysts and SQL Developers, while relying on partners to support DS/ML teams with separate tools. To capture the full value of data, analytics and AI with Snowflake, organizations must integrate multiple, disparate tools, resulting in increased complexity, and higher costs.

Languages Supported- This feature support in bringing different persona roles on the Platform

Databricks allows developers to work in their programming language of choice (Python, SQL, Scala, R) and to natively leverage the most popular open-source ML libraries and leading frameworks, including TensorFlow and PyTorch.

Within Snowflake Java and python is supported via User Defined functions in Snowpark. No support for R or Scala.

Support for Structured/Unstructured and Semi Structured Data

Databricks Lakehouse Platform can natively ingest, process, store and power downstream use cases on all data types — including unstructured data such as images, video, audio, and text. Databricks enables optimized access to unstructured data that scales from a single node to massively parallel processing.

Snowflake was designed for structured and semi-structured data and has only recently announced limited support for unstructured data types The announced capabilities focus on document management and governance. To analyze unstructured files, Snowflake requires developers to either write, compile and upload custom Java Jar files or make inefficient calls to external functions such as AWS Lambda.

Data Sharing to augment Data platform as a product

Databricks Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to share data with other organizations regardless of where the data lives or what platform the data provider and recipient are using. Users can directly connect to the shared data through pandas, Tableau, or dozens of other systems that implement the open protocol.

Snowflake uses proprietary data sharing, which only works between Snowflake accounts, and locks the community of customers, suppliers, and partners into a single vendor platform. For recipients who are not Snowflake customers, providers have to manage and pay for “Reader Accounts”, including the compute charges to query shared data, resulting in an unnecessary burden on the providers to manage and pay for reader accounts If customers want to share data with consumers across different clouds or cloud regions, Snowflake’s approach is to create multiple replicas of a database. This approach results in access control, data lineage and storage costs issues. Also, Snowflake replication for data sharing does not work if the primary database has external tables or if a primary database was created from a share.

Native ETL Capabilities

Databricks provides an end-to-end data engineering solution with declarative pipeline development, automatic data testing, and deep visibility for monitoring and recovery. Auto Loader simplifies data ingestion by incrementally and efficiently processing new data files. Delta Live Tables abstracts complexity for managing the ETL lifecycle by automating and maintaining data dependencies, leveraging built-in quality controls with monitoring, and providing deep visibility into pipeline operations with automatic recovery.

Snowflake provides limited native ETL capabilities requiring Snowpipe for data ingestion and COPY INTO <table> command for basic transformations. organizations are forced to leverage third-party orchestration, data quality, and monitoring tools (i.e. dbt, Azure Data Factory, Informatica, Matillion, etc.) for complex ETL at scale.

Support for Streaming Data

Databricks can directly read from the most popular streaming sources, including Kafka, Amazon Kinesis, Azure Event Hubs, APIs, flat files,images, videos, and supports programmatic interfaces that allow you to specify arbitrary data writers. Databricks streaming continuously reads changing data without additional complexity.

Snowflake has limited support for streaming and takes a data-warehouse-centric approach, loading records into a stage and then copying to a table before transformations or analytics can be applied to the data. Snowflake does not support streaming unstructured data (images/videos/audios). Snowflake only supports Kafka as a streaming source and offers a connector which has fault-tolerance limitations. Data from other common streaming sources must be saved to cloud storage before loading into Snowflake. While it supports ingestion from streaming sources, Snowflake does not have the ability to process data in a streaming manner once the data has been ingested.

Parting with my thoughts -Who ever wins in this battle of data platform it is the end users who will be benefitted.

--

--

Anurag Singh

A visionary Gen AI, Data Science, Machine Learning, MLOPS and Big Data Leader/ Architect