-
Notifications
You must be signed in to change notification settings - Fork 906
Market research on versioning tools
Original issue: https://github.com/kedro-org/kedro/issues/3933
Data versioning (Miro Board)
Data versioning is the practice of tracking and managing changes to datasets over time. This includes capturing versions of data as it evolves, enabling reproducibility, rollback capabilities, and auditability. Data versioning is crucial for maintaining data integrity and ensuring that data pipelines and machine learning models are reproducible and reliable.
Click here to see Data Lake's versioning workflow
Delta Lake, by Databricks, is an open-source storage that enables building a Lakehouse architecture on top of data lakes. It is designed to provide ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, and unified streaming and batch data processing. Delta Lake is built on top of Apache Spark and enhances the capabilities of data lakes by addressing common challenges like data reliability and consistency.
Strengths
- ACID Transactions: Delta Lake provides strong consistency guarantees through ACID transactions, ensuring data integrity and reliability.
- Unified Batch and Streaming Processing: Delta Lake supports both batch and streaming data processing in a unified manner.
- Time Travel: Delta Lake's time travel feature allows users to query historical versions of data.
- Schema Enforcement and Evolution: Delta Lake enforces schemas at write time and supports schema evolution, allowing changes to the schema without breaking existing queries.
- Scalability and Performance: Delta Lake optimizes storage and querying through techniques like data compaction and Z-Ordering.
- Integration with Spark: Built on top of Apache Spark, Delta Lake integrates seamlessly with the Spark ecosystem, enabling powerful data processing capabilities.
- Rich Ecosystem and Enterprise Support backed by Databricks
Weaknesses
- Limited Direct Support for Unstructured Data: Delta Lake is primarily designed for structured and semi-structured data.
- Complexity in Setup and Management: Setting up and managing Delta Lake can be complex, particularly for teams not familiar with Spark.
- Tight Coupling with Apache Spark: Delta Lake is heavily dependent on Apache Spark for its operations.
Click here to see DVC's versioning workflow
DVC, or Data Version Control, is an open-source tool specifically designed for data science and machine learning projects. It combines the version control power of Git with functionalities tailored for large datasets, allowing users to track data changes, collaborate efficiently, and ensure project reproducibility by referencing specific data versions. Imagine DVC as a special organizer for your data science projects. Just like how Git keeps track of changes you make to your code, DVC keeps track of changes you make to your data. DVC is your “Git for data”!
Strengths
- Integration with Git: DVC seamlessly integrates with Git, leveraging familiar version control workflows for managing datasets and models. This integration makes it easy for teams already using Git to adopt DVC without significant changes to their workflow.
- Efficient Large File Management: DVC efficiently handles large files by storing them in remote storage backends and only keeping metadata in the Git repository. This avoids bloating the Git repository and ensures efficient data management.
- Reproducibility: DVC's pipeline management and experiment tracking features ensure that data workflows are reproducible. Users can recreate specific experiment runs by tracking versions of data, models, and code.
- Flexible Remote Storage: DVC supports various remote storage options, including AWS S3, Google Cloud Storage, Azure Blob Storage, and more. This flexibility allows users to choose storage solutions that best fit their needs.
- Experiment Management: DVC's experiment management capabilities, including checkpointing and comparing experiment runs, provide a robust framework for tracking and optimizing machine learning experiments.
- Open Source and Community Support: DVC is open source, with an active community contributing to its development and providing support. This ensures continuous improvement and a wealth of shared knowledge and resources.
Weaknesses
- CLI Focused: DVC introduces new concepts and CLI commands that users need to learn, which can be a barrier for those not familiar with command-line tools or version control systems.
- Limited Scalability for Very Large Datasets: Managing very large projects with complex data pipelines can become cumbersome with DVC, as it requires careful organization and management of DVC files and configurations.
- Limited Native UI: While DVC provides a powerful CLI, its native graphical user interface (UI) options are limited. Users often rely on third-party tools or custom-built interfaces for visualization and management.
- Dependency on Git: DVC's strong dependency on Git means that it might not be suitable for environments where Git is not the primary version control system, or where users are not familiar with Git workflows.
- Complexity of Collaborative Configurations: Collaboration with others requires multiple configurations such as setting up remote storage, defining roles, and providing access to each contributor, which can be frustrating and time-consuming.
- Inefficient Data Addition Process: Adding new data to the storage requires pulling the existing data, then calculating the new hash before pushing back the whole data.
- Lack of Relational Database Features: DVC lacks crucial relational database features, making it an unsuitable choice for those familiar with relational databases.
Click here to see Hudi's versioning workflow
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data management framework that helps manage large datasets stored in data lakes. It brings core warehouse and database functionality directly to a data lake. Hudi is designed to provide efficient data ingestion, storage, and query capabilities with strong support for incremental data processing. It enables data engineers to build near real-time data pipelines with support for transactions, indexing, and upserts (updates and inserts).
Strengths
- Efficient Incremental Processing: Hudi excels at incremental data processing, allowing for efficient upserts (updates and inserts) and deletes.
- ACID Transactions: Hudi supports ACID transactions, ensuring data consistency and reliability.
- Real-Time Data Ingestion: Hudi is designed to support near real-time data ingestion and processing, making it suitable for streaming data applications.
- Time Travel and Historical Queries: Hudi supports time travel queries, allowing users to access historical versions of data efficiently.
- Schema Evolution: Supports schema evolution, allowing for changes to the schema without significant overhead.
- Integration with Big Data Ecosystem: Hudi integrates seamlessly with Apache Spark, Apache Hive, Presto, and other big data tools.
Weaknesses
- Complexity in Setup and Management: Hudi can be complex to set up and manage, particularly for teams not familiar with the Hadoop ecosystem.
- Limited Support for Unstructured Data: Hudi is primarily focused on structured and semi-structured data.
- Performance Overhead: Managing frequent updates and maintaining indexes can introduce performance overhead.
- Maturity and Ecosystem: While rapidly maturing, Hudi’s ecosystem may not be as mature as some traditional data management tools.
Click here to see Iceberg's versioning workflow
Apache Iceberg is an open-source table format for managing large-scale datasets in data lakes, designed for petabyte-scale data. It ensures data consistency, integrity, and performance, and works efficiently with big data processing engines like Apache Spark, Apache Flink, and Apache Hive. Iceberg combines the reliability and simplicity of SQL tables with high performance, enabling multiple engines to safely work with the same tables simultaneously.
Strengths
- Schema and Partition Evolution: Supports non-disruptive schema changes and partition evolution, allowing tables to adapt to changing requirements without data rewriting.
- Snapshot Isolation and Time Travel: Offers robust snapshot isolation, enabling time travel to query historical versions of data.
- Hidden Partitioning: Abstracts partitioning details from users, simplifying query writing while ensuring efficient data access.
- Integration with Multiple Big Data Engines: Supports integration with Apache Spark, Flink, Hive, and other big data processing engines.
- Atomic Operations: Ensures atomicity for operations like appends, deletes, and updates, providing strong consistency guarantees.
- Integration with Multiple Big Data Engines: including Spark, Flink, and Hive.
Weaknesses
- Complexity in Setup and Management: Setting up and managing Iceberg tables can be complex, particularly for teams not familiar with big data ecosystems.
- Limited Direct Support for Unstructured Data: Primarily designed for structured and semi-structured data.
- Ecosystem Maturity: While rapidly maturing, Apache Iceberg's ecosystem is newer compared to some competitors like Delta Lake.
Click here to see Pachyderm's versioning workflow
Pachyderm is an open-source data engineering platform that provides data versioning, pipeline management, and reproducibility for large-scale data processing. It combines data lineage and version control with the ability to manage complex data pipelines, making it an ideal tool for data science and machine learning workflows.
Strengths
- Comprehensive Data Lineage: Automatically tracks data transformations, making it easy to audit and trace the source of any data.
- Robust Versioning: Provides Git-like version control for data, ensuring all changes are tracked and reproducible.
- Scalability and Performance: Built to handle large datasets and complex workflows efficiently.
- Integration with Kubernetes: Benefits from Kubernetes’ powerful orchestration capabilities for scaling and managing resources.
- Reproducibility: Ensures that every step in a data pipeline can be reproduced exactly, which is critical for reliable data science and machine learning workflows.
Weaknesses
- Complexity: Can be complex to set up and manage, especially for users unfamiliar with Kubernetes.
- Learning Curve: Has a steep learning curve due to its powerful but intricate features.
- Resource Intensive: Requires significant computational resources, particularly for large-scale data processing tasks.
Code versioning (Miro Board)
Code versioning is the practice of managing changes to source code over time. It involves tracking and controlling modifications to the codebase to ensure that all changes are recorded, identifiable, and reversible. Code versioning is a fundamental practice in software development and is typically facilitated by version control systems (VCS).
- Version Control Systems (VCS)
- Centralized VCS: A single central repository where all versions of the code are stored.
- Distributed VCS: Each developer has a local copy of the repository, including its full history.
- Repositories: A repository is a storage location for the codebase, including all versions of the code and its history.
- Commits: A commit is a record of changes made to the code. Each commit includes a unique identifier, a message describing the changes, and metadata such as the author and timestamp.
- Branches: Branches allow developers to work on different features, bug fixes, or experiments in parallel without affecting the main codebase. Branches can be merged back into the main branch once the changes are ready.
- Tags: Tags are used to mark specific points in the repository's history as significant, such as releases or milestones.
- Merging: Merging combines changes from different branches into a single branch, resolving any conflicts that arise from simultaneous modifications.
- Conflict Resolution: When changes from different branches conflict, developers must resolve these conflicts to integrate the changes.
Click here to see Git's versioning workflow
Model versioning (Miro Board)
Model versioning refers to the practice of managing different versions of machine learning models to track changes, ensure reproducibility, and manage deployments. It involves maintaining records of model parameters, architecture, training data, and performance metrics for each version of the model. This practice is crucial for model experimentation, collaboration, auditability, and continuous integration/continuous deployment (CI/CD) processes in machine learning workflows.
Click here to see MLflow's versioning workflow
Click here to see DVC's versioning workflow
Click here to see W&B's versioning workflow
- Contribute to Kedro
- Guidelines for contributing developers
- Contribute changes to Kedro that are tested on Databricks
- Backwards compatibility and breaking changes
- Contribute to the Kedro documentation
- Kedro documentation style guide
- Creating developer documentation
- Kedro new project creation - how it works
- The CI Setup: GitHub Actions
- The Performance Test Setup: Airspeed Velocity
- Kedro Framework team norms & ways of working ⭐️
- Kedro Framework Pull Request and Review team norms ✍️