LakeFS


lakeFS is an open-source data version control system for managing data stored in object storage.
It provides Git-like operations such as branching, committing, merging, and reverting for large-scale data stored in systems including Amazon S3, Azure Blob Storage, and Google Cloud Storage, as well as other S3-compatible object storage platforms. lakeFS is used in data engineering and machine learning workflows to manage changes to data, support reproducibility, and enable data governance across data lakes.
The software is available as an open-source project, as well as in enterprise and managed service offerings, including lakeFS Cloud.

History

lakeFS was created in 2020 by Einat Orr and Oz Katz at Treeverse.
Its first public release, version 0.8.1, appeared in August 2020 and introduced Git-style operations with support for Amazon S3.
In 2021, Treeverse raised $23 million in a Series A funding round led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.
The same year, lakeFS was included in InfoWorld’s Best of Open Source Software awards.
In June 2022, Treeverse introduced lakeFS Cloud, a managed service providing hosted lakeFS deployments for cloud-based data lakes.
Version 1.0 was released in October 2023, adding integrations with platforms such as Databricks and Apache Iceberg, as well as support for orchestration tools including Apache Airflow.
Public case studies and conference materials have described usage of lakeFS by organizations such as Microsoft, Volvo, and NASA.
In July 2025, Treeverse announced an additional $20 million in growth funding to support further development of lakeFS.
In November 2025, Treeverse announced the acquisition of the open-source data version control project DVC.

Software

Overview

lakeFS provides Git-like operations such as branching, committing, merging, and reverting for datasets stored in object storage. These operations are used to manage changes to data, test modifications in isolation, reproduce specific data states, and recover from errors or unintended updates.

Architecture

lakeFS operates as a metadata layer on top of object storage systems such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
It stores repository metadata describing commits, branches, and tags, enabling versioned views of data without copying underlying objects.
The system provides access through multiple interfaces, including a web user interface, command-line tools, a REST API, and software development kits.
It is designed to integrate with existing data engineering and machine learning workflows, and can be deployed either in self-hosted environments or as a managed service.

Functions

lakeFS provides version control functionality for data stored in object storage–based data lakes. Core features include:
  • Atomic commits and version tracking for datasets, supporting reproducibility and auditability.
  • Branching and merging mechanisms that allow isolated development and testing without duplicating data.
  • Configurable hooks that can validate data or trigger external processes during commit and merge operations.
  • The ability to revert repositories to earlier states to recover from data errors or failed changes.
  • Recording of commit history and associated metadata for lineage tracking.
  • Support for managing data across multiple object storage systems, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.
  • Use of fixed data versions to reproduce experiments and machine learning model training.

    Integrations

Coverage of lakeFS has described integrations with platforms such as Databricks and Apache Iceberg, as well as support for environments including Red Hat OpenShift.
Additional materials describe its use with Trino, including validation of data changes prior to merging in versioned data workflows, as well as compatibility with orchestration tools such as Apache Airflow.