Data versioning in Red Hat OpenShift AI with lakeFS

January 22, 2025Sean Merrow3-minute read

In the fast-moving world of artificial intelligence and machine learning (AI/ML), everything seems to revolve around data. Entire careers are built around data. Data engineers, data analysts and data scientists are just a few roles that gather, synthesize, process, analyze and use data to help AI/ML solve real-world problems. As the volume of data continues to grow exponentially, the management of data becomes critical and a major challenge.

Challenges in data management

Data management brings many challenges in the AI/ML space, such as:

Governance and compliance: Data governance often must meet the requirements of state or federal compliance policies for traceability, privacy and audit
Data lineage: Tracking all variations or modifications to data is critical in replicating complex AI/ML workflows and outputs
Data management: Managing the lifecycle of data, how users interface with it and its storage requirements as well as using the best tooling helps maintain costs and productivity
Knowledge and expertise: There are hundreds of AI/ML tools available to data science engineers, and many new tools are added to the AI/ML industry every day. Introducing tools that have a familiar workflow—and that are easy to understand and consume—allows engineers to focus on their business goals, rather than on learning new tools

The combination of Red Hat OpenShift AI as an AI/ML platform and data version control from lakeFS help alleviate these challenges.

Data versioning with lakeFS

The most prevalent way developers manage source code today is through Git and the use of tools such as GitHub and GitLabs. There are many other tools available for use, but most have a similar workflow. Git, however, is not intended for objects, such as large data files, tarballs, container images or AI/ML models. For those file types, object storage is commonly used, often through the use of an object storage solution that offers an Amazon S3 interface. OpenShift AI has built-in support for interfacing with S3-accessible object storage.

Imagine being able to manage AI/ML data, models, pipeline artifacts and other large object files in a Git-like manner, either through a web console or an API. lakeFS serves as a S3 Gateway to many different object storage solutions, including Amazon S3, Azure Blob Storage, Google Cloud Storage, Red Hat OpenShift Data Foundation, MinIO and many more. Even better, it can be easily added between the OpenShift AI cluster and an existing object storage solution with very few changes to the environment. lakeFS can be run locally in the OpenShift cluster or OpenShift AI can connect to lakeFS in another on-premise environment, public cloud or private cloud.

Red Hat OpenShift AI uses lakeFS as a gateway to third-party object storage

With lakeFS, data engineers can now create new repositories for their AI/ML data and models, create branches, make changes, merge changes and track the entire lineage of data. It offers a single, familiar interface regardless of where the data or models are stored.

Try out lakeFS with OpenShift AI

Red Hat has worked closely with the lakeFS team at Treeverse to validate the integration of Red Hat OpenShift AI with lakeFS. We replicated the fraud detection demo found within the OpenShift AI documentation and adapted it to insert lakeFS in between OpenShift AI and a local instance of MinIO. With this validation complete, the lakeFS team has announced support for running lakeFS on an OpenShift cluster and its integration with OpenShift AI. Be sure to check out the Accelerating AI Innovation with lakeFS and OpenShift AI blog on the lakeFS site.

Follow the instructions on how to get your OpenShift AI environment up with lakeFS and MinIO and perform the fraud detection demo with the changes outlined in the instructions. You’ll get to test pulling data from lakeFS, storing data via lakeFS, saving trained models to lakeFS, pulling models to serve from lakeFS and even exploring how OpenShift AI pipelines use lakeFS for artifact storage.

Happy data versioning!

About the author

Sean Merrow

Senior Principal Engineering Partner Manager

Sean has been (back) at Red Hat since 2020 working with strategic Red Hat ecosystem partners to co-create integrated product solutions and get them to market.

Read full bio

Keep exploring

Browse by channel

Explore all channels

Platform products

Try & buy

Featured

By industry

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Recommendations

Select a language

Select a language

Data versioning in Red Hat OpenShift AI with lakeFS

Challenges in data management

Data versioning with lakeFS

Try out lakeFS with OpenShift AI

Red Hat OpenShift Data Foundation | Product Trial

About the author

Sean Merrow

More like this

Keep exploring

Browse by channel

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links