Subscribe to the feed

In the fast-moving world of artificial intelligence and machine learning (AI/ML), everything seems to revolve around data. Entire careers are built around data. Data engineers, data analysts and data scientists are just a few roles that gather, synthesize, process, analyze and use data to help AI/ML solve real-world problems. As the volume of data continues to grow exponentially, the management of data becomes critical and a major challenge.

Challenges in data management

Data management brings many challenges in the AI/ML space, such as:

  • Governance and compliance: Data governance often must meet the requirements of state or federal compliance policies for traceability, privacy and audit
  • Data lineage: Tracking all variations or modifications to data is critical in replicating complex AI/ML workflows and outputs
  • Data management: Managing the lifecycle of data, how users interface with it and its storage requirements as well as using the best tooling helps maintain costs and productivity
  • Knowledge and expertise: There are hundreds of AI/ML tools available to data science engineers, and many new tools are added to the AI/ML industry every day. Introducing tools that have a familiar workflow—and that are easy to understand and consume—allows engineers to focus on their business goals, rather than on learning new tools

The combination of Red Hat OpenShift AI as an AI/ML platform and data version control from lakeFS help alleviate these challenges.

Data versioning with lakeFS

The most prevalent way developers manage source code today is through Git and the use of tools such as GitHub and GitLabs. There are many other tools available for use, but most have a similar workflow. Git, however, is not intended for objects, such as large data files, tarballs, container images or AI/ML models. For those file types, object storage is commonly used, often through the use of an object storage solution that offers an Amazon S3 interface. OpenShift AI has built-in support for interfacing with S3-accessible object storage.

Imagine being able to manage AI/ML data, models, pipeline artifacts and other large object files in a Git-like manner, either through a web console or an API. lakeFS serves as a S3 Gateway to many different object storage solutions, including Amazon S3, Azure Blob Storage, Google Cloud Storage, Red Hat OpenShift Data Foundation, MinIO and many more. Even better, it can be easily added between the OpenShift AI cluster and an existing object storage solution with very few changes to the environment. lakeFS can be run locally in the OpenShift cluster or OpenShift AI can connect to lakeFS in another on-premise environment, public cloud or private cloud.

 

Red Hat OpenShift AI uses lakeFS as a gateway to third-party object storage

 

With lakeFS, data engineers can now create new repositories for their AI/ML data and models, create branches, make changes, merge changes and track the entire lineage of data. It offers a single, familiar interface regardless of where the data or models are stored.

Try out lakeFS with OpenShift AI

Red Hat has worked closely with the lakeFS team at Treeverse to validate the integration of Red Hat OpenShift AI with lakeFS. We replicated the fraud detection demo found within the OpenShift AI documentation and adapted it to insert lakeFS in between OpenShift AI and a local instance of MinIO. With this validation complete, the lakeFS team has announced support for running lakeFS on an OpenShift cluster and its integration with OpenShift AI. Be sure to check out the Accelerating AI Innovation with lakeFS and OpenShift AI blog on the lakeFS site.

Follow the instructions on how to get your OpenShift AI environment up with lakeFS and MinIO and perform the fraud detection demo with the changes outlined in the instructions. You’ll get to test pulling data from lakeFS, storing data via lakeFS, saving trained models to lakeFS, pulling models to serve from lakeFS and even exploring how OpenShift AI pipelines use lakeFS for artifact storage.

Happy data versioning!

product trial

Red Hat OpenShift Data Foundation | Product Trial

Red Hat OpenShift Data Foundation | Product Trial

About the author

Sean has been (back) at Red Hat since 2020 working with strategic Red Hat ecosystem partners to co-create integrated product solutions and get them to market.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Original series icon

Original shows

Entertaining stories from the makers and leaders in enterprise tech