Open Sourcing FlyteInteractive: Saving thousands of AI engineering hours in developing ML interactively

Pin-Lun (Byron) Hsu

February 15, 2024

Co-authors: Co-authored byPin-Lun (Byron) Hsu, Co-authored byJason (Siyu) Zhu, and Co-authored byAnkit Goyal

In the dynamic landscape of AI innovation, LinkedIn is embracing new possibilities and adopting and experimenting with advancements, such as large ranking models with hundreds of millions of parameters and billion-parameter large language models. These innovations will enhance member experience with more personalized and accurate job recommendations, networking opportunities, career insights, and much more.

The shift towards large and complex models necessitates enhancing development productivity and model iteration velocity, ensuring that we can quickly validate ideas and iterate on groundbreaking AI solutions using LinkedIn's rich data.

When we recognized existing inefficiencies in the current development platform, we took steps towards ensuring that we could fulfill our ambitious goals. As part of this effort to improve machine learning (ML) developer productivity at LinkedIn, we developed FlyteInteractive. This tool, built on top of Flyte, provides engineers with an interactive environment inside Kubernetes Pods to allow users to easily debug their model in the “production-like” environment. This has reduced the number of iterations and iteration time for tracking bugs and testing ideas by an impressive 96%. Today, we’re excited to open source FlyteInteractive.

Previous challenges of ML development

Figure 1. Diagram of a traditional ML development process

Prior to FlyteInteractive, LinkedIn's ML model development process could be less than straightforward and presented our developers with many challenges. Our engineers would:

Start by developing the model code locally.
Write code to generate a mock dataset. This dataset could be used in a local test framework to train the model for fewer steps locally, allowing them to catch trivial issues like syntax errors early before submitting the actual run job. The dataset was not a true reflection of real data used in the real-world run because it had a smaller number of records, and fewer feature fields.
Local tests with constraints on the size/structure of the models that engineers can use for testing. Often, engineers must shrink the model size to pass the test, which does not truly reflect the production use case. Scalability issues arise when training jobs are submitted using billion-parameter models on millions of training records. As we experimented with models at a large scale, the gap became even more pronounced, and local tests became less effective.
Due to the absence of GPUs and a distributed environment locally, tasks like developing distributed training optimizations, which are very common at LinkedIn—such as accelerating gradient communication or optimizing CUDA kernels—could not be performed locally.
Upload the pipeline to cluster for training at scale. If failed, go back to step 1.

Following these steps, the developers entered a perpetual cycle of making local changes, resubmitting their workflow for execution that had to go through the scheduling cycle of the orchestrator. Because of the natural gap between the development and production environment, the success rate was at only 20%, and it typically took more than 15 minutes for the results of these attempts to become apparent.

This was an extremely inefficient process, exemplified by one developer who needed dozens of attempts to find a minor bug, taking nearly a week to resolve. This highlighted the necessity for a more efficient and streamlined approach to ML model development.

Modernizing our ML infrastructure with Flyte

In January 2023, we began investing in Flyte as our next-generation machine learning pipeline orchestrator. To date, we have migrated all of our Large Language Model (LLM) workloads, and some of our traditional workloads, to Flyte. This platform has provided numerous benefits, significantly accelerating and streamlining our ML development.

The team began to build an ecosystem on top of Flyte. We built our Component Hub, containing 20+ reusable components for different stages of the ML lifecycle. Output caching and fast code syncing have been instrumental in eliminating redundant and duplicative work. Writing a Flyte pipeline is as straightforward as scripting in Python, enhanced by strong-type checks that catch bugs at compile time, significantly reducing failure rates.

However, we experienced challenges of mismatch across the Production and Development environments, including GPU, data access, and memory capacity. We quickly recognized the need for an interactive developing experience. To resolve these issues, we invented FlyteInteractive

Interactively Develop with FlyteInteractive

With FlyteInteractive, developers are able to conduct thorough tests, debug tedious issues on the code in a production-like environment. Code-server is started in the pod so users can develop on vscode in the browser. All they need to do is add `@vscode` decorator to seamlessly enable this feature in any existing Python Flyte task:

              Python
          

          @task
+ @vscode
def training():
  ...
      

Without `@vscode`, the above training task will be run as a batch job. Users have to wait for the termination to see the result. With only one line change of adding `@vscode`, it turns the task into an interactive job running a vscode server. Users can easily connect to it and develop with a remote environment with a rich feature IDE.

Notable features of FlyteInteractive

Remote environment access

FlyteInteractive allows interactive jobs to run in a remote setting with HDFS and Multi-GPU access. This feature eliminates the need for mock data, which can be ineffective due to discrepancies in data characteristics. Developers can securely access and test on derived data (non-PII) from HDFS, aligning closely with the data used in batch jobs and significantly reducing error risks. This works for both single node jobs and in complex setups such multi-GPU with torchrun, or multi-node multi-GPU ray jobs.

Code inspection and debugger

The code structure of large models frequently involves complexity, with highly entangled modules, multiple configurations, unrefined prototyping code, and numerous imported modules.

Vscode IDE assists developers in navigating this complexity by enabling them to click on portions of the code to navigate across files or delve into the details of a Python module's implementation, which is invaluable for understanding a complex codebase.

Vscode IDE also is equipped with a debugger. Developers can execute Python programs with this debugger and set breakpoints. This allows for tracing and examining variables line-by-line, greatly aiding in the debugging process.

Jupyter notebook support

FlyteInteractive offers Jupyter Notebook support, beneficial for various scenarios, such as visual data analysis and iterative code development. Users can create a .ipynb file to execute it as a Jupyter Notebook. This feature is particularly valuable for tasks like visualizing data, exploring datasets, and prototyping code, providing an interactive environment for block-by-block debugging and analysis.

Resource Management

The plug-in has a built-in garbage collector that periodically monitors the active connections to efficiently manage resources. It can garbage collector the environment based on either a time-to-live setting, that defines the max time for the environment, or max-idle-seconds setting, that deletes the environment if it has not been accessed for configured value of time. Users can configure both the settings together to get the best possible utilization of the resources.

Interactively debug the task (Collaborate with the Flyte Community)

To run the task in VSCode, select “Run and debug” from the left panel and execute the “interactive debugging” configuration. This will run your task with inputs from the previous task.

Screenshot of the “interactive debugging” configuration

Resume your task with updated code (Collaborate with the Flyte Community)

After you finish debugging, you can resume your task with updated code by executing the “resume task” configuration. This will terminate the code server, run the task with inputs from the previous task.

Screenshot of the “resume task” configuration

Impact and Testimony

Diagram of the streamlined ML development process with FlyteInteractive — Figure 2. Streamlined ML development process with FlyteInteractive

The traditionally time-consuming ML development process can be greatly optimized with FlyteInteractive. With it, users no longer need to conduct less effective local tests due to the mentioned gaps, then submit and wait for results. Instead, they can rapidly run experiments and get results within seconds using interactive jobs.

Based on user feedback and our internal metrics, FlyteInteractive has been very successful and saved thousands of AI eng hours developing ML interactively. It has become the default method for experimenting and debugging for the use cases on the new Flyte- based platform at LinkedIn. Its effectiveness in debugging complex issues has been particularly notable.

AI engineers have praised our tool as transformative, highlighting its efficient code inspection, profiling, and benchmarking. Users utilize FlyteInteractive's code inspection and debugger features to resolve various challenges, such as AUC drops in distributed training and bottlenecks in transformer models.

Open Sourcing FlyteInteractive and next steps

We are excited to announce that we have open sourced FlyteInteractive, now available at flytekit-FlyteInteractive.

We hope other members of the community can leverage this and collectively help improve the overall ecosystem. Our vision is to transition FlyteInteractive from the plugin to become the first-class feature in Flyte by enhancing the developer experience to be more seamless.

Looking ahead, we're focusing our efforts on how we can transform LinkedIn's AI infrastructure, aiming for quicker iteration, simpler experimentation, and an easier interface. FlyteInteractive is just a part of this journey so continue visiting our engineering blog to learn more about how we're enhancing developer productivity.

Acknowledgements

This collaboration spans multiple organizations across LinkedIn and Flyte Community with contributions from various teammates.

Thanks Byron (Pin-Lun) Hsu and Jason (Siyu) Zhu for initiating the idea, shepherding FlyteInteractive, and evangelizing to broader users.

Thanks Ankit Goyal, Biao He, Ben Levine, Clint Zhang, Haowen Ning, Keqiu Hu, Lingyu Lyu, Qilin Xu, Richard Li, Wenye Zhang, Chen Xie, Nithila Ilangovan, Yue Shang, Yubo Wang from Training and Infra team for advancing LinkedIn MLI Infrastructure with Flyte.

Thanks to the Flyte Infra team - Shardool S, Yizhou Luo, Vinayak Agarwal, Binyao Jiang, Kamal Duggireddy for enabling us with a scalable Flyte service that is fast evolving to meet ML needs at LinkedIn.

Thanks Benjamin Le, Ran Zhou, Sai Vivek Kanaparthy, Yanbin Jiang, Anastasiya Karpovich, Yun Dai, Chen Zhu, Ata Fatahi Baarzi, Vignesh Kothapalli, Siddharth Dangi, Jitendra Agarwal for being the pilot users and providing precious feedback to improve the project.

Thanks Yi Chiu, Han-Ru Chen, Jason Lai, Chi-Sheng Liu, and Fabio Gratz from Flyte Community for driving the open sourcing effort, implementing advanced features, and promoting to broader Flyte users.

Thanks Kevin Su, Yee Hing Tong, Eduardo Apolinario, Ketan Umare, Haytham Abuelfutuh from Union.ai for providing dedicated guidance on Flyte.

Thank our leadership team Animesh Singh, Zheng Li, and Zhifei Song for the advice and support.