Rebecca Li’s Post

View profile for Rebecca Li

AI Investor @ Costanoa VC; ex-Databricks; ex-Weights & Biases

LLMOps can be boiled down to three things: 1. Make sure the LLM system does what it is supposed to do: eval 2. Make sure the LLM system doesn't do what it is not supposed to do: red-teaming, guardrails 3. Have continuous visibility into the previous two: observability and analytics into dev/test/prod 2 & 3 can be automated or heavily augmented. 1 is the hardest to automate given every LLM application has its own intent. A design pattern for 1 is the continuous articulation of human intent to an AI evaluator, so the AI can execute them. The future of eval tools augment human evaluators by elicit an important set of high-level intent from human and translate that to low-level assertions. Shreya Shankar did a wonderful job with the EvalGen framework. Who else is working on it?

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

arxiv.org

Brian Bradford Dunn

Founder & CEO of Rogatio.ai, an AI-native services company. ‘Retired’ Kearney Senior Partner with over 25 years of experience.

8mo

Rebecca Li I can’t say we’re working on this (we aren’t!), but I’m at least seeing an extension to this problem… We function with a host of independent LLM endpoints who all interact with each other (many-to-many). Our current scale means we’re mainly focused on #1 (2/3) and #2 (1/3) with - I agree - #1 being the most difficult (mission critical?). Because, however, we are also ‘multi-model’ (OpenAI, Anthropic, Gemini) - and ‘multiple model’ (for example, we combine GPT 3.5 and 4 into same end-point via Azure’s prompt flow), I’m anticipating some real complexity of LLMOps at-scale. The rise of interest in Mixture-of-Agents (MoA) will likely lead others to some of these same challenges…

Like
Reply
Walter Haydock

I help AI-powered companies manage cyber, compliance, and privacy risk so they can innovate responsibly | ISO 42001, NIST AI RMF, HITRUST AI security, and EU AI Act expert | Harvard MBA | Marine veteran

8mo

A key thing for 1 & 2 that cannot be automated: DEFINING the business and security requirements before you start building. I frequently see engineering and product teams skip this step and just get right to it!

In my experience, human eval is the bottle-neck when scaling up. Humans are just not very good at describing intentions in precision. I am working on combining LLM with rule engine to gain the right mix of flexibility and consistency.

Like
Reply
Frederique De Letter

Analytics Leader @ Keller Williams | Data & AI Transformation

8mo

100% it is still shocking to see instances where it is LLM dev only without the ops and no path to sustainable and responsible scaling.

👋 Luca Baggi

AI Engineer @xtream • Maintainer @functime.ai • Machine Learning and Statistics @UniMi

8mo

Very effective summary! s/o Emanuele

See more comments

To view or add a comment, sign in

Explore topics