LLMOps can be boiled down to three things: 1. Make sure the LLM system does what it is supposed to do: eval 2. Make sure the LLM system doesn't do what it is not supposed to do: red-teaming, guardrails 3. Have continuous visibility into the previous two: observability and analytics into dev/test/prod 2 & 3 can be automated or heavily augmented. 1 is the hardest to automate given every LLM application has its own intent. A design pattern for 1 is the continuous articulation of human intent to an AI evaluator, so the AI can execute them. The future of eval tools augment human evaluators by elicit an important set of high-level intent from human and translate that to low-level assertions. Shreya Shankar did a wonderful job with the EvalGen framework. Who else is working on it?
A key thing for 1 & 2 that cannot be automated: DEFINING the business and security requirements before you start building. I frequently see engineering and product teams skip this step and just get right to it!
In my experience, human eval is the bottle-neck when scaling up. Humans are just not very good at describing intentions in precision. I am working on combining LLM with rule engine to gain the right mix of flexibility and consistency.
100% it is still shocking to see instances where it is LLM dev only without the ops and no path to sustainable and responsible scaling.
Very effective summary! s/o Emanuele
Founder & CEO of Rogatio.ai, an AI-native services company. ‘Retired’ Kearney Senior Partner with over 25 years of experience.
8moRebecca Li I can’t say we’re working on this (we aren’t!), but I’m at least seeing an extension to this problem… We function with a host of independent LLM endpoints who all interact with each other (many-to-many). Our current scale means we’re mainly focused on #1 (2/3) and #2 (1/3) with - I agree - #1 being the most difficult (mission critical?). Because, however, we are also ‘multi-model’ (OpenAI, Anthropic, Gemini) - and ‘multiple model’ (for example, we combine GPT 3.5 and 4 into same end-point via Azure’s prompt flow), I’m anticipating some real complexity of LLMOps at-scale. The rise of interest in Mixture-of-Agents (MoA) will likely lead others to some of these same challenges…