🔔 Our CEO, Niall Murphy, writes about the big problems in #MLOps on foot of his recent presentation to the #ACM. Stanza’s work in data processing, statistics, analysis, and machine learning aligns well with questions of how to monitor and understand performance, cost, and reliability of pipelines and production systems generally. Please get in contact if you want to join our private beta! #Stanza #reliability #devops
I was honoured to be invited to speak to the ACM Queue editorial board on the topic of MLOps a few weeks ago. For those not already familiar, the ACM Queue is the official journal of the ACM, the premier professional association for folks working in computing, and the organisation that selects the Turing Award recipients. The Queue, being their journal, develops a sense of what’s important to the membership and tries to commission articles on those topics. I wouldn't describe myself as an expert on this, but colleagues(*) helped to weigh in on my poorly informed opinions, framed in the meeting as "the unsolved problems of MLOps". Model versioning and quality control. Applicable both internally and externally. From the user's point of view, a model can be called v123 today and v123 tomorrow, but give radically different answers. A model can be the "same" from a versioning PoV but trained over different data and give different responses, etc. Today the users pick up the tab for this. Data and metadata management. Anyone doing model training has to organize their datasets, track provenance, permissible and/or suitable use-cases, accommodating legal compliance, problem domain relevance, quality scoring, and so on. Hardly anyone is doing this thoroughly and well. We have toolkits, not a methodolgy. Data leakage from LLMs. Today, the only technique we know that certainly works to prevent inappropriate data leakage from an LLM is to remove what you care about from the training data. LLMs are also very difficult to control, as is evidenced from the various jail-breaking contests. There is nothing that yet amounts to a useful body of practice. Efficiency and costs. At the time of writing, open market GPU hardware specifically enabling AI has a prototyping segment and a commodity segment - i.e. some components are really expensive in return for leading edge performance, the rest are not. Using resources efficiently is hugely important, but query estimation approaches relying on simple token-based load-balancing don’t correctly estimate costs. In practice inference loads are inherently spiky so capacity is stranded. Monitoring. Over half of survey respondents (link below) stated they didn't monitor ML in production: this is astonishing. Of the many reasons why this is the case, maybe the most important one is that today, there is rarely a way to quickly tell whether your ML system is working well, or as well as it used to. The only metric we really have confidence in is "closed-loop with external feedback" - e.g. sales, positive user reaction, etc. Another big question - who does that verification? Today the answer is disappointingly frequently “the users”. Summary. For MLOps, it sometimes feels we’re using old service paradigms to construct/monitor/manage these new services, and suffering as a result. Stay tuned for more! *Thanks to Todd Underwood, Demetrios Brinkmann, Betsy Beyer (and others) for providing much of the support for this.