Prateek Joshi’s Post

View profile for Prateek Joshi

Investor at Moxxie Ventures | Author of 13 AI books | Infinite Curiosity Pod and Newsletter | ex Nvidia | AI Builder

Can we trust machines to judge machines when it comes to LLMs? There's an ongoing battle between manual and automated evaluation of LLMs. Manual evaluation is predominantly used by many companies, especially large ones with sufficient budgets. This approach involves internal QA teams or external consultants who manually create test cases and evaluate the outputs in spreadsheets. This process is time-consuming, but can provide insights into: - model's performance - specific failure modes such as hallucinations - tone discrepancies - biases - copyright violations - privacy issues Smaller companies without such budgets often resort to using expensive engineering time to manually review outputs, which can be inefficient. The primary advantage of manual evaluation is the accuracy and reliability of human judgment. But it's not scalable and introduces human biases, which can affect the consistency of the evaluations. Also manual processes cannot keep up with the rapid development cycles and the volume of outputs generated by modern LLMs. Automated evaluation aims to reduce the reliance on human labor by employing metrics and models to assess LLM performance. Techniques used in automated evaluation include: - Metrics-Based Evaluation: It utilizes established metrics such as BLEU score, ROUGE score, and perplexity. These metrics help assess aspects like fluency, accuracy, and relevance. But these metrics often do not correlate well with human judgments and can be inadequate for comprehensive evaluation. - Heuristic-Based Functions: These are specific functions to test particular capabilities or behaviors of the LLM e.g. toxicity detection or PII (Personally Identifiable Information) detection. These approaches can be tailored to specific use cases but may lack generalizability. - Model-Based Evaluation: It uses another trusted language model to evaluate the outputs of the target LLM. This can automate the evaluation process but requires careful calibration to ensure reliability. Automating this process can be a huge unlock. But developers attempting to create their own automated evaluation systems report very low confidence in the results. Lot of opportunity on this front.

LaSalle Browne

Quantum Thinker, Precision based personalization - Data + Systems + People, Biohacker, Traveler, Learning enthusiast, Reader, Sports & Fitness Lover

8mo

Not sure it’s and either or question. I suspect the best systems will be an outgrowth of human + machine = best outcome. Very similar to what Kasparov discovered in the world of chest, where the best outcomes were delivered by humans + machine versus optimized machine or human players independently.

Like
Reply

To view or add a comment, sign in

Explore topics