Update with lighteval

mlabonne · Mar 22, 2024 · e6ed76e · e6ed76e
1 parent 8b54211
commit e6ed76e
Show file tree

Hide file tree

Showing 3 changed files with 27 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@
 
 ## 🔍 Overview
 
-LLM AutoEval **simplifies the process of evaluating LLMs** using a convenient [Colab notebook](https://colab.research.google.com/drive/164LsJ5mfCaaBrWhP6eHWJyy9jRmaJZ7l?usp=sharing). You just need to specify the name of your model, a GPU, and press run!
+LLM AutoEval **simplifies the process of evaluating LLMs** using a convenient [Colab notebook](https://colab.research.google.com/drive/164LsJ5mfCaaBrWhP6eHWJyy9jRmaJZ7l?usp=sharing). You just need to specify the name of your model, a benchmark, a GPU, and press run!
 
 ### Key Features
 
@@ -31,26 +31,33 @@ LLM AutoEval **simplifies the process of evaluating LLMs** using a convenient [C
 
 ## ⚡ Quick Start
 
-### Evaluation parameters
+### Evaluation
 
-* **Benchmark suite**: 
+* **`MODEL_ID`**: Enter the model id from Hugging Face.
+* **`BENCHMARK`**: 
     * `nous`: List of tasks: AGIEval, GPT4ALL, TruthfulQA, and Bigbench (popularized by [Teknium](https://github.com/teknium1) and [NousResearch](https://github.com/NousResearch)). This is recommended.
+    * `lighteval`: This is a [new library](https://github.com/huggingface/lighteval) from Hugging Face. It allows you to specify your tasks as shown in the readme. Check the list of [recommended tasks](https://github.com/huggingface/lighteval/blob/main/tasks_examples/recommended_set.txt) to see what you can use (e.g., HELM, PIQA, GSM8K, MATH, etc.)
     * `openllm`: List of tasks: ARC, HellaSwag, MMLU, Winogrande, GSM8K, and TruthfulQA (like the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). It uses the [vllm](https://docs.vllm.ai/) implementation to enhance speed (note that the results will not be identical to those obtained without using vllm). "mmlu" is currently missing because of a problem with vllm.
-* **Model**: Enter the model id from Hugging Face.
-* **GPU**: Select the GPU you want for evaluation (see prices [here](https://www.runpod.io/console/gpu-cloud)). I recommend using beefy GPUs (RTX 3090 or higher), especially for the Open LLM benchmark suite.
-* **Number of GPUs**: Self-explanatory (more cost-efficient than bigger GPUs if you need more VRAM).
-* **Container disk**: Size of the disk in GB.
-* **Cloud type**: RunPod offers a community cloud (cheaper) and a secure cloud.
-* **Repo**: If you made a fork of this repo, you can specify its URL here (the image only runs `runpod.sh`).
-* **Trust remote code**: Models like Phi require this flag to run them.
-* **Debug**: The pod will not be destroyed at the end of the run (not recommended).
+* **`LIGHTEVAL_TASK`**: You can select one or several tasks as specified in the [readme](https://github.com/huggingface/lighteval?tab=readme-ov-file#usage) or in the list of [recommended tasks](https://github.com/huggingface/lighteval/blob/main/tasks_examples/recommended_set.txt).
+
+### Cloud GPU
+
+* **`GPU`**: Select the GPU you want for evaluation (see prices [here](https://www.runpod.io/console/gpu-cloud)). I recommend using beefy GPUs (RTX 3090 or higher), especially for the Open LLM benchmark suite.
+* **`Number of GPUs`**: Self-explanatory (more cost-efficient than bigger GPUs if you need more VRAM).
+* **`CONTAINER_DISK`**: Size of the disk in GB.
+* **`CLOUD_TYPE`**: RunPod offers a community cloud (cheaper) and a secure cloud (more reliable).
+* **`REPO`**: If you made a fork of this repo, you can specify its URL here (the image only runs `runpod.sh`).
+* **`TRUST_REMOTE_CODE`**: Models like Phi require this flag to run them.
+* **`PRIVATE_GIST`**: (W.I.P.) Make the Gist with the results private (true) or public (false).
+* **`DEBUG`**: The pod will not be destroyed at the end of the run (not recommended).
 
 ### Tokens
 
 Tokens use Colab's Secrets tab. Create two secrets called "runpod" and "github" and add the corresponding tokens you can find as follows:
 
-* **Runpod**: Please consider using my [referral link](https://runpod.io?ref=9nvk2srl) if you don't have an account yet. You can create your token [here](https://www.runpod.io/console/user/settings) under "API keys" (read & write permission). You'll also need to transfer some money there to start a pod.
-* **GitHub**: You can create your token [here](https://github.com/settings/tokens) (read & write, can be restricted to "gist" only).
+* **`RUNPOD_TOKEN`**: Please consider using my [referral link](https://runpod.io?ref=9nvk2srl) if you don't have an account yet. You can create your token [here](https://www.runpod.io/console/user/settings) under "API keys" (read & write permission). You'll also need to transfer some money there to start a pod.
+* **`GITHUB_TOKEN`**: You can create your token [here](https://github.com/settings/tokens) (read & write, can be restricted to "gist" only).
+* **`HF_TOKEN`**: Optional. You can find your Hugging Face token [here](https://huggingface.co/settings/tokens) if you have an account.
 
 ## 📊 Benchmark suites
 
@@ -61,6 +68,10 @@ You can compare your results with:
 * Models like [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B#benchmark-results), [Nous-Hermes-2-SOLAR-10.7B](https://huggingface.co/NousResearch/Nous-Hermes-2-SOLAR-10.7B), or [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B). 
 * Teknium stores his evaluations in his [LLM-Benchmark-Logs](https://github.com/teknium1/LLM-Benchmark-Logs).
 
+### Lighteval
+
+You can compare your results on a case-by-case basis, depending on the tasks you have selected.
+
 ### Open LLM
 
 You can compare your results with those listed on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
@@ -81,5 +92,7 @@ Let me know if you're interested in creating your own leaderboard with your gist
 
 ## Acknowledgements
 
-Special thanks to EleutherAI for the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [dmahan93](https://github.com/dmahan93) for his fork that adds agieval to the lm-evaluation-harness, [NousResearch](https://github.com/NousResearch) and [Teknium](https://github.com/teknium1) for the Nous benchmark suite, and 
+
+
+Special thanks to [burtenshaw](https://github.com/burtenshaw) for integrating lighteval, EleutherAI for the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [dmahan93](https://github.com/dmahan93) for his fork that adds agieval to the lm-evaluation-harness, Hugging Face for the [lighteval](https://github.com/huggingface/lighteval) library, [NousResearch](https://github.com/NousResearch) and [Teknium](https://github.com/teknium1) for the Nous benchmark suite, and 
 [vllm](https://docs.vllm.ai/) for the additional inference speed. 
diff --git a/img/llmautoeval.png b/img/llmautoeval.png
diff --git a/main.py b/main.py
@@ -80,11 +80,7 @@ def _make_lighteval_summary(directory: str, elapsed_time: float) -> str:
     result_dict = _get_result_dict(directory)
     final_table = make_results_table(result_dict)
     summary = f"## {MODEL_ID.split('/')[-1]} - {BENCHMARK.capitalize()}\n\n"
-    summary += (
-        f"Elapsed time: {time.strftime('%H:%M:%S', time.gmtime(elapsed_time))}\n\n"
-    )
     summary += final_table
-    # TODO: Implement summarisatiopn of results
     return summary