Skip to content

Commit

Permalink
Update with lighteval
Browse files Browse the repository at this point in the history
  • Loading branch information
mlabonne committed Mar 22, 2024
1 parent 8b54211 commit e6ed76e
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 18 deletions.
41 changes: 27 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

## 🔍 Overview

LLM AutoEval **simplifies the process of evaluating LLMs** using a convenient [Colab notebook](https://colab.research.google.com/drive/164LsJ5mfCaaBrWhP6eHWJyy9jRmaJZ7l?usp=sharing). You just need to specify the name of your model, a GPU, and press run!
LLM AutoEval **simplifies the process of evaluating LLMs** using a convenient [Colab notebook](https://colab.research.google.com/drive/164LsJ5mfCaaBrWhP6eHWJyy9jRmaJZ7l?usp=sharing). You just need to specify the name of your model, a benchmark, a GPU, and press run!

### Key Features

Expand All @@ -31,26 +31,33 @@ LLM AutoEval **simplifies the process of evaluating LLMs** using a convenient [C

## ⚡ Quick Start

### Evaluation parameters
### Evaluation

* **Benchmark suite**:
* **`MODEL_ID`**: Enter the model id from Hugging Face.
* **`BENCHMARK`**:
* `nous`: List of tasks: AGIEval, GPT4ALL, TruthfulQA, and Bigbench (popularized by [Teknium](https://github.com/teknium1) and [NousResearch](https://github.com/NousResearch)). This is recommended.
* `lighteval`: This is a [new library](https://github.com/huggingface/lighteval) from Hugging Face. It allows you to specify your tasks as shown in the readme. Check the list of [recommended tasks](https://github.com/huggingface/lighteval/blob/main/tasks_examples/recommended_set.txt) to see what you can use (e.g., HELM, PIQA, GSM8K, MATH, etc.)
* `openllm`: List of tasks: ARC, HellaSwag, MMLU, Winogrande, GSM8K, and TruthfulQA (like the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). It uses the [vllm](https://docs.vllm.ai/) implementation to enhance speed (note that the results will not be identical to those obtained without using vllm). "mmlu" is currently missing because of a problem with vllm.
* **Model**: Enter the model id from Hugging Face.
* **GPU**: Select the GPU you want for evaluation (see prices [here](https://www.runpod.io/console/gpu-cloud)). I recommend using beefy GPUs (RTX 3090 or higher), especially for the Open LLM benchmark suite.
* **Number of GPUs**: Self-explanatory (more cost-efficient than bigger GPUs if you need more VRAM).
* **Container disk**: Size of the disk in GB.
* **Cloud type**: RunPod offers a community cloud (cheaper) and a secure cloud.
* **Repo**: If you made a fork of this repo, you can specify its URL here (the image only runs `runpod.sh`).
* **Trust remote code**: Models like Phi require this flag to run them.
* **Debug**: The pod will not be destroyed at the end of the run (not recommended).
* **`LIGHTEVAL_TASK`**: You can select one or several tasks as specified in the [readme](https://github.com/huggingface/lighteval?tab=readme-ov-file#usage) or in the list of [recommended tasks](https://github.com/huggingface/lighteval/blob/main/tasks_examples/recommended_set.txt).

### Cloud GPU

* **`GPU`**: Select the GPU you want for evaluation (see prices [here](https://www.runpod.io/console/gpu-cloud)). I recommend using beefy GPUs (RTX 3090 or higher), especially for the Open LLM benchmark suite.
* **`Number of GPUs`**: Self-explanatory (more cost-efficient than bigger GPUs if you need more VRAM).
* **`CONTAINER_DISK`**: Size of the disk in GB.
* **`CLOUD_TYPE`**: RunPod offers a community cloud (cheaper) and a secure cloud (more reliable).
* **`REPO`**: If you made a fork of this repo, you can specify its URL here (the image only runs `runpod.sh`).
* **`TRUST_REMOTE_CODE`**: Models like Phi require this flag to run them.
* **`PRIVATE_GIST`**: (W.I.P.) Make the Gist with the results private (true) or public (false).
* **`DEBUG`**: The pod will not be destroyed at the end of the run (not recommended).

### Tokens

Tokens use Colab's Secrets tab. Create two secrets called "runpod" and "github" and add the corresponding tokens you can find as follows:

* **Runpod**: Please consider using my [referral link](https://runpod.io?ref=9nvk2srl) if you don't have an account yet. You can create your token [here](https://www.runpod.io/console/user/settings) under "API keys" (read & write permission). You'll also need to transfer some money there to start a pod.
* **GitHub**: You can create your token [here](https://github.com/settings/tokens) (read & write, can be restricted to "gist" only).
* **`RUNPOD_TOKEN`**: Please consider using my [referral link](https://runpod.io?ref=9nvk2srl) if you don't have an account yet. You can create your token [here](https://www.runpod.io/console/user/settings) under "API keys" (read & write permission). You'll also need to transfer some money there to start a pod.
* **`GITHUB_TOKEN`**: You can create your token [here](https://github.com/settings/tokens) (read & write, can be restricted to "gist" only).
* **`HF_TOKEN`**: Optional. You can find your Hugging Face token [here](https://huggingface.co/settings/tokens) if you have an account.

## 📊 Benchmark suites

Expand All @@ -61,6 +68,10 @@ You can compare your results with:
* Models like [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B#benchmark-results), [Nous-Hermes-2-SOLAR-10.7B](https://huggingface.co/NousResearch/Nous-Hermes-2-SOLAR-10.7B), or [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B).
* Teknium stores his evaluations in his [LLM-Benchmark-Logs](https://github.com/teknium1/LLM-Benchmark-Logs).

### Lighteval

You can compare your results on a case-by-case basis, depending on the tasks you have selected.

### Open LLM

You can compare your results with those listed on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
Expand All @@ -81,5 +92,7 @@ Let me know if you're interested in creating your own leaderboard with your gist

## Acknowledgements

Special thanks to EleutherAI for the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [dmahan93](https://github.com/dmahan93) for his fork that adds agieval to the lm-evaluation-harness, [NousResearch](https://github.com/NousResearch) and [Teknium](https://github.com/teknium1) for the Nous benchmark suite, and


Special thanks to [burtenshaw](https://github.com/burtenshaw) for integrating lighteval, EleutherAI for the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [dmahan93](https://github.com/dmahan93) for his fork that adds agieval to the lm-evaluation-harness, Hugging Face for the [lighteval](https://github.com/huggingface/lighteval) library, [NousResearch](https://github.com/NousResearch) and [Teknium](https://github.com/teknium1) for the Nous benchmark suite, and
[vllm](https://docs.vllm.ai/) for the additional inference speed.
Binary file modified img/llmautoeval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 0 additions & 4 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,7 @@ def _make_lighteval_summary(directory: str, elapsed_time: float) -> str:
result_dict = _get_result_dict(directory)
final_table = make_results_table(result_dict)
summary = f"## {MODEL_ID.split('/')[-1]} - {BENCHMARK.capitalize()}\n\n"
summary += (
f"Elapsed time: {time.strftime('%H:%M:%S', time.gmtime(elapsed_time))}\n\n"
)
summary += final_table
# TODO: Implement summarisatiopn of results
return summary


Expand Down

0 comments on commit e6ed76e

Please sign in to comment.