Huge release by Daekeun Kim and Hyo (HK) Choi from our AI GBB team in Korea, sharing Korean language benchmarks for our latest Azure OpenAI model, gpt-4o-mini - like gpt-4o, the model excels on Korean language tasks, in this case mini completely eclipses gpt-35-turbo and nearly achieves gpt-4-turbo performance. The benchmarks and code are open source, extensible beyond our Azure models, so we encourage our customers and the larger community to try it out and even fork/share your own findings - and please let me in the comments what other Asia languages beyond Korean you may be looking to evaluate with your language models.
다양한 LLM/SLM 모델이 지속적으로 등장하면서 기본적인 평가 데이터셋으로 LLM/SLM의 성능을 빠르게 파악하려는 고객들이 많습니다. 이에 CLIcK(Cultural and Linguistic Intelligence in Korean)과 HAE_RAE_BENCH 1.0 데이터셋에 대해 LLM이 객관식 문제를 얼마나 정확하게 푸는지 판별하는 벤치마킹 코드를 구현하여 공개합니다. 코드는 https://lnkd.in/gnVsNtq9 의 구현을 뼈대로 많은 부분을 수정하고 추가했습니다. 저는 같은 팀의 Hyo (HK) Choi 님과 함께 GPT-4o-mini (2024-07-18), GPT-4o (2024-05-13), GPT-4-turbo (2024-04-09), GPT-3.5-turbo (2023-06-13) 4개 모델에 대한 벤치마킹을 수행하였습니다. 벤치마킹 결과 GPT-4o-mini 의 성능이 매우 인상적입니다. 모든 지표에서 GPT-3.5-turbo보다 압도적이고 일부 지표는 GPT-4-turbo에 근접한 성능을 보이고 있습니다. 허깅페이스 모델을 비롯한 커스텀 모델의 벤치마킹도 가능하니 이 지표를 베이스라인으로 다른 모델도 자유롭게 비교해 보세요. ------------------------------------------------------------------------------- As different LLM/SLM models continue to emerge, many customers want to quickly see how LLM/SLM performs on basic evaluation datasets. We have implemented and released benchmarking code to determine how accurately LLM solves multiple-choice questions on the CLIcK (Cultural and Linguistic Intelligence in Korean) and HAE_RAE_BENCH 1.0 datasets. The code is based on the implementation at https://lnkd.in/gnVsNtq9, with many modifications to make it suitable for benchmarking. Together with Hyo (HK) Choi, I performed benchmarking on 4 models: GPT-4o-mini (2024-07-18), GPT-4o (2024-05-13), GPT-4-turbo (2024-04-09), and GPT-3.5-turbo (2023-06-13). The benchmarking results show that the performance of GPT-4o-mini is very impressive. It outperforms the GPT-3.5-turbo on all metrics and is close to the GPT-4-turbo on some metrics. You can also benchmark custom models, including Hugging Face models, so feel free to compare other models using these metrics as a baseline. Code: https://lnkd.in/geCXW2Nz #azureopenai #gpt4omini #gpt4o
Welsh 🏴
Thank you to your team for the great contribution to the community I'd like to note our internal benchmarking for specific application shows significant inconsistencies between gpt-4o and gpt-4o mini. Using the mini model requires caution for applications need strong reasoning ability.
Business focused technology executive passionate about technology driving customer value
3moThai!