Steve Shirkey’s Post

View profile for Steve Shirkey, graphic

Director, Azure AI (ANZ, ASEAN, Korea) at Microsoft

Huge release by Daekeun Kim and Hyo (HK) Choi from our AI GBB team in Korea, sharing Korean language benchmarks for our latest Azure OpenAI model, gpt-4o-mini - like gpt-4o, the model excels on Korean language tasks, in this case mini completely eclipses gpt-35-turbo and nearly achieves gpt-4-turbo performance. The benchmarks and code are open source, extensible beyond our Azure models, so we encourage our customers and the larger community to try it out and even fork/share your own findings - and please let me in the comments what other Asia languages beyond Korean you may be looking to evaluate with your language models.

View profile for Daekeun Kim, graphic

AI Global Black Belt @ Microsoft | ex-AWS

다양한 LLM/SLM 모델이 지속적으로 등장하면서 기본적인 평가 데이터셋으로 LLM/SLM의 성능을 빠르게 파악하려는 고객들이 많습니다. 이에 CLIcK(Cultural and Linguistic Intelligence in Korean)과 HAE_RAE_BENCH 1.0 데이터셋에 대해 LLM이 객관식 문제를 얼마나 정확하게 푸는지 판별하는 벤치마킹 코드를 구현하여 공개합니다. 코드는 https://lnkd.in/gnVsNtq9 의 구현을 뼈대로 많은 부분을 수정하고 추가했습니다. 저는 같은 팀의 Hyo (HK) Choi 님과 함께 GPT-4o-mini (2024-07-18), GPT-4o (2024-05-13), GPT-4-turbo (2024-04-09), GPT-3.5-turbo (2023-06-13) 4개 모델에 대한 벤치마킹을 수행하였습니다. 벤치마킹 결과 GPT-4o-mini 의 성능이 매우 인상적입니다. 모든 지표에서 GPT-3.5-turbo보다 압도적이고 일부 지표는 GPT-4-turbo에 근접한 성능을 보이고 있습니다. 허깅페이스 모델을 비롯한 커스텀 모델의 벤치마킹도 가능하니 이 지표를 베이스라인으로 다른 모델도 자유롭게 비교해 보세요. ------------------------------------------------------------------------------- As different LLM/SLM models continue to emerge, many customers want to quickly see how LLM/SLM performs on basic evaluation datasets. We have implemented and released benchmarking code to determine how accurately LLM solves multiple-choice questions on the CLIcK (Cultural and Linguistic Intelligence in Korean) and HAE_RAE_BENCH 1.0 datasets. The code is based on the implementation at https://lnkd.in/gnVsNtq9, with many modifications to make it suitable for benchmarking. Together with Hyo (HK) Choi, I performed benchmarking on 4 models: GPT-4o-mini (2024-07-18), GPT-4o (2024-05-13), GPT-4-turbo (2024-04-09), and GPT-3.5-turbo (2023-06-13). The benchmarking results show that the performance of GPT-4o-mini is very impressive. It outperforms the GPT-3.5-turbo on all metrics and is close to the GPT-4-turbo on some metrics. You can also benchmark custom models, including Hugging Face models, so feel free to compare other models using these metrics as a baseline. Code: https://lnkd.in/geCXW2Nz #azureopenai #gpt4omini #gpt4o

GitHub - daekeun-ml/evaluate-llm-on-korean-dataset: Performs benchmarking on two Korean datasets with minimal time and effort.

GitHub - daekeun-ml/evaluate-llm-on-korean-dataset: Performs benchmarking on two Korean datasets with minimal time and effort.

github.com

Dipen Mehta

Business focused technology executive passionate about technology driving customer value

3mo

Thai!

Myles Hosford

Zero to X Security & Technology Leader

3mo

Welsh 🏴󠁧󠁢󠁷󠁬󠁳󠁿

Kevin (SangWoo) Kim

Head of Data Biz Div. at SOCAR

3mo

Thank you to your team for the great contribution to the community I'd like to note our internal benchmarking for specific application shows significant inconsistencies between gpt-4o and gpt-4o mini. Using the mini model requires caution for applications need strong reasoning ability.

  • No alternative text description for this image
See more comments

To view or add a comment, sign in

Explore topics