Massive Multitask Language Understanding (MMLU)
17
General Language Understanding Evaluation
3
Massive Multitask Language Understanding Pro
2
Semantic Textual Similarity Benchmark
2
Stanford Question Answering Dataset v2.0
2
1-shot Machine Translation
1
ARC-Challenge (Knowledge Q&A)
1
BIG-Bench-Hard (Mixed Evaluations)
1
Chatbot Arena - Hard Prompts
1
Chatbot Arena - Math and Coding
1
Chinese Language Understanding Evaluation
1
Chinese Question Answering
1
Chinese Tool-Use Benchmark - False Positive Error
1
Chinese Tool-Use Benchmark - Tool Input
1
Chinese Tool-Use Benchmark - Tool Selection
1
Code Development Performance
1
Comprehensive Retrieval Augmented Generation (CRAG)
1
Corpus of Linguistic Acceptability (CoLA)
1
Data-To-Text: Czech Restaurant (cs)
1
Data-To-Text: WebNLG (en)
1
Data-To-Text: WebNLG (ru)
1
DROP (Reasoning over Text)
1
English Generation: XSum (en)
1
English-to-French Translation
1
English-to-German Translation
1
English-to-Romanian Translation
1
GPQA (Graduate-Level Reasoning)
1
GRE Quantitative Reasoning
1
GSM8K (Grade School Math)
1
GSM8K (Grade School Math Problems)
1
HellaSwag (Common Knowledge)
1
HuggingFace Agent Benchmark - Run Mode - Code
1
HuggingFace Agent Benchmark - Run Mode - Tool Selection
1
HuggingFace Agent Benchmark - Run Mode - Tool Used
1
HumanEval and MBPP (3-shot)
1
HumanEval (Code Generation)
1
Japanese MT-Bench - Humanities and Social Sciences Tasks
1
Machine Translation Benchmark
1
MATH (Math Problem-Solving)
1
Medicine/Health Text Identification (High Filtering)
1
Medicine/Health Text Identification (Low Filtering)
1
Microsoft Research Paraphrase Corpus
1
Multi-Genre Natural Language Inference
1
Multimodal Machine Translation Benchmark
1
Natural Language Inference (MultiNLI)
1
Natural Language Inference (QNLI)
1
Question Natural Language Inference
1
Quora Question Pairs (QQP)
1
Recognizing Textual Entailment
1
ScienceWorld Goal Achievement
1
Sports Injury Text Identification (High Filtering)
1
Sports Injury Text Identification (Low Filtering)
1
Stanford Question Answering Dataset
1
Stanford Question Answering Dataset Exact Match
1
Stanford Question Answering Dataset F1
1
Stanford Question Answering Dataset v1.1
1
Stanford Sentiment Treebank
1
Torrance Tests of Creative Thinking
1
TriviaQA (unfiltered, test)
1
Visual Question Answering v2
1
Winograd Schema Challenge
1
WMT-14 English-to-French Translation
1
WMT-14 French-to-English Translation
1
WMT19 English-to-French Translation
1
WMT19 English-to-German Translation
1
Comparison with Google Gemini 1.5 Flash-8B
0
Comparison with Meta LLaMa 3.1 8B
0
Comparison with OpenAI GPT-4o
0
Comparison with OpenAI GPT-4o mini
0
Multitask Prompted Finetuning
0