This is the Sapienza NLP GitHub repository for ITA-Bench (Italian Benchmarks), a benchmark suite for the evaluation of Large Language Models (LLMs) on the Italian language. ITA-Bench is designed to evaluate the performance of LLMs on a variety of tasks, including question answering, commonsense reasoning, mathematical capabilities, named entity recognition, reading comprehension, and others.
ITA-Bench includes a variety of datasets for evaluating LLMs on Italian. These datasets are collected from various sources and cover a wide range of tasks.
Note
All the datasets are available on 🤗 Hugging Face Datasets!
The datasets are divided into two main categories:
-
🌐 Translations: These datasets are translations of existing English datasets into Italian. They are used to evaluate the performance of LLMs on tasks that have been previously studied in the English language, allowing for a direct comparison between models trained on different languages.
- Pros: Translations allow for a direct comparison between models trained on different languages
- Cons: Translations may introduce biases or errors that are not present in the original dataset
-
🔨 Adaptations: These datasets are converted from existing Italian datasets into a format that can be used to evaluate LLMs. They are used to evaluate the performance of LLMs on tasks that may be more specific to the Italian language.
- Pros: The original datasets are already in Italian, so there is no need for translation that may introduce errors
- Cons: These datasets were not originally designed for evaluating LLMs and the adaptation process may introduce biases or errors
ITA-Bench currently includes the following datasets:
Dataset | Task | Type | Description |
---|---|---|---|
ARC-Challenge | QA | 🌐 Translation | Commonsense and scientific knowledge |
ARC-Easy | QA | 🌐 Translation | Commonsense and scientific knowledge |
BoolQ | QA + passage | 🌐 Translation | Boolean questions |
GSM8K | QA | 🌐 Translation | Simple math word problems |
Hellaswag | Completion | 🌐 Translation | Commonsense reasoning |
MMLU | QA | 🌐 Translation | Advanced questions on 57 subjects |
PIQA | QA | 🌐 Translation | Physical interactions reasoning |
SciQ | QA + passage | 🌐 Translation | Scientific reading comprehension |
TruthfulQA | QA | 🌐 Translation | Questions on Web misconceptions |
WinoGrande | Completion | 🌐 Translation | Commonsense reasoning |
AMI | QA | 🔨 Adaptation | Misogyny detection |
Discotex | Completion | 🔨 Adaptation | Commonsense and world knowledge |
Ghigliottinai | QA | 🔨 Adaptation | Guess the missing concept |
NERMUD | NER | 🔨 Adaptation | Named entity recognition |
PreLearn | QA | 🔨 Adaptation | Reasoning about concept relationships |
PreTens | QA | 🔨 Adaptation | Reasoning about concept relationships |
QuandHO | QA | 🔨 Adaptation | Reading comprehesion |
WiC | QA | 🔨 Adaptation | Word sense disambiguation |
ITA-Bench is designed to be easy to use and flexible. You can evaluate any LLM on the included datasets using the lm_eval
command-line tool. The tool supports a variety of options to customize the evaluation process, including the ability to specify the LLM model, the number of few-shot examples, and the tasks to evaluate.
We always recommend using a virtual environment to manage your dependencies, e.g., using venv
or conda
. To create a new environment with conda
, you can run:
# Create a new environment with Conda
conda create -n ita-bench python=3.10
# Always remember to activate the environment before running any command!
conda activate ita-bench
Note
You can read more about managing environments with Conda in the official documentation.
To use ITA-Bench, you can follow these steps:
- Clone this repository:
git clone [email protected]:SapienzaNLP/ita-bench.git
cd ita-bench
- Install the required packages:
pip install -r requirements.txt
- Run the evaluation script:
lm_eval \
--model hf \
--model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 \
--num_fewshot 0 \
--log_samples \
--output_path outputs/ \
--tasks itabench_trans_it-it,itabench_adapt_cloze,itabench_adapt_mc \
--include tasks
This command will evaluate meta-llama/Meta-Llama-3.1-8B-Instruct
on all the benchmarks in our suite. The results will be saved in the outputs/
directory.
If you have multiple GPUs available, you can use the accelerate
command to run the evaluation on multiple GPUs:
accelerate launch -m lm_eval \
--model hf \
--model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 \
--num_fewshot 0 \
--log_samples \
--output_path outputs/ \
--tasks itabench_trans_it-it,itabench_adapt_cloze,itabench_adapt_mc
Note
You can read more about accelerate
in the official documentation.
We welcome contributions to ITA-Bench!
The code in this repository is licensed under the Apache License, Version 2.0. See the LICENSE
file for more details.
However, the datasets included in ITA-Bench may have different licenses. Please refer to the original datasets for more information about their licenses.
Coming soon: a paper on our benchmark suite is under review. Stay tuned for updates!
- Future AI Research for supporting this work.
- CINECA for providing computational resources.
- Unbabel for building Tower-LLM.
- Thanks to the authors of the original datasets for making them available.
- Thanks to all the Multilingual Natural Language Processing course students of the Master's of Engineering in Computer Science (Dipartimento di Ingegneria Informatica, Automatica e Gestionale, DIAG) of Sapienza University of Rome for their help in adapting some datasets.