-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Russian tasks (RU-MTEB) #815
Add Russian tasks (RU-MTEB) #815
Conversation
Could you guide me about the size of datasets. There is a clustering task of 100 topics, 300к examples and 10 splits. There are retrieval tasks of >2000 examples. InappropriatenessClassification task has 5к examples per class but they are collected from 18 different topics (~300 examples per topic per class). It seems reasonable to me. Is this size okay or too big? |
Hello @artemsnegirev, 5K samples per topic is too much for the benchmark, we'll go for 2048 samples in total. For subsampling: |
@imenelydiaker just a couple of questions for my understanding:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comment - otherwise I believe the PR looks very promising
mteb/tasks/Classification/rus/InappropriatenessClassification.py
Outdated
Show resolved
Hide resolved
We accepted it for training sets also, although imo it may affect performance and introduce biases due to the selected samples. I'd go for subsampling the test and validation sets only.
You can do the Fast version only. |
So we have 14 tasks for now in this PR. It should be 28 points to add, right? |
Hi! Co-author |
@AlexeyVatolin and @artemsnegirev we typically match the original benchmark so I would suggest that we change it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the exception of the comment, I believe everything looks fine. Please add points as well as a list of tasks to benchmarks.py
@AlexeyVatolin First of all, thank you! The datasets are balanced for use in mteb. You can check it out. The size of original test split exceeds mteb limit which is 2048. In this case, it's impossible to use original split here, so it's downsampled and balanced. Could you confirm it's okay for you? |
@artemsnegirev, Sorry, I didn't notice that the dataset is balanced by class size. I thought you took a random part from the original dataset. In this case you can use accuracy, I agree. |
@KennethEnevoldsen I need to add list of tasks in |
@artemsnegirev you can probably do it simpler now with new format:
|
@KennethEnevoldsen yes it looks good. I think everything is ready, could we merge it? |
Wonderful @artemsnegirev, thanks for the contributions. If you want feel free to examine if there is any section upon which you can expand in the paper (see #595 for a link) |
Hi! @artemsnegirev sent PR for the Russian MTEB from us earlier. Thank you for the suggestion! We would be glad to contribute to your paper, but we did not directly discuss co-authorship conditions. While we initially planned to publish Russian MTEB alongside our new encoder models as a separate paper. We now believe it would make a more significant contribution as part of the multilingual benchmark you're working on. Publishing the Russian part of MTEB in another paper seems irrelevant now. What are the conditions for co-authoring the paper you submit to EMNLP-2024? How did you plan to include the French/Chinese MTEBs as citations or detailed descriptions with the respective contributors? At this point, we see the two possible solutions:
We appreciate the opportunity for international collaboration and would prefer the first variant if that's okay with you. |
Hi @Alenush, we have decided that we sadly can't make it for EMNLP 2024, but we are planning a release of the paper to arxiv this summer and then a submission to a journal. Co-authorship is determined by points (see https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb), but includes multiple stages e.g. paper writing, dataset additions etc. You can join in the paper writing (see #784, and #595). I believe an ideal approach would be a co-authorship on MMTEB (we would love your help and the addition of the benchmark is meaningful to MMTEB) and potentially cross-citing if you decide to go forward with the paper as well (whether you decide to do that is up to you guys). There is also additional effort that is currently not being worked on, such as updating the leaderboard to the new format (again see #784). |
Checklist for adding MMTEB dataset
Reason for dataset addition:
This PR adds a bunch of new tasks for russian. They comes as RU-MTEB tasks, just like PL-MTEB and C-MTEB were previously added.
mteb
package.mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).There are hard tasks (i.e. TERRa, InappropriatenessClassification) for non-instruct embedding. So performance seems random, but it's not. As you can see the model with random weights (untrained-multilingual) shows even worse result.