Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Russian tasks (RU-MTEB) #815

Merged
merged 9 commits into from
Jun 2, 2024

Conversation

artemsnegirev
Copy link
Contributor

@artemsnegirev artemsnegirev commented May 24, 2024

Checklist for adding MMTEB dataset

Reason for dataset addition:

This PR adds a bunch of new tasks for russian. They comes as RU-MTEB tasks, just like PL-MTEB and C-MTEB were previously added.

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

There are hard tasks (i.e. TERRa, InappropriatenessClassification) for non-instruct embedding. So performance seems random, but it's not. As you can see the model with random weights (untrained-multilingual) shows even worse result.

multilingual-e5-small paraphrase-multilingual-MiniLM-L12-v2 sbert_large_mt_nlu_ru rubert-tiny2 untrained-multilingual
GeoreviewClassification 42.3 38.24 38.26 39.64 26.24
GeoreviewClusteringP2P 60.64 54.86 58.13 44.18 21.69
HeadlineClassification 73.74 68.3 76.28 74.19 23.09
InappropriatenessClassification 58.44 58.18 63.99 58.57 51.62
KinopoiskClassification 47.57 41.45 49.13 49.06 36.17
RiaNewsRetrieval 66.66 44.82 21.4 13.92 0.21
RuBQRetrieval 66.35 29.7 29.8 10.87 0.36
RuReviewsClassification 60.64 58.88 58.18 56.99 40.05
RuSTSBenchmarkSTS 77.72 79.55 71.22 69.43 47.03
RuSciBenchGRNTIClassification 53.59 53.19 53.88 45.63 9
RuSciBenchGRNTIClusteringP2P 48.33 48.93 52.25 41.41 11.86
RuSciBenchOECDClassification 40.35 41.41 41.8 35.48 7.62
RuSciBenchOECDClusteringP2P 43.08 43.71 46.57 38.09 12.32
TERRa 57.51 58.56 52.5 51.87 52.84

@artemsnegirev
Copy link
Contributor Author

Could you guide me about the size of datasets.

There is a clustering task of 100 topics, 300к examples and 10 splits. There are retrieval tasks of >2000 examples. InappropriatenessClassification task has 5к examples per class but they are collected from 18 different topics (~300 examples per topic per class). It seems reasonable to me.

Is this size okay or too big?

@imenelydiaker
Copy link
Contributor

imenelydiaker commented May 25, 2024

Could you guide me about the size of datasets.

There is a clustering task of 100 topics, 300к examples and 10 splits. There are retrieval tasks of >2000 examples. InappropriatenessClassification task has 5к examples per class but they are collected from 18 different topics (~300 examples per topic per class). It seems reasonable to me.

Is this size okay or too big?

Hello @artemsnegirev,

5K samples per topic is too much for the benchmark, we'll go for 2048 samples in total.

For subsampling:

  • For clustering and classification tasks only: if your dataset exceeds 2048 samples, use startified_subsampling() function to downsample to 2048 according to your label/class [docs, example].
  • For other tasks, no need to subsample.

@artemsnegirev
Copy link
Contributor Author

@imenelydiaker just a couple of questions for my understanding:

  1. do we use the 2048 samples limit for eval_splits (test, validation) only? Should I downsample train split if it exceeds the limit?
  2. if I do downsampling for clustering tasks I'll get 7-20 samples per class. That seems not stable. Can I go with fast and usual versions as it's done here for WikiClustering?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comment - otherwise I believe the PR looks very promising

mteb/tasks/Clustering/rus/RuSciBenchGRNTIClusteringP2P.py Outdated Show resolved Hide resolved
mteb/tasks/Classification/rus/KinopoiskClassification.py Outdated Show resolved Hide resolved
mteb/tasks/Classification/rus/HeadlineClassification.py Outdated Show resolved Hide resolved
mteb/tasks/Classification/rus/GeoreviewClassification.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/rus/MMarcoRetrieval.py Outdated Show resolved Hide resolved
@imenelydiaker
Copy link
Contributor

  1. do we use the 2048 samples limit for eval_splits (test, validation) only? Should I downsample train split if it exceeds the limit?

We accepted it for training sets also, although imo it may affect performance and introduce biases due to the selected samples. I'd go for subsampling the test and validation sets only.

  1. if I do downsampling for clustering tasks I'll get 7-20 samples per class. That seems not stable. Can I go with fast and usual versions as it's done here for WikiClustering?

You can do the Fast version only.

@artemsnegirev
Copy link
Contributor Author

So we have 14 tasks for now in this PR. It should be 28 points to add, right?

@AlexeyVatolin
Copy link
Contributor

Hi! Co-author RuSciBench here. Why do you use accuracy metric for all classification tasks in RuSciBench? In both tasks the classes are unbalanced and it is better to use f1 as in our original benchmark.

@KennethEnevoldsen
Copy link
Contributor

@AlexeyVatolin and @artemsnegirev we typically match the original benchmark so I would suggest that we change it.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the exception of the comment, I believe everything looks fine. Please add points as well as a list of tasks to benchmarks.py

@artemsnegirev
Copy link
Contributor Author

Hi! Co-author RuSciBench here. Why do you use accuracy metric for all classification tasks in RuSciBench? In both tasks the classes are unbalanced and it is better to use f1 as in our original benchmark.

@AlexeyVatolin First of all, thank you! The datasets are balanced for use in mteb. You can check it out. The size of original test split exceeds mteb limit which is 2048. In this case, it's impossible to use original split here, so it's downsampled and balanced. Could you confirm it's okay for you?

@AlexeyVatolin
Copy link
Contributor

@artemsnegirev, Sorry, I didn't notice that the dataset is balanced by class size. I thought you took a random part from the original dataset. In this case you can use accuracy, I agree.

@artemsnegirev
Copy link
Contributor Author

@KennethEnevoldsen I need to add list of tasks in benchmarks.py in the same way as MTEB_MAIN_EN is there, right?

@KennethEnevoldsen
Copy link
Contributor

@artemsnegirev you can probably do it simpler now with new format:

MTEB_RUS = mteb.get_tasks(tasks=["{task_name}", ...])

@artemsnegirev
Copy link
Contributor Author

@KennethEnevoldsen yes it looks good. I think everything is ready, could we merge it?

@KennethEnevoldsen
Copy link
Contributor

Wonderful @artemsnegirev, thanks for the contributions.

If you want feel free to examine if there is any section upon which you can expand in the paper (see #595 for a link)

@KennethEnevoldsen KennethEnevoldsen merged commit e9d61bb into embeddings-benchmark:main Jun 2, 2024
7 checks passed
@Alenush
Copy link

Alenush commented Jun 3, 2024

Wonderful @artemsnegirev, thanks for the contributions.

If you want feel free to examine if there is any section upon which you can expand in the paper (see #595 for a link)

Hi!

@artemsnegirev sent PR for the Russian MTEB from us earlier. Thank you for the suggestion! We would be glad to contribute to your paper, but we did not directly discuss co-authorship conditions. While we initially planned to publish Russian MTEB alongside our new encoder models as a separate paper. We now believe it would make a more significant contribution as part of the multilingual benchmark you're working on. Publishing the Russian part of MTEB in another paper seems irrelevant now.

What are the conditions for co-authoring the paper you submit to EMNLP-2024? How did you plan to include the French/Chinese MTEBs as citations or detailed descriptions with the respective contributors?

At this point, we see the two possible solutions:

  1. We can contribute to your paper directly, writing particular parts about our datasets, and you include us in the list of authors.
    In this case, in Overleaf (https://www.overleaf.com/8693731561tpntkddrhngs#41bcff), we can add the info about Russian MTEB and how we construct and filtered it, help with the other paper sections, etc. We are eager to get involved in writing an article and are ready to start working on it immediately.
    We are a team of five people affiliated with SaluteDevice and HSE University.

  2. Another possible variant is cross-citing. We can write a separate paper about the Russian MTEB benchmark and encoder models as we planned initially. In this case, we cite your paper, and you write in yours that you added Russian MTEB (+ cite our future paper).

We appreciate the opportunity for international collaboration and would prefer the first variant if that's okay with you.

@KennethEnevoldsen
Copy link
Contributor

Hi @Alenush, we have decided that we sadly can't make it for EMNLP 2024, but we are planning a release of the paper to arxiv this summer and then a submission to a journal.

Co-authorship is determined by points (see https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb), but includes multiple stages e.g. paper writing, dataset additions etc.

You can join in the paper writing (see #784, and #595). I believe an ideal approach would be a co-authorship on MMTEB (we would love your help and the addition of the benchmark is meaningful to MMTEB) and potentially cross-citing if you decide to go forward with the paper as well (whether you decide to do that is up to you guys).

There is also additional effort that is currently not being worked on, such as updating the leaderboard to the new format (again see #784).

@Samoed Samoed mentioned this pull request Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants