Add Russian tasks (RU-MTEB) #815

artemsnegirev · 2024-05-24T17:23:35Z

Checklist for adding MMTEB dataset

Reason for dataset addition:

This PR adds a bunch of new tasks for russian. They comes as RU-MTEB tasks, just like PL-MTEB and C-MTEB were previously added.

There are hard tasks (i.e. TERRa, InappropriatenessClassification) for non-instruct embedding. So performance seems random, but it's not. As you can see the model with random weights (untrained-multilingual) shows even worse result.

	multilingual-e5-small	paraphrase-multilingual-MiniLM-L12-v2	sbert_large_mt_nlu_ru	rubert-tiny2	untrained-multilingual
GeoreviewClassification	42.3	38.24	38.26	39.64	26.24
GeoreviewClusteringP2P	60.64	54.86	58.13	44.18	21.69
HeadlineClassification	73.74	68.3	76.28	74.19	23.09
InappropriatenessClassification	58.44	58.18	63.99	58.57	51.62
KinopoiskClassification	47.57	41.45	49.13	49.06	36.17
RiaNewsRetrieval	66.66	44.82	21.4	13.92	0.21
RuBQRetrieval	66.35	29.7	29.8	10.87	0.36
RuReviewsClassification	60.64	58.88	58.18	56.99	40.05
RuSTSBenchmarkSTS	77.72	79.55	71.22	69.43	47.03
RuSciBenchGRNTIClassification	53.59	53.19	53.88	45.63	9
RuSciBenchGRNTIClusteringP2P	48.33	48.93	52.25	41.41	11.86
RuSciBenchOECDClassification	40.35	41.41	41.8	35.48	7.62
RuSciBenchOECDClusteringP2P	43.08	43.71	46.57	38.09	12.32
TERRa	57.51	58.56	52.5	51.87	52.84

artemsnegirev · 2024-05-24T17:39:28Z

Could you guide me about the size of datasets.

There is a clustering task of 100 topics, 300к examples and 10 splits. There are retrieval tasks of >2000 examples. InappropriatenessClassification task has 5к examples per class but they are collected from 18 different topics (~300 examples per topic per class). It seems reasonable to me.

Is this size okay or too big?

imenelydiaker · 2024-05-25T16:01:43Z

Could you guide me about the size of datasets.

There is a clustering task of 100 topics, 300к examples and 10 splits. There are retrieval tasks of >2000 examples. InappropriatenessClassification task has 5к examples per class but they are collected from 18 different topics (~300 examples per topic per class). It seems reasonable to me.

Is this size okay or too big?

Hello @artemsnegirev,

5K samples per topic is too much for the benchmark, we'll go for 2048 samples in total.

For subsampling:

For clustering and classification tasks only: if your dataset exceeds 2048 samples, use startified_subsampling() function to downsample to 2048 according to your label/class [docs, example].
For other tasks, no need to subsample.

artemsnegirev · 2024-05-27T07:38:32Z

@imenelydiaker just a couple of questions for my understanding:

do we use the 2048 samples limit for eval_splits (test, validation) only? Should I downsample train split if it exceeds the limit?
if I do downsampling for clustering tasks I'll get 7-20 samples per class. That seems not stable. Can I go with fast and usual versions as it's done here for WikiClustering?

KennethEnevoldsen

A few comment - otherwise I believe the PR looks very promising

mteb/tasks/Clustering/rus/RuSciBenchGRNTIClusteringP2P.py

mteb/tasks/Classification/rus/KinopoiskClassification.py

mteb/tasks/Classification/rus/InappropriatenessClassification.py

mteb/tasks/Classification/rus/HeadlineClassification.py

mteb/tasks/Classification/rus/GeoreviewClassification.py

mteb/tasks/Retrieval/rus/MMarcoRetrieval.py

imenelydiaker · 2024-05-27T08:47:20Z

do we use the 2048 samples limit for eval_splits (test, validation) only? Should I downsample train split if it exceeds the limit?

We accepted it for training sets also, although imo it may affect performance and introduce biases due to the selected samples. I'd go for subsampling the test and validation sets only.

if I do downsampling for clustering tasks I'll get 7-20 samples per class. That seems not stable. Can I go with fast and usual versions as it's done here for WikiClustering?

You can do the Fast version only.

artemsnegirev · 2024-05-28T07:24:00Z

So we have 14 tasks for now in this PR. It should be 28 points to add, right?

mteb/tasks/Clustering/rus/RuSciBenchOECDClusteringP2P.py

mteb/tasks/Classification/rus/GeoreviewClassification.py

mteb/tasks/Classification/rus/HeadlineClassification.py

mteb/tasks/STS/rus/RuSTSBenchmarkSTS.py

AlexeyVatolin · 2024-05-29T13:50:28Z

Hi! Co-author RuSciBench here. Why do you use accuracy metric for all classification tasks in RuSciBench? In both tasks the classes are unbalanced and it is better to use f1 as in our original benchmark.

KennethEnevoldsen · 2024-05-29T18:44:38Z

@AlexeyVatolin and @artemsnegirev we typically match the original benchmark so I would suggest that we change it.

KennethEnevoldsen

With the exception of the comment, I believe everything looks fine. Please add points as well as a list of tasks to benchmarks.py

artemsnegirev · 2024-05-29T19:02:56Z

Hi! Co-author RuSciBench here. Why do you use accuracy metric for all classification tasks in RuSciBench? In both tasks the classes are unbalanced and it is better to use f1 as in our original benchmark.

@AlexeyVatolin First of all, thank you! The datasets are balanced for use in mteb. You can check it out. The size of original test split exceeds mteb limit which is 2048. In this case, it's impossible to use original split here, so it's downsampled and balanced. Could you confirm it's okay for you?

AlexeyVatolin · 2024-05-29T19:23:30Z

@artemsnegirev, Sorry, I didn't notice that the dataset is balanced by class size. I thought you took a random part from the original dataset. In this case you can use accuracy, I agree.

artemsnegirev · 2024-05-29T19:30:55Z

@KennethEnevoldsen I need to add list of tasks in benchmarks.py in the same way as MTEB_MAIN_EN is there, right?

KennethEnevoldsen · 2024-05-31T10:56:34Z

@artemsnegirev you can probably do it simpler now with new format:

MTEB_RUS = mteb.get_tasks(tasks=["{task_name}", ...])

artemsnegirev · 2024-05-31T14:13:35Z

@KennethEnevoldsen yes it looks good. I think everything is ready, could we merge it?

KennethEnevoldsen · 2024-06-02T10:55:48Z

Wonderful @artemsnegirev, thanks for the contributions.

If you want feel free to examine if there is any section upon which you can expand in the paper (see #595 for a link)

Alenush · 2024-06-03T10:58:24Z

Wonderful @artemsnegirev, thanks for the contributions.

If you want feel free to examine if there is any section upon which you can expand in the paper (see #595 for a link)

Hi!

@artemsnegirev sent PR for the Russian MTEB from us earlier. Thank you for the suggestion! We would be glad to contribute to your paper, but we did not directly discuss co-authorship conditions. While we initially planned to publish Russian MTEB alongside our new encoder models as a separate paper. We now believe it would make a more significant contribution as part of the multilingual benchmark you're working on. Publishing the Russian part of MTEB in another paper seems irrelevant now.

What are the conditions for co-authoring the paper you submit to EMNLP-2024? How did you plan to include the French/Chinese MTEBs as citations or detailed descriptions with the respective contributors?

At this point, we see the two possible solutions:

We can contribute to your paper directly, writing particular parts about our datasets, and you include us in the list of authors.
In this case, in Overleaf (https://www.overleaf.com/8693731561tpntkddrhngs#41bcff), we can add the info about Russian MTEB and how we construct and filtered it, help with the other paper sections, etc. We are eager to get involved in writing an article and are ready to start working on it immediately.
We are a team of five people affiliated with SaluteDevice and HSE University.
Another possible variant is cross-citing. We can write a separate paper about the Russian MTEB benchmark and encoder models as we planned initially. In this case, we cite your paper, and you write in yours that you added Russian MTEB (+ cite our future paper).

We appreciate the opportunity for international collaboration and would prefer the first variant if that's okay with you.

KennethEnevoldsen · 2024-06-03T11:23:35Z

Hi @Alenush, we have decided that we sadly can't make it for EMNLP 2024, but we are planning a release of the paper to arxiv this summer and then a submission to a journal.

Co-authorship is determined by points (see https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb), but includes multiple stages e.g. paper writing, dataset additions etc.

You can join in the paper writing (see #784, and #595). I believe an ideal approach would be a co-authorship on MMTEB (we would love your help and the addition of the benchmark is meaningful to MMTEB) and potentially cross-citing if you decide to go forward with the paper as well (whether you decide to do that is up to you guys).

There is also additional effort that is currently not being worked on, such as updating the leaderboard to the new format (again see #784).

artemsnegirev added 3 commits May 24, 2024 19:29

add ru-mteb tasks

74696c6

Merge branch 'main' into add_ru_mteb

b076285

add results for new tasks

166400e

isaac-chung assigned imenelydiaker May 26, 2024

KennethEnevoldsen reviewed May 27, 2024

View reviewed changes

downsample classifcation tasks & remove validation splits

03453af

isaac-chung assigned KennethEnevoldsen May 27, 2024

artemsnegirev added 2 commits May 27, 2024 21:50

update clustering tasks to fit size limit

e6dc362

remove mmarco dataset

b9ab5eb

KennethEnevoldsen reviewed May 28, 2024

View reviewed changes

mteb/tasks/Clustering/rus/RuSciBenchOECDClusteringP2P.py Outdated Show resolved Hide resolved

KennethEnevoldsen reviewed May 28, 2024

View reviewed changes

mteb/tasks/Classification/rus/GeoreviewClassification.py Outdated Show resolved Hide resolved

mteb/tasks/Classification/rus/HeadlineClassification.py Outdated Show resolved Hide resolved

mteb/tasks/STS/rus/RuSTSBenchmarkSTS.py Outdated Show resolved Hide resolved

minor changes

411bf32

artemsnegirev requested a review from KennethEnevoldsen May 28, 2024 15:02

KennethEnevoldsen approved these changes May 29, 2024

View reviewed changes

artemsnegirev added 2 commits May 30, 2024 11:30

add points

28adb41

add list of tasks to benchmarks

4791d80

KennethEnevoldsen merged commit e9d61bb into embeddings-benchmark:main Jun 2, 2024
7 checks passed

Samoed mentioned this pull request Jun 20, 2024

HF Space avidale/encodechka#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Russian tasks (RU-MTEB) #815

Add Russian tasks (RU-MTEB) #815

artemsnegirev commented May 24, 2024 •

edited

Loading

artemsnegirev commented May 24, 2024

imenelydiaker commented May 25, 2024 •

edited

Loading

artemsnegirev commented May 27, 2024

KennethEnevoldsen left a comment

imenelydiaker commented May 27, 2024

artemsnegirev commented May 28, 2024

AlexeyVatolin commented May 29, 2024

KennethEnevoldsen commented May 29, 2024

KennethEnevoldsen left a comment

artemsnegirev commented May 29, 2024

AlexeyVatolin commented May 29, 2024

artemsnegirev commented May 29, 2024

KennethEnevoldsen commented May 31, 2024

artemsnegirev commented May 31, 2024

KennethEnevoldsen commented Jun 2, 2024

Alenush commented Jun 3, 2024

KennethEnevoldsen commented Jun 3, 2024

Add Russian tasks (RU-MTEB) #815

Add Russian tasks (RU-MTEB) #815

Conversation

artemsnegirev commented May 24, 2024 • edited Loading

Checklist for adding MMTEB dataset

artemsnegirev commented May 24, 2024

imenelydiaker commented May 25, 2024 • edited Loading

artemsnegirev commented May 27, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

imenelydiaker commented May 27, 2024

artemsnegirev commented May 28, 2024

AlexeyVatolin commented May 29, 2024

KennethEnevoldsen commented May 29, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

artemsnegirev commented May 29, 2024

AlexeyVatolin commented May 29, 2024

artemsnegirev commented May 29, 2024

KennethEnevoldsen commented May 31, 2024

artemsnegirev commented May 31, 2024

KennethEnevoldsen commented Jun 2, 2024

Alenush commented Jun 3, 2024

KennethEnevoldsen commented Jun 3, 2024

artemsnegirev commented May 24, 2024 •

edited

Loading

imenelydiaker commented May 25, 2024 •

edited

Loading