Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leaderboard metrics and COIR tasks #26

Merged
merged 8 commits into from
Sep 12, 2024

Conversation

Samoed
Copy link
Contributor

@Samoed Samoed commented Aug 27, 2024

# Conflicts:
#	EXTERNAL_MODEL_RESULTS.json
#	all_data_tasks/0/default.jsonl
#	all_data_tasks/33/default.jsonl
#	all_data_tasks/34/default.jsonl
#	all_data_tasks/36/default.jsonl
#	all_data_tasks/37/default.jsonl
#	all_data_tasks/38/default.jsonl
#	all_data_tasks/39/default.jsonl
#	all_data_tasks/40/default.jsonl
#	all_data_tasks/41/default.jsonl
#	all_data_tasks/42/default.jsonl
#	all_data_tasks/43/default.jsonl
#	all_data_tasks/44/default.jsonl
#	boards_data/bright/data_tasks/Retrieval/default.jsonl
#	boards_data/en/data_tasks/Classification/default.jsonl
#	boards_data/ru/data_overall/default.jsonl
#	boards_data/ru/data_tasks/Classification/default.jsonl
#	boards_data/ru/data_tasks/Clustering/default.jsonl
#	boards_data/ru/data_tasks/Reranking/default.jsonl
#	boards_data/ru/data_tasks/Retrieval/default.jsonl
#	boards_data/ru/data_tasks/STS/default.jsonl
#	refresh.py
# Conflicts:
#	all_data_tasks/0/default.jsonl
#	all_data_tasks/1/default.jsonl
#	all_data_tasks/10/default.jsonl
#	all_data_tasks/11/default.jsonl
#	all_data_tasks/12/default.jsonl
#	all_data_tasks/13/default.jsonl
#	all_data_tasks/15/default.jsonl
#	all_data_tasks/16/default.jsonl
#	all_data_tasks/17/default.jsonl
#	all_data_tasks/18/default.jsonl
#	all_data_tasks/19/default.jsonl
#	all_data_tasks/2/default.jsonl
#	all_data_tasks/20/default.jsonl
#	all_data_tasks/21/default.jsonl
#	all_data_tasks/22/default.jsonl
#	all_data_tasks/23/default.jsonl
#	all_data_tasks/26/default.jsonl
#	all_data_tasks/27/default.jsonl
#	all_data_tasks/28/default.jsonl
#	all_data_tasks/29/default.jsonl
#	all_data_tasks/3/default.jsonl
#	all_data_tasks/30/default.jsonl
#	all_data_tasks/37/default.jsonl
#	all_data_tasks/38/default.jsonl
#	all_data_tasks/39/default.jsonl
#	all_data_tasks/4/default.jsonl
#	all_data_tasks/5/default.jsonl
#	all_data_tasks/6/default.jsonl
#	all_data_tasks/8/default.jsonl
#	all_data_tasks/9/default.jsonl
#	boards_data/da/data_tasks/Classification/default.jsonl
#	boards_data/en/data_overall/default.jsonl
#	boards_data/en/data_tasks/Classification/default.jsonl
#	boards_data/en/data_tasks/Clustering/default.jsonl
#	boards_data/en/data_tasks/PairClassification/default.jsonl
#	boards_data/en/data_tasks/Reranking/default.jsonl
#	boards_data/en/data_tasks/Retrieval/default.jsonl
#	boards_data/en/data_tasks/STS/default.jsonl
#	boards_data/en/data_tasks/Summarization/default.jsonl
#	boards_data/fr/data_overall/default.jsonl
#	boards_data/fr/data_tasks/Classification/default.jsonl
#	boards_data/fr/data_tasks/Clustering/default.jsonl
#	boards_data/fr/data_tasks/PairClassification/default.jsonl
#	boards_data/fr/data_tasks/Reranking/default.jsonl
#	boards_data/fr/data_tasks/Retrieval/default.jsonl
#	boards_data/fr/data_tasks/STS/default.jsonl
#	boards_data/fr/data_tasks/Summarization/default.jsonl
#	boards_data/no/data_tasks/Classification/default.jsonl
#	boards_data/other-sts/data_tasks/STS/default.jsonl
#	boards_data/pl/data_overall/default.jsonl
#	boards_data/pl/data_tasks/Classification/default.jsonl
#	boards_data/pl/data_tasks/Clustering/default.jsonl
#	boards_data/pl/data_tasks/PairClassification/default.jsonl
#	boards_data/pl/data_tasks/Retrieval/default.jsonl
#	boards_data/pl/data_tasks/STS/default.jsonl
#	boards_data/se/data_tasks/Classification/default.jsonl
#	boards_data/zh/data_overall/default.jsonl
#	boards_data/zh/data_tasks/Classification/default.jsonl
#	boards_data/zh/data_tasks/Clustering/default.jsonl
#	boards_data/zh/data_tasks/PairClassification/default.jsonl
#	boards_data/zh/data_tasks/Reranking/default.jsonl
#	boards_data/zh/data_tasks/Retrieval/default.jsonl
#	boards_data/zh/data_tasks/STS/default.jsonl
@Samoed Samoed changed the title add tasks from rumteb benchmark Fix leaderboard metrics and COIR tasks Sep 11, 2024
@Samoed
Copy link
Contributor Author

Samoed commented Sep 11, 2024

After c21efc7, the metrics for datasets are taken from config.yaml for each dataset. However, the current implementation uses unique metrics directly from the dataset, not from the config. Also, the dataset name for COIR is incorrect in the config.
Also I've updated external models results after embeddings-benchmark/results#25

Coir tab
image
RuMTEB tab
image

@KennethEnevoldsen @Muennighoff
Fixes #27

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love a check from @Muennighoff and @orionw, but otherwise I can't see any issues here.

model_meta.yaml Outdated Show resolved Hide resolved
@@ -20,7 +20,7 @@ tasks:
task_description: "Clustering is the task of grouping similar documents together."
PairClassification:
icon: "🎭"
metric: ap
metric: max_ap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this cause issues with external results? (@Muennighoff I believe we have discussed this before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but I would add these metrics to refresh.py for compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed refresh.py, but I'll leave comment open until @Muennighoff review

Copy link
Contributor Author

@Samoed Samoed Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed refresh.py, but I'll leave the comment open until @Muennighoff reviews it. But I rather left max_ap in config, because after embeddings-benchmark/mteb#1037 there is no ap in model results.

@Samoed
Copy link
Contributor Author

Samoed commented Sep 11, 2024

CI fix in embeddings-benchmark/results#30

Copy link
Collaborator

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the CI fixed!

Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (once CI is fixed)

metric_description: "Spearman correlation based on the model's similarity metric (usually cosine)"
task_description: "Semantic Textual Similarity is the task of determining how similar two texts are."
Summarization:
icon: "📜"
metric: spearman
metric: cosine_spearman
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had changed this so that future models can have their own distance metrics and it does not have to be cosine - only the use of spearman would be the same across models; but since there are no such models yet I think, reverting this works for me! cc @KennethEnevoldsen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to left it as spearman?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current code allows submitting results of models with other distance metrics, then maybe yes; @KennethEnevoldsen probably knows best?

Copy link
Contributor Author

@Samoed Samoed Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For summarization:

"pearson"
"spearman"
"cosine_spearman"
"cosine_pearson"
"dot_spearman"
"dot_pearson"

I checked main_score for summarization tasks and they have cosine_spearman as main_score

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spearman will often just be cosine spearman, but I think it is nicer to leave to up to the model developer to choose their comparison metric. I.e. would leave it as spearman

Copy link
Contributor Author

@Samoed Samoed Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the issue that metrics will filter based on metric specied in config, but in results there is no metric with name spearman. I can extend metrics in results file to avoid this, but I don't know if it good solution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case I am fine with keeping it as cosine_spearman (we can make a change to custom similarity metrics at a later point)

# Conflicts:
#	all_data_tasks/0/default.jsonl
#	all_data_tasks/1/default.jsonl
#	all_data_tasks/10/default.jsonl
#	all_data_tasks/11/default.jsonl
#	all_data_tasks/12/default.jsonl
#	all_data_tasks/13/default.jsonl
#	all_data_tasks/16/default.jsonl
#	all_data_tasks/17/default.jsonl
#	all_data_tasks/18/default.jsonl
#	all_data_tasks/19/default.jsonl
#	all_data_tasks/2/default.jsonl
#	all_data_tasks/20/default.jsonl
#	all_data_tasks/21/default.jsonl
#	all_data_tasks/22/default.jsonl
#	all_data_tasks/3/default.jsonl
#	all_data_tasks/38/default.jsonl
#	all_data_tasks/39/default.jsonl
#	all_data_tasks/4/default.jsonl
#	all_data_tasks/5/default.jsonl
#	all_data_tasks/6/default.jsonl
#	all_data_tasks/8/default.jsonl
#	all_data_tasks/9/default.jsonl
#	boards_data/en/data_overall/default.jsonl
#	boards_data/en/data_tasks/Classification/default.jsonl
#	boards_data/en/data_tasks/Clustering/default.jsonl
#	boards_data/en/data_tasks/PairClassification/default.jsonl
#	boards_data/en/data_tasks/Reranking/default.jsonl
#	boards_data/en/data_tasks/Retrieval/default.jsonl
#	boards_data/en/data_tasks/STS/default.jsonl
#	boards_data/en/data_tasks/Summarization/default.jsonl
#	boards_data/fr/data_overall/default.jsonl
#	boards_data/fr/data_tasks/Classification/default.jsonl
#	boards_data/fr/data_tasks/Clustering/default.jsonl
#	boards_data/fr/data_tasks/PairClassification/default.jsonl
#	boards_data/fr/data_tasks/Reranking/default.jsonl
#	boards_data/fr/data_tasks/Retrieval/default.jsonl
#	boards_data/fr/data_tasks/STS/default.jsonl
#	boards_data/fr/data_tasks/Summarization/default.jsonl
#	boards_data/other-sts/data_tasks/STS/default.jsonl
#	boards_data/zh/data_overall/default.jsonl
#	boards_data/zh/data_tasks/Classification/default.jsonl
#	boards_data/zh/data_tasks/Clustering/default.jsonl
#	boards_data/zh/data_tasks/PairClassification/default.jsonl
#	boards_data/zh/data_tasks/Reranking/default.jsonl
#	boards_data/zh/data_tasks/Retrieval/default.jsonl
#	boards_data/zh/data_tasks/STS/default.jsonl
@Samoed
Copy link
Contributor Author

Samoed commented Sep 12, 2024

@KennethEnevoldsen CI is now passing

@KennethEnevoldsen KennethEnevoldsen merged commit 01b06df into embeddings-benchmark:main Sep 12, 2024
1 check passed
@Samoed Samoed deleted the add_rumteb branch September 12, 2024 14:07
@Samoed Samoed mentioned this pull request Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants