fix: evaluate missing splits #1268

thivyanth · 2024-10-01T15:16:44Z

Addresses #1260

Checklist

Run tests locally to make sure nothing is broken using make test.
- 944 passed, 236 skipped, 55 warnings in 232.13s (0:03:52)
Run the formatter to format the code using make lint.

isaac-chung

@thivyanth Thanks for taking a first stab at it! For any new functionality, we usually add a test / test cases to help confirm that the added code works. Could you please add a test for this? Specifically the following cases:

Results exist, but no missing splits
Results exist, and 1 missing split.

I can review the PR afterwards.

KennethEnevoldsen

Looking good, but still a few things to adjust.

KennethEnevoldsen · 2024-10-03T09:29:29Z

mteb/evaluation/MTEB.py

-                if save_path.exists() and not overwrite_results:
-                    logger.info(
-                        f"{task.metadata.name} results already exists. Loading results from disk. Set overwrite_results=True to overwrite."
+                existing_results = self.load_existing_results(save_path)


you should use

MTEBResults.from_disk(path)

to load results.

(see line 354)

KennethEnevoldsen · 2024-10-03T09:30:58Z

mteb/evaluation/MTEB.py

+                        logger.info(
+                            f"{task.metadata.name} results exist but missing splits: {missing_splits}. Running evaluation for missing splits."
+                        )
+                        task_eval_splits = missing_splits


This will overwrite the existing file further down resulting in a new file where the splits are missing. You will have to merge the results objects.

mteb/evaluation/MTEB.py

KennethEnevoldsen · 2024-10-03T09:32:51Z

mteb/evaluation/MTEB.py

+    def compare_splits_and_subsets(self, existing_results, task_eval_splits):
+        missing_splits = []
+        for split in task_eval_splits:
+            if split not in existing_results:
+                missing_splits.append(split)
+        return missing_splits


Suggested change

def compare_splits_and_subsets(self, existing_results, task_eval_splits):

missing_splits = []

for split in task_eval_splits:

if split not in existing_results:

missing_splits.append(split)

return missing_splits

@staticmethod

def compare_splits_and_subsets(existing_results: MTEBResults, task_eval_splits: list[str]) -> list[str]:

missing_splits = []

for split in task_eval_splits: # this will need to be adapted to MTEBResults object

if split not in existing_results:

missing_splits.append(split)

return missing_splits

thivyanth · 2024-10-04T14:01:20Z

@KennethEnevoldsen Thank you for the suggestions. I've implemented the changes as requested but from scratch. I manually checked the logs to ensure that only the missing splits are being evaluated rather than running a pytest. Could you please suggest a way to verify this with pytest, specifically to confirm that only the missing splits are evaluated? @isaac-chung @Muennighoff

(The latest commit has passed make test and make lint.)

thivyanth · 2024-10-04T19:50:13Z

Wrote a test as well.

(The latest commit has passed make test and make lint.)

KennethEnevoldsen · 2024-10-06T14:50:59Z

@thivyanth rerunning failed test to ensure that everything passes

KennethEnevoldsen

This is already looking much better - A few more changes are needed though, let me know if there are any questions

KennethEnevoldsen · 2024-10-06T14:55:57Z

mteb/evaluation/MTEB.py

+                        self.last_evaluated_splits[task.metadata_dict["name"]] = set()
+                    self.last_evaluated_splits[task.metadata_dict["name"]].add(split)
+
+                new_results = MTEBResults.from_task_results(


We need to consider what to do with the MTEB version specification.

Ideally we should add multiple "1.2.11, 1.2.16". We should also add a flag to only append to existing files if the version matches (should probably be on by default)

KennethEnevoldsen · 2024-10-06T14:56:30Z

mteb/evaluation/MTEB.py

+                        f"{task.metadata.name} results already exist. Loading results from disk."
+                    )
+                    evaluation_results.append(existing_results)
+                    self.last_evaluated_splits[task.metadata.name] = []  # Add this line


Seems like an error?

KennethEnevoldsen · 2024-10-06T14:57:15Z

mteb/evaluation/MTEB.py

@@ -488,3 +563,11 @@ def _save_model_metadata(model_meta: ModelMeta, output_folder: Path) -> None:

        with save_path.open("w") as f:
            json.dump(model_meta.to_dict(), f)
+
+    def get_last_evaluated_splits(self):


Is this needed? (I would just add some logging messages instead)

KennethEnevoldsen · 2024-10-06T15:00:00Z

mteb/evaluation/MTEB.py

+                if existing_results:
+                    merged_results = self._merge_results(existing_results, new_results)


probably worth merging using the MTEBResult object, instead of directly on the dict.

new_results = MTEBResults(...) if existing_results: new_results.update(existing_results)

KennethEnevoldsen · 2024-10-06T15:01:14Z

tests/test_evaluation/test_split_evaluation.py

+@pytest.fixture
+def model():
+    return SentenceTransformer("all-MiniLM-L6-v2")


Use mock model instead (see e.g. MockNumpyEncoder here)

KennethEnevoldsen · 2024-10-06T15:01:49Z

tests/test_evaluation/test_split_evaluation.py

+@pytest.fixture
+def nfcorpus_tasks():
+    return mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])


Use MockTask as well (see above)

KennethEnevoldsen · 2024-10-06T15:03:28Z

tests/test_evaluation/test_split_evaluation.py

+    evaluation.run(
+        model,
+        eval_splits=["train", "test"],
+        save_predictions=True,
+        output_folder=str(tmp_path / "testcase1"),
+        verbosity=2,
+    )
+    last_evaluated_splits = evaluation.get_last_evaluated_splits()
+    print(last_evaluated_splits)
+    assert "NFCorpus" in last_evaluated_splits
+    assert set(last_evaluated_splits["NFCorpus"]) == {"train", "test"}
+    assert len(last_evaluated_splits["NFCorpus"]) == 2


We can simplify tests a bit: (general across tests)

Suggested change

evaluation.run(

model,

eval_splits=["train", "test"],

save_predictions=True,

output_folder=str(tmp_path / "testcase1"),

verbosity=2,

)

last_evaluated_splits = evaluation.get_last_evaluated_splits()

print(last_evaluated_splits)

assert "NFCorpus" in last_evaluated_splits

assert set(last_evaluated_splits["NFCorpus"]) == {"train", "test"}

assert len(last_evaluated_splits["NFCorpus"]) == 2

result_obj = evaluation.run(

model,

eval_splits=["train", "test"],

save_predictions=True,

output_folder=str(tmp_path / "testcase1"),

verbosity=2,

)

# check splits here based on object - no need to last_evaluated_splits

any reason why save_predictions is true?

Muennighoff · 2024-10-10T15:02:38Z

@thivyanth great work thus far! It's a pretty useful PR so would be amazing to have it merged soon if you have time to address the comments?

thivyanth · 2024-10-10T17:54:13Z

@thivyanth great work thus far! It's a pretty useful PR so would be amazing to have it merged soon if you have time to address the comments?

I just saw this message, few hours ago I had mailed you

thivyanth added 2 commits October 1, 2024 11:05

implement partial evaluation for missing splits

5b538f9

lint

7cec0d7

thivyanth mentioned this pull request Oct 1, 2024

Only skip benchmarking if split results are the same too #1260

Open

isaac-chung requested changes Oct 1, 2024

View reviewed changes

KennethEnevoldsen requested changes Oct 3, 2024

View reviewed changes

isaac-chung assigned thivyanth Oct 3, 2024

isaac-chung changed the title ~~Feature/missing-splits-evaluation~~ fix: evaluate missing splits Oct 4, 2024

requested changes done from scratch

00d92b8

test for missing split evaluation added

6ec791c

uncomment test

3987e3e

KennethEnevoldsen requested changes Oct 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: evaluate missing splits #1268

fix: evaluate missing splits #1268

thivyanth commented Oct 1, 2024 •

edited by isaac-chung

Loading

isaac-chung left a comment

KennethEnevoldsen left a comment •

edited

Loading

KennethEnevoldsen Oct 3, 2024

KennethEnevoldsen Oct 3, 2024

KennethEnevoldsen Oct 3, 2024

thivyanth commented Oct 4, 2024

thivyanth commented Oct 4, 2024

KennethEnevoldsen commented Oct 6, 2024

KennethEnevoldsen left a comment

KennethEnevoldsen Oct 6, 2024

KennethEnevoldsen Oct 6, 2024

KennethEnevoldsen Oct 6, 2024

KennethEnevoldsen Oct 6, 2024

KennethEnevoldsen Oct 6, 2024

KennethEnevoldsen Oct 6, 2024

KennethEnevoldsen Oct 6, 2024

KennethEnevoldsen Oct 6, 2024

Muennighoff commented Oct 10, 2024

thivyanth commented Oct 10, 2024

		if existing_results:
		merged_results = self._merge_results(existing_results, new_results)

fix: evaluate missing splits #1268

Are you sure you want to change the base?

fix: evaluate missing splits #1268

Conversation

thivyanth commented Oct 1, 2024 • edited by isaac-chung Loading

Checklist

isaac-chung left a comment

Choose a reason for hiding this comment

KennethEnevoldsen left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thivyanth commented Oct 4, 2024

thivyanth commented Oct 4, 2024

KennethEnevoldsen commented Oct 6, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Muennighoff commented Oct 10, 2024

thivyanth commented Oct 10, 2024

thivyanth commented Oct 1, 2024 •

edited by isaac-chung

Loading

KennethEnevoldsen left a comment •

edited

Loading