Skip to content
This repository has been archived by the owner on Feb 1, 2025. It is now read-only.

Duplicates in data? #44

Open
geraltofrivia opened this issue Aug 18, 2021 · 0 comments
Open

Duplicates in data? #44

geraltofrivia opened this issue Aug 18, 2021 · 0 comments

Comments

@geraltofrivia
Copy link

Hi,

The following is most likely a misunderstanding on my part but I notice that there are many duplicates and pseudo-duplicates in the jsonl files.

For instance, this line in lama/TREx/P17.jsonl:

{
	"uuid": "df10f035-6269-4cdf-88df-26395e0dc3b4",
	"obj_uri": "Q16",
	"obj_label": "Canada",
	"sub_uri": "Q7517499",
	"sub_label": "Simcoe Composite School",
	"predicate_id": "P17",
	"evidences": [{
		"sub_surface": "Simcoe Composite School",
		"obj_surface": "Canada",
		"masked_sentence": "Simcoe Composite School is a high school in Simcoe, Ontario, [MASK]."
	}, {
		"sub_surface": "Simcoe Composite School",
		"obj_surface": "Canada",
		"masked_sentence": "Simcoe Composite School is a high school in Simcoe, Ontario, [MASK]."
	}]
}            

has two evidences both of which are the same. This is not always the case, i.e., in many other cases the evidences are different sentences.

Further, in the conceptnet corpora, apparently every UUID appears twice. As an example, here are two instances with the same UUID:

{
	"sub": "alive",
	"obj": "think",
	"pred": "HasSubevent",
	"masked_sentences": ["One of the things you do when you are alive is [MASK]."],
	"obj_label": "think",
	"uuid": "d4f11631dde8a43beda613ec845ff7d1"
}

and

{
	"pred": "HasSubevent",
	"masked_sentences": ["One of the things you do when you are alive is [MASK]."],
	"obj_label": "think",
	"uuid": "d4f11631dde8a43beda613ec845ff7d1",
	"sub_label": "alive"
}

Here, in the second time the instance does not have the following fields sub, obj but otherwise seems to remain unchanged.


So, based on this, my question is:

Are the duplicates intentional? For instance, when computing metrics of my model over the probe, am I to treat the task as-is and if need be, make predictions twice over the same instance?

Alternatively, I could easily root out the duplicates when processing the files? Do I do that instead? Have others done that?

I know for a fact that LAMA on HuggingFace datasets (https://huggingface.co/datasets/lama) contains these duplicates.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant