Duplicates in data? #44

geraltofrivia · 2021-08-18T08:31:34Z

Hi,

The following is most likely a misunderstanding on my part but I notice that there are many duplicates and pseudo-duplicates in the jsonl files.

For instance, this line in lama/TREx/P17.jsonl:

{
	"uuid": "df10f035-6269-4cdf-88df-26395e0dc3b4",
	"obj_uri": "Q16",
	"obj_label": "Canada",
	"sub_uri": "Q7517499",
	"sub_label": "Simcoe Composite School",
	"predicate_id": "P17",
	"evidences": [{
		"sub_surface": "Simcoe Composite School",
		"obj_surface": "Canada",
		"masked_sentence": "Simcoe Composite School is a high school in Simcoe, Ontario, [MASK]."
	}, {
		"sub_surface": "Simcoe Composite School",
		"obj_surface": "Canada",
		"masked_sentence": "Simcoe Composite School is a high school in Simcoe, Ontario, [MASK]."
	}]
}

has two evidences both of which are the same. This is not always the case, i.e., in many other cases the evidences are different sentences.

Further, in the conceptnet corpora, apparently every UUID appears twice. As an example, here are two instances with the same UUID:

{
	"sub": "alive",
	"obj": "think",
	"pred": "HasSubevent",
	"masked_sentences": ["One of the things you do when you are alive is [MASK]."],
	"obj_label": "think",
	"uuid": "d4f11631dde8a43beda613ec845ff7d1"
}

and

{
	"pred": "HasSubevent",
	"masked_sentences": ["One of the things you do when you are alive is [MASK]."],
	"obj_label": "think",
	"uuid": "d4f11631dde8a43beda613ec845ff7d1",
	"sub_label": "alive"
}

Here, in the second time the instance does not have the following fields sub, obj but otherwise seems to remain unchanged.

So, based on this, my question is:

Are the duplicates intentional? For instance, when computing metrics of my model over the probe, am I to treat the task as-is and if need be, make predictions twice over the same instance?

Alternatively, I could easily root out the duplicates when processing the files? Do I do that instead? Have others done that?

I know for a fact that LAMA on HuggingFace datasets (https://huggingface.co/datasets/lama) contains these duplicates.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates in data? #44

Duplicates in data? #44

geraltofrivia commented Aug 18, 2021

Duplicates in data? #44

Duplicates in data? #44

Comments

geraltofrivia commented Aug 18, 2021