Add GLUE datasets #26

PetrochukM · 2018-04-27T18:31:33Z

GLUE datasets are standard for evaluating NLU tasks.

In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark
(GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks.

PattynR · 2018-11-09T21:57:54Z

Hi, I am a Belgian student in computer engineering, I am following an introduction course about open source. One of my goal this semester is to make a contribution to a project. My master thesis will be related to NLP, this is why this project interest me. Is there a way I could help fixing this issue? (or maybe another issue related to this project)

PetrochukM · 2018-11-10T00:31:37Z

Hi There!

Yeah, please fix this issue! GLUE datasets are a popular suite of datasets for evaluating NLP models. It'd be nice if there was support for those datasets. This issue should be an easy one to get started with.

Recently, I was at Belgium for EMNLP 2018. One of the best NLP conferences in the world.

PattynR · 2018-11-18T10:49:33Z

Hey, so bad I missed the EMNLP! This is the first year I work on NLP, and I had never heard about those conferences, I hope I'll be able to go there next year.
About the issue, could you please confirm that my job is to add a new file into the torchnlp/datasets folder? A file that would be named "glue.py". I guess this is what I have to do, but I would prefer to be completely sure!

PetrochukM · 2018-11-18T16:38:04Z

Yeah that'd work!

PattynR · 2018-12-08T11:28:22Z

Hi,
I'm almost done, for the moment it works for all the datasets of GLUE except for QQP and SNLI. There is an issue with those files that I don't know how to handle ... When I load the QQP and SNLI datasets, there are some lines in the files themselves that doesn't have the right amount of parameters. Here is an example to illustrate what I mean.

On the first line of each downloaded file, we can find the names of the different features of the tsv file. In the 'train.tsv' file of SNLI for example, there should be 11 features per line. There are however a lot of lines (38.656 in total) where there are more than 10 tabs, so more than 11 features ....

For the moment I decided not to add those lines in the Dataset object, but I know this is not what should be done. I've looked on the internet to find a meaning to those lines, but there is not a lot of documentation about QQP and SNLI.

So do you maybe know what I should do? Or should I add my file to the project, and create a new issue? Someone that has already worked with those datasets should be able to fix it easily.

Thanks.

PetrochukM · 2020-07-04T03:57:33Z

Thanks for your attempt at contributing this function: #60 :)

karish-grover · 2021-08-29T13:30:03Z

Hey! I want to give this a try. Is there any way that I can do it still? It seems like it's too late to contribute to this project.

PetrochukM added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Apr 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLUE datasets #26

Add GLUE datasets #26

PetrochukM commented Apr 27, 2018

PattynR commented Nov 9, 2018

PetrochukM commented Nov 10, 2018

PattynR commented Nov 18, 2018

PetrochukM commented Nov 18, 2018

PattynR commented Dec 8, 2018

PetrochukM commented Jul 4, 2020

karish-grover commented Aug 29, 2021

Add GLUE datasets #26

Add GLUE datasets #26

Comments

PetrochukM commented Apr 27, 2018

PattynR commented Nov 9, 2018

PetrochukM commented Nov 10, 2018

PattynR commented Nov 18, 2018

PetrochukM commented Nov 18, 2018

PattynR commented Dec 8, 2018

PetrochukM commented Jul 4, 2020

karish-grover commented Aug 29, 2021