Create dataset loader for CHEBI (Chapati) #113

leonweber · 2022-02-17T13:45:08Z

Task: NER
License: Creative Commons
Format: custom
Language: English
Citation: ???

Referenced and used by "Habibi, Maryam, et al. "Deep learning with word embeddings improves biomedical named entity recognition." Bioinformatics"

Source: http://chebi.cvs.sourceforge.net/viewvc/chebi/chapati/

napsternxg · 2022-04-04T22:46:01Z

#self-assign

hakunanatasha · 2022-04-06T16:39:26Z

Hi @napsternxg, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)

napsternxg · 2022-04-07T06:34:19Z

Hi @hakunanatasha yes I plan to work on this over the weekend.

napsternxg · 2022-04-11T21:34:30Z

I have started working on this dataset. I will send a PR soon.

napsternxg · 2022-04-12T15:12:08Z

Hi @hakunanatasha and @leonweber I have a few questions on how to parse the data. Code related to my questions is in: https://colab.research.google.com/drive/1Ne8A76yn0vxwKkpU7l_OzGI968B-YieJ?usp=sharing

The data is in modified HTML format. I am able to parse is via beautiful soup library but that library is not part of our requirements file. What would be the best way to proceed? E.g. if if try to load the file via:

filepath = "./scrapbook/WO2007000651/source.xml"
reader = biocxml.BioCXMLDocumentReader(str(filepath))

I get the error:

AttributeError: 'BioCXMLDocumentReader' object has no attribute '_BioCXMLDocumentReader__document'

The data download requires CVS to be installed. How to should I address this, should I include a note on adding this. Is it better to just process the data and upload the processed data to huggingface dataset hub?

jason-fries · 2022-04-19T21:53:30Z

Hi @napsternxg
Sorry about the delay in responding!

Let's remove the CVS dependency. The original gold data is open ("This work is distributed under the Creative Commons license: http://creativecommons.org/licenses/by/3.0/") so I would download the files and put them somewhere open (e.g., google drive link) and then we can eventually host the files on the biomedical community hub (see our BIOSSES example which does this).
The BioCXMLDocumentReader assumes you are using a BioC formatted file, so it won't work (that I know of) with standard or nonstandard XML files. The XML package available by default in Python might work here. If not, go ahead and use BeautifulSoup and we can discuss adding it to our supported packages.

napsternxg · 2022-04-21T11:36:55Z

Hi @jason-fries thanks for the response.
I will download and upload the files somehere.
I will try to use the XML parser in python if it doesn't add beautifulsoup.

I plan to submit it early next week.

napsternxg · 2022-04-27T05:06:05Z

Downloaded the files from CVS and uploading it here for usage. We can later move it to HF datasets and update the URL in the code.
PatentAnnotations_GoldStandard.tar.gz

napsternxg · 2022-04-28T13:40:28Z

Added PR: #525

jason-fries added this to Biomedical Dataset Hackathon 2022 Feb 18, 2022

github-actions bot assigned napsternxg Apr 4, 2022

hakunanatasha moved this to In Progress in Biomedical Dataset Hackathon 2022 Apr 8, 2022

napsternxg added a commit to napsternxg/biomedical that referenced this issue Apr 11, 2022

Fixes bigscience-workshop#113 - Add Chebi (Chapti)

66090e3

napsternxg linked a pull request May 5, 2022 that will close this issue

Closes #113 - Add Chebi (Chapti) #525

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for CHEBI (Chapati) #113

Create dataset loader for CHEBI (Chapati) #113

leonweber commented Feb 17, 2022

napsternxg commented Apr 4, 2022

hakunanatasha commented Apr 6, 2022

napsternxg commented Apr 7, 2022

napsternxg commented Apr 11, 2022

napsternxg commented Apr 12, 2022

jason-fries commented Apr 19, 2022

napsternxg commented Apr 21, 2022

napsternxg commented Apr 27, 2022

napsternxg commented Apr 28, 2022

Create dataset loader for CHEBI (Chapati) #113

Create dataset loader for CHEBI (Chapati) #113

Comments

leonweber commented Feb 17, 2022

napsternxg commented Apr 4, 2022

hakunanatasha commented Apr 6, 2022

napsternxg commented Apr 7, 2022

napsternxg commented Apr 11, 2022

napsternxg commented Apr 12, 2022

jason-fries commented Apr 19, 2022

napsternxg commented Apr 21, 2022

napsternxg commented Apr 27, 2022

napsternxg commented Apr 28, 2022