-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERG Treebanks: summary about data #40
Comments
Here's the list of the files in the current release. I filled out the mapping where I could make them, but some things I cannot map to something described in Flickinger 2011 easily:
|
Can anyone help complete this table? |
The OMW means Open Multilingual Wordnet. This is a sample of 2000 sentences from synset English definitions from @fcbond 's OMW (http://compling.hss.ntu.edu.sg/omw/). As far as I know, this is mostly the Princeton Wordnet 3.0 with small fixes. |
I agree that we do need this documentation in the wiki. BTW, not always clear that WSJ is also part of Ontonotes and Propbank dataset (see propbank/propbank-release#14). |
Is |
|
This just means, all corpora that start with "ws". |
Thank you, @oepen ! Here's the same table updated with the info from
|
CSLI is a rebranding of the legendary HP test suite: for WNB and WLB: PETE: |
Many thanks, @oepen . Let me know if you think that this table could go directly into the wiki e.g. into RedwoodsTop. |
What about adding some extra information about the size of each treebank? I am particularly interested to know how many sentences we have with golden MRS. Does anyone have this number? Is there any other ERG golden analyzed treebank besides the data inside the ERG repository under In the % for f in Two profiles are 'virtual'. The wescience and redwoods. But redwoods mention profiles that do not exist in the
Questions:
|
@oepen, the CCS event is the precursor of http://mrp.nlpl.eu/2020/index.php?page=14#companion? If so, what is the origin of the EDS data on MRP datasets? Finally, there are sentences duplicated in the profiles:
some examples:
|
Alex, the redwoods.xlsx file (which you can find in the release) has the sentence numbers! |
I found a link to the redwoods.xls file https://github.com/delph-in/docs/wiki/RedwoodsTop. But the page is pointing to http://svn.delph-in.net/erg/tags/1214/etc/redwoods.xls. In the etc folder of ERG in the trunk branch of the repository, I found the new version of this file. If I am reading it right, we have 97,286 sentences fully disambiguated in the redwoods collection, right? Still the more than the 59,255 AMR sentences but less impressive number. Is this number the actually number of sentences with golden MRS that we have available? What is the status of the sentences under the profiles not included in the redwoods? I noticed that |
broadly speaking, i guess one could say that CCS (and a series of additional meetings in a similar spirit) was part of the build-up for the MRP shared tasks. but one could just as well say that the desire to compare different frameworks and specific analyses has been a motivating force for dan, emily, myself, and others for at least the past decade. sitting down to compare individual sentences in great depth (in the CCS spirit) is one technique we have used; the SDP and MRP shared tasks series was a different approach with some of the same underlying motivation. regarding the EDS data in MRP 2019 and 2020, it comes from the 1214 ERG release, aka DeepBank 1.1. |
yes, with the transition from the original [incr tsdb()]-based treebanking environment to FFTB, profiles became a lot smaller, seeing as only the packed forest is recorded rather than a 500-best list of full derivations for each input. that meant that dan could undo some sub-divisions of collections that logically belonged together (JH, TG, and SC). post-1214, he concatenated these profiles back together. |
So we also have DeepBank in addition to the wesearch and redwoods "virtual" profiles? According to https://github.com/delph-in/docs/wiki/DeepBank it is the |
I suggest that we expand the section about the datasets that constitue the ERG treebanks: https://github.com/delph-in/docs/wiki/RedwoodsTop
Currently, the wiki page refers the reader to Flickinger 2011 but that work is not easily available online (I don't think?) Furthermore, even if one has it, it is still not fully obvious how to map the datasets described there to the files in the ERG release (for some, it is obvious, for others, it is not).
The text was updated successfully, but these errors were encountered: