Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

names of the profiles in the last ERG treebanks #5

Open
arademaker opened this issue Nov 21, 2022 · 6 comments
Open

names of the profiles in the last ERG treebanks #5

arademaker opened this issue Nov 21, 2022 · 6 comments

Comments

@arademaker
Copy link

arademaker commented Nov 21, 2022

This is related to delph-in/docs#40, and maybe @olzama and @danflick can add something.

The names of the ERG gold profiles in the tsdb/gold changed. The http://svn.delph-in.net/erg/tags/2020/etc/redwoods.xls didn't preserve the old names, which is pretty confusing. So, for example, wsj06c now is only wsj06, right?

How were the dev, test, and train sets defined for https://github.com/goodmami/mrs-to-penman/blob/master/convert-redwoods.sh#L8-L187? The new names can impact the dev/test/train sets?

@arademaker
Copy link
Author

arademaker commented Nov 21, 2022

I ended up with the following list

testsuites=(
  "+csli"
  "+esd"
  "+fracas"
  "+mrs"
  "+trec"

  "?cb"
  "+ecoc"
  "+ecos"
  "=ecpa"
  "?ecpr"

  "+hike"
  "+jh"
  "?jhk"
  "?jhu"
  
  "+tg"
  "?tgk"
  "?tgu"
  "+ps"
  "?psk"
  "?psu"
  "?rondane"

  "+rtc000"
  "+rtc001"

  "+bcs"
  "+ccs"
  "+control"
  "+scm"
  "+peted"
  "?petet"

  "+pest"
  "+omw"
  "+ntucle"
  "+handp12"
  "+sh-spband-r"
  "+sh-spec"

  "+vm6"
  "+vm13"
  "+vm31"
  "?vm32"
  "+wlb03"
  "+wnb03"

  "+ws201"
  "+ws202"
  "+ws203"
  "+ws204"
  "+ws205"
  "+ws206"
  "+ws207"
  "+ws208"
  "+ws209"
  "+ws210"
  "+ws211"
  "=ws212"
  "?ws213"
  "?ws214"

  "+wsj00"
  "+wsj01"
  "+wsj02"
  "+wsj03"
  "+wsj04"
  "+wsj05"
  "+wsj06"
  "+wsj07"
  "+wsj08"
  "+wsj09"
  "+wsj10"
  "+wsj11"
  "+wsj12"
  "+wsj13"
  "+wsj14"
  "+wsj15"
  "+wsj16"
  "+wsj17"
  "+wsj18"
  "+wsj19"
  "=wsj20"
  "?wsj21"
  "+wsj23"
)

@arademaker
Copy link
Author

arademaker commented Nov 21, 2022

In https://arxiv.org/pdf/1904.11564.pdf, you wrote

About half of the training data comes from the Wall Street Journal (sections 00-21), while the rest spans a range of domains, including Wikipedia, e- commerce dialogues, tourism brochures, and the Brown corpus. The data is split into training, development and test sets with 72,190, 5,288, and 10,201 sentences, respectively.

once the script executed, I counted the graphs with:

(venv) ar@tenis mrs-to-penman % rg "^\(\)"  | wc -l
    2297
(venv) ar@tenis mrs-to-penman % rg "^\([0-9]"  | wc -l
   69319

So I am missing (72190+5288+10201)-69319 = 18,360 sentences...

The profiles sum up ..

% find profiles -name 'item.*' | xargs gzcat | wc -l
  131401

@goodmami
Copy link
Owner

How were the dev, test, and train sets defined for https://github.com/goodmami/mrs-to-penman/blob/master/convert-redwoods.sh#L8-L187?

These were taken from the redwoods.xls file linked in the comment above the code you linked to here.

Regarding the new distribution of the Redwoods 2020 data, I don't really know what changed or why, so I cannot comment on your proposed list.

Regarding the counts, a few things:

  • The number of lines in the item files are not always a good indicator of the number of items. Some of those may be specified as to be ignored (e.g., when they contain non-linguistic data scraped from a web page, like a table of numbers). You should filter on those where i-wf is 1.
  • MRSs that could not be converted to DMRS were dropped (possibly an ill-formed MRS)
  • Duplicate MRSs were dropped (as noted in the appendix of https://aclanthology.org/N19-1235/)

These may account for the discrepancies you saw.

@arademaker
Copy link
Author

arademaker commented Nov 22, 2022

Sorry, I was reading the profile inputs but I should read the results:

% find profiles -name 'item.*' | xargs gzcat | wc -l
  131401
% find profiles -name 'result.*' | xargs gzcat | wc -l
   98924

@arademaker
Copy link
Author

The cases of possible invalid MRS I already count, this is my 2297 above.

@goodmami
Copy link
Owner

Ah, yes, the result file is better because of course some items won't get a parse. Good catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants