Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SyntaxGym redundant suites? #184

Open
mschrimpf opened this issue May 23, 2023 · 8 comments
Open

SyntaxGym redundant suites? #184

mschrimpf opened this issue May 23, 2023 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@mschrimpf
Copy link
Member

is there a reason for specifying both the github links as well as keep the file contents themselves for the test suites?

@mschrimpf mschrimpf added the question Further information is requested label May 23, 2023
@jimn2
Copy link
Contributor

jimn2 commented May 23, 2023

This is a result of Jon G and I doing it two different ways. Jon stored the file contents locally in the BrainScore repo and was using them directly, but it struck me as not a very flexible way to do it so I later set it up to read the data externally using urls. I left them both in figuring that other users would then have examples of how to do it either way. Technically though, one method or the other could be removed.

@mschrimpf
Copy link
Member Author

mschrimpf commented May 23, 2023

which one would you say is preferable? I think it would be clearer to have one consistent way. We can always re-introduce the secondary option if necessary, but right now this feels like YAGNI.

@jimn2
Copy link
Contributor

jimn2 commented May 23, 2023

there are advantages/disadvantages of both ways, but my gut was that using urls was more general and also avoids the possible headache of somebody trying to store huge amounts of data in our repo.

@jimn2
Copy link
Contributor

jimn2 commented May 23, 2023

i thought about this some more and i think if we are only going to keep one we should keep the one that best matches how most of the other benchmarks work. certainly storing the data locally is simpler and accessing external data is an added feature that maybe we ain't gonna need.

how is the data stored/accessed with most of the other benchmarks? if you let me know i'll fix this today.

@mschrimpf
Copy link
Member Author

Mostly the data is on S3 or some server, so the URL access you built likely is most aligned with that. Especially because you're already pointing to a specific version of the data, so the files are not going to change on us. Although maybe we want to add some checksums to verify integrity?

@jimn2
Copy link
Contributor

jimn2 commented May 23, 2023

aren't the specific values given in test_integration.py essentially acting as checks on these data files being changed? (i know it's possible to cleverly change those files and still get the same score but it's unlikely)

@mschrimpf
Copy link
Member Author

true!

@jimn2
Copy link
Contributor

jimn2 commented May 23, 2023

thought about this a little more. the way the syntaxgym benchmarks work, the resulting scores are very discrete and not unique. you can definitely change multiple tokens and then end up with the same score. so no matter what kind of checksums we add, there will almost always be the possibility of a slightly changed data file not being caught. so what's there right now in test_integration.py is about as good as we probably want to do (anything more effective would cost a lot).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants