Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface Dataset #2

Open
hunoutl opened this issue Jul 22, 2024 · 1 comment
Open

huggingface Dataset #2

hunoutl opened this issue Jul 22, 2024 · 1 comment

Comments

@hunoutl
Copy link

hunoutl commented Jul 22, 2024

Thank you for this repo which saved me time to do a quick analysis on CVPR24.
I would like to give back to the community and so I made a first draft of what paperlists could produce in Huggingface datasets format : https://huggingface.co/datasets/hunoutl/paperlists

It is a raw dataset. I kept most of the keys that I applied to all of the papers. Do you have an idea of ​​what would be possible to standardize everything?

For now I have a simple code for merging, I will try to find time to make it cleaner and share it.
I had made synthetic data for CVPR with the use of LLM to complete information and add new ones (country of belonging). I'm thinking of going over all the papers in the future.

@jingyangcarl
Copy link
Contributor

jingyangcarl commented Jul 22, 2024

Hi @hunoutl

Thanks for using this papercopilot/paperlist repository. It's great to have it hosted on Hugging Face (huggingface/papercopilot) as well. I actually started an organization on Hugging Face, but I haven't posted anything there yet, lol.

I've also spent some time thinking about whether we can standardize everything during development, and I believe we can. This paper list is powered by papercopilot/paperbot and is currently organized into modules by conference.

I used to put all conference papers into a large data table and use the title as the key. However, there's a chance that papers could share the same title, making it difficult to identify missing papers. Therefore, I split them from the big standard output into smaller shards.

Still, it would be good to have a standard output as a function of the paperbot at output time to make it an easy-to-use tool.

I welcome to your contribution.

Best,
Jing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants