Largest ever data set of Devfolio hackathons, sponsors, projects and much more. Scraped from 60K+ Devfolio pages and hackathons.
The data set is huge. The Raw Datasets folder has two files: one with all external links and another with Internal links.
For specific use cases, this needs filtering and cleaning.
- Finding sponsors
- Building AI tools to find sponsors
- Exploring Projects being built on Devfolio
- Analysing patterns in Projects
- Tech trends among participants
- Whatever you can think of......
- raw/ External Links => Contains all Links that go outside of Devolio, includes projects, sponsors etc..
- raw/ Internal Links => Contains all Devfolio internal links such as devfolio hackathons etc..
- raw/ Site Structure => Has a lot of data specific to Devfolio structure along with user profiles and description. What would you do with it? IDK!
- Create a dataset of Sponsors on Devfolio
- Filter all outlinks
- Remove all social media links
- Remove all web hosting services link
- Remove all readme file links
- Remove all Github Links
- Remove duplicates
- Finally, run all the domains against the "whois" database and purge any domain that is not older than 8 months or does not exist.
- We have a clean database 🎉
- Create a dataset of all Hackathons on Devfolio
- Create a dataset of Sponsor frequency on Devfolio (Top Sponsors etc..)
- Create a dataset of all Projects on Devfolio
- Use Google Data Studio and create nice Graphs
- Create dataset of Sponsors along with their emails and point of contact.
I am accepting contributions for this project and need help with cleaning and filtering data.
To contribute:
- Form this project
- Pick one of the "Todo" from above
- Check if a PR for your "Todo" already exists, either work with them or choose another "Todo"
- Create a PR
- I will review and merge
The raw data, any dataset and all contributions to this project are released under GNU GPLv3. By contributing to or using this project you accept the GNU GPLv3 license.
If you use this dataset for any project, please mention this GitHub repo or me.