This repo collates a list of websites I've scraped. They have either been for open source contributions (e.g., sourcing a Malaysian text dataset for fine-tuning LLama 2) or for my own personal practice/use.
I'm also hoping that this repo serves as a benchmark for my code quality over time. 🤣
- https://theedgemalaysia.com/
- https://timchew.net/
- https://techrakyat.com/
- https://mat-gaming.com/
- https://www.leaazleeya.com/
- https://www.bikesrepublic.com/
- https://en.wikipedia.org/wiki/Road_signs_in_Malaysia
- https://huggingface.co/datasets/wanadzhar913/crawl-theedgemalaysia
- https://huggingface.co/datasets/wanadzhar913/crawl-timchew
- https://huggingface.co/datasets/wanadzhar913/crawl-techrakyat
- https://huggingface.co/datasets/wanadzhar913/crawl-mat-gaming
- https://huggingface.co/datasets/wanadzhar913/crawl-leaazleeya
- https://huggingface.co/datasets/wanadzhar913/crawl-bikesrepublic
- https://huggingface.co/datasets/wanadzhar913/wikipedia-malaysian-road-sign-images