-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1
Comments
ShareGPT Dataset: Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each: Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md (Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.) The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered |
ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): |
Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko? |
Dạ tại đây ạ: https://drive.google.com/drive/folders/156yLw2lZHMu6W4rnEIMNhXx5BLqRz_Yu?usp=sharing |
Kho dữ liệu huấn luyện chỉ dẫn, QnA, hội thoại ...
Dữ liệu nổi bật
Khác
The text was updated successfully, but these errors were encountered: