Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1

Open
2 of 7 tasks
tiendung opened this issue Mar 31, 2023 · 5 comments
Open
2 of 7 tasks

Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1

tiendung opened this issue Mar 31, 2023 · 5 comments
Assignees

Comments

@tiendung tiendung self-assigned this Mar 31, 2023
@tiendung
Copy link
Contributor Author

tiendung commented Apr 3, 2023

lm-sys/FastChat#90 (comment)

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each:
part 1: https://files.catbox.moe/bhtp9i.zip
part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)

The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

@trinhdoduyhungss
Copy link

trinhdoduyhungss commented Apr 3, 2023

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+):
v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing
v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

tiendung pushed a commit that referenced this issue Apr 4, 2023
@tiendung
Copy link
Contributor Author

tiendung commented Apr 4, 2023

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko?

@trinhdoduyhungss
Copy link

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko?

Dạ tại đây ạ: https://drive.google.com/drive/folders/156yLw2lZHMu6W4rnEIMNhXx5BLqRz_Yu?usp=sharing

@tiendung
Copy link
Contributor Author

tiendung commented Apr 12, 2023

@tiendung tiendung changed the title Thêm dữ liệu Làm dữ liệu tốt hơn (GPT4, ShareGPT, Dolly 2.0 ...) Apr 15, 2023
@tiendung tiendung changed the title Làm dữ liệu tốt hơn (GPT4, ShareGPT, Dolly 2.0 ...) Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants