Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1

tiendung · 2023-03-31T12:53:37Z

Kho dữ liệu huấn luyện chỉ dẫn, QnA, hội thoại ...

Dữ liệu nổi bật

SharedGPT
https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
https://github.com/teknium1/GPTeacher
https://github.com/project-baize/baize-chatbot/tree/main/data dữ liệu multiple dialog (đã dịch quora)
https://github.com/lightaime/camel#data-hosted-on-hugging-face sinh bởi gpt-4
https://github.com/anthropics/hh-rlhf
https://huggingface.co/datasets/balisujohn/RWKV-oasst1

Khác

tiendung · 2023-04-03T08:40:50Z

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each:
part 1: https://files.catbox.moe/bhtp9i.zip
part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)

The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

trinhdoduyhungss · 2023-04-03T15:53:07Z

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+):
v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing
v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

feat: split vi_alpaca

tiendung · 2023-04-04T08:50:21Z

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko?

trinhdoduyhungss · 2023-04-05T01:57:24Z

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko?

Dạ tại đây ạ: https://drive.google.com/drive/folders/156yLw2lZHMu6W4rnEIMNhXx5BLqRz_Yu?usp=sharing

tiendung · 2023-04-12T09:53:46Z

https://www.chatorg.ai/blog/chat-language-models-tracker

tiendung self-assigned this Mar 31, 2023

tiendung pushed a commit that referenced this issue Apr 4, 2023

Merge pull request #1 from stbhcm/feat/split_vi_alpaca

c444b8b

feat: split vi_alpaca

tiendung changed the title ~~Thêm dữ liệu~~ Làm dữ liệu tốt hơn (GPT4, ShareGPT, Dolly 2.0 ...) Apr 15, 2023

tiendung changed the title ~~Làm dữ liệu tốt hơn (GPT4, ShareGPT, Dolly 2.0 ...)~~ Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1

Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1

tiendung commented Mar 31, 2023 •

edited

Loading

tiendung commented Apr 3, 2023

trinhdoduyhungss commented Apr 3, 2023 •

edited

Loading

tiendung commented Apr 4, 2023

trinhdoduyhungss commented Apr 5, 2023

tiendung commented Apr 12, 2023 •

edited

Loading

Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1

Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1

Comments

tiendung commented Mar 31, 2023 • edited Loading

Dữ liệu nổi bật

Khác

tiendung commented Apr 3, 2023

trinhdoduyhungss commented Apr 3, 2023 • edited Loading

tiendung commented Apr 4, 2023

trinhdoduyhungss commented Apr 5, 2023

tiendung commented Apr 12, 2023 • edited Loading

tiendung commented Mar 31, 2023 •

edited

Loading

trinhdoduyhungss commented Apr 3, 2023 •

edited

Loading

tiendung commented Apr 12, 2023 •

edited

Loading