Any easier/faster way to obtain data? #8

piee-kun · 2023-03-21T23:19:05Z

I'm trying to use a dataset of 100k of my own messages among 3.5mil of other messages. The data collection takes over 5hr, but I'm not letting it finish due to time constraints. Is there any faster way to obtain training data as opposed to mining it from the server itself?

CakeCrusher · 2023-03-22T15:12:15Z

@piee-kun
Off the top of my head I know of one alternative, which is basically a web scraper but I would imagine that is even slower (I'll link this solution later today).
Another thing you could do is request (download) your data from discord itself (assuming that they offer it, I'll look into that later today), preprocess it as necessary and reformat it so you train it with your own data. On the README's video you can see how that works.

piee-kun · 2023-03-22T15:27:51Z

I already have both my data and a DiscordChatExporter capture of the entire 3.5mil message channel. Is that able to be used or not?

CakeCrusher · 2023-03-22T20:27:49Z

@piee-kun
Yes its able to be used but not out of the box you need to format the data to fulfill these requirements:

Create a messages.csv file with two columns author_id (id of user) and content (message of user)
The rows must be organized from newest to oldest so that the newest message is of index 0
Then simply run the forge command and set it as "individually ran bot" (which will require your messages file)
Fill the rest of the data.

The pipeline will then train the model with your messages file.

piee-kun · 2023-03-22T20:33:21Z

Thanks so much for the info and for the time out of your day in order to help me. I'll get back to you with the results.

piee-kun · 2023-03-27T20:11:59Z

Well, I've formatted the data properly and now I'm stuck on step 2 of initializing, I left it for basically the entire day and nothing came of it.

CakeCrusher · 2023-03-28T22:20:29Z

@piee-kun Im sorry to hear that. Could you please share the logs and anything else you think may be useful to figure out what is wrong?

At the current state Mimicbot is not too fault tolerant so I can imagine a couple things that could go wrong but your logs will help a lot.

piee-kun · 2023-03-29T10:59:55Z

Where are the logs stored? I see none generated.

piee-kun · 2023-03-30T20:37:07Z

Okay scratch that, I got to training and it's doing that.

And when it does finish I get this error huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96:

CakeCrusher · 2023-04-06T22:57:14Z

@piee-kun Hi sorry for the late reply. The loading bar unfortunately is broken and will never move, despite training actually making progress.

That is due to the utilization of tqdm in the train.py file in areas like the following:

mimicbot/mimicbot_cli/train.py

Lines 436 to 437 in 039c95b

    
           epoch_iterator = tqdm( 
        
               train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])

The error is notifying you that after the first epoch, on the first save that the repo name is invalid. I am not sure if you explicitly set the repo name, if you did then try to change the name to the satisfy the requirements on the error, otherwise if you did not, I suggest explicitly setting a name (it is prompted at the beginning of the forge command). Please let me know if you did explicitly set the name (i may need to update the repo if you didn't).
d8ahazard/sd_dreambooth_extension#626

Let me know how it goes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any easier/faster way to obtain data? #8

Any easier/faster way to obtain data? #8

piee-kun commented Mar 21, 2023

CakeCrusher commented Mar 22, 2023

piee-kun commented Mar 22, 2023

CakeCrusher commented Mar 22, 2023

piee-kun commented Mar 22, 2023

piee-kun commented Mar 27, 2023

CakeCrusher commented Mar 28, 2023

piee-kun commented Mar 29, 2023

piee-kun commented Mar 30, 2023 •

edited

Loading

CakeCrusher commented Apr 6, 2023 •

edited

Loading

Any easier/faster way to obtain data? #8

Any easier/faster way to obtain data? #8

Comments

piee-kun commented Mar 21, 2023

CakeCrusher commented Mar 22, 2023

piee-kun commented Mar 22, 2023

CakeCrusher commented Mar 22, 2023

piee-kun commented Mar 22, 2023

piee-kun commented Mar 27, 2023

CakeCrusher commented Mar 28, 2023

piee-kun commented Mar 29, 2023

piee-kun commented Mar 30, 2023 • edited Loading

CakeCrusher commented Apr 6, 2023 • edited Loading

piee-kun commented Mar 30, 2023 •

edited

Loading

CakeCrusher commented Apr 6, 2023 •

edited

Loading