Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any easier/faster way to obtain data? #8

Open
piee-kun opened this issue Mar 21, 2023 · 9 comments
Open

Any easier/faster way to obtain data? #8

piee-kun opened this issue Mar 21, 2023 · 9 comments

Comments

@piee-kun
Copy link

I'm trying to use a dataset of 100k of my own messages among 3.5mil of other messages. The data collection takes over 5hr, but I'm not letting it finish due to time constraints. Is there any faster way to obtain training data as opposed to mining it from the server itself?

@CakeCrusher
Copy link
Owner

@piee-kun
Off the top of my head I know of one alternative, which is basically a web scraper but I would imagine that is even slower (I'll link this solution later today).
Another thing you could do is request (download) your data from discord itself (assuming that they offer it, I'll look into that later today), preprocess it as necessary and reformat it so you train it with your own data. On the README's video you can see how that works.

@piee-kun
Copy link
Author

I already have both my data and a DiscordChatExporter capture of the entire 3.5mil message channel. Is that able to be used or not?

@CakeCrusher
Copy link
Owner

@piee-kun
Yes its able to be used but not out of the box you need to format the data to fulfill these requirements:

  1. Create a messages.csv file with two columns author_id (id of user) and content (message of user)
  2. The rows must be organized from newest to oldest so that the newest message is of index 0
  3. Then simply run the forge command and set it as "individually ran bot" (which will require your messages file)
  4. image
  5. Fill the rest of the data.
  6. image

The pipeline will then train the model with your messages file.

@piee-kun
Copy link
Author

Thanks so much for the info and for the time out of your day in order to help me. I'll get back to you with the results.

@piee-kun
Copy link
Author

Well, I've formatted the data properly and now I'm stuck on step 2 of initializing, I left it for basically the entire day and nothing came of it.

@CakeCrusher
Copy link
Owner

@piee-kun Im sorry to hear that. Could you please share the logs and anything else you think may be useful to figure out what is wrong?

At the current state Mimicbot is not too fault tolerant so I can imagine a couple things that could go wrong but your logs will help a lot.

@piee-kun
Copy link
Author

Where are the logs stored? I see none generated.

@piee-kun
Copy link
Author

piee-kun commented Mar 30, 2023

Okay scratch that, I got to training and it's doing that.
image

And when it does finish I get this error huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96:

@CakeCrusher
Copy link
Owner

CakeCrusher commented Apr 6, 2023

@piee-kun Hi sorry for the late reply. The loading bar unfortunately is broken and will never move, despite training actually making progress.

That is due to the utilization of tqdm in the train.py file in areas like the following:

epoch_iterator = tqdm(
train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])

The error is notifying you that after the first epoch, on the first save that the repo name is invalid. I am not sure if you explicitly set the repo name, if you did then try to change the name to the satisfy the requirements on the error, otherwise if you did not, I suggest explicitly setting a name (it is prompted at the beginning of the forge command). Please let me know if you did explicitly set the name (i may need to update the repo if you didn't).
d8ahazard/sd_dreambooth_extension#626

Let me know how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants