[Bug] fugashi.Tagger causes pickling error during multiprocessing in tokenizer (Japanese) #4031

easyautoml · 2024-10-18T00:22:32Z

Describe the bug

When fine-tuning the XTTS model with num_workers > 0 for Japanese dataset, a TypeError occurs related to fugashi.Tagger.

Specifically, the error self.c_tagger cannot be converted to a Python object for pickling is triggered because fugashi.Tagger, used in the cutlet library for Japanese text processing, cannot be serialized for multiprocessing.

To Reproduce

Steps to Reproduce:

Load training and evaluation samples using load_tts_samples().
Initialize the Trainer object.
Create a training DataLoader using trainer.get_train_dataloader().
Set num_workers=2 in the DataLoader to enable multiprocessing.
Attempt to iterate through the DataLoader and observe the error.

train_samples, eval_samples = load_tts_samples(
    # Your loading code here...
)

# Initialize the trainer
trainer = Trainer(
    # Trainer initialization code here...
)

train_loader = trainer.get_train_dataloader(
    {},
    train_samples,
    True
)

dataset = train_loader.dataset

# Create DataLoader with num_workers > 0, which uses multiprocessing and may trigger the pickling issue
loader = DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    collate_fn=dataset.collate_fn,
    drop_last=False,
    sampler=None,
    num_workers=2,  # Setting this to 2 will use multiple workers (multiprocessing)
    pin_memory=False,
)

# Create an iterator from the dataloader
data_iter = iter(loader)

# Try to fetch the first batch, this should trigger the pickling error
try:
    first_batch = next(data_iter)
    pd.DataFrame(list(first_batch.items()), columns=['Key', 'Value'])
except Exception as e:
    print(f"Error: {e}")

Expected behavior

The data should be processed without any errors, even with num_workers > 0.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3070 Laptop GPU"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.4.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Windows",
        "architecture": [
            "64bit",
            "WindowsPE"
        ],
        "processor": "Intel64 Family 6 Model 165 Stepping 2, GenuineIntel",
        "python": "3.9.19",
        "version": "10.0.22631"
    }
}

Additional context

This issue only occurs when processing Japanese text, due to the use of fugashi.Tagger in the tokenization process, which is not compatible with multiprocessing.

The text was updated successfully, but these errors were encountered:

easyautoml added the bug Something isn't working label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] fugashi.Tagger causes pickling error during multiprocessing in tokenizer (Japanese) #4031

[Bug] fugashi.Tagger causes pickling error during multiprocessing in tokenizer (Japanese) #4031

easyautoml commented Oct 18, 2024

[Bug] fugashi.Tagger causes pickling error during multiprocessing in tokenizer (Japanese) #4031

[Bug] fugashi.Tagger causes pickling error during multiprocessing in tokenizer (Japanese) #4031

Comments

easyautoml commented Oct 18, 2024

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context