Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Fast" version #3

Open
Lundez opened this issue Mar 23, 2021 · 7 comments
Open

"Fast" version #3

Lundez opened this issue Mar 23, 2021 · 7 comments

Comments

@Lundez
Copy link

Lundez commented Mar 23, 2021

Hi @stefan-it thanks for the awesome job of providing all these embeddings.

I'm wondering how you trained them and if I could perhaps create a "fast" version of the Swedish ones myself?
I'm in need of a little smaller size to increase the inference time 😄

@codemaster-22
Copy link

Hi @Lundez how do we get a fast version ? Did you figure this out? and @stefan-it can you please help regarding this as soon as possible.

@Lundez
Copy link
Author

Lundez commented Jun 21, 2021

I never got a response and didn't get started on training it myself.
I think I saw the width of 'fast model mentioned somewhere. And Stefan has mentioned he used wiki + opus + opensubtitles I believe.
That should get you started. If you complete a small model please share!

(p.s. for me it was enough to quantize model post training my NER)

@stefan-it
Copy link
Member

Hi @codemaster-22 and @Lundez ,

unfortunately, I have no plans to re-train fast models.

You're right, if you want to have fast models, then you need to train the model from scratch.

You would need to train fast versions with a decent training corpus on your own; however, if you want to try smaller models you could use e.g. the distilled version of multilingual BERT, provided from Hugging Face: https://huggingface.co/distilbert-base-multilingual-cased.

@stefan-it
Copy link
Member

@Lundez if you want to train a fast version, then you just need to use a smaller hidden states size. The "fast" then usually refers to models with a hidden size of 1024 instead of 2048.

You can use this example as orientation:

https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md#training-the-language-model

@stefan-it
Copy link
Member

Here are the scripts (forward and backward lm training) that I've used for training e.g. the Swedish Flair Embeddings:

Forward lm:

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

from pathlib import Path

# are you training a forward or backward LM?
is_forward_lm = True 

# load the default character dictionary
#dictionary: Dictionary = Dictionary.load('chars')
dictionary: Dictionary = Dictionary.load_from_file('dictionary.pkl')

# get your corpus, process forward and at the character level
corpus = TextCorpus(Path('./corpus'),
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
                               is_forward_lm,
                               hidden_size=2048,
                               dropout=0.1,
                               nlayers=1)

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model_forward',
              sequence_length=250,
              mini_batch_size=50,
              max_epochs=1,
              checkpoint=True)

Backward lm:

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

from pathlib import Path

# are you training a forward or backward LM?
is_forward_lm = False 

# load the default character dictionary
#dictionary: Dictionary = Dictionary.load('chars')
dictionary: Dictionary = Dictionary.load_from_file('dictionary.pkl')

# get your corpus, process forward and at the character level
corpus = TextCorpus(Path('./corpus'),
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
                               is_forward_lm,
                               hidden_size=2048,
                               dropout=0.1,
                               nlayers=1)

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model_backward',
              sequence_length=250,
              mini_batch_size=50,
              max_epochs=1,
              checkpoint=True)

The dictionary.pkl was created with this script:

import sys

from flair.data import Dictionary
char_dictionary: Dictionary = Dictionary()

# counter object
import collections
counter = collections.Counter()

processed = 0


file = sys.argv[1]

with open(file, 'r', encoding='utf-8') as f:
    tokens = 0
    for line in f:

        processed += 1            
        chars = list(line)
        tokens += len(chars)

        # Add chars to the dictionary
        counter.update(chars)

        # comment this line in to speed things up (if the corpus is too large)
        # if tokens > 50000000: break


total_count = 0
for letter, count in counter.most_common():
    total_count += count

print(total_count)
print(processed)

sum = 0
idx = 0
for letter, count in counter.most_common():
    sum += count
    percentile = (sum / total_count)

    # comment this line in to use only top X percentile of chars, otherwise filter later
    # if percentile < 0.00001: break

    char_dictionary.add_item(letter)
    idx += 1
    print('%d\t%s\t%7d\t%7d\t%f' % (idx, letter, count, sum, percentile))

print(char_dictionary.item2idx)

import pickle

output = sys.argv[2]

with open(output, 'wb') as f:
    mappings = {
        'idx2item': char_dictionary.idx2item,
        'item2idx': char_dictionary.item2idx
    }
    pickle.dump(mappings, f)

So if you want to train a faster model, just use hidden_size=1024 instead of hidden_states=2048 and you should be able to use the scripts above 🤗

@codemaster-22
Copy link

Thanks a lot @stefan-it I am looking for this answer !! Love you bruh for the amazing repo and quick responses!!

@codemaster-22
Copy link

codemaster-22 commented Jun 26, 2021

I have a doubt , Like I have a small Corpus close to 1 million words , it is Hinglish , by Hinglish I mean Hindi Language written in English word , for eg : Mein kya karu , u can see it's english text but it sounds Hindi , so Shall I fine tune the existing English flair embedding model like news-X or do I have to train from scratch !! @stefan-it , @Lundez

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants