Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetaCAT BERT update to tutorials 4.1 and 4.2 #26

Merged
merged 8 commits into from
Aug 29, 2024

Conversation

shubham-s-agarwal
Copy link
Collaborator

Updates to Tutorials 4.1 and 4.2: Integration of BERT Implementation

  • Tutorial 4.1: No changes were necessary since BERT utilizes its own pre-trained tokenizer. A note has been added advising users to proceed directly to Tutorial 4.2 if using BERT.

  • Tutorial 4.2: BERT implementation has been added, showing training on the given dataset. Unlike BiLSTM, which fine-tunes a previously trained model pack, BERT is trained from scratch.

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.1 looks good. Don't think there needs to be more in here for that.

4.2

  • The MetaCAT configuration and the Train MetaCAT (sub)section are now duplicated. It would be good to have these appear once. Otherwise we'll end up having to maintain 2 copies of the same text/code.
    The two MetaCAT instances (BERT vs BiLSTM) should provide the exact same interface anyway. The only difference is that some config entries don't affect one or the other.
  • Perhaps we can trim the included output a little? I know it's already long in the existing one, but just doing this comparison it was pretty annoying to have to scroll around the sizeable chunks of output.
  • There is no description of the model_variant config option in the tutorial

EDIT:
Also, regarding 4.2. The changes to the config in the first code section in the For BERT section is the reason I was talking about creating a static method for getting a BERT-based meta-cat with defaults.
I.e in ConfigMetaCAT we could have something along the lines:

    @classmethod
    def get_default_bert_config(cls, category_name: str, model_name: str = 'bert', nclasses: int = 2) -> 'ConfigMetaCAT':
        config = cls()
        cls.model.model_name = model_name
        cls.model.nclasses = nclasses
        cls.general.category_name = category_name
        return cls

@shubham-s-agarwal
Copy link
Collaborator Author

shubham-s-agarwal commented Aug 21, 2024

For 4.2:

  • MetaCAT configuration and the Train MetaCAT have subsections for BERT and Bi-LSTM in them.
  • Since model_variant isn't commonly changed, I haven't added it in the tutorial. There is a link to all config variables in the section.
  • Regarding the config section for BERT, we have to explicitly change the model_name, category_name and other variables since we are starting the training from scratch. For Bi-LSTM, we are loading the model pack; if we were training Bi-LSTM from scratch, we would have the do the same.

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Since model_variant isn't commonly changed, I haven't added it in the tutorial. There is a link to all config variables in the section.

But you're using it in the tutorial - passing it to TokenizerWrapperBERT.load. Someone going through the tutorial won't have any idea what this is or what value it holds.

In 4.2 I don't think we need 2 different (almost identical) mc.train_from_json code blocks. We could just have 1 and get the suffix for the save file from config_metacat.model.model_name automatically.

Other than that, I think this looks good.

@shubham-s-agarwal
Copy link
Collaborator Author

Fixed both!

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought I already approved yesterday, but I think I was waiting for GHA workflow just in case.

Looks good to me.

@shubham-s-agarwal shubham-s-agarwal merged commit ae31b0f into main Aug 29, 2024
6 checks passed
@shubham-s-agarwal shubham-s-agarwal deleted the MetaCAT_bert_update branch August 29, 2024 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants