-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relation extraction #173
Merged
+6,776
−6
Merged
Relation extraction #173
Changes from 1 commit
Commits
Show all changes
119 commits
Select commit
Hold shift + click to select a range
b20e7c8
Added files.
vladd-bit eec6c59
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 56220aa
More additions to rel extraction.
vladd-bit 7ad88f5
Rel base.
vladd-bit 233ce36
Update.
vladd-bit 85a7015
Updates.
vladd-bit 5003548
Dependency parsing.
vladd-bit 541b47d
Updates.
vladd-bit c042b0d
Added pre-training steps.
vladd-bit 87d0c0c
Added training & model utils.
vladd-bit 4f42696
Cleanup & fixes.
vladd-bit 018d811
Update.
vladd-bit f3d3f44
Evaluation updates for pretraining.
vladd-bit e5f354e
Removed duplicate relation storage.
vladd-bit c69de67
Merged master.
vladd-bit 031d256
Moved RE model file location.
vladd-bit 2259a6b
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 1c469e9
Structure revisions.
vladd-bit 423b4e1
Added custom config for RE.
vladd-bit 8ae9abb
Implemented custom dataset loader for RE.
vladd-bit 186416c
More changes.
vladd-bit 451e33f
Small fix.
vladd-bit 8b36413
Latest additions to RelCAT (pipe + predictions)
vladd-bit 2fb8fc9
Setup.py fix.
vladd-bit 930dd11
RE utils update.
vladd-bit 24b2841
rel model update.
vladd-bit 193ecb1
rel dataset + tokenizer improvements.
vladd-bit 03111a7
RelCAT updates.
vladd-bit 7ab60f4
RelCAT saving/loading improvements.
vladd-bit 40875f3
RelCAT saving/loading improvements.
vladd-bit 810d1dc
RelCAT model fixes.
vladd-bit 11dcb32
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 72187f6
Attempted gpu learning fix. Dataset label generation fixes.
vladd-bit 5f67a4c
Minor train dataset gen fix.
vladd-bit cfc0e91
Minor train dataset gen fix No.2.
vladd-bit 9f4b220
Config updates.
vladd-bit 19afa81
Gpu support fixes. Added label stats.
vladd-bit 8eb1665
Evaluation stat fixes.
vladd-bit 6e86fa2
Cleaned stat output mode during training.
vladd-bit 5cee8cf
Build fix.
vladd-bit 223ac9a
removed unused dependencies and fixed code formatting
vladd-bit ea7d68c
Mypy compliance.
vladd-bit 1ea9738
Fixed linting.
vladd-bit 9f6609e
More Gpu mode train fixes.
vladd-bit 1782c0b
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit fb86869
Fixed model saving/loading issues when using other baes models.
vladd-bit df21543
More fixes to stat evaluation. Added proper CAT integration of RelCAT.
vladd-bit 92a5e08
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 87d1a9c
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit ced1627
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 7b69710
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 37fd212
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit f0eda2b
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 10269b9
Setup.py typo fix.
vladd-bit b8a45b2
Merge branch 'relation_extraction' of https://github.com/CogStack/Med…
vladd-bit 20203ac
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit f057139
RelCAT loading fix.
vladd-bit 197a27a
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 86fd509
RelCAT Config changes.
vladd-bit 79dc069
Type fix. Minor additions to RelCAT model.
vladd-bit 323c895
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit f1c56bf
Type fixes.
vladd-bit a78ff86
Type corrections.
vladd-bit f09ceb2
RelCAT update.
vladd-bit 32574f2
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit c081c3e
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit e2e48b5
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 4ce5ba3
Type fixes.
vladd-bit 21c09ff
Merge branch 'relation_extraction' of https://github.com/CogStack/Med…
vladd-bit 8123689
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 57ab0c5
Fixed type issue.
vladd-bit 9da5aa6
RelCATConfig: added seed param.
vladd-bit 009e832
Adaptations to the new codebase + type fixes..
vladd-bit 1a7d130
Doc/type fixes.
vladd-bit 53dba6a
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 92613ed
Fixed input size issue for model.
vladd-bit a49a44a
Fixed issue(s) with model size and config.
vladd-bit 6456e6e
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 5aac9ab
RelCAT: updated configs to new style.
vladd-bit 9c50b30
RelCAT: removed old refs to logging.
vladd-bit b071607
Merge branches 'relation_extraction' and 'master' of https://github.c…
vladd-bit 89d9128
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit e6e99cb
Fixed GPU training + added extra stat print for train set.
vladd-bit 307d194
Type fixes.
vladd-bit fb7efe3
Updated dev requirements.
vladd-bit c235daf
Linting.
vladd-bit fcdf2e3
Merge branches 'relation_extraction' and 'master' of https://github.c…
vladd-bit aad0a73
Fixed pin_memory issue when training on CPU.
vladd-bit 8a9026b
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit f94e349
Updated RelCAT dataset get + default config.
vladd-bit 0770356
Updated RelDS generator + default config
vladd-bit bdf20f5
Linting.
vladd-bit f7b5aaf
Updated RelDatset + config.
vladd-bit 3e827cf
Merge branch 'relation_extraction' of https://github.com/CogStack/Med…
vladd-bit aaf6533
Pushing updates to model
shubham-s-agarwal 18f9bb8
Fixing formatting
shubham-s-agarwal 503513c
Update rel_dataset.py
shubham-s-agarwal 040821b
Update rel_dataset.py
shubham-s-agarwal ed7c8d5
Update rel_dataset.py
shubham-s-agarwal 8d0bfe4
RelCAT: added test resource files.
vladd-bit 3f3a780
RelCAT: Fixed model load/checkpointing.
vladd-bit 3f56824
RelCAT: updated to pipe spacy doc call.
vladd-bit b7a4987
RelCAT: added tests.
vladd-bit 77d27b0
Merge branch 'relation_extraction' of https://github.com/CogStack/Med…
vladd-bit a9258a2
Fixed lint/type issues & added rel tag to test DS.
vladd-bit 0ed70fb
Fixed ann id to token issue.
vladd-bit 8db2e76
RelCAT: updated test dataset + tests.
vladd-bit 6eea6b7
RelCAT: updates to requested changes + dataset improvements.
vladd-bit 6972310
RelCAT: updated docs/logs according to commends.
vladd-bit d03316c
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 8cb12a4
RelCAT: type fix.
vladd-bit d10318a
RelCAT: mct export dataset updates.
vladd-bit 12acaeb
RelCAT: test updates + requested changes p2.
vladd-bit 4c14a3a
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 382cefc
RelCAT: log for MCT export train.
vladd-bit 35b0913
Updated docs + split train_test & dataset for benchmarks.
vladd-bit d48bc41
type fixes.
vladd-bit 3068516
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit 72643fc
Merge branch 'master' into relation_extraction
mart-r File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
RE utils update.
commit 930dd11eafe41ec08e62893a64b9a49ad13256e4
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,10 +21,9 @@ def save_bin_file(file_name, data, path="./"): | |
|
||
def load_state(net, optimizer, scheduler, path="./", model_name="BERT", file_prefix="train", load_best=False): | ||
""" Loads saved model and optimizer states if exists """ | ||
base_path = path | ||
|
||
checkpoint_path = os.path.join(base_path, file_prefix + "_checkpoint_%s.pth.tar" % model_name) | ||
best_path = os.path.join(base_path, file_prefix + "_model_best_%s.pth.tar" % model_name) | ||
checkpoint_path = os.path.join(path, file_prefix + "_checkpoint_%s.dat" % model_name) | ||
best_path = os.path.join(path, file_prefix + "_model_best_%s.dat" % model_name) | ||
start_epoch, best_pred, checkpoint = 0, 0, None | ||
if (load_best == True) and os.path.isfile(best_path): | ||
checkpoint = torch.load(best_path) | ||
|
@@ -43,26 +42,17 @@ def load_state(net, optimizer, scheduler, path="./", model_name="BERT", file_pre | |
logging.info("Loaded model and optimizer.") | ||
return start_epoch, best_pred | ||
|
||
|
||
def save_results(losses_per_epoch, accuracy_per_epoch, model_name="BERT", path="./", file_prefix="train"): | ||
save_bin_file(file_prefix + "_losses_per_epoch_%s.pkl" % model_name, losses_per_epoch, path) | ||
save_bin_file(file_prefix + "_accuracy_per_epoch__%s.pkl" % model_name, accuracy_per_epoch, path) | ||
def save_results(data, model_name="BERT", path="./", file_prefix="train"): | ||
save_bin_file(file_prefix + "_losses_accuracy_f1_per_epoch_%s.dat" % model_name, data, path) | ||
|
||
def load_results(path, model_name="BERT", file_prefix="train"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Type hints and doc string would help |
||
losses_path = os.path.join(path, file_prefix + "_losses_per_epoch_%s.pkl" % model_name) | ||
accuracy_path = os.path.join(path, file_prefix + "_accuracy_per_epoch_%s.pkl" % model_name) | ||
f1_path = os.path.join(path, file_prefix + "_f1_per_epoch_%s.pkl" % model_name) | ||
|
||
losses_per_epoch, accuracy_per_epoch, f1_per_epoch = [], [], [] | ||
data_dict_path = os.path.join(path, file_prefix + "_losses_accuracy_f1_per_epoch_%s.dat" % model_name) | ||
|
||
if os.path.isfile(losses_path): | ||
losses_per_epoch = load_bin_file(losses_path) | ||
if os.path.isfile(accuracy_path): | ||
accuracy_per_epoch = load_bin_file(accuracy_path) | ||
if os.path.isfile(f1_path): | ||
f1_per_epoch = load_bin_file(f1_path) | ||
data_dict = {"losses_per_epoch" : [], "accuracy_per_epoch" : [], "f1_per_epoch" : []} | ||
if os.path.isfile(data_dict_path): | ||
data_dict = load_bin_file(data_dict_path) | ||
|
||
return losses_per_epoch, accuracy_per_epoch, f1_per_epoch | ||
return data_dict["losses_per_epoch"], data_dict["accuracy_per_epoch"], data_dict["f1_per_epoch"] | ||
|
||
def put_blanks(relation_data : List, blanking_threshold : float = 0.5): | ||
""" | ||
|
@@ -85,9 +75,9 @@ def put_blanks(relation_data : List, blanking_threshold : float = 0.5): | |
|
||
return blanked_relation | ||
|
||
def create_tokenizer_pretrain(tokenizer, tokenizer_name="BERT"): | ||
def create_tokenizer_pretrain(tokenizer, tokenizer_path): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Type hints would help |
||
tokenizer.hf_tokenizers.add_tokens(["[BLANK]", "[ENT1]", "[ENT2]", "[/ENT1]", "[/ENT2]"], special_tokens=True) | ||
save_bin_file(file_name=tokenizer_name + "_tokenizer_relation_extraction.dat", data=tokenizer) | ||
tokenizer.save(tokenizer_path) | ||
|
||
# Used for creating data sets for pretraining | ||
def tokenize(relations_dataset: Series, tokenizer : TokenizerWrapperBERT, mask_probability : float = 0.5) -> Tuple: | ||
|
@@ -132,9 +122,4 @@ def tokenize(relations_dataset: Series, tokenizer : TokenizerWrapperBERT, mask_p | |
token_ids = tokenizer.hf_tokenizers.convert_tokens_to_ids(tokens) | ||
masked_for_pred = tokenizer.hf_tokenizers.convert_tokens_to_ids(masked_for_pred) | ||
|
||
return token_ids, masked_for_pred, ent1_ent2_start | ||
|
||
|
||
def batch_split(dataset, chunk_size): | ||
n = max(1, chunk_size) | ||
return (dataset[i:i+n] for i in range(0, len(dataset), n)) | ||
return token_ids, masked_for_pred, ent1_ent2_start |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hints and doc string would help