BERT based NER for JDs

My personal project - extracting information using pre-trained language models!

Key takeaway from this project

Utilize PLMs (Finetune on benchmark dataset, Inference on real-world text data)
Data collection from real-world (webpage crawling)

Further steps

Utilize LLMs to make data for entity types that do not have benchmark datasets i.e. without pre-existing annotations.

Quick start steps

Crawl data - (a)
Tokenize
Prepare PLM - fine-tune BERT on CoNLL data - (b)
Inference prepared data (a) using trained PLM (b)
Postprocessing
Gather data and use it on downstream applications (e.g. Visualization)

Environments

Please see requirements.txt Tested on Python 3.10.

How to crawl

See crawldata.py.

Output of the code: data.json (Omitted to avoid potential (unforeseen) license issue)

Format (example):

{
	"data": [
		{
			"university": "Center for ABC | Harvard & Smithsonian",
			"department": "",
			"job_code": "FELLOWS",
			"job_title": "Center for ABC Fellowship | Harvard",
			"deadline": "2023/11/12 11:59PM*",
			"apply_link": "https://<URLs>",
			"description": "The Center for ABC | Harvard and DEF is a joint (omitted)"
		},
    ]
}

Learned: TBA

Tokenize

Using off-the-shelf NLP model stanza by StanfordNLP group (Qi et al 2020 and Zhang et al. 2021)

Prepare PLM - fine-tune BERT on CoNLL data

Using Transformers library (version 4.35.2) and BERT, we trained (fine-tune) and inferenced our datasets.

Main code: run_ner.py will conll2003 data from HugginfFace hub.

# training
export OUTPUT_PATH="./outputs"
python run_ner.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name conll2003 \
  --output_dir ${OUTPUT_PATH} \
  --do_train \
  --do_eval

Check ./outputs for report.

Our models performance is:

Precision: 0.9429
Recall: 0.9497
F1: 0.9463
Accuracy: 0.9896

pytorch_model.bin is omitted as it is too large (435 MB) for uploading to a github repo. (added in .gitingnore)

Inference prepared data using trained PLM

Main code: run_ner.py will load our custom data processing scripts located in custom_data/custom_data.py.
The model will load our trained model from $OUTPUT_PATH.

Note that I omitted train/dev/test dataset to avoid potential (unforeseen) license issue. These should be in custom_data folder.

# prediction
python run_ner.py \
  --model_name_or_path ${OUTPUT_PATH} \
  --dataset_name custom_data \
  --output_dir ${OUTPUT_PATH}/eval \
  --do_eval --do_predict

Check ./eval for report. Note that our inference run is on non-annotated data, which means there are no true labels. So F1 score of 0 is normal.
./eval/predictions.txt is the final output for our prediction.

Postprocessing

TBA Outputs: origin_predict_result.csv

Visualization

TBA!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
custom_data		custom_data
eval		eval
outputs		outputs
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
crawldata.py		crawldata.py
origin_predict_result.csv		origin_predict_result.csv
process_result.py		process_result.py
requirements.txt		requirements.txt
run_ner.py		run_ner.py
run_ner_quickstart.sh		run_ner_quickstart.sh
text_tokenizer.py		text_tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT based NER for JDs

Key takeaway from this project

Further steps

Quick start steps

Environments

How to crawl

Tokenize

Prepare PLM - fine-tune BERT on CoNLL data

Inference prepared data using trained PLM

Postprocessing

Visualization

About

Releases

Packages

Languages

dmcheon/ner_for_jd

Folders and files

Latest commit

History

Repository files navigation

BERT based NER for JDs

Key takeaway from this project

Further steps

Quick start steps

Environments

How to crawl

Tokenize

Prepare PLM - fine-tune BERT on CoNLL data

Inference prepared data using trained PLM

Postprocessing

Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages