Skip to content

koc-lab/turkishlegalner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TurkishLegalNER: Named Entity Recognition in Turkish Legal Text

Source code for the paper Named-entity recognition in Turkish legal texts. This work is the first legal domain-specific NER model for Turkish legal texts with a custom-made corpus as well as several NER architectures based on conditional random fields and bidirectional long-short-term memories to address the task.

Data and Results

This repository is generated based on Berkay Yazıcıoğlu's LegalNER implementation. However, due to GitHub LFS storage and bandwidth restrictions, some data and result files are moved to Google Drive storage. The files under data/vectors/* and src/*_results/* are available under this Google Drive link. To get the full version of the repository, download and place necessary files from Google Drive, and place regarding the directory hierarchy.

Data Format

Follow the [data/train_and_test]

  1. For name in {train, test}, create files {name}.words.txt and {name}.tags.txt that contain one sentence per line, each word / tag separated by space using IOBES tagging scheme.
  2. Create files vocab.words.txt, vocab.tags.txt and vocab.chars.txt that contain one token per line. This can be automatized using the corresponding function in [src/preprocessing.py] and altering the fields DATASET_DIR to point to the locations of the file in Step 1 where DATA_DIR pointing to the output directory.
  3. Create a glove.*.npz file containing one array embeddings of shape (size_vocab_words, 300) using GloVe 840B vectors. This can be built by using the corresponding function in [src/preprocessing.py] after completing Step 2 and altering the field VECTOR_DIR to point desired output directory.

Get Started

Tensorflow 1.15 should be used, other versions are untested. Remaining packages should be operational without a specified version.

Once produced all the required data files, use the main.py with specifying correct parameters. These are:

  1. model (-m): Three base architectures; lcfor LSTM-CRF, llc for LSTM-LSTM-CRF and lcc for LSTM-CRF-CRF.
  2. embeddings (-e): Three embeddings to pair with a base architecture; glove for GloVe, m2v for Morph2Vec and hybrid for their combination. Make sure that correct .npz files are present in the data folder and call preprocessing each time a different embedding is used.
  3. preprocessing (-p): Flag to use preprocessing scripts, non-mandatory. Must be called in between new embedding selections.
  4. mode (-a): Four modes, directories are used as stated in [src/data.json]; train to train a model from scratch, k_fold to perform cross-validation (default=5), test for testing and validating a specific input (default=None) and use for generating an output from a specified file using a trained model (default=None).
python main.py -m <model> -e <embed> -p (preprocess flag) -a <mode>

If multiple tests are aimed to be performed with slight changes on the parameters, check out [src/multiple_run.sh] script.