This project is an entity normalisation engine developed for the Vector AI recruitment process. It supports entity normalisation for the following types of entities:
- Companies, businesses;
- Products, objects;
- Locations, cities, countries;
- Serial numbers;
- Street addresses.
The model takes as input a stream of strings in the classes above. There is no context provided for each entity.
The model performs a normalisation to suitable Wikipedia articles for the first three types of entities. Given the uniqueness of the latter two types of entities, normalisation is performed according to linguistic similarity of the input entities using the Levenshtein distance.
The model accepts entities in any language supported by the Google Translator API.
To set up this project:
- Clone GitHub repo:
git clone https://github.com/jleguina0/entity-normalization.git
-
Create a suitable virtual environment and install dependencies:
- With
conda
:cd entity-normalization conda env create -f environment.yml conda activate entity-norm37
- Or else, create a virtual environment with Python 3.7 and do:
pip install -r requirements.txt
- With
-
To run the normalization engine with some predefined examples in various languages:
python entity_norm.py
Javier Leguina Peral - [email protected]
Project Link: https://github.com/jleguina0/entity-normalization