Simple entity extractor

Extract the entities of a given URL using the NLP system from Google Cloud

Requirements

NodeJS installed
Google Cloud credentials, follow this guide: https://cloud.google.com/iam/docs/keys-create-delete

How to use it

Download repo
Add dependencies yarn or npm install
Add your Google Cloud credentials in ./src/config/gcp.json
Run through the command line: node src/index.js <url> <css_selector>
The output with your entities will be in ./src/output/entities.csv

If you don't add a selector the whole body will be used (some words maybe appear weird because the parsing system to delete HTML is quite simple).

Examples:

node src/index.js 'https://www.softonic.com/articulos/ahsoka-a-que-hora-se-estrena-la-nueva-serie-de-star-wars-en-disney-plus' 'article'

{
  'Rosario Dawson': 0.20106926560401917,
  Ahsoka: 0.16541194915771484,
  Martes: 0.043081074953079224,
  serie: 0.0016539701027795672,
  fin: 0.0323215052485466,
  'Disney Plus': 0.025058437138795853,
  uno: 0.003629029495641589,
  estrenos: 0.02174600400030613,
  NoticiasAhsoka: 0.018189461901783943,
  punto: 0.017994651570916176,
  'Suscripción Anual Disney+': 0.01701190322637558,
  series: 0.0016943010268732905,
  'Star Wars': 0.011811340227723122,
  videojuegos: 0.011043447069823742,
  'aparición': 0.009717367589473724,
  personaje: 0.009717367589473724,
  'país': 0.00966811552643776,
  juego: 0.009523184038698673,
  espera: 0.0075791748240590096,
  pistas: 0.0075791748240590096,
  'The Mandalorian': 0.006372471340000629,
  ...

node src/index.js 'https://nachomascort.com/scraping-content-hijacking-the-endpoint-calls-in-the-front-end/' '.post-container'

{
  Scraping: 0.007178295403718948,
  '\\ -H': 0.04049227386713028,
  'https://github.com/NachoSEO/google-autocomplete-extractor': 0.036833275109529495,
  payloadOnce: 0.029546057805418968,
  Google: 0.004429913125932217,
  Googlebot: 0.025536995381116867,
  way: 0.00153245753608644,
  call: 0.003964927978813648,
  example: 0.0059328884817659855,
  '\\/b\\u003e': 0.0017164398450404406,
  'Scraping content': 0.01152738370001316,
  endpoint: 0.005422821268439293,
  order: 0.002042317995801568,
  actions: 0.009126781485974789,
  site: 0.001900155795738101,
  ...

Bulk mode

If instead of just extracting the entities for one URL you want to get the info of several ones you need to use the bulk mode.

How to use Bulk mode

Download repo
Add dependencies yarn or npm install
Add your Google Cloud credentials in ./src/config/gcp.json
Instead of passing the URL and the selector via terminal you need to add that info in this document: ./src/input/input.txt
Run through the command line: node src/bulk.js
The output with your entities will be in ./src/output/entities.csv

Format of input

Add every URL with its selector for every line. Separate both with commas. The selector is optional, if no selector is provided it will scrape the entire body.

Example:

https://nachomascort.com/scraping-content-hijacking-the-endpoint-calls-in-the-front-end/,.post-container
https://www.softonic.com/articulos/ahsoka-a-que-hora-se-estrena-la-nueva-serie-de-star-wars-en-disney-plus,article
https://github.com/NachoSEO/simple-entity-extractor

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
package.json		package.json
readme.md		readme.md
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple entity extractor

Requirements

How to use it

Examples:

Bulk mode

How to use Bulk mode

Format of input

About

Releases

Packages

Languages

NachoSEO/simple-entity-extractor

Folders and files

Latest commit

History

Repository files navigation

Simple entity extractor

Requirements

How to use it

Examples:

Bulk mode

How to use Bulk mode

Format of input

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages