links scraper

This links scraper is intended to scrape links of an SPA web page.

Usage

drc up -d
drc logs -f links-scraper

Then make a post call to

http://localhost8891/scrape/

with the body:

{
    "url": "https://publicatie.gelinkt-notuleren.demo.lblod.info"
}

and Content-Type set to application/json

After that the triple store (http://localhost:8890/sparql) will contain the scraped information of the websites and in the scraper-data folder there will be a downloaded version of each of the encountered pages.

The data will be stored with the following scheme:

options

SPARQL config

There are 2 environment variables for configuring the SPARQL endpoint:

MU_SPARQL_ENDPOINT='http://database:8890/sparql'
MU_APPLICATION_GRAPH='http://mu.semte.ch/application'

minimum rescraping age

the minimum age in seconds that need to have passed before a duplicate page will be added to the triple store, this is in milliseconds.

ENV MINIMUM_TIME_FOR_RESCRAPING=14400000

blacklisting

As we do not want to scrape the entire internet there is a blacklist file that can be configured. The position of that file is indicated by an environment variable:

ENV BLACKLIST_FILE="/app/sites.blacklist"

The content of this file is just URLS, for which we will check if they appear anywhere in the found links.

datamodel

The datamodel can be changed by adjusting the scraper/SPARQL.js file. Here all the predicates are documented. To make more complex changes the queries that are being made need to be looked at. Now this file contains an object (SPARQL) that holds the representations as follows:

const SPARQL = {
    PREFIXES: "PREFIX mu:<http://mu.semte.ch/vocabularies/> PREFIX muExt:<http://mu.semte.ch/vocabularies/ext/> PREFIX dct:<http://purl.org/dc/terms/> ",
    TYPE_PAGE: "muExt:Page",
    TYPE_DOWNLOADED_PAGE: "muExt:DownloadedPage",
    PREDICATE_URL: "muExt:url",
    PREDICATE_LINKSTO: "muExt:linksTo",
    PREDICATE_DOWNLOADEDAS: "muExt:downloadedAs",
    PREDICATE_FILENAME: "muExt:filename",
    PREDICATE_MODIFIED: "muExt:modified",
    PREDICATE_UUID: "mu:uuid",
    RESOURCE_BASE_PAGE: "http://example.com/resources/pages/",
    RESOURCE_BASE_DOWNLOADED_PAGE: "http://example.com/resources/downloaded-pages/",
    query: query,
    update: update
};

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
scraper		scraper
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.js		app.js
docker-compose.yml		docker-compose.yml
index.js		index.js
package.json		package.json
schema.jpg		schema.jpg
sites.blacklist		sites.blacklist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

links scraper

Usage

options

SPARQL config

minimum rescraping age

blacklisting

datamodel

About

Releases

Packages

Languages

License

langens-jonathan/rdfa-link-scraper

Folders and files

Latest commit

History

Repository files navigation

links scraper

Usage

options

SPARQL config

minimum rescraping age

blacklisting

datamodel

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages