Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the API declarative #4

Closed
gajus opened this issue Jan 17, 2017 · 26 comments
Closed

Make the API declarative #4

gajus opened this issue Jan 17, 2017 · 26 comments

Comments

@gajus
Copy link
Owner

gajus commented Jan 17, 2017

The entire input, validation and formatting rules can be declared using a simple JSON object.

The benefit of this approach is portability, i.e. easy to move scraper from one programming language to another.

Furthermore, it easier to enforce consistent style, and maintain complexity of the code base.

We could even use https://www.npmjs.com/package/jsonscript.

@gajus
Copy link
Owner Author

gajus commented Jan 17, 2017

To continue the example from the documentation:

x('body', {
  title: x('title'),
  articles: x('article {0,}', {
    body: x('[email protected]'),
    summary: x('.body p {0,}[0]'),
    imageUrl: x('img@src'),
    title: x('.title')
  })
});

A declarative approach would look something like this:

{
  "selector": "body",
  "properties": {
    "title": "title",
    "articles": {
      "selector": "article {0,}",
      "properties": {
        "body": "[email protected]",
        "summary": ".body p {0,}[0]",
        "imageUrl": "img@src",
        "title": ".title"
      }
    }
  }
}

The equivalent in YAML is even shorter:

selector: body
properties:
  title: title
  articles:
    selector: article {0,}
    properties:
      body: [email protected]
      summary: .body p {0,}[0]
      imageUrl: img@src
      title: .title

This also makes the syntax for declaring validators and formatters more intuitive, e.g.

selector: body
properties:
  title: title
  articles:
    selector: article {0,}
    properties:
      body: [email protected]
      summary: .body p {0,}[0]
      imageUrl: img@src
      title:
        selector: .title
        test: /foo/
        format: "upperCase"

Where test (property of title) is a regular expression used to validate the result, and format (also a property of title) is used to format the result (as requested #3).

Minus the custom DSL used in the selector, this is becoming a lot like https://github.com/rla/dom-eee. (Might be a good thing.)

We could also add JSON schema support, e.g.

selector: .movie
schema: "movie"
properties:
  name: ".name"
  url: ".url"

Where schema: "movie" refers to a JSON schema loaded at a time of constructing a Surgeon instance.


I was looking for "declarative scraper" and found this, https://github.com/ContentMine/scraperJSON. Its not mature or anything, but demonstrates an attempt to write a declarative scraper. There is even a Node.js implementation, https://github.com/ContentMine/thresher.

I like the idea of the regex capture groups,

regex - an Object specifying a regular expression whose groups should be captured as the results. The results will be an array of the captured groups. If the global flag (g) is specified, the result will be an array of arrays of captured groups. There are two keys allowed:
- source - a string specifying the regular expression to be executed. Required
- flags - an array specifying the regex flags to be used (g, m, i, etc.). Optional (omitting this key will cause the regex to be executed with no flags).

There is also https://github.com/drbig/grabber

@gajus
Copy link
Owner Author

gajus commented Jan 17, 2017

@rla any thoughts on this?

@sllvn
Copy link

sllvn commented Jan 18, 2017

By limiting the schema to JSON, you do limit the transforms that can be done re: #3. Or would there be a way to define custom transforms and reference those by key (such that I could specify format: "myAwesomeTransform")?

@gajus
Copy link
Owner Author

gajus commented Jan 18, 2017

By limiting the schema to JSON, you do limit the transforms that can be done re: #3. Or would there be a way to define custom transforms and reference those by key (such that I could specify format: "myAwesomeTransform")?

Thats the idea.

I am convinced transforms should be added, though. Extracting data and formatting data are two very different tasks.

@rla
Copy link
Collaborator

rla commented Jan 18, 2017

The focus of DOM-EEE was mainly on the extraction part as the code involved there had tendency to get too complex. The idea was to get simpler objects from DOM by cutting down most of the element noise. The simplified object tree is supposed to be easier to work with using basic language constructs, such as loops or Array methods. This is what I experienced in multiple projects. The second goal was extreme portability. This made JSON input-output mandatory. If the project is mainly to be used from JavaScript environments then this might not be the optimal choice.

Validation and transformation can also be represented in the declarative form as string identifiers or as arrays of them, like

{
  selector: '.date',
  transform: 'convert-date',
  validity: 'date-not-in-future'
}

The actual transforms and validators need to be defined and registered with the library first. Non-existing transforms and validators can then be easily checked for. I see that defining them in-line directly on the declarative form can make it too complex. The order in which to apply validations and transforms is not clear tho. We might want to check if the selector matches at all or actually validate the transformed date.

One of the DOM-EEE design aspects is that non-matching selectors return null, making catch-all validation easy. The output just has to be checked for nulls. This with some amounts of "manually executed" checks has proven to cover lots of cases for me.

JSON schema can be applied to the output independently of this library. If we built in support, we would have to choose an implementation. As I have understood there are draft 3 and 4 of JSON schema with huge differences and various packages pick arbitrarily what to support from either of them.

@gajus
Copy link
Owner Author

gajus commented Jan 18, 2017

JSON schema can be applied to the output independently of this library. If we built in support, we would have to choose an implementation. As I have understood there are draft 3 and 4 of JSON schema with huge differences and various packages pick arbitrarily what to support from either of them.

In terms of choosing an implementation, Ajv (https://github.com/epoberezkin/ajv) is now a somewhat de facto standard in the JavaScript community.

However, I agree with your point that it can be done outside of Surgeon.

One of the DOM-EEE design aspects is that non-matching selectors return null, making catch-all validation easy. The output just has to be checked for nulls. This with some amounts of "manually executed" checks has proven to cover lots of cases for me.

What about throwing an error? (like Surgeon does at a present time)

This way you are sure that no unexpected behaviour is left unseen.

The actual transforms and validators need to be defined and registered with the library first. Non-existing transforms and validators can then be easily checked for. I see that defining them in-line directly on the declarative form can make it too complex.

Agree.

The order in which to apply validations and transforms is not clear tho. We might want to check if the selector matches at all or actually validate the transformed date.

What is a use case for wanting to apply a validator after formatting the message?

Wouldn't a formatter throw an error if it cannot format the data to the desired format?

@rla
Copy link
Collaborator

rla commented Jan 18, 2017

@gajus,

What about throwing an error? (like Surgeon does at a present time)

This needs some mechanism to mark optional properties for the case when an element sometimes exists but sometimes not but you would like to use it when it exists.

What is a use case for wanting to apply a validator after formatting the message?

To decouple date parsing from checking whether the date is in the future, for example. Composability, otherwise you need something like parseDateButAlsoCheckItIsInFuture single transform. I can see that some sort of pipeline could be defined, maybe even represented similar to shell pipes, like apply: 'parseDate|checkDateInFuture' where validation is just an identity transform that throws error when the condition does not validate.

@gajus
Copy link
Owner Author

gajus commented Jan 18, 2017

Wouldn't you agree that checkDateInFuture is a filter feature rather than validation?

The fact that whatever document contains date thats in the past does not make it an invalid date. It just a data set that you are not interested in. "validate" here would do no good since it would throw an error (and break the scraper). A filter could be used though.

@rla
Copy link
Collaborator

rla commented Jan 18, 2017

@gajus,

Wouldn't you agree that checkDateInFuture is a filter feature rather than validation?

A better example is parsing an URL and checking if it contains a specific query parameter. The validation here means guarding against the changed URL structure, not for filtering a set of URLs on the page. Filtering probably needs to be described as well, maybe as a separate issue.

@gajus
Copy link
Owner Author

gajus commented Jan 18, 2017

Going back to the original question:

What is a use case for wanting to apply a validator after formatting the message?

If you have retrieved a URL, you can use the validator to assert that URL schema has not changed. Where does the formatting come in?

@rla
Copy link
Collaborator

rla commented Jan 18, 2017

The use case is where I want to parse the URL only once, not in the validator and not in the later steps.

@gajus
Copy link
Owner Author

gajus commented Jan 18, 2017

The use case is where I want to parse the URL only once, not in the validator and not in the later steps.

For performance purposes?

@gajus
Copy link
Owner Author

gajus commented Jan 18, 2017

That said, you wouldn't be parsing URL for validation purposes... in most cases a regex would be enough.

Unless of course your intention is to ignore new parameters being added to the URL or parameters changing order. This sounds dangerous, though.

@gajus
Copy link
Owner Author

gajus commented Jan 18, 2017

A bit of case study.

I took an existing scraper (https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae/8943410d8e39d1eb013b11ec0d5ae50471829c09) and attempted to rewrite it using a declarative API.

Note:

It is a rather complicated example. I have chosen it intentionally to discover the edge cases.

Lets start with:

scrapeVenues

export const scrapeVenues = async () => {
  const $ = await request('get', 'http://www.mk2.com/', 'html');

  return mapSelector($('#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"]'), (venue) => {
    let nid;

    venue.find('span').remove();

    nid = venue.attr('href');
    nid = extractMatch(/\/salles\/(.+)/, nid);

    const url = 'http://www.mk2.com/salles/' + nid;
    const name = extractTextFromElement(venue);

    if (!_.includes(nid, 'mk2-')) {
      throw new Error('Unexpected nid.');
    }

    return {
      guide: {
        url
      },
      result: {
        name,
        nid: nid.substr(4),
        url
      }
    };
  });
};

This is simple:

export const scrapeVenues = async () => {
  const document = await request('get', 'http://www.mk2.com/', 'html');

  const x = surgeon(document);

  const venues = x({
    properties: {
      name: {
        selector: '::text()'
      },
      nid: {
        match: '/salles/mk2-(.+)',
        selector: '::attribute(href)'
      },
      url: {
        selector: '::attribute(href)'
      }
    },
    selector: '#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"] {1,}'
  });

  return venues.map((venue) => {
    return {
      guide: {
        url: 'http://www.mk2.com/salles/mk2-' + venue.url
      },
      result: {
        name: venue.name,
        nid: venue.nid,
        url: 'http://www.mk2.com/salles/mk2-' + venue.url
      }
    }
  });
};

It can be made even more succinct if we allow to:

  • declare selector "properties" property value as a string assuming a single query
  • allow to inline inbuilt methods into the query

Example:

const venues = x({
  properties: {
    name: '::text()',
    nid: '::attribute(href)::match("/salles/mk2-(.+)")',
    url: '::attribute(href)'
  },
  selector: '#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"] {1,}'
});

Thats succinct and unambiguous.

scrapeMovies

Next comes:

export const scrapeMovies = async (guide) => {
  const $ = await request('get', guide.url, 'html');

  return mapSelector($('#seances .l-mk2-tables .l-session-table .fiche-film-info .fiche-film-title'), (movieElement) => {
    return {
      guide: {
        movieElement
      },
      result: {
        name: extractTextFromElement(movieElement)
      }
    };
  });
};

The first problem is the selector:

#seances .l-mk2-tables .l-session-table .fiche-film-info .fiche-film-title

scrapeMovies selects the movie elements, then passes an instance of the resulting cheerio selector to scrapeShowtimes, then scrapeShowtimes is using parent selector tr to find the corresponding movie table row. Using the parent selector is bad because a scrapeShowtimes should work only on the information it is provided (the identifier of an element, the element, etc.); it shouldn't be capable to iterate the DOM upwards.

We cannot fix this by changing scapeMovies selector to #seances .l-mk2-tables .l-session-table tr because this will include rows that do not have the movie information. This can be solved using a parent selector. (I have raised a proposal #8 to add has() function.)

The next problem is that movieElement is an instance of a cheerio selector. This is bad because it makes it hard to log program inputs and outputs (guide.movieElement is being passed to scrapeShowtimes). Therefore, a simple solution is to create a selector that uniquely represents the element. (I have created a proposal for selector() function).

Which gives us something like this:

export const scrapeMovies = async (guide) => {
  const document = await request('get', guide.url, 'html');

  const x = surgeon(document);

  const movies = x({
    properties: {
      "name": ".fiche-film-title",
      "movieElementSelector": "tr::selector()"
    },
    selector: '#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title) {0,}'
  });

  return movies.map((movie) => {
    return {
      guide: {
        url: movie.url,
        movieElementSelector: movie.movieElementSelector
      },
      result: {
        name: movie.name
      }
    }
  });
};

scrapeShowtimes

This takes us to scrapeShowtimes.

export const scrapeShowtimes = (guide) => {
  return mapSelector(guide.movieElement.parents('tr').find('.item-list a[href^="/reservation"]'), (timeElement) => {
    let date;
    let showtime;

    const text = extractTextFromElement(timeElement);
    const time = extractTime(text, 'HH[h]mm');
    const version = extractMatch(/(VOST|VO|VF)/, text);

    date = timeElement.parents('.l-session-table').find('.table-header .l-schedule-days').attr('id');
    date = extractDate(date, 'YYYYMMDD');

    showtime = {
      time: date + ' ' + time,
      url: 'http://www.mk2.com' + timeElement.attr('href')
    };

    showtime = _.assign(showtime, scrapeLanguageAttributes(version));

    return {
      result: showtime
    };
  });
};

Where do I start 🤦.

First, this reveals an error in the scrapeMovies. The same movie appears multiple times in the document.

If you look at the target document, the structure is (pseudo markup) <date><movie /><movie /></date><date><movie /><movie /></date>.

That means that the scrapeMovies needs to be modified to include a unique reference to the movie. I am going to use the movie URL for that.

const movies = x({
  properties: {
    name: '.fiche-film-title',
    movieUrl: 'a[href^="/films/"]::attribute(href)'
  },
  selector: '#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title) {0,}',
  uniqueBy: 'url'
});

I have added an ad-hoc helper uniqueBy (equivalent to _.uniqBy) to make the list of movies unique.
@todo Write a proposal.

Now we need to iterate each date and find the movie.

const times = x({
  has: "a[href='${url}']",
  properties: {},
  selector: '#seances .l-mk2-tables .l-session-table'
}, {
  parameters: {
    url: '/films/fleur-tonnerre'
  }
});

I have added parameters configuration. Parameters can be referred to using ${parameter name} syntax. This allows us to filter elements using a dynamic condition.

@todo Write a proposal.

Finally, we need to extract the data:

const x = surgeon(document, {
  parameters: {
    url: '/films/fleur-tonnerre'
  },
  formatters: {
    extractDate: (input, ...args) => {},
    extractTime: (input, ...args) => {}
  }
});

const times = x({
  has: "a[href='${url}']",
  properties: {
    date: '.table-header .l-schedule-days::attribute(id)::extractDate(YYYYMMDD)',
    times: {
      selector: '.item-list a[href^="/reservation"]',
      properties: {
        time: '::text()::extractTime(HH[h]mm)',
        version: '::text()::match("(VOST|VO|VF)")'
      }
    }
  }
  selector: '#seances .l-mk2-tables .l-session-table'
});

I am using formatters (helper functions) to format the result.

Which gives us:

export const scrapeShowtimes = (guide) => {
  const document = await request('get', guide.url, 'html');

  const x = surgeon(document, {
    parameters: {
      url: guide.movieUrl
    },
    formatters: {
      extractDate,
      extractTime
    }
  });

  const dates = x({
    has: 'a[href="${url}"]',
    properties: {
      date: '.table-header .l-schedule-days::attribute(id)::extractDate(YYYYMMDD)',
      events: {
        selector: '.item-list a[href^="/reservation"]',
        properties: {
          url: '::attribute(href)',
          time: '::text()::extractTime(HH[h]mm)',
          version: '::text()::match("(VOST|VO|VF)")'
        }
      }
    }
    selector: '#seances .l-mk2-tables .l-session-table'
  });

  return _.flatten(dates.map((date) => {
    return date.events.map((event) => {
      return {
        time: date.date + ' ' + event.time,
        url: 'http://www.mk2.com' + event.url
      };
    });
  }));
};

The end result is this.

https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae/ba0926962fe57ef68b060664a50a5c01d272b8b2

There is quite a large chunk of development involved to make this work. I'd really appreciate a review and suggestions for improvement.

@gajus
Copy link
Owner Author

gajus commented Jan 18, 2017

@bitshadow please have a look at this too.

@gajus
Copy link
Owner Author

gajus commented Jan 30, 2017

I have implemented a variation of above API in the declarative-api branch.

{
  "adopt": {
    "articles": {
      "imageUrl": {
        "extract": {
          "name": "href",
          "type": "attribute"
        },
        "select": "img"
      },
      "summary": "p:first-child",
      "title": ".title"
    },
    "pageTitle": "h1"
  },
  "select": "main"
}

Or even shorter using action expressions.

{  
  "adopt": {
    "pageTitle": "::self",
    "articles": {
      "body": ".body @extract(attribute, innerHtml)",
      "imageUrl": "img @extract(attribute, src)",
      "summary": "p:first-child @extract(attribute, innerHtml)",
      "title": ".title"
    }
  },
  "pageTitle": "main > h1"
}

@DaniGuardiola
Copy link

I just discovered this tool and also this alternative API and I'm quite hyped. If this becomes stable I'll be talking to my CTO to refactor our scrapers. I'm preparing the proposal already!

Keep it up 👍

@gajus
Copy link
Owner Author

gajus commented Jan 30, 2017

DaniGuardiola commented 2 minutes ago

I just discovered this tool and also this alternative API and I'm quite hyped. If this becomes stable I'll be talking to my CTO to refactor our scrapers. I'm preparing the proposal already!

Keep it up 👍

Don't rush into it @DaniGuardiola. This API is going to still change. (I appreciate the enthusiasm, though.)

I am going to post an update in the next 15 minutes describing whats wrong with the above and how it can be improved.

@gajus
Copy link
Owner Author

gajus commented Jan 30, 2017

As I was saying, while working on the above API, I have realised that a lot more powerful API would allow to perform subroutines, e.g.

{
  "articles": [
    {
      "action": "select",
      "selector": "article"
    },
    {
      "action": "adopt",
      "children": {
        "body": [
          {
            "action": "select",
            "selector": ".body"
          },
          {
            "action": "extract",
            "name": "innerHTML",
            "type": "property"
          }
        ],
        "imageUrl": [
          {
            "action": "select",
            "selector": "img"
          },
          {
            "action": "extract",
            "name": "src",
            "type": "attribute"
          }
        ],
        "summary": [
          {
            "action": "select",
            "selector": ".body p:first-child"
          },
          {
            "action": "extract",
            "type": "property",
            "name": "innerHTML"
          },
          {
            "action": "format",
            "name": "text"
          }
        ],
        "title": [
          {
            "action": "select",
            "selector": ".title"
          },
          {
            "action": "extract",
            "name": "textContent",
            "type": "property"
          }
        ]
      }
    }
  ],
  "pageName": [
    {
      "action": "select",
      "selector": ".body"
    },
    {
      "action": "extract",
      "name": "innerHTML",
      "type": "property"
    }
  ]
}

Because of the formatting, this looks huge. However, if we go back to using DSL, it becomes manageable:

{
  "articles": [
    "select article",
    {
      "body": [
        "select .body",
        "extract property innerHTML"
      ],
      "imageUrl": [
        "select img",
        "extract attribute src"
      ],
      "summary": [
        "select .body p:first-child",
        "extract property innerHTML",
        "format text"
      ],
      "title": [
        "select .title",
        "extract property textContent"
      ]
    }
  ],
  "pageName": [
    "select .body",
    "extract property innerHTML"
  ]
}

If we use use YAML, the entire thing is even more simple to read:

articles:
- select article
- body:
  - select .body
  - extract property innerHTML
  imageUrl:
  - select img
  - extract attribute src
  summary:
  - select .body p:first-child
  - extract property innerHTML
  - format text
  title:
  - select .title
  - extract property textContent
pageName:
- select .body
- extract property innerHTML

The benefit of the latter approach over the current implementation is that it enables combination or arbitrary test and format functions, e.g.

pageName:
- select .body
- extract property innerHTML
- format extractFirstTextNode
- format extractTime
- test timeInFuture

I also think that it is easier to read and debug, because all commands are read (and executed) from top-to-bottom. Therefore, following the progress log is as simple as following the schema.

@gajus
Copy link
Owner Author

gajus commented Jan 30, 2017

@rla @DaniGuardiola @licyeus what are your thoughts about this approach vs the earlier?

@DaniGuardiola
Copy link

DaniGuardiola commented Jan 30, 2017 via email

@gajus
Copy link
Owner Author

gajus commented Jan 30, 2017

If we use use YAML, the entire thing is even more simple to read:

articles:
- select article
- body:
  - select .body
  - extract property innerHTML
  imageUrl:
  - select img
  - extract attribute src
  summary:
  - select .body p:first-child
  - extract property innerHTML
  - format text
  title:
  - select .title
  - extract property textContent
pageName:
- select .body
- extract property innerHTML

This could be even further reduced using (optional) a pipe operator:

articles:
- select article
- body: select .body | extract property innerHTML
  imageUrl: select img | extract attribute src
  summary: select .body p:first-child | extract property innerHTML | format text
  title: select .title | extract property textContent
pageName: select .body | extract property innerHTML

@DaniGuardiola
Copy link

DaniGuardiola commented Jan 30, 2017 via email

@gajus
Copy link
Owner Author

gajus commented Jan 30, 2017

Also, thinking about performance, did you think about "stacking" cheerio
(or whatever) calls somehow? Like, for example:

Thats already being done.

@DaniGuardiola
Copy link

DaniGuardiola commented Jan 30, 2017

Edit: reformatted (email replies don't support markdown) and corrected some of the text (my english is not the best)

About the select and extract shorthand thing, I just came up with an idea:

This:

articles:
- select article
- body: select .body | extract property innerHTML
  imageUrl: select img | extract attribute src
  summary: select .body p:first-child | extract property innerHTML | format text
  title: select .title | extract property textContent
  pageName: select .body | extract property innerHTML

Would become this:

articles:
- select article
- body: .body { property innerHTML }
  imageUrl: img { attribute src }
  summary: .body p:first-child { property innerHTML } | format text
  title: .title { property textContent }
  pageName: .body { property innerHTML }

Also possible shorthand for getting textContent which is
extremely common:

articles: 
- select article
- title: .title {}

For attributes:

articles:
- select article
- imageUrl: img { [src] }

Some other ideas:

Use { propertyName } as a shorthand for properties when no action is
declared and [ attributeName ] for attributes.

Also for innerHTML and textContent, writing "html" and "text" instead
of the actual property name should make it clearer as those are the
two most common scraped properties.

In addition, it might even be a good idea to remove the need for
piping right after using { } or [ ]

All of this, including the textContent shorthand, would result in:

articles:
- select article
- body: .body {html}
  imageUrl: img [src]
  summary: .body p:first-child {innerHTML} format text
  title: .title {} // or .title {text}
  pageName: .body {html}

Of course all of these shorthand expressions must be optional, but I
think it would make an awesomely great addition, this would really
simplify scraping, it makes it even beautiful I dare to say! (And it's
a lot to say in the messy web-scraping world)

Thank you! :)

@DaniGuardiola
Copy link

I just saw a flaw on my proposal: selectors targeting attributes would be conflictive with the [ ] shorthand but maybe something can be done about that

@gajus gajus closed this as completed in a15cf4d Jan 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants