add "selector" function #9

gajus · 2017-01-18T12:26:26Z

Sometimes different parts of the scraper script need to access the same element.

Consider this example:

scrapeMovies gets a list of movie names, https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae#file-mk2-js-L49-L62
scrapeShowtimes parsers additional information about the parsed movies, https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae#file-mk2-js-L83-L106

The information is scraped from the same URL (therefore, the same document).

scrapeMovies selects movie elements, then passes an instance of the resulting cheerio selector to scrapeShowtimes, then scrapeShowtimes is using parent selector tr to find the corresponding movie table row.

Using the parent selector is bad because a scrapeShowtimes should work only on the information it is provided (e.g., the identifier of an element); it shouldn't be capable to iterate the DOM upwards. Furthermore, this makes logging useless.

A better alternative would be to derive a unique selector that can be shared between the processes. The above example could be then rewritten to:

export const scrapeMovies = async (guide) => {
  const document = await request('get', guide.url, 'html');

  const x = surgeon(document);

  const movies = x({
    properties: {
      "name": ".fiche-film-title",
      "movieElementSelector": "tr::selector()"
    },
    selector: '#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title) {0,}'
  });

  return movies.map((movie) => {
    return {
      guide: {
        url: movie.url,
        movieElementSelector: movie.movieElementSelector
      },
      result: {
        name: movie.name
      }
    }
  });
};

export const scrapeShowtimes = (guide) => {
  const document = await request('get', guide.url, 'html');

  const x = surgeon(document);

  const events = x({
    properties: {
      time: '::text()',
      version: '(VOST|VO|VF)',
      url: '::attribute(href)'
    }
    selector: [
      guide.movieElementSelector,
      '.item-list a[href^="/reservation"]'
    ]
  });

  return events.map((event) => {
    return {
      result: {
        time: event.time,
        url: 'http://www.mk2.com' + event.url
      }
    };
  });
};

The idea is that tr::selector() returns a CSS selector that given the same document will select the same element.

This example ignores "date" selection. The latter poses another complication.

The text was updated successfully, but these errors were encountered:

gajus · 2017-01-18T12:30:10Z

The example used in this proposal is also using an array for selectors.

selector: [
  guide.movieElementSelector,
  '.item-list a[href^="/reservation"]'
]

Thats simply for chaining multiple selectors. I guess it could be written as guide.movieElementSelector + '.item-list a[href^="/reservation"]', but that would selector parsing a lot more complicated (because quantifier expression and other expressions could appear anywhere in the selector).

This needs a separate proposal.

gajus added enhancement proposal labels Jan 18, 2017

gajus mentioned this issue Jan 18, 2017

Make the API declarative #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add "selector" function #9

add "selector" function #9

gajus commented Jan 18, 2017 •

edited

Loading

gajus commented Jan 18, 2017

add "selector" function #9

add "selector" function #9

Comments

gajus commented Jan 18, 2017 • edited Loading

gajus commented Jan 18, 2017

gajus commented Jan 18, 2017 •

edited

Loading