Generate some static site pages after JavaScript has run using Capybara #15640

henare · 2018-01-17T06:11:27Z

What does this do?

Extends the static site build process to allow some pages to be generated after JavaScript (specifically jQuery AJAX) has finished running.

Why was this needed?

As part of everypolitician/democratic-commons-tasks#36 (comment) we're about to add data to pages exclusively via JavaScript. This would result in pages missing that data using our current build process, resulting in broken pages (at least for browsers without JS).

This pull request allows us to specify pages that we want generated after JS has run and therefore the data on the page has been populated.

Relevant Issue(s)

Closes #15639.

Implementation notes

Notes to Reviewer

If this is working correctly then the output of the build process should be the same before and after this PR. This is because the only page it's scraping has duplicated logic in the Ruby code for populating data. Once this PR is merged we can remove that duplicated logic and just build that page in JS (along with adding our new pages, when they're ready to be added to the site).

Notes to Merger

tmtmtmtm · 2018-01-17T08:37:30Z

scripts/save_javascript_pages.rb

+end
+
+page = PageAfterAJAX.new
+javascript_pages_to_scrape = open(ARGV[0]).read.split


Probably nicer to explicitly check that there is an argument, and blow up early if not.

tmtmtmtm

The basic idea seems good, though I've a few questions/comments about the detail:

Approach

I'm not quite following the approach of serving up javascript_pages_to_scrape.txt through the app, rather than simply having that be a text file entirely outside the app that the script reads directly. If this is meant to be a parallel with how the wget equivalent works, I fear that that's a slightly misleading consistency that doesn't buy us much, and makes these seem more similar than they actually are. The wget base is a page because that version is a recursive process of fetching (and saving) a given page and all other pages linked from that, whereas here you're simply working down a single list non-recursively.

Implementation

The implementation of PageAfterAJAX is also little odd to me (or possibly the naming, depending on what the class is actually supposed to represent). The naming implies that it's a Page, but it actually seems to be a wrapper around Capybara which you then feed pages to.

In general a constructor that doesn't take any arguments is a bit of an anti-pattern, especially when lots of the methods then take a consistent argument (in this case url). That implies that either this class should actually represent a Page, or at least that there's a Page class lurking in here, looking to escape. (I don't know whether creating a new Capybara::Session is expensive enough to want to share that across the URLs, or whether it's lightweight enough to simply create a new one each time)

The implementation is also quite imperative at the minute. That's not necessarily a bad thing, and we certainly shouldn't try to needlessly shoe-horn it into an OO pattern, but giving it the trappings of being in a Class, but then not really being one, is a little confusing. So, for example, rather than create_parent_directories, you could have a containing_directory method which also takes care of making sure that the directory exists before returning it. (That's strictly a violation of command-query separation, but I think it'd be OK here.)

You should also be able to simplify some of the implementation by using Pathname everywhere, rather than doing so much string manipulation. It has built-ins for getting parts out of the path, transforming the name, creating the directory, writing the file, etc.

Setup

Having this branch off #15637 seems like a slightly needless dependency here (or a dependency in the wrong order). Is there a reason this PR can't go live before that one? It seems that whichever commit(s) are needed for both could be cherry-picked into here, and go live with whichever is merged first, if introducing them in their own PR first doesn't make sense.

henare · 2018-01-18T05:35:02Z

Thanks @tmtmtmtm for the detailed and very useful feedback 🌟

I've responded below to the detail but here's a summary of the actions I'd like to take, in roughly the order they should be done in:

Point this PR at master and remove the unnecessary dependency on Display seat count from Wikidata on legislature wikidata subpages #15637
Refactor class to use Pathname
Refactor class to be less imperative
Move the save_javascript_pages.rb script into the application somewhere, renaming the class in the process
Remove the unnecessary javascript_pages_to_scrape.txt endpoint
Have the process run from a Rake task

As this is a pretty significant rework I don't think there's going to be a lot of benefit in me using fixup commits so I'll probably rewrite history a bit as part of this process.

Approach

I'm not quite following the approach of serving up javascript_pages_to_scrape.txt through the app, rather than simply having that be a text file entirely outside the app that the script reads directly.

We want a way for the app to generate a list of URLs (e.g. a set of pages for every country we have) so I think I was thinking, "how can we communicate this to a process [the script] that's running outside the application?". In my initial notes for this work I was calling this a Rake task but then I don't even mention that in my discussion on how to approach this part. I think the obvious thing is to bring this process into the application itself, as I'd been originally thinking with a Raketask.

If this is meant to be a parallel with how the wget equivalent works, I fear that that's a slightly misleading consistency that doesn't buy us much, and makes these seem more similar than they actually are. The wget base is a page because that version is a recursive process of fetching (and saving) a given page and all other pages linked from that, whereas here you're simply working down a single list non-recursively.

👍

Implementation

The implementation of PageAfterAJAX is also little odd to me (or possibly the naming, depending on what the class is actually supposed to represent). The naming implies that it's a Page, but it actually seems to be a wrapper around Capybara which you then feed pages to.

It's probably the implementation and the naming 😄 It wasn't supposed to be a Page like in the application so the naming is definitely where the problems start.

If we bring it into the application as I suggest above then I'll try come up with a better name, and one that doesn't imply it's a Page.

In general a constructor that doesn't take any arguments is a bit of an anti-pattern, especially when lots of the methods then take a consistent argument (in this case url). That implies that either this class should actually represent a Page, or at least that there's a Page class lurking in here, looking to escape. (I don't know whether creating a new Capybara::Session is expensive enough to want to share that across the URLs, or whether it's lightweight enough to simply create a new one each time)

Yes, I agree this looks weird. It is because the intent is to share a Capybara::Session across URLs because these are expensive (akin to booting a browser AIUI).

The implementation is also quite imperative at the minute. That's not necessarily a bad thing, and we certainly shouldn't try to needlessly shoe-horn it into an OO pattern, but giving it the trappings of being in a Class, but then not really being one, is a little confusing. So, for example, rather than create_parent_directories, you could have a containing_directory method which also takes care of making sure that the directory exists before returning it. (That's strictly a violation of command-query separation, but I think it'd be OK here.)

Good call. I don't natively think in OO so it's not surprising it's quite imperative. I'll make that change and try to keep going with a refactor in that direction.

You should also be able to simplify some of the implementation by using Pathname everywhere, rather than doing so much string manipulation. It has built-ins for getting parts out of the path, transforming the name, creating the directory, writing the file, etc.

Great! I've not used that library so I'll give it a shot.

Setup

Having this branch off #15637 seems like a slightly needless dependency here (or a dependency in the wrong order). Is there a reason this PR can't go live before that one? It seems that whichever commit(s) are needed for both could be cherry-picked into here, and go live with whichever is merged first, if introducing them in their own PR first doesn't make sense.

Thank you for picking up on this. I don't know what my thinking was here. I'm guessing it was just because these bits of work had become conflated in my head, but of course they're not in code. I'll fix that up too.

henare · 2018-01-19T03:22:44Z

I've done a pretty major reworking of this now. It's called from a Rake task and the class now just takes an array of URLs to build. I think this all makes it a lot easier to understand and simplifies the implementation.

This still allows us to easily expand the set of pages we want to build by inserting more URLs into the array we pass to StaticSiteGenerator.

@tmtmtmtm could you please review this again? 🌻

We're going to use this gem to scrape pages as part of our static site generation, after JavaScript has finished running on those pages. We're putting this in our Gemfile under the test and development groups so you can use it in a Rakefile during development and by Travis under test when the static site is built.

We want to be able to generate some pages as part of the static site build after JavaScript has run on that page. This will allow these pages to be populated with data from JavaScript but appear to progressively enhance when deployed as part of our static site. This Rake task currently has a hardcoded list of one page that needs JavaScript run. This can be expanded pretty easily later. It then uses capybara-webkit to fetch that page, run JS, and save those pages to disk. The target files will be in the same location as `wget` uses which will allow us to overwrite the static files already generated by `wget` with these JS-enhanced versions.

In previous commits we added a way to generate some of our static pages after JavaScript had finished running. This commit adds this step into the static site generation build script. After `wget` has finished running this Rake task overwrites some of the static files it generated with the (currently) few pages we want enhanced with JS.

tmtmtmtm

@henare I'm having a bit of difficulty working out what to do with this. The PR description seems quite out of date now, the new class you've created has no tests, and the rake task that uses it seems to require some undocumented things to happen first before it will work. So it's a little tricky for me to actually play with this at the minute, particularly in terms of being able to test out some of the things I refer to below and actually give working examples instead of more abstract comments.

My suspicion is that adding some working tests for StaticSiteGenerator will resolve some of this, so that might be the place to start.

I'd also suggest running reek against this new class: in particular it highlights a couple of utility functions masquerading as methods here. As before, there still seems to be quite of a clash of styles going on, where this has the trappings of being a class, but is still mostly a bit of a grab-bag of semi-related (and still quite imperative) functionality.

One tip: This isn't really a hard-and-fast rule, but as a useful heuristic any class name ending in -er/-or class is often a red flag: see, for example, http://objology.blogspot.co.uk/2011/09/one-of-best-bits-of-programming-advice.html

If you think about each part in terms of what it is, then add some behaviour to that, I suspect a simpler, more maintainable design will pop out. (My strong suspicion is that there's still at two different classes mixed up together here — one to represent the session, and one to represent the webpage being spidered.)

In terms of interface, intuitively I find supplying a list of all the URLs to be scraped to the constructor to be a little odd. This seems like something you should be able to keep feeding new URLs to — especially if the reason for the class in the first place is because starting a new session is expensive.

tmtmtmtm · 2018-01-23T15:19:13Z

lib/static_site_generator.rb

+  def initialize(urls:)
+    Capybara::Webkit.configure(&:allow_unknown_urls)
+    @page = Capybara::Session.new(:webkit)
+    @urls = urls.map { |u| URI.parse(u) }


NB: our practice in this repo is to avoid doing work directly in the constructor — most of these seem like things that can be deferred to the methods that ask for them.

This splits up the generator class as was a problem suspected in this PR review: #15640 (review) It creates a single class for a Browser which extends the capybara-webkit browser to include a way to get a page's contents after jQuery's AJAX calls are done. This is in a StaticSite module just to namespace it and indicate where it's used in the application, rather than it's nature. In other words it's not a _static site browser_ but you could read it that way so maybe this needs changing? All the remaining procedural logic is pushed into the Rake task. From here we could then extract that if we see fit.

henare · 2018-01-30T03:44:05Z

Sorry this has been such a pain to review @tmtmtmtm

As suggested I went to add tests for that class. Straight away I came up against a road block in that most of that class's methods were private and so difficult to test.

Instead of battling through that I tried to take what you said in the rest of your comment and rethink each part in terms of what it is. What I've come up with is 358001c which I think is a bit simpler.

As I mention in the commit it feels like some of the complexity has just been pushed in to the rake task so there's probably room for improvement there. Do you have any suggestions for things I could try?

We also a need to test the new Browser class. Since the methods on that class don't do a whole lot by themselves I think this will mean some kind of integration-level test with a page fixture.

henare · 2019-05-02T00:34:24Z

Closing ancient pull requests 🗿

henare requested review from mhl and tmtmtmtm January 17, 2018 06:11

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 17, 2018 06:11 Inactive

tmtmtmtm reviewed Jan 17, 2018

View reviewed changes

henare changed the base branch from seat-count-wikidata to master January 18, 2018 05:37

henare force-pushed the javascript-page-generation-capybara-refactor branch from 7209e4b to 227e3a3 Compare January 18, 2018 05:47

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 18, 2018 05:47 Inactive

henare force-pushed the javascript-page-generation-capybara-refactor branch from 227e3a3 to a9c588f Compare January 18, 2018 06:44

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 18, 2018 06:44 Inactive

henare self-assigned this Jan 18, 2018

henare force-pushed the javascript-page-generation-capybara-refactor branch from a9c588f to c5efc41 Compare January 18, 2018 23:17

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 18, 2018 23:17 Inactive

henare force-pushed the javascript-page-generation-capybara-refactor branch from c5efc41 to e24d7da Compare January 19, 2018 00:54

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 19, 2018 00:54 Inactive

henare force-pushed the javascript-page-generation-capybara-refactor branch from e24d7da to 2ba56ca Compare January 19, 2018 02:19

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 19, 2018 02:19 Inactive

henare force-pushed the javascript-page-generation-capybara-refactor branch from 2ba56ca to 527a258 Compare January 19, 2018 02:58

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 19, 2018 02:58 Inactive

henare force-pushed the javascript-page-generation-capybara-refactor branch from 527a258 to 59b4a4c Compare January 19, 2018 03:07

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 19, 2018 03:08 Inactive

henare assigned tmtmtmtm Jan 19, 2018

henare force-pushed the javascript-page-generation-capybara-refactor branch from 59b4a4c to 13064bf Compare January 23, 2018 01:09

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 23, 2018 01:09 Inactive

henare added 3 commits January 23, 2018 12:29

henare force-pushed the javascript-page-generation-capybara-refactor branch from 13064bf to 9de377d Compare January 23, 2018 01:30

tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 23, 2018 01:30 Inactive

This was referenced Jan 23, 2018

[Wikidata] Create countries list page #15643

Closed

Generate static site pages after JavaScript has run #15639

Open

[Wikidata] Create country detail pages #15644

Closed

[Wikidata] Create legislature detail pages #15645

Closed

tmtmtmtm reviewed Jan 23, 2018

View reviewed changes

tmtmtmtm deployed to everypolitician-viewe-pr-15640 January 30, 2018 03:28 View deployment

henare closed this May 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate some static site pages after JavaScript has run using Capybara #15640

Generate some static site pages after JavaScript has run using Capybara #15640

henare commented Jan 17, 2018 •

edited

Loading

tmtmtmtm Jan 17, 2018

tmtmtmtm left a comment

henare commented Jan 18, 2018 •

edited

Loading

henare commented Jan 19, 2018

tmtmtmtm left a comment

tmtmtmtm Jan 23, 2018

henare commented Jan 30, 2018

henare commented May 2, 2019

Generate some static site pages after JavaScript has run using Capybara #15640

Generate some static site pages after JavaScript has run using Capybara #15640

Conversation

henare commented Jan 17, 2018 • edited Loading

What does this do?

Why was this needed?

Relevant Issue(s)

Implementation notes

Notes to Reviewer

Notes to Merger

tmtmtmtm Jan 17, 2018

Choose a reason for hiding this comment

tmtmtmtm left a comment

Choose a reason for hiding this comment

Approach

Implementation

Setup

henare commented Jan 18, 2018 • edited Loading

Approach

Implementation

Setup

henare commented Jan 19, 2018

tmtmtmtm left a comment

Choose a reason for hiding this comment

tmtmtmtm Jan 23, 2018

Choose a reason for hiding this comment

henare commented Jan 30, 2018

henare commented May 2, 2019

henare commented Jan 17, 2018 •

edited

Loading

henare commented Jan 18, 2018 •

edited

Loading