Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate some static site pages after JavaScript has run using Capybara #15640

Closed
wants to merge 4 commits into from

Conversation

henare
Copy link
Contributor

@henare henare commented Jan 17, 2018

What does this do?

Extends the static site build process to allow some pages to be generated after JavaScript (specifically jQuery AJAX) has finished running.

Why was this needed?

As part of everypolitician/democratic-commons-tasks#36 (comment) we're about to add data to pages exclusively via JavaScript. This would result in pages missing that data using our current build process, resulting in broken pages (at least for browsers without JS).

This pull request allows us to specify pages that we want generated after JS has run and therefore the data on the page has been populated.

Relevant Issue(s)

Closes #15639.

Implementation notes

Notes to Reviewer

If this is working correctly then the output of the build process should be the same before and after this PR. This is because the only page it's scraping has duplicated logic in the Ruby code for populating data. Once this PR is merged we can remove that duplicated logic and just build that page in JS (along with adding our new pages, when they're ready to be added to the site).

Notes to Merger

@henare henare requested review from mhl and tmtmtmtm January 17, 2018 06:11
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 17, 2018 06:11 Inactive
end

page = PageAfterAJAX.new
javascript_pages_to_scrape = open(ARGV[0]).read.split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably nicer to explicitly check that there is an argument, and blow up early if not.

Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic idea seems good, though I've a few questions/comments about the detail:

Approach

I'm not quite following the approach of serving up javascript_pages_to_scrape.txt through the app, rather than simply having that be a text file entirely outside the app that the script reads directly. If this is meant to be a parallel with how the wget equivalent works, I fear that that's a slightly misleading consistency that doesn't buy us much, and makes these seem more similar than they actually are. The wget base is a page because that version is a recursive process of fetching (and saving) a given page and all other pages linked from that, whereas here you're simply working down a single list non-recursively.

Implementation

The implementation of PageAfterAJAX is also little odd to me (or possibly the naming, depending on what the class is actually supposed to represent). The naming implies that it's a Page, but it actually seems to be a wrapper around Capybara which you then feed pages to.

In general a constructor that doesn't take any arguments is a bit of an anti-pattern, especially when lots of the methods then take a consistent argument (in this case url). That implies that either this class should actually represent a Page, or at least that there's a Page class lurking in here, looking to escape. (I don't know whether creating a new Capybara::Session is expensive enough to want to share that across the URLs, or whether it's lightweight enough to simply create a new one each time)

The implementation is also quite imperative at the minute. That's not necessarily a bad thing, and we certainly shouldn't try to needlessly shoe-horn it into an OO pattern, but giving it the trappings of being in a Class, but then not really being one, is a little confusing. So, for example, rather than create_parent_directories, you could have a containing_directory method which also takes care of making sure that the directory exists before returning it. (That's strictly a violation of command-query separation, but I think it'd be OK here.)

You should also be able to simplify some of the implementation by using Pathname everywhere, rather than doing so much string manipulation. It has built-ins for getting parts out of the path, transforming the name, creating the directory, writing the file, etc.

Setup

Having this branch off #15637 seems like a slightly needless dependency here (or a dependency in the wrong order). Is there a reason this PR can't go live before that one? It seems that whichever commit(s) are needed for both could be cherry-picked into here, and go live with whichever is merged first, if introducing them in their own PR first doesn't make sense.

@henare
Copy link
Contributor Author

henare commented Jan 18, 2018

Thanks @tmtmtmtm for the detailed and very useful feedback 🌟

I've responded below to the detail but here's a summary of the actions I'd like to take, in roughly the order they should be done in:

  • Point this PR at master and remove the unnecessary dependency on Display seat count from Wikidata on legislature wikidata subpages #15637
  • Refactor class to use Pathname
  • Refactor class to be less imperative
  • Move the save_javascript_pages.rb script into the application somewhere, renaming the class in the process
  • Remove the unnecessary javascript_pages_to_scrape.txt endpoint
  • Have the process run from a Rake task

As this is a pretty significant rework I don't think there's going to be a lot of benefit in me using fixup commits so I'll probably rewrite history a bit as part of this process.

Approach

I'm not quite following the approach of serving up javascript_pages_to_scrape.txt through the app, rather than simply having that be a text file entirely outside the app that the script reads directly.

We want a way for the app to generate a list of URLs (e.g. a set of pages for every country we have) so I think I was thinking, "how can we communicate this to a process [the script] that's running outside the application?". In my initial notes for this work I was calling this a Rake task but then I don't even mention that in my discussion on how to approach this part. I think the obvious thing is to bring this process into the application itself, as I'd been originally thinking with a Raketask.

If this is meant to be a parallel with how the wget equivalent works, I fear that that's a slightly misleading consistency that doesn't buy us much, and makes these seem more similar than they actually are. The wget base is a page because that version is a recursive process of fetching (and saving) a given page and all other pages linked from that, whereas here you're simply working down a single list non-recursively.

👍

Implementation

The implementation of PageAfterAJAX is also little odd to me (or possibly the naming, depending on what the class is actually supposed to represent). The naming implies that it's a Page, but it actually seems to be a wrapper around Capybara which you then feed pages to.

It's probably the implementation and the naming 😄 It wasn't supposed to be a Page like in the application so the naming is definitely where the problems start.

If we bring it into the application as I suggest above then I'll try come up with a better name, and one that doesn't imply it's a Page.

In general a constructor that doesn't take any arguments is a bit of an anti-pattern, especially when lots of the methods then take a consistent argument (in this case url). That implies that either this class should actually represent a Page, or at least that there's a Page class lurking in here, looking to escape. (I don't know whether creating a new Capybara::Session is expensive enough to want to share that across the URLs, or whether it's lightweight enough to simply create a new one each time)

Yes, I agree this looks weird. It is because the intent is to share a Capybara::Session across URLs because these are expensive (akin to booting a browser AIUI).

The implementation is also quite imperative at the minute. That's not necessarily a bad thing, and we certainly shouldn't try to needlessly shoe-horn it into an OO pattern, but giving it the trappings of being in a Class, but then not really being one, is a little confusing. So, for example, rather than create_parent_directories, you could have a containing_directory method which also takes care of making sure that the directory exists before returning it. (That's strictly a violation of command-query separation, but I think it'd be OK here.)

Good call. I don't natively think in OO so it's not surprising it's quite imperative. I'll make that change and try to keep going with a refactor in that direction.

You should also be able to simplify some of the implementation by using Pathname everywhere, rather than doing so much string manipulation. It has built-ins for getting parts out of the path, transforming the name, creating the directory, writing the file, etc.

Great! I've not used that library so I'll give it a shot.

Setup

Having this branch off #15637 seems like a slightly needless dependency here (or a dependency in the wrong order). Is there a reason this PR can't go live before that one? It seems that whichever commit(s) are needed for both could be cherry-picked into here, and go live with whichever is merged first, if introducing them in their own PR first doesn't make sense.

Thank you for picking up on this. I don't know what my thinking was here. I'm guessing it was just because these bits of work had become conflated in my head, but of course they're not in code. I'll fix that up too.

@henare henare changed the base branch from seat-count-wikidata to master January 18, 2018 05:37
@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from 7209e4b to 227e3a3 Compare January 18, 2018 05:47
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 18, 2018 05:47 Inactive
@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from 227e3a3 to a9c588f Compare January 18, 2018 06:44
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 18, 2018 06:44 Inactive
@henare henare self-assigned this Jan 18, 2018
@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from a9c588f to c5efc41 Compare January 18, 2018 23:17
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 18, 2018 23:17 Inactive
@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from c5efc41 to e24d7da Compare January 19, 2018 00:54
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 19, 2018 00:54 Inactive
@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from e24d7da to 2ba56ca Compare January 19, 2018 02:19
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 19, 2018 02:19 Inactive
@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from 2ba56ca to 527a258 Compare January 19, 2018 02:58
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 19, 2018 02:58 Inactive
@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from 527a258 to 59b4a4c Compare January 19, 2018 03:07
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 19, 2018 03:08 Inactive
@henare
Copy link
Contributor Author

henare commented Jan 19, 2018

I've done a pretty major reworking of this now. It's called from a Rake task and the class now just takes an array of URLs to build. I think this all makes it a lot easier to understand and simplifies the implementation.

This still allows us to easily expand the set of pages we want to build by inserting more URLs into the array we pass to StaticSiteGenerator.

@tmtmtmtm could you please review this again? 🌻

@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from 59b4a4c to 13064bf Compare January 23, 2018 01:09
@tmtmtmtm tmtmtmtm temporarily deployed to everypolitician-viewe-pr-15640 January 23, 2018 01:09 Inactive
We're going to use this gem to scrape pages as part of our static site
generation, after JavaScript has finished running on those pages.

We're putting this in our Gemfile under the test and development groups
so you can use it in a Rakefile during development and by Travis under
test when the static site is built.
We want to be able to generate some pages as part of the static site
build after JavaScript has run on that page. This will allow these pages
to be populated with data from JavaScript but appear to progressively
enhance when deployed as part of our static site.

This Rake task currently has a hardcoded list of one page that needs
JavaScript run. This can be expanded pretty easily later. It then uses
capybara-webkit to fetch that page, run JS, and save those pages to
disk. The target files will be in the same location as `wget` uses which
will allow us to overwrite the static files already generated by `wget`
with these JS-enhanced versions.
In previous commits we added a way to generate some of our static pages
after JavaScript had finished running. This commit adds this step into
the static site generation build script.

After `wget` has finished running this Rake task overwrites some of the
static files it generated with the (currently) few pages we want
enhanced with JS.
@henare henare force-pushed the javascript-page-generation-capybara-refactor branch from 13064bf to 9de377d Compare January 23, 2018 01:30
Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henare I'm having a bit of difficulty working out what to do with this. The PR description seems quite out of date now, the new class you've created has no tests, and the rake task that uses it seems to require some undocumented things to happen first before it will work. So it's a little tricky for me to actually play with this at the minute, particularly in terms of being able to test out some of the things I refer to below and actually give working examples instead of more abstract comments.

My suspicion is that adding some working tests for StaticSiteGenerator will resolve some of this, so that might be the place to start.


I'd also suggest running reek against this new class: in particular it highlights a couple of utility functions masquerading as methods here. As before, there still seems to be quite of a clash of styles going on, where this has the trappings of being a class, but is still mostly a bit of a grab-bag of semi-related (and still quite imperative) functionality.

One tip: This isn't really a hard-and-fast rule, but as a useful heuristic any class name ending in -er/-or class is often a red flag: see, for example, http://objology.blogspot.co.uk/2011/09/one-of-best-bits-of-programming-advice.html

If you think about each part in terms of what it is, then add some behaviour to that, I suspect a simpler, more maintainable design will pop out. (My strong suspicion is that there's still at two different classes mixed up together here — one to represent the session, and one to represent the webpage being spidered.)

In terms of interface, intuitively I find supplying a list of all the URLs to be scraped to the constructor to be a little odd. This seems like something you should be able to keep feeding new URLs to — especially if the reason for the class in the first place is because starting a new session is expensive.

def initialize(urls:)
Capybara::Webkit.configure(&:allow_unknown_urls)
@page = Capybara::Session.new(:webkit)
@urls = urls.map { |u| URI.parse(u) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: our practice in this repo is to avoid doing work directly in the constructor — most of these seem like things that can be deferred to the methods that ask for them.

henare added a commit that referenced this pull request Jan 26, 2018
This splits up the generator class as was a problem suspected in this PR
review:

#15640 (review)

It creates a single class for a Browser which extends the
capybara-webkit browser to include a way to get a page's contents after
jQuery's AJAX calls are done.

This is in a StaticSite module just to namespace it and indicate where
it's used in the application, rather than it's nature. In other words
it's not a _static site browser_ but you could read it that way so maybe
this needs changing?

All the remaining procedural logic is pushed into the Rake task. From
here we could then extract that if we see fit.
This splits up the generator class as was a problem suspected in this PR
review:

#15640 (review)

It creates a single class for a Browser which extends the
capybara-webkit browser to include a way to get a page's contents after
jQuery's AJAX calls are done.

This is in a StaticSite module just to namespace it and indicate where
it's used in the application, rather than it's nature. In other words
it's not a _static site browser_ but you could read it that way so maybe
this needs changing?

All the remaining procedural logic is pushed into the Rake task. From
here we could then extract that if we see fit.
@henare
Copy link
Contributor Author

henare commented Jan 30, 2018

Sorry this has been such a pain to review @tmtmtmtm :hurtrealbad:

As suggested I went to add tests for that class. Straight away I came up against a road block in that most of that class's methods were private and so difficult to test.

Instead of battling through that I tried to take what you said in the rest of your comment and rethink each part in terms of what it is. What I've come up with is 358001c which I think is a bit simpler.

As I mention in the commit it feels like some of the complexity has just been pushed in to the rake task so there's probably room for improvement there. Do you have any suggestions for things I could try?

We also a need to test the new Browser class. Since the methods on that class don't do a whole lot by themselves I think this will mean some kind of integration-level test with a page fixture.

@henare
Copy link
Contributor Author

henare commented May 2, 2019

Closing ancient pull requests 🗿

@henare henare closed this May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate static site pages after JavaScript has run
2 participants