-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate some static site pages after JavaScript has run using Capybara #15640
Conversation
scripts/save_javascript_pages.rb
Outdated
end | ||
|
||
page = PageAfterAJAX.new | ||
javascript_pages_to_scrape = open(ARGV[0]).read.split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably nicer to explicitly check that there is an argument, and blow up early if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic idea seems good, though I've a few questions/comments about the detail:
Approach
I'm not quite following the approach of serving up javascript_pages_to_scrape.txt
through the app, rather than simply having that be a text file entirely outside the app that the script reads directly. If this is meant to be a parallel with how the wget
equivalent works, I fear that that's a slightly misleading consistency that doesn't buy us much, and makes these seem more similar than they actually are. The wget
base is a page because that version is a recursive process of fetching (and saving) a given page and all other pages linked from that, whereas here you're simply working down a single list non-recursively.
Implementation
The implementation of PageAfterAJAX
is also little odd to me (or possibly the naming, depending on what the class is actually supposed to represent). The naming implies that it's a Page, but it actually seems to be a wrapper around Capybara which you then feed pages to.
In general a constructor that doesn't take any arguments is a bit of an anti-pattern, especially when lots of the methods then take a consistent argument (in this case url
). That implies that either this class should actually represent a Page, or at least that there's a Page
class lurking in here, looking to escape. (I don't know whether creating a new Capybara::Session is expensive enough to want to share that across the URLs, or whether it's lightweight enough to simply create a new one each time)
The implementation is also quite imperative at the minute. That's not necessarily a bad thing, and we certainly shouldn't try to needlessly shoe-horn it into an OO pattern, but giving it the trappings of being in a Class, but then not really being one, is a little confusing. So, for example, rather than create_parent_directories
, you could have a containing_directory
method which also takes care of making sure that the directory exists before returning it. (That's strictly a violation of command-query separation, but I think it'd be OK here.)
You should also be able to simplify some of the implementation by using Pathname everywhere, rather than doing so much string manipulation. It has built-ins for getting parts out of the path, transforming the name, creating the directory, writing the file, etc.
Setup
Having this branch off #15637 seems like a slightly needless dependency here (or a dependency in the wrong order). Is there a reason this PR can't go live before that one? It seems that whichever commit(s) are needed for both could be cherry-picked into here, and go live with whichever is merged first, if introducing them in their own PR first doesn't make sense.
Thanks @tmtmtmtm for the detailed and very useful feedback 🌟 I've responded below to the detail but here's a summary of the actions I'd like to take, in roughly the order they should be done in:
As this is a pretty significant rework I don't think there's going to be a lot of benefit in me using fixup commits so I'll probably rewrite history a bit as part of this process. Approach
We want a way for the app to generate a list of URLs (e.g. a set of pages for every country we have) so I think I was thinking, "how can we communicate this to a process [the script] that's running outside the application?". In my initial notes for this work I was calling this a Rake task but then I don't even mention that in my discussion on how to approach this part. I think the obvious thing is to bring this process into the application itself, as I'd been originally thinking with a Raketask.
👍 Implementation
It's probably the implementation and the naming 😄 It wasn't supposed to be a Page like in the application so the naming is definitely where the problems start. If we bring it into the application as I suggest above then I'll try come up with a better name, and one that doesn't imply it's a Page.
Yes, I agree this looks weird. It is because the intent is to share a Capybara::Session across URLs because these are expensive (akin to booting a browser AIUI).
Good call. I don't natively think in OO so it's not surprising it's quite imperative. I'll make that change and try to keep going with a refactor in that direction.
Great! I've not used that library so I'll give it a shot. Setup
Thank you for picking up on this. I don't know what my thinking was here. I'm guessing it was just because these bits of work had become conflated in my head, but of course they're not in code. I'll fix that up too. |
7209e4b
to
227e3a3
Compare
227e3a3
to
a9c588f
Compare
a9c588f
to
c5efc41
Compare
c5efc41
to
e24d7da
Compare
e24d7da
to
2ba56ca
Compare
2ba56ca
to
527a258
Compare
527a258
to
59b4a4c
Compare
I've done a pretty major reworking of this now. It's called from a Rake task and the class now just takes an array of URLs to build. I think this all makes it a lot easier to understand and simplifies the implementation. This still allows us to easily expand the set of pages we want to build by inserting more URLs into the array we pass to StaticSiteGenerator. @tmtmtmtm could you please review this again? 🌻 |
59b4a4c
to
13064bf
Compare
We're going to use this gem to scrape pages as part of our static site generation, after JavaScript has finished running on those pages. We're putting this in our Gemfile under the test and development groups so you can use it in a Rakefile during development and by Travis under test when the static site is built.
We want to be able to generate some pages as part of the static site build after JavaScript has run on that page. This will allow these pages to be populated with data from JavaScript but appear to progressively enhance when deployed as part of our static site. This Rake task currently has a hardcoded list of one page that needs JavaScript run. This can be expanded pretty easily later. It then uses capybara-webkit to fetch that page, run JS, and save those pages to disk. The target files will be in the same location as `wget` uses which will allow us to overwrite the static files already generated by `wget` with these JS-enhanced versions.
In previous commits we added a way to generate some of our static pages after JavaScript had finished running. This commit adds this step into the static site generation build script. After `wget` has finished running this Rake task overwrites some of the static files it generated with the (currently) few pages we want enhanced with JS.
13064bf
to
9de377d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henare I'm having a bit of difficulty working out what to do with this. The PR description seems quite out of date now, the new class you've created has no tests, and the rake task that uses it seems to require some undocumented things to happen first before it will work. So it's a little tricky for me to actually play with this at the minute, particularly in terms of being able to test out some of the things I refer to below and actually give working examples instead of more abstract comments.
My suspicion is that adding some working tests for StaticSiteGenerator
will resolve some of this, so that might be the place to start.
I'd also suggest running reek
against this new class: in particular it highlights a couple of utility functions masquerading as methods here. As before, there still seems to be quite of a clash of styles going on, where this has the trappings of being a class, but is still mostly a bit of a grab-bag of semi-related (and still quite imperative) functionality.
One tip: This isn't really a hard-and-fast rule, but as a useful heuristic any class name ending in -er/-or class is often a red flag: see, for example, http://objology.blogspot.co.uk/2011/09/one-of-best-bits-of-programming-advice.html
If you think about each part in terms of what it is, then add some behaviour to that, I suspect a simpler, more maintainable design will pop out. (My strong suspicion is that there's still at two different classes mixed up together here — one to represent the session, and one to represent the webpage being spidered.)
In terms of interface, intuitively I find supplying a list of all the URLs to be scraped to the constructor to be a little odd. This seems like something you should be able to keep feeding new URLs to — especially if the reason for the class in the first place is because starting a new session is expensive.
lib/static_site_generator.rb
Outdated
def initialize(urls:) | ||
Capybara::Webkit.configure(&:allow_unknown_urls) | ||
@page = Capybara::Session.new(:webkit) | ||
@urls = urls.map { |u| URI.parse(u) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: our practice in this repo is to avoid doing work directly in the constructor — most of these seem like things that can be deferred to the methods that ask for them.
This splits up the generator class as was a problem suspected in this PR review: #15640 (review) It creates a single class for a Browser which extends the capybara-webkit browser to include a way to get a page's contents after jQuery's AJAX calls are done. This is in a StaticSite module just to namespace it and indicate where it's used in the application, rather than it's nature. In other words it's not a _static site browser_ but you could read it that way so maybe this needs changing? All the remaining procedural logic is pushed into the Rake task. From here we could then extract that if we see fit.
This splits up the generator class as was a problem suspected in this PR review: #15640 (review) It creates a single class for a Browser which extends the capybara-webkit browser to include a way to get a page's contents after jQuery's AJAX calls are done. This is in a StaticSite module just to namespace it and indicate where it's used in the application, rather than it's nature. In other words it's not a _static site browser_ but you could read it that way so maybe this needs changing? All the remaining procedural logic is pushed into the Rake task. From here we could then extract that if we see fit.
Sorry this has been such a pain to review @tmtmtmtm As suggested I went to add tests for that class. Straight away I came up against a road block in that most of that class's methods were private and so difficult to test. Instead of battling through that I tried to take what you said in the rest of your comment and rethink each part in terms of what it is. What I've come up with is 358001c which I think is a bit simpler. As I mention in the commit it feels like some of the complexity has just been pushed in to the rake task so there's probably room for improvement there. Do you have any suggestions for things I could try? We also a need to test the new Browser class. Since the methods on that class don't do a whole lot by themselves I think this will mean some kind of integration-level test with a page fixture. |
Closing ancient pull requests 🗿 |
What does this do?
Extends the static site build process to allow some pages to be generated after JavaScript (specifically jQuery AJAX) has finished running.
Why was this needed?
As part of everypolitician/democratic-commons-tasks#36 (comment) we're about to add data to pages exclusively via JavaScript. This would result in pages missing that data using our current build process, resulting in broken pages (at least for browsers without JS).
This pull request allows us to specify pages that we want generated after JS has run and therefore the data on the page has been populated.
Relevant Issue(s)
Closes #15639.
Implementation notes
Notes to Reviewer
If this is working correctly then the output of the build process should be the same before and after this PR. This is because the only page it's scraping has duplicated logic in the Ruby code for populating data. Once this PR is merged we can remove that duplicated logic and just build that page in JS (along with adding our new pages, when they're ready to be added to the site).
Notes to Merger