RFC: Weaken dependency on the CMS #12

Aatch · 2013-06-10T02:39:27Z

Use Case

At PocketRent, we want to crawl a remote site and extract data from it. However, our use case is for building a database, rather than importing remote content to the CMS. In fact, we don't even have the CMS module in our code base. Thus, being able to simply remove the CMS-specific code would allow us to use this module without pulling in the entire CMS module to do so, we would merely delete the unused files.

Proposal

Refactor the code to move all of the crawling, extracting and storing functionality into it's own separate sub-system, then re-implement the current CMS functionality on top of it.

Details

The current code, from what I can see, is fairly separated already. This means that the refactoring would mostly consist of making the separation explicit.

A second, lower-level, public API would also be created. This lower-level API would expose the crawling, extraction and storage functionality in a way the allows developers to implement their own control mechanisms. This API would also be (almost) entirely self-contained, relying only on phpQuery.

No current functionality would be significantly changed, this is merely a refactoring project that allows for more generic use of the existing code.

sminnee · 2013-06-10T05:33:33Z

Hi Aatch,

The current code is fairly tightly bound to the externalcontent module and I don't think it would make much sense to separate that. The externalcontent module hopefully doesn't require the CMS and if not, I would probably attack the CMS dependency there. There's no reason that externalcontent needs CMS - it should be usable as a general connection between the SilverStripe ORM and 3rd party data sources. The one thing that is baked into it is the notion of hierarchical data; perhaps this can be resolved, but I would suggest resolving it within externalcontent.

I'm concerned about your comments "we'll create another lower level API", and although I wouldn't necessarily say it's a bad idea until I see actual code, I'm not hearing a great advantage for most users of this module, and instead additional complexity.

sminnee · 2013-06-10T05:51:55Z

Another way of looking at this is "what impact does the URL have on the imported data?"

Right now, this is hard-coded - the URL is split by token, the missing URLs are filled out, and the parent URL is used as a starting point, and StaticSiteContentItem::stageChildren() is used to recursively import child nodes at each point.

I think for an importer based on URLs, maintaining the existence of a hierarchy is hard to get away from and I'd expect any attempts to do so to end up as a separate project.

Given this, I wouldn't expect the code behind StaticSiteContentItem to change much: it's basically just ties a URL hierarchy into the external content system off the back of StaticSiteUrlList.

The meat of your refactoring would then probably happen in StaticSitePageTransformer and StaticSiteImporter. The system already allows for the possibility that different data-objects are imported, we just happen to have hard-coded getExternalType()'s value to "sitetree".

In summary, the refactoring could involve the following:

Update StaticSiteImporter to return more sensible values for getExternalType(). Perhaps use the existing notion of a StaticSiteContentSource_ImportSchema to do this. This might also be used to define contentTransforms in the constructor.
Refactor the bulk of StaticSitePageTransformer into a generic StaticSiteDataTransformer class, making StaticSitePageTransformer a subclass that does the SiteTree-specific things.
Create your own subclass of StaticSiteDataTransformer and tie it to an appropriate URL sub-set.

sminnee · 2013-06-10T06:17:03Z

OK, so I've had a look at what happens if you install staticsiteconnector and external-content without the framework, and the good news is that it doesn't look insurmountable:

ExternalContentAdmin implements CurrentPageIdentifier, but I don't know that this is necessary
StaticSiteContentSource has a has_many relationship, "Pages", which would need to be removed. I'm not sure if it's used much at the moment, so not a biggie.
LeftAndMain::getSiteTreeFor() refers to CMSPagesController, but it's not really clear why it needs to. Commenting it out fixed, external-content; it's possible that this could removed, or refactored to only be passed if needed. The code here seems a bit ganky in any case.

If you resolve those 3 things then you can use the external content system without CMS.

Aatch · 2013-06-10T06:43:57Z

@sminnee I haven't taken a proper look at external-content, so I'll take a closer look at that. It seems it's more generic than I though. It also seems that it may be more coherent to change external-content to support our use case, and just remove any duplicated logic.

I'll check it out properly and amend my proposal.

sminnee · 2013-06-10T07:40:05Z

Cool, in general where looking at external-content as a pretty extensible tool for connecting 3rd party systems through a common interface, so it makes sense for us to enrichen it for your needs.

phptek · 2013-06-11T02:28:48Z

@Aatch Have you had any further thoughts on StaticSiteDataTransformer as alluded-to by @sminnee ? Only it kinda goes against the ideas put forward in #1 unless specific forks in the logic were catered for, for different "classes" of content e.g. "Page" ("SiteTree") and "File".

At this early stage of my own dev, I'm rather more keen on a staged approach - implement a close-copy of StaticSitePageTransformer as StaticSiteFileTransformer and see what kind of issues we encounter with that before abstracting too far/much.

Thoughts?

I'm obviously keen to avoid duplication of effort seeing as we seem to be working on similar functionality. Feel free to contact me.

sminnee · 2013-06-12T00:28:39Z

@phptek my idea was that StaticSiteDataTransformer was basically a common base class of StaticSitePageTransformer and StaticSiteFileTransformer. If there's not a lot of code, then it's probably not necessary, but I suspected there might be some cross-over.

Aatch mentioned this issue Jun 10, 2013

RFC: Post-processing extracted data. #13

Open

sminnee mentioned this issue Jun 10, 2013

Import files & images to assets #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Weaken dependency on the CMS #12

RFC: Weaken dependency on the CMS #12

Aatch commented Jun 10, 2013

sminnee commented Jun 10, 2013

sminnee commented Jun 10, 2013

sminnee commented Jun 10, 2013

Aatch commented Jun 10, 2013

sminnee commented Jun 10, 2013

phptek commented Jun 11, 2013

sminnee commented Jun 12, 2013

RFC: Weaken dependency on the CMS #12

RFC: Weaken dependency on the CMS #12

Comments

Aatch commented Jun 10, 2013

Use Case

Proposal

Details

sminnee commented Jun 10, 2013

sminnee commented Jun 10, 2013

sminnee commented Jun 10, 2013

Aatch commented Jun 10, 2013

sminnee commented Jun 10, 2013

phptek commented Jun 11, 2013

sminnee commented Jun 12, 2013