-
-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the API declarative #4
Comments
To continue the example from the documentation: x('body', {
title: x('title'),
articles: x('article {0,}', {
body: x('[email protected]'),
summary: x('.body p {0,}[0]'),
imageUrl: x('img@src'),
title: x('.title')
})
}); A declarative approach would look something like this: {
"selector": "body",
"properties": {
"title": "title",
"articles": {
"selector": "article {0,}",
"properties": {
"body": "[email protected]",
"summary": ".body p {0,}[0]",
"imageUrl": "img@src",
"title": ".title"
}
}
}
}
The equivalent in YAML is even shorter: selector: body
properties:
title: title
articles:
selector: article {0,}
properties:
body: [email protected]
summary: .body p {0,}[0]
imageUrl: img@src
title: .title This also makes the syntax for declaring validators and formatters more intuitive, e.g. selector: body
properties:
title: title
articles:
selector: article {0,}
properties:
body: [email protected]
summary: .body p {0,}[0]
imageUrl: img@src
title:
selector: .title
test: /foo/
format: "upperCase" Where
We could also add JSON schema support, e.g. selector: .movie
schema: "movie"
properties:
name: ".name"
url: ".url" Where I was looking for "declarative scraper" and found this, https://github.com/ContentMine/scraperJSON. Its not mature or anything, but demonstrates an attempt to write a declarative scraper. There is even a Node.js implementation, https://github.com/ContentMine/thresher. I like the idea of the regex capture groups,
There is also https://github.com/drbig/grabber |
@rla any thoughts on this? |
By limiting the schema to JSON, you do limit the transforms that can be done re: #3. Or would there be a way to define custom transforms and reference those by key (such that I could specify |
Thats the idea. I am convinced transforms should be added, though. Extracting data and formatting data are two very different tasks. |
The focus of DOM-EEE was mainly on the extraction part as the code involved there had tendency to get too complex. The idea was to get simpler objects from DOM by cutting down most of the element noise. The simplified object tree is supposed to be easier to work with using basic language constructs, such as loops or Array methods. This is what I experienced in multiple projects. The second goal was extreme portability. This made JSON input-output mandatory. If the project is mainly to be used from JavaScript environments then this might not be the optimal choice. Validation and transformation can also be represented in the declarative form as string identifiers or as arrays of them, like {
selector: '.date',
transform: 'convert-date',
validity: 'date-not-in-future'
} The actual transforms and validators need to be defined and registered with the library first. Non-existing transforms and validators can then be easily checked for. I see that defining them in-line directly on the declarative form can make it too complex. The order in which to apply validations and transforms is not clear tho. We might want to check if the selector matches at all or actually validate the transformed date. One of the DOM-EEE design aspects is that non-matching selectors return null, making catch-all validation easy. The output just has to be checked for nulls. This with some amounts of "manually executed" checks has proven to cover lots of cases for me. JSON schema can be applied to the output independently of this library. If we built in support, we would have to choose an implementation. As I have understood there are draft 3 and 4 of JSON schema with huge differences and various packages pick arbitrarily what to support from either of them. |
In terms of choosing an implementation, Ajv (https://github.com/epoberezkin/ajv) is now a somewhat de facto standard in the JavaScript community. However, I agree with your point that it can be done outside of Surgeon.
What about throwing an error? (like Surgeon does at a present time) This way you are sure that no unexpected behaviour is left unseen.
Agree.
What is a use case for wanting to apply a validator after formatting the message? Wouldn't a formatter throw an error if it cannot format the data to the desired format? |
This needs some mechanism to mark optional properties for the case when an element sometimes exists but sometimes not but you would like to use it when it exists.
To decouple date parsing from checking whether the date is in the future, for example. Composability, otherwise you need something like |
Wouldn't you agree that The fact that whatever document contains date thats in the past does not make it an invalid date. It just a data set that you are not interested in. "validate" here would do no good since it would throw an error (and break the scraper). A filter could be used though. |
A better example is parsing an URL and checking if it contains a specific query parameter. The validation here means guarding against the changed URL structure, not for filtering a set of URLs on the page. Filtering probably needs to be described as well, maybe as a separate issue. |
Going back to the original question:
If you have retrieved a URL, you can use the validator to assert that URL schema has not changed. Where does the formatting come in? |
The use case is where I want to parse the URL only once, not in the validator and not in the later steps. |
For performance purposes? |
That said, you wouldn't be parsing URL for validation purposes... in most cases a regex would be enough. Unless of course your intention is to ignore new parameters being added to the URL or parameters changing order. This sounds dangerous, though. |
A bit of case study. I took an existing scraper (https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae/8943410d8e39d1eb013b11ec0d5ae50471829c09) and attempted to rewrite it using a declarative API.
Lets start with:
|
@bitshadow please have a look at this too. |
I have implemented a variation of above API in the declarative-api branch. {
"adopt": {
"articles": {
"imageUrl": {
"extract": {
"name": "href",
"type": "attribute"
},
"select": "img"
},
"summary": "p:first-child",
"title": ".title"
},
"pageTitle": "h1"
},
"select": "main"
}
|
I just discovered this tool and also this alternative API and I'm quite hyped. If this becomes stable I'll be talking to my CTO to refactor our scrapers. I'm preparing the proposal already! Keep it up 👍 |
Don't rush into it @DaniGuardiola. This API is going to still change. (I appreciate the enthusiasm, though.) I am going to post an update in the next 15 minutes describing whats wrong with the above and how it can be improved. |
As I was saying, while working on the above API, I have realised that a lot more powerful API would allow to perform subroutines, e.g. {
"articles": [
{
"action": "select",
"selector": "article"
},
{
"action": "adopt",
"children": {
"body": [
{
"action": "select",
"selector": ".body"
},
{
"action": "extract",
"name": "innerHTML",
"type": "property"
}
],
"imageUrl": [
{
"action": "select",
"selector": "img"
},
{
"action": "extract",
"name": "src",
"type": "attribute"
}
],
"summary": [
{
"action": "select",
"selector": ".body p:first-child"
},
{
"action": "extract",
"type": "property",
"name": "innerHTML"
},
{
"action": "format",
"name": "text"
}
],
"title": [
{
"action": "select",
"selector": ".title"
},
{
"action": "extract",
"name": "textContent",
"type": "property"
}
]
}
}
],
"pageName": [
{
"action": "select",
"selector": ".body"
},
{
"action": "extract",
"name": "innerHTML",
"type": "property"
}
]
}
Because of the formatting, this looks huge. However, if we go back to using DSL, it becomes manageable: {
"articles": [
"select article",
{
"body": [
"select .body",
"extract property innerHTML"
],
"imageUrl": [
"select img",
"extract attribute src"
],
"summary": [
"select .body p:first-child",
"extract property innerHTML",
"format text"
],
"title": [
"select .title",
"extract property textContent"
]
}
],
"pageName": [
"select .body",
"extract property innerHTML"
]
}
If we use use YAML, the entire thing is even more simple to read: articles:
- select article
- body:
- select .body
- extract property innerHTML
imageUrl:
- select img
- extract attribute src
summary:
- select .body p:first-child
- extract property innerHTML
- format text
title:
- select .title
- extract property textContent
pageName:
- select .body
- extract property innerHTML
The benefit of the latter approach over the current implementation is that it enables combination or arbitrary test and format functions, e.g. pageName:
- select .body
- extract property innerHTML
- format extractFirstTextNode
- format extractTime
- test timeInFuture I also think that it is easier to read and debug, because all commands are read (and executed) from top-to-bottom. Therefore, following the progress log is as simple as following the schema. |
@rla @DaniGuardiola @licyeus what are your thoughts about this approach vs the earlier? |
I love it! Specially the YAML formatting. So clean and clear... Awesome!
…On Mon, Jan 30, 2017 at 5:43 PM Gajus Kuizinas ***@***.***> wrote:
@rla <https://github.com/rla> @DaniGuardiola
<https://github.com/DaniGuardiola> @licyeus <https://github.com/licyeus>
what are your thoughts about this approach vs the earlier?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AIc89xPjfEJZoOdlJjj-ioQCu4WHZ5lcks5rXhMUgaJpZM4LmA5e>
.
|
This could be even further reduced using (optional) a pipe operator: articles:
- select article
- body: select .body | extract property innerHTML
imageUrl: select img | extract attribute src
summary: select .body p:first-child | extract property innerHTML | format text
title: select .title | extract property textContent
pageName: select .body | extract property innerHTML |
YES
Maybe there's something that can be done to further reduce size by having
(optional) shorthand syntax for common actions like select and extract?
Also, thinking about performance, did you think about "stacking" cheerio
(or whatever) calls somehow? Like, for example:
titles:
- select article
- title: select .title | extract property textContent
ratings:
- select article
- rating: select .rating | extract attribute data-rating
This could be efficient by actually selecting "article" once and storing
the result instead of selecting it everytime
…On Mon, Jan 30, 2017 at 6:34 PM Gajus Kuizinas ***@***.***> wrote:
If we use use YAML, the entire thing is even more simple to read:
articles:
- select article
- body:
- select .body
- extract property innerHTML
imageUrl:
- select img
- extract attribute src
summary:
- select .body p:first-child
- extract property innerHTML
- format text
title:
- select .title
- extract property textContentpageName:
- select .body
- extract property innerHTML
This could be even further reduced using a pipe operator:
articles:
- select article
- body: select .body | extract property innerHTML
imageUrl: select img | extract attribute src
summary: select .body p:first-child | extract property innerHTML | format text
title: select .title | extract property textContentpageName: select .body | extract property innerHTML
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AIc89xRIzTslOXHCjdKwd7Ibysa4t_5pks5rXh8ZgaJpZM4LmA5e>
.
|
Thats already being done. |
Edit: reformatted (email replies don't support markdown) and corrected some of the text (my english is not the best) About the select and extract shorthand thing, I just came up with an idea: This: articles:
- select article
- body: select .body | extract property innerHTML
imageUrl: select img | extract attribute src
summary: select .body p:first-child | extract property innerHTML | format text
title: select .title | extract property textContent
pageName: select .body | extract property innerHTML Would become this: articles:
- select article
- body: .body { property innerHTML }
imageUrl: img { attribute src }
summary: .body p:first-child { property innerHTML } | format text
title: .title { property textContent }
pageName: .body { property innerHTML } Also possible shorthand for getting textContent which is articles:
- select article
- title: .title {} For attributes: articles:
- select article
- imageUrl: img { [src] } Some other ideas: Use { propertyName } as a shorthand for properties when no action is Also for innerHTML and textContent, writing "html" and "text" instead In addition, it might even be a good idea to remove the need for All of this, including the textContent shorthand, would result in: articles:
- select article
- body: .body {html}
imageUrl: img [src]
summary: .body p:first-child {innerHTML} format text
title: .title {} // or .title {text}
pageName: .body {html} Of course all of these shorthand expressions must be optional, but I Thank you! :) |
I just saw a flaw on my proposal: selectors targeting attributes would be conflictive with the [ ] shorthand but maybe something can be done about that |
The entire input, validation and formatting rules can be declared using a simple JSON object.
The benefit of this approach is portability, i.e. easy to move scraper from one programming language to another.
Furthermore, it easier to enforce consistent style, and maintain complexity of the code base.
We could even use https://www.npmjs.com/package/jsonscript.
The text was updated successfully, but these errors were encountered: