-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix additional details extraction in ATK scraper #1320
base: main
Are you sure you want to change the base?
Conversation
@@ -72,13 +72,36 @@ | |||
], | |||
"category": "Main Courses, Casseroles", | |||
"yields": "8 servings", | |||
"description": "Could we adapt and simplify this northern Italian classic for the American kitchen?", | |||
"description": "Could we adapt and simplify this northern Italian classic for the American kitchen? When we started thinking about a simple lasagna Bolognese recipe, there was no denying the appeal of no-boil noodles. After several tests, we found that a five-minute soak proved most effective for getting sturdy, al dente noodles that bound together the layers of ragu and béchamel without soaking up all the moisture. Stumbling through multiple rounds of meat sauce testing gave us the idea of combining the ragu and béchamel when both were lukewarm. The resulting sauce was thickened but easy to spread, with enough moisture for cooking the noodles in our simplified lasagna recipe.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about this text a bit - basically I'm wondering whether it's a description relating to the recipe itself (information that we test against with the intent of confirming accuracy), or arguably other only-tangentially-related information. If it's the latter, then by including a fairly large amount of text here -- admittedly only for a few recipes, but even so -- we might be infringing on the source material's copyright without much of a justification to provide in response.
I don't think that checking for a subset of the text would be a great alternative -- because then we might find it difficult to confirm and code review that we're parsing recipe webpages correctly and maintaining integrity/authenticity.
Another alternative could be to omit description
from the test cases in situations where the value it contains seems to go off-topic. If that's the case, then users could still retrieve it, and I think we'd have to argue that our description
schema.org retrieval simply returns the first schema.org
description from the recipe webpage without altering it. That would be possible because description
is an optional test field, not a mandatory one.
I'll spend a bit more time thinking about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the concern about replicating creative content is valid, especially given that websites often include personal stories or input as a traffic driver, and this type of content could easily end up in the description category (as in the example here).
While I'm not sure of the best workaround at the moment, I certainly share your concern about potentially infringing on copyright. It’s worth exploring more to see which alternative best fits this use case.
Maybe description shouldn't be included in test coverage at all but still left as a possible field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe description shouldn't be included in test coverage at all but still left as a possible field?
A possible solution is to truncate the test on description
to the first X characters. You validate that the field is pulled correctly, but you aren't storing copyrighted material and have a strong defense that it is fair use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe description shouldn't be included in test coverage at all but still left as a possible field?
A possible solution is to truncate the test on
description
to the first X characters. You validate that the field is pulled correctly, but you aren't storing copyrighted material and have a strong defense that it is fair use.
Indeed, that's an option. The downsides that I can think of are (most important, in my opinion, first):
- It'd reduce our ability to state that we're checking the accuracy of all fields -- but perhaps that's more important for the core recipe fields, anyway.
- It could become a precedent/reason for reducing the accuracy of other field checks (for example, if we applied similar logic to
instructions
, then it could quickly become difficult to test/code review whether scrapers are working correctly. - The implementatation might be inelegant (it'd involve a special-case for a particular test field -- I don't think we should attempt to come up with any kind of rule-based system within the test data itself, or at least not yet).
I haven't been able to think of better options - so maybe we go with this. What do you think @jknndy? I could file a feature request to assert only on the first 100/150/200 characters of description
, and apply that to the existing test data.
Generally these changes are OK with me; the two things I'd like to figure out before merge are:
|
Normally, I believe backwards compatibility should be a priority, as long as it doesn't degrade performance. However, since it appears all the existing recipes have already been migrated to this new format (with all three test cases requiring updates to pass), it might not be necessary in this case. |
While developing the ATK scraper for v15 this was the initial format and I had to change it at least once during development. It might make sense to provide both until there is some confidence that it is stable. Though not sure how to discover that? Some |
Implements the fix recommended by @sudobash1 in #1317 and pulls new test data to represent current site structure/information.
Also, pulls new test data and updates tests for two related coverages,
cookscountry
&cooksillustrated
.Resolves #1317