Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix additional details extraction in ATK scraper #1320

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jknndy
Copy link
Collaborator

@jknndy jknndy commented Oct 22, 2024

Implements the fix recommended by @sudobash1 in #1317 and pulls new test data to represent current site structure/information.

Also, pulls new test data and updates tests for two related coverages, cookscountry & cooksillustrated.

Resolves #1317

@jknndy jknndy marked this pull request as ready for review October 22, 2024 21:03
@@ -72,13 +72,36 @@
],
"category": "Main Courses, Casseroles",
"yields": "8 servings",
"description": "Could we adapt and simplify this northern Italian classic for the American kitchen?",
"description": "Could we adapt and simplify this northern Italian classic for the American kitchen? When we started thinking about a simple lasagna Bolognese recipe, there was no denying the appeal of no-boil noodles. After several tests, we found that a five-minute soak proved most effective for getting sturdy, al dente noodles that bound together the layers of ragu and béchamel without soaking up all the moisture. Stumbling through multiple rounds of meat sauce testing gave us the idea of combining the ragu and béchamel when both were lukewarm. The resulting sauce was thickened but easy to spread, with enough moisture for cooking the noodles in our simplified lasagna recipe.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about this text a bit - basically I'm wondering whether it's a description relating to the recipe itself (information that we test against with the intent of confirming accuracy), or arguably other only-tangentially-related information. If it's the latter, then by including a fairly large amount of text here -- admittedly only for a few recipes, but even so -- we might be infringing on the source material's copyright without much of a justification to provide in response.

I don't think that checking for a subset of the text would be a great alternative -- because then we might find it difficult to confirm and code review that we're parsing recipe webpages correctly and maintaining integrity/authenticity.

Another alternative could be to omit description from the test cases in situations where the value it contains seems to go off-topic. If that's the case, then users could still retrieve it, and I think we'd have to argue that our description schema.org retrieval simply returns the first schema.org description from the recipe webpage without altering it. That would be possible because description is an optional test field, not a mandatory one.

I'll spend a bit more time thinking about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the concern about replicating creative content is valid, especially given that websites often include personal stories or input as a traffic driver, and this type of content could easily end up in the description category (as in the example here).

While I'm not sure of the best workaround at the moment, I certainly share your concern about potentially infringing on copyright. It’s worth exploring more to see which alternative best fits this use case.

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

A possible solution is to truncate the test on description to the first X characters. You validate that the field is pulled correctly, but you aren't storing copyrighted material and have a strong defense that it is fair use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

A possible solution is to truncate the test on description to the first X characters. You validate that the field is pulled correctly, but you aren't storing copyrighted material and have a strong defense that it is fair use.

Indeed, that's an option. The downsides that I can think of are (most important, in my opinion, first):

  • It'd reduce our ability to state that we're checking the accuracy of all fields -- but perhaps that's more important for the core recipe fields, anyway.
  • It could become a precedent/reason for reducing the accuracy of other field checks (for example, if we applied similar logic to instructions, then it could quickly become difficult to test/code review whether scrapers are working correctly.
  • The implementatation might be inelegant (it'd involve a special-case for a particular test field -- I don't think we should attempt to come up with any kind of rule-based system within the test data itself, or at least not yet).

I haven't been able to think of better options - so maybe we go with this. What do you think @jknndy? I could file a feature request to assert only on the first 100/150/200 characters of description, and apply that to the existing test data.

@jayaddison
Copy link
Collaborator

Generally these changes are OK with me; the two things I'd like to figure out before merge are:

  • Should we retain backwards-compatibility with the previous format?
  • What to do about the lengthy description test data.

@jknndy
Copy link
Collaborator Author

jknndy commented Oct 23, 2024

  • Should we retain backwards-compatibility with the previous format?

Normally, I believe backwards compatibility should be a priority, as long as it doesn't degrade performance. However, since it appears all the existing recipes have already been migrated to this new format (with all three test cases requiring updates to pass), it might not be necessary in this case.

@smilerz
Copy link
Contributor

smilerz commented Nov 1, 2024

  • Should we retain backwards-compatibility with the previous format?

Normally, I believe backwards compatibility should be a priority, as long as it doesn't degrade performance. However, since it appears all the existing recipes have already been migrated to this new format (with all three test cases requiring updates to pass), it might not be necessary in this case.

While developing the ATK scraper for v15 this was the initial format and I had to change it at least once during development. It might make sense to provide both until there is some confidence that it is stable. Though not sure how to discover that? Some stdout during tests that the legacy format was detected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Americas Test Kitchen] - Scraper broken
3 participants