Fix additional details extraction in ATK scraper #1320

jknndy · 2024-10-22T20:42:25Z

Implements the fix recommended by @sudobash1 in #1317 and pulls new test data to represent current site structure/information.

Also, pulls new test data and updates tests for two related coverages, cookscountry & cooksillustrated.

Resolves #1317

jayaddison · 2024-10-23T17:35:18Z

tests/test_data/americastestkitchen.com/americastestkitchen.json

@@ -72,13 +72,36 @@
  ],
  "category": "Main Courses, Casseroles",
  "yields": "8 servings",
-  "description": "Could we adapt and simplify this northern Italian classic for the American kitchen?",
+  "description": "Could we adapt and simplify this northern Italian classic for the American kitchen? When we started thinking about a simple lasagna Bolognese recipe, there was no denying the appeal of no-boil noodles. After several tests, we found that a five-minute soak proved most effective for getting sturdy, al dente noodles that bound together the layers of ragu and béchamel without soaking up all the moisture. Stumbling through multiple rounds of meat sauce testing gave us the idea of combining the ragu and béchamel when both were lukewarm. The resulting sauce was thickened but easy to spread, with enough moisture for cooking the noodles in our simplified lasagna recipe.",


I'm thinking about this text a bit - basically I'm wondering whether it's a description relating to the recipe itself (information that we test against with the intent of confirming accuracy), or arguably other only-tangentially-related information. If it's the latter, then by including a fairly large amount of text here -- admittedly only for a few recipes, but even so -- we might be infringing on the source material's copyright without much of a justification to provide in response.

I don't think that checking for a subset of the text would be a great alternative -- because then we might find it difficult to confirm and code review that we're parsing recipe webpages correctly and maintaining integrity/authenticity.

Another alternative could be to omit description from the test cases in situations where the value it contains seems to go off-topic. If that's the case, then users could still retrieve it, and I think we'd have to argue that our description schema.org retrieval simply returns the first schema.org description from the recipe webpage without altering it. That would be possible because description is an optional test field, not a mandatory one.

I'll spend a bit more time thinking about it.

I agree that the concern about replicating creative content is valid, especially given that websites often include personal stories or input as a traffic driver, and this type of content could easily end up in the description category (as in the example here).

While I'm not sure of the best workaround at the moment, I certainly share your concern about potentially infringing on copyright. It’s worth exploring more to see which alternative best fits this use case.

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

A possible solution is to truncate the test on description to the first X characters. You validate that the field is pulled correctly, but you aren't storing copyrighted material and have a strong defense that it is fair use.

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

A possible solution is to truncate the test on description to the first X characters. You validate that the field is pulled correctly, but you aren't storing copyrighted material and have a strong defense that it is fair use.

Indeed, that's an option. The downsides that I can think of are (most important, in my opinion, first):

It'd reduce our ability to state that we're checking the accuracy of all fields -- but perhaps that's more important for the core recipe fields, anyway.

It could become a precedent/reason for reducing the accuracy of other field checks (for example, if we applied similar logic to instructions, then it could quickly become difficult to test/code review whether scrapers are working correctly.

The implementatation might be inelegant (it'd involve a special-case for a particular test field -- I don't think we should attempt to come up with any kind of rule-based system within the test data itself, or at least not yet).

I haven't been able to think of better options - so maybe we go with this. What do you think @jknndy? I could file a feature request to assert only on the first 100/150/200 characters of description, and apply that to the existing test data.

jayaddison · 2024-10-23T17:37:17Z

Generally these changes are OK with me; the two things I'd like to figure out before merge are:

Should we retain backwards-compatibility with the previous format?
What to do about the lengthy description test data.

jknndy · 2024-10-23T21:12:59Z

Should we retain backwards-compatibility with the previous format?

Normally, I believe backwards compatibility should be a priority, as long as it doesn't degrade performance. However, since it appears all the existing recipes have already been migrated to this new format (with all three test cases requiring updates to pass), it might not be necessary in this case.

smilerz · 2024-11-01T11:43:21Z

Should we retain backwards-compatibility with the previous format?

Normally, I believe backwards compatibility should be a priority, as long as it doesn't degrade performance. However, since it appears all the existing recipes have already been migrated to this new format (with all three test cases requiring updates to pass), it might not be necessary in this case.

While developing the ATK scraper for v15 this was the initial format and I had to change it at least once during development. It might make sense to provide both until there is some confidence that it is stable. Though not sure how to discover that? Some stdout during tests that the legacy format was detected?

w/ comments

jknndy added 4 commits October 22, 2024 16:34

implement sudobash1's recommended fix

f6de07e

updated test data

7466962

cookscountry

11a8f4a

cooksillustrated

61f0b79

jknndy marked this pull request as ready for review October 22, 2024 21:03

jayaddison reviewed Oct 23, 2024

View reviewed changes

jayaddison mentioned this pull request Nov 1, 2024

[Americas Test Kitchen] - Scraper broken #1317

Open

jknndy added 3 commits November 3, 2024 14:35

rename

bee40c7

Test case for old format

f9060fd

Add fallback for old format

76415ab

w/ comments

jknndy requested review from smilerz and jayaddison November 3, 2024 19:37

smilerz approved these changes Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix additional details extraction in ATK scraper #1320

Fix additional details extraction in ATK scraper #1320

jknndy commented Oct 22, 2024 •

edited

Loading

jayaddison Oct 23, 2024

jknndy Oct 23, 2024

smilerz Nov 1, 2024

jayaddison Nov 4, 2024

jayaddison commented Oct 23, 2024

jknndy commented Oct 23, 2024

smilerz commented Nov 1, 2024

Fix additional details extraction in ATK scraper #1320

Are you sure you want to change the base?

Fix additional details extraction in ATK scraper #1320

Conversation

jknndy commented Oct 22, 2024 • edited Loading

jayaddison Oct 23, 2024

Choose a reason for hiding this comment

jknndy Oct 23, 2024

Choose a reason for hiding this comment

smilerz Nov 1, 2024

Choose a reason for hiding this comment

jayaddison Nov 4, 2024

Choose a reason for hiding this comment

jayaddison commented Oct 23, 2024

jknndy commented Oct 23, 2024

smilerz commented Nov 1, 2024

jknndy commented Oct 22, 2024 •

edited

Loading