Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix additional details extraction in ATK scraper #1320

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions recipe_scrapers/americastestkitchen.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,5 +71,4 @@ def _parse_ingredient_item(ingredient_item):
@functools.cached_property
def _get_additional_details(self):
j = json.loads(self.soup.find(type="application/json").string)
name = list(j["props"]["initialState"]["content"]["documents"])[0]
return j["props"]["initialState"]["content"]["documents"][name]
return j["props"]["pageProps"]["data"]
31 changes: 27 additions & 4 deletions tests/test_data/americastestkitchen.com/americastestkitchen.json
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,36 @@
],
"category": "Main Courses, Casseroles",
"yields": "8 servings",
"description": "Could we adapt and simplify this northern Italian classic for the American kitchen?",
"description": "Could we adapt and simplify this northern Italian classic for the American kitchen? When we started thinking about a simple lasagna Bolognese recipe, there was no denying the appeal of no-boil noodles. After several tests, we found that a five-minute soak proved most effective for getting sturdy, al dente noodles that bound together the layers of ragu and béchamel without soaking up all the moisture. Stumbling through multiple rounds of meat sauce testing gave us the idea of combining the ragu and béchamel when both were lukewarm. The resulting sauce was thickened but easy to spread, with enough moisture for cooking the noodles in our simplified lasagna recipe.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about this text a bit - basically I'm wondering whether it's a description relating to the recipe itself (information that we test against with the intent of confirming accuracy), or arguably other only-tangentially-related information. If it's the latter, then by including a fairly large amount of text here -- admittedly only for a few recipes, but even so -- we might be infringing on the source material's copyright without much of a justification to provide in response.

I don't think that checking for a subset of the text would be a great alternative -- because then we might find it difficult to confirm and code review that we're parsing recipe webpages correctly and maintaining integrity/authenticity.

Another alternative could be to omit description from the test cases in situations where the value it contains seems to go off-topic. If that's the case, then users could still retrieve it, and I think we'd have to argue that our description schema.org retrieval simply returns the first schema.org description from the recipe webpage without altering it. That would be possible because description is an optional test field, not a mandatory one.

I'll spend a bit more time thinking about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the concern about replicating creative content is valid, especially given that websites often include personal stories or input as a traffic driver, and this type of content could easily end up in the description category (as in the example here).

While I'm not sure of the best workaround at the moment, I certainly share your concern about potentially infringing on copyright. It’s worth exploring more to see which alternative best fits this use case.

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

A possible solution is to truncate the test on description to the first X characters. You validate that the field is pulled correctly, but you aren't storing copyrighted material and have a strong defense that it is fair use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe description shouldn't be included in test coverage at all but still left as a possible field?

A possible solution is to truncate the test on description to the first X characters. You validate that the field is pulled correctly, but you aren't storing copyrighted material and have a strong defense that it is fair use.

Indeed, that's an option. The downsides that I can think of are (most important, in my opinion, first):

  • It'd reduce our ability to state that we're checking the accuracy of all fields -- but perhaps that's more important for the core recipe fields, anyway.
  • It could become a precedent/reason for reducing the accuracy of other field checks (for example, if we applied similar logic to instructions, then it could quickly become difficult to test/code review whether scrapers are working correctly.
  • The implementatation might be inelegant (it'd involve a special-case for a particular test field -- I don't think we should attempt to come up with any kind of rule-based system within the test data itself, or at least not yet).

I haven't been able to think of better options - so maybe we go with this. What do you think @jknndy? I could file a feature request to assert only on the first 100/150/200 characters of description, and apply that to the existing test data.

"total_time": 210,
"cuisine": "Europe, Italian",
"ratings": 4.4,
"ratings_count": 80,
"ratings_count": 98,
"nutrients": {
"calories": "5764 calories"
"calories": "721",
"fatContent": "40 grams",
"saturatedFatContent": "20 grams",
"unsaturatedFatContent": "14 grams",
"transFatContent": "1 grams",
"carbohydrateContent": "39 grams",
"sugarContent": "13 grams",
"proteinContent": "39 grams",
"sodiumContent": "1165 miligrams",
"cholesterolContent": "119 miligrams"
},
"image": "https://res.cloudinary.com/hksqkdlah/image/upload/ar_1:1,c_fill,dpr_2.0,f_auto,fl_lossy.progressive.strip_profile,g_faces:auto,q_auto:low,w_150/3801_so04-lasagnabolognes-article"
"image": "https://res.cloudinary.com/hksqkdlah/image/upload/ar_1:1,c_fill,dpr_2.0,f_auto,fl_lossy.progressive.strip_profile,g_faces:auto,q_auto:low,w_150/3801_so04-lasagnabolognes-article",
"keywords": [
"Main Courses",
"Europe",
"Italian",
"Pasta",
"Grains",
"Rice & Beans",
"Eggs & Dairy",
"Meat",
"Cheese",
"Beef",
"Pork",
"Casseroles"
]
}
6,663 changes: 3,395 additions & 3,268 deletions tests/test_data/americastestkitchen.com/americastestkitchen.testhtml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion tests/test_data/cookscountry.com/cookscountry.json
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
"total_time": 75,
"cuisine": "Latin America & Caribbean, Mexican",
"ratings": 4.2,
"ratings_count": 39,
"ratings_count": 56,
"nutrients": {
"calories": "687",
"fatContent": "31 grams",
Expand Down
6,665 changes: 3,395 additions & 3,270 deletions tests/test_data/cookscountry.com/cookscountry.testhtml

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions tests/test_data/cooksillustrated.com/cooksillustrated.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
"description": "Whisk butter into a little simmering water and—poof!—you've got beurre monté: liquid silk that pairs with any seasoning and gilds everything it touches. Beurre monté, emulsified melted butter, is a classic French preparation that can be drizzled over cooked meats, vegetables, pastas, and other dishes to add richness and a glossy appearance or used as a rich, creamy sauce base for pairing with a wide range of savory or sweet seasonings. Vigorously whisking cold butter into a measured amount of simmering water broke up the butterfat into tiny droplets that dispersed throughout the water, establishing a thick, creamy emulsion. Whisking oyster sauce plus orange zest and juice into the beurre monté produced a bold, rich, easy-to-make sauce that paired especially well with lean roasted or pan-seared proteins such as scallops, cod, pork tenderloin, or chicken breast; it was also great over pasta.",
"total_time": 10,
"cuisine": "French",
"ratings": 4.5,
"ratings_count": 26,
"ratings": 4.2,
"ratings_count": 44,
"nutrients": {
"calories": "207",
"fatContent": "23 grams",
Expand Down
Loading
Loading