Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the parse function to accept an entity id #189

Closed
wants to merge 20 commits into from

Conversation

Rosencrantz
Copy link
Contributor

This includes a tweak to the parse function so that it generates an entity id before creating an entity. There are two ways in which this can occur

  1. Supply a list of xpath values that will be concatenated together and hashed in order to generate a unique key
  2. Do nothing and allow the parse function to automatically generate a key based on the url of the page that is being parsed.

@Rosencrantz Rosencrantz added this to the 2.4.2 milestone Sep 7, 2021
@Rosencrantz Rosencrantz requested a review from sunu September 7, 2021 08:40
@Rosencrantz Rosencrantz self-assigned this Sep 7, 2021
@@ -371,8 +371,13 @@ parse:
author: .//meta[@name="author"]/@content
publishedAt: .//*[@class="date"]/text()
description: .//meta[@property="og:description"]/@content
keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we use syntax similar to ftm mappings for consistency. Something like https://github.com/alephdata/aleph/blob/main/mappings/md_companies.yml#L15-L17

So the keys section will look like:

keys:
  - title
  - author

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys section needs to be updated now I think?

hashlib.md5(key_string.encode("utf-8")).hexdigest()
!= hashlib.md5("".encode("utf-8")).hexdigest()
):
entity_id = hashlib.md5(key_string.encode("utf-8")).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use make_id from memorious.helpers.key instead of making the key using hashlib.

for key, value in properties.items():
properties_dict[key] = html.xpath(value)

data["entity_id"] = hashlib.md5(data["url"].encode("utf-8")).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better and more consistent to make keys out of the supplied keys only instead of falling back on a default. We should instead raise an error in case of null keys so that the user can notice and change their keying strategy.

@Rosencrantz Rosencrantz requested a review from sunu September 8, 2021 10:04
countries: list[str] = list(context.params.get("countries", []))
mime_type: str = context.params.get("mime_type", "")

context.log.warn(languages)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray warning

published_at=data.get("published_at"),
headers=data.get("headers", {}),
keywords=data.get("keywords", []),
)

if data.get("aleph_folder_id"):
meta["parent"] = {"id": data.get("aleph_folder_id")}
Copy link
Contributor

@sunu sunu Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using 2 different styles here. We are treating meta as a object in the line above and as a dict here. I think we should be consistent and use a single style throughout the function.

And imo we can merge _create_meta_object and _create_document_metadata together into just one function and only set whatever metadata is available.

@@ -0,0 +1,24 @@
# from typing_extensions import TypedDict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is not generic enough to be its own module. We should put it in the aleph operation module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not a module. It's the definition of a type. I would type definitions seperate from modules that actually do stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the type definition near the relevant code is the convention we're already using in other places. See for example https://github.com/alephdata/followthemoney/blob/master/followthemoney/schema.py#L24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I'll move the meta file into the operations module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant the Meta type definition should live in operations/aleph.py and not in a separate file since it's very specific to the aleph_emit operations and unlikely to be reused anywhere else.

@@ -88,22 +92,41 @@ def parse_for_metadata(context, data, html):
if value is not None:
data[key] = value
break
meta_paths.update(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

temp_key = "".join(properties[key])

if not temp_key == "":
return make_id(temp_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_id can take multiple arguments. You don't need to join them beforehand.

@@ -371,8 +371,13 @@ parse:
author: .//meta[@name="author"]/@content
publishedAt: .//*[@class="date"]/text()
description: .//meta[@property="og:description"]/@content
keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys section needs to be updated now I think?

@@ -45,6 +45,10 @@ pipeline:
author: .//meta[@name="author"]/@content
publishedAt: .//*[@class="date"]/text()
description: .//meta[@property="og:description"]/@content
keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be the method needs to be changed to use the built-in parse method instead of a custom Python method to match the documentation example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The problem with doing that though is that the scraper won't be able to extract the body of the article, which is why the custom script exists. I guess, as it's an example it doesn't really matter too much, but that is why we have a difference.

Personally I don't have an issue with having the documentation not match the example in the repo

@Rosencrantz Rosencrantz requested a review from sunu October 6, 2021 06:28
from typing import Optional, TypedDict


class MetaBase(TypedDict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaBase vs Meta difference is no longer used. So I guess it's safe to merge these two together now?

@sunu
Copy link
Contributor

sunu commented Oct 19, 2021

Could you fix the merge conflict too? I think it's from the changes Simon made to fix a couple of issues in aleph_emit operation.

@Rosencrantz Rosencrantz requested a review from sunu October 20, 2021 14:41
Copy link
Contributor

@sunu sunu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the merge conflict wasn't fixed the way it should be. So the code is left in a broken state. A test for the aleph operations should have caught this. But we don't have any tests for it. May be it's worth adding one now since we are making changes to the aleph_emit operations.

from memorious.logic.context import Context


class Meta(MetaBase, total=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaBase is undefined

make_key(collection_id, foreign_id, content_hash)
)

if document_id:
context.log.info("Skip aleph upload: %s", foreign_id)
data["aleph_id"] = document["id"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document is undefined

@Rosencrantz
Copy link
Contributor Author

This died

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using the standard parse function for creating entities does not generate an entity_id
2 participants