Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to import resmap collection & small fixes #209

Merged
merged 4 commits into from
Dec 11, 2023

Conversation

leandroradusky
Copy link
Contributor

Small changes & fixes:

  • The map zoom and boundaries were changed to allow see Africa completely, for the work to be done it is more comfortable (we are working with Ethiopian and Tanzanian data, we will work with Somalian data).
  • The bin/import-dataset script was successful (tested with the raw dataset available on the slack channel) but the facilities couldn't be visualized since the photo field was not set for facilities, it was added.
  • The es locale in development generated unnecessary problems since the datasets we're working with have no Spanish data.

Import ResMap collection:

New script to import data consuming it directly from ResourceMap.

  • ResourceMap::Collection will be filled with hashes representing Sites in ResourceMap.
  • ResmapSiteParser is a class that will parse each of these Sites and will generate an FPP Facility but also all the other entities (Category(ies), Services, Locations) with its hierarchies.
  • All this information is written in a folder passed as a parameter, to be then imported with the bin/import-dataset script.
  • The errors on the import are logged in the same folder (from the Tanzania dataset, the one used to test this script, there were Sites without coordinates and only one without Category).

Some notes:

  • The entities' IDs are generated with the Digest::MD5.hexdigest(name) function. There is no visualization of the IDs, they are just used on the database. For other datasets like the raw dataset, they used UUIDs. When importing Sites from ResMap the IDs are auto numeric ones and this could bring errors to FPP, that's why this decision was made, any suggestions are welcome here.
  • Analyzing the Tanzania dataset there is no information about the facility_types, this on FPP has a name and a priority. Since there was no data for this. A unique type, "Institution", with priority "1" was hardcoded for all the facilities. In the PdMing stage, this will have to be revised.
  • No photo information was present in the Tanzania dataset, then nothing was done with respect to this, to be revised also in the PdMing stage.
  • Nothing was done with respect to deleted sites and last update: given that for 10k sites (Tanzania dataset) the running times are in the order of seconds I'm not sure if its more convenient to delete all and reload it or to update existing data, that will require to check a lot of stuff.

@leandroradusky
Copy link
Contributor Author

With respect to the categories and category groups: It is not correctly displayed, I will rework it. The rest of the items appear to be shown ok.

@leandroradusky leandroradusky linked an issue Nov 30, 2023 that may be closed by this pull request
@leandroradusky leandroradusky linked an issue Nov 30, 2023 that may be closed by this pull request
Copy link
Contributor

@ysbaddaden ysbaddaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice!

This can be merged as-is, or you can push further:

  1. Create a ResourceMap namespace and split a few files, for example:
  • app/models/resource_map/collection.rb
  • app/models/resource_map/site_parser.rb

note: we might want to reuse normalization that is already used rather than introduce a new parser term?

  1. Move the CSV generation out of the site parser and keep it in the script.

Then in a subsequent PR: use the metadata on the resmap fields to identify the fields to map, so we can use any collection in an abstract way, not just the collection we happen to have.


def generate_unique_id(name)
return nil if name.nil?
Digest::MD5.hexdigest(name) # TODO: use UUIDs? How to enconde ids?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id can be used for in-place updates and it might be used through the API, so it must be stable (well, as stable as possible). Calculating the hash digest of a stable value is a good solution since the hash digest will be as stable as the input value.

Alternatively we could use UUIDv3 or UUIDv5 to generate stable UUIDs that are neither random nor time based (they internally use MD5 or SHA1) but I'm not sure this is worth it.

See https://en.wikipedia.org/wiki/Universally_unique_identifier#Versions_3_and_5_(namespace_name-based)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Digest::MD5.hexdigest(name) if name.presence feels more like Ruby.

return nil if category_name.nil?

# Create or find the category
unless categories.has_key?(category_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug:

Suggested change
unless categories.has_key?(category_name)
unless categories.has_key?(category_id)

category_group_name = properties["fac_ type-2"] ? properties["fac_type-1"] : category_name
category_group_id = generate_unique_id(category_group_name)

return nil if category_name.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this should be right after the assignment of category_name.

nitpick:

Suggested change
return nil if category_name.nil?
return unless category_name

Comment on lines 201 to 208
unless locations.key?(location_id)
location = {}
location["id"] = location_id
location["name"] = value
location["parent_id"] = parent_ids[level - 1] # Parent from previous level

locations[location_id] = location
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Ruby's idiomatic way is:

Suggested change
unless locations.key?(location_id)
location = {}
location["id"] = location_id
location["name"] = value
location["parent_id"] = parent_ids[level - 1] # Parent from previous level
locations[location_id] = location
end
locations[location_id] ||= {
"id" => location_id,
"name" => value,
"parent_id" => parent_ids[level - 1], # Parent from previous level
}

@ysbaddaden
Copy link
Contributor

Additional notes:

The more stability we have, the less likely we are to break anything when we do an upgrade. For example, if we use random UUID then the IDs will change on each upgrade, and won't that break each facility URL?

I'd hash resmap-id for the facility id. It should be perfectly stable overtime.

I'm not fond of delete everything and reinsert everything. That would break FPP during the upgrade, and since everything's stored in ElasticSearch and not in PostgreSQL, we can't even count on a database transaction (:face_exhaling:). Now, maybe doing the whole update in one bulk request would be fast enough that it's acceptable?

@leandroradusky
Copy link
Contributor Author

For facilities, the import script that is already there is doing inside fpp:

id: @last_facility_id += 1,
source_id: f[:id].to_s

Then I won't use the uuid for them and save the resmap one. In the example dataset there was uuids everywhere and I will use them then for all the other entities.

@leandroradusky
Copy link
Contributor Author

I realized that I wasn't 100% clear with respect to the deletion-updating of sites: this script creates a dataset to be imported with the bin/import-dataset script, which already exists and wasn't modified. I will create an issue to modify that script in order to update/delete/create facilities when corresponds, but that will go in a different PR.

Also, I will create an issue for your suggestion:

Then in a subsequent PR: use the metadata on the resmap fields to identify the fields to map, so we can use any collection in an abstract way, not just the collection we happen to have.

Copy link
Contributor

@ysbaddaden ysbaddaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one typo and a couple suggestion, then LGTM!

facility["last"] = nil # TODO
facility["photo"] = nil # TODO
facility
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: maybe:

def extract_facility(site, properties)
  return unless site["lat"] && site["lng"]
  {
    "lat" => site["lat"],
    ...
  }
end

facility["name"] = site["name"]
facility["lat"] = site["lat"]
return nil if facility["lat"].nil?
facility["lng"] = site["long"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
facility["lng"] = site["long"]
facility["lng"] = site["lng"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in resmap sites the name is long not lng.

bin/import-resmap-collection Outdated Show resolved Hide resolved
@leandroradusky leandroradusky merged commit dc1e855 into master Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import data from ReMap
2 participants