Script to import resmap collection & small fixes #209

leandroradusky · 2023-11-27T11:44:47Z

Small changes & fixes:

The map zoom and boundaries were changed to allow see Africa completely, for the work to be done it is more comfortable (we are working with Ethiopian and Tanzanian data, we will work with Somalian data).
The bin/import-dataset script was successful (tested with the raw dataset available on the slack channel) but the facilities couldn't be visualized since the photo field was not set for facilities, it was added.
The es locale in development generated unnecessary problems since the datasets we're working with have no Spanish data.

Import ResMap collection:

New script to import data consuming it directly from ResourceMap.

ResourceMap::Collection will be filled with hashes representing Sites in ResourceMap.
ResmapSiteParser is a class that will parse each of these Sites and will generate an FPP Facility but also all the other entities (Category(ies), Services, Locations) with its hierarchies.
All this information is written in a folder passed as a parameter, to be then imported with the bin/import-dataset script.
The errors on the import are logged in the same folder (from the Tanzania dataset, the one used to test this script, there were Sites without coordinates and only one without Category).

Some notes:

The entities' IDs are generated with the Digest::MD5.hexdigest(name) function. There is no visualization of the IDs, they are just used on the database. For other datasets like the raw dataset, they used UUIDs. When importing Sites from ResMap the IDs are auto numeric ones and this could bring errors to FPP, that's why this decision was made, any suggestions are welcome here.
Analyzing the Tanzania dataset there is no information about the facility_types, this on FPP has a name and a priority. Since there was no data for this. A unique type, "Institution", with priority "1" was hardcoded for all the facilities. In the PdMing stage, this will have to be revised.
No photo information was present in the Tanzania dataset, then nothing was done with respect to this, to be revised also in the PdMing stage.
Nothing was done with respect to deleted sites and last update: given that for 10k sites (Tanzania dataset) the running times are in the order of seconds I'm not sure if its more convenient to delete all and reload it or to update existing data, that will require to check a lot of stuff.

leandroradusky · 2023-11-28T17:31:07Z

With respect to the categories and category groups: It is not correctly displayed, I will rework it. The rest of the items appear to be shown ok.

ysbaddaden

This is very nice!

This can be merged as-is, or you can push further:

Create a ResourceMap namespace and split a few files, for example:

app/models/resource_map/collection.rb
app/models/resource_map/site_parser.rb

note: we might want to reuse normalization that is already used rather than introduce a new parser term?

Move the CSV generation out of the site parser and keep it in the script.

Then in a subsequent PR: use the metadata on the resmap fields to identify the fields to map, so we can use any collection in an abstract way, not just the collection we happen to have.

ysbaddaden · 2023-12-01T11:01:38Z

bin/import-resmap-collection

+
+  def generate_unique_id(name)
+    return nil if name.nil?
+    Digest::MD5.hexdigest(name) # TODO: use UUIDs? How to enconde ids?


The id can be used for in-place updates and it might be used through the API, so it must be stable (well, as stable as possible). Calculating the hash digest of a stable value is a good solution since the hash digest will be as stable as the input value.

Alternatively we could use UUIDv3 or UUIDv5 to generate stable UUIDs that are neither random nor time based (they internally use MD5 or SHA1) but I'm not sure this is worth it.

See https://en.wikipedia.org/wiki/Universally_unique_identifier#Versions_3_and_5_(namespace_name-based)

nitpick: Digest::MD5.hexdigest(name) if name.presence feels more like Ruby.

ysbaddaden · 2023-12-01T11:25:19Z

bin/import-resmap-collection

+    return nil if category_name.nil?
+
+    # Create or find the category
+    unless categories.has_key?(category_name)


bug:

Suggested change

unless categories.has_key?(category_name)

unless categories.has_key?(category_id)

ysbaddaden · 2023-12-01T11:26:11Z

bin/import-resmap-collection

+    category_group_name = properties["fac_ type-2"] ? properties["fac_type-1"] : category_name
+    category_group_id = generate_unique_id(category_group_name)
+
+    return nil if category_name.nil?


note: this should be right after the assignment of category_name.

nitpick:

Suggested change

return nil if category_name.nil?

return unless category_name

ysbaddaden · 2023-12-01T11:29:34Z

bin/import-resmap-collection

+      unless locations.key?(location_id)
+        location = {}
+        location["id"] = location_id
+        location["name"] = value
+        location["parent_id"] = parent_ids[level - 1]  # Parent from previous level
+
+        locations[location_id] = location
+      end


nitpick: Ruby's idiomatic way is:

Suggested change

unless locations.key?(location_id)

location = {}

location["id"] = location_id

location["name"] = value

location["parent_id"] = parent_ids[level - 1] # Parent from previous level

locations[location_id] = location

end

locations[location_id] ||= {

"id" => location_id,

"name" => value,

"parent_id" => parent_ids[level - 1], # Parent from previous level

}

ysbaddaden · 2023-12-01T12:08:18Z

Additional notes:

The more stability we have, the less likely we are to break anything when we do an upgrade. For example, if we use random UUID then the IDs will change on each upgrade, and won't that break each facility URL?

I'd hash resmap-id for the facility id. It should be perfectly stable overtime.

I'm not fond of delete everything and reinsert everything. That would break FPP during the upgrade, and since everything's stored in ElasticSearch and not in PostgreSQL, we can't even count on a database transaction (:face_exhaling:). Now, maybe doing the whole update in one bulk request would be fast enough that it's acceptable?

leandroradusky · 2023-12-05T11:33:07Z

For facilities, the import script that is already there is doing inside fpp:

id: @last_facility_id += 1,
source_id: f[:id].to_s

Then I won't use the uuid for them and save the resmap one. In the example dataset there was uuids everywhere and I will use them then for all the other entities.

leandroradusky · 2023-12-05T14:12:37Z

I realized that I wasn't 100% clear with respect to the deletion-updating of sites: this script creates a dataset to be imported with the bin/import-dataset script, which already exists and wasn't modified. I will create an issue to modify that script in order to update/delete/create facilities when corresponds, but that will go in a different PR.

Also, I will create an issue for your suggestion:

Then in a subsequent PR: use the metadata on the resmap fields to identify the fields to map, so we can use any collection in an abstract way, not just the collection we happen to have.

ysbaddaden

Just one typo and a couple suggestion, then LGTM!

ysbaddaden · 2023-12-08T15:04:09Z

app/models/resmap/site_normalizer.rb

+      facility["last"] = nil # TODO
+      facility["photo"] = nil # TODO
+      facility
+    end


nitpick: maybe:

def extract_facility(site, properties) return unless site["lat"] && site["lng"] { "lat" => site["lat"], ... } end

ysbaddaden · 2023-12-08T15:04:19Z

app/models/resmap/site_normalizer.rb

+      facility["name"] = site["name"]
+      facility["lat"] = site["lat"]
+      return nil if facility["lat"].nil?
+      facility["lng"] = site["long"]


Suggested change

facility["lng"] = site["long"]

facility["lng"] = site["lng"]

No, in resmap sites the name is long not lng.

bin/import-resmap-collection

Co-authored-by: Julien Portalier <[email protected]>

leandroradusky added 2 commits November 27, 2023 12:07

Script to import remsmap collection & small fixes

f34eb46

Removing hardcoded params

3416909

leandroradusky requested a review from ysbaddaden November 27, 2023 11:44

leandroradusky linked an issue Nov 30, 2023 that may be closed by this pull request

Update to Ruby 2.7 and Rails 5.2 #212

Open

leandroradusky removed a link to an issue Nov 30, 2023

Update to Ruby 2.7 and Rails 5.2 #212

Open

leandroradusky linked an issue Nov 30, 2023 that may be closed by this pull request

Import data from ReMap #213

Closed

ysbaddaden reviewed Dec 1, 2023

View reviewed changes

Fixes PR review

7fff687

leandroradusky requested a review from ysbaddaden December 5, 2023 14:13

ysbaddaden approved these changes Dec 8, 2023

View reviewed changes

Update bin/import-resmap-collection

dac1865

Co-authored-by: Julien Portalier <[email protected]>

leandroradusky merged commit dc1e855 into master Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to import resmap collection & small fixes #209

Script to import resmap collection & small fixes #209

leandroradusky commented Nov 27, 2023

leandroradusky commented Nov 28, 2023

ysbaddaden left a comment •

edited

Loading

ysbaddaden Dec 1, 2023

ysbaddaden Dec 1, 2023

ysbaddaden Dec 1, 2023

ysbaddaden Dec 1, 2023

ysbaddaden Dec 1, 2023

ysbaddaden commented Dec 1, 2023

leandroradusky commented Dec 5, 2023

leandroradusky commented Dec 5, 2023

ysbaddaden left a comment

ysbaddaden Dec 8, 2023

ysbaddaden Dec 8, 2023

leandroradusky Dec 11, 2023

	unless categories.has_key?(category_name)
	unless categories.has_key?(category_id)

Script to import resmap collection & small fixes #209

Script to import resmap collection & small fixes #209

Conversation

leandroradusky commented Nov 27, 2023

Small changes & fixes:

Import ResMap collection:

leandroradusky commented Nov 28, 2023

ysbaddaden left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ysbaddaden commented Dec 1, 2023

leandroradusky commented Dec 5, 2023

leandroradusky commented Dec 5, 2023

ysbaddaden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ysbaddaden left a comment •

edited

Loading