You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we harvest a data package from schema.org, we create a canonical copy of the schema.org JSON-LD, and index that. If the SO entry contains a link to a more detailed metadata record as proposed int he SOSO guidelines, then we should also index that content. To do so means we need to resolve conflicts and issues of precedence (e.g., if the two metadata sources provide different titles), and determine how to merge them into a single package so they do not show up in the index as distinct data packages. This could involve creating an ORE and having both metadata docs be a member of the package, or other solutions.
Dave and I had a slack conversation on this, some of which is included below for context.
Matt Jones 11:25 AM
Hey @davev with the new schema.org harvester, are we also picking up and indexing related metadata records (like linked ISO or EML records)?
11:25
I’ve been talking about that as a strength of DataONE’s indexer, but I’ve recently realized that maybe under the new approach things would work differently
11:26
I think it would be good if we could index all metadata content if its linked in the SO record. Thoughts?
Dave Vieglais 11:27 AM
we could, it wouldn’t be much more work, except that it’s a bit confusing having multiple metadata
Matt Jones 11:27 AM
true
Dave Vieglais 11:27 AM
what would that even mean? treat the SO like a resourcemap?
Matt Jones 11:27 AM
but for groups that have both, seems like it would be a win
11:28
yeah, not sure. the big question is when the two metadata documents say different things — like SO and EML have different titles
Dave Vieglais 11:28 AM
yeah
Matt Jones 11:28 AM
it would be nice to treat them as additive
11:30
maybe SO is designated as primary… to resolve conflicts.. if that is how the records were harvested
11:30
when you pull in a SO dataset, do you create an ORE in GMN?
Dave Vieglais 11:30 AM
not right now, it’s just metadata. It’s easy enough to create the ORE, but version management gets painful
Matt Jones 11:31 AM
right now the ORE and other metadata documents are additive in terrms of what is indexed, but I don’t think they overlap in content much. but we’ve talked about allowing that, so that PROV and semantic annotations can go in either the ORE or the metadata doc. Seems like the same issues exist with SO
Dave Vieglais 11:32 AM
yep. SO is just more metadata-ish than ORE
...
Matt Jones 11:37 AM
for IEDA nodes, are you indexing schema.org and ISO?
Dave Vieglais 11:37 AM
they are on the old pattern, which uses SO as a way to find the ISO, which is then retrieved, sys meta created, and served up to the CNs for indexing
Matt Jones 11:38 AM
ah, so the SO is discarded?
Dave Vieglais 11:39 AM
Yes I think so. Perhaps identifier and a couple other properties retained for sysmeta
Matt Jones 11:39 AM
it seems to me that the right thing for us to do over the long run is to index both, and have a well-established precedence for conflicts. Maybe we’re not ready to offer this to NEON yet….
Dave Vieglais 11:40 AM
we need to at least have a clear implementation pattern as to what goes where.
Matt Jones 11:41 AM
the SOSO guidelines say how to link in the extra metadata, so that seems like something we should follow and I think it would be pretty clear. (edited)
11:41
maybe we could add some language there about precedence for harvesters
Dave Vieglais 11:42 AM
ah, good point
Matt Jones 11:43 AM
which should be preferred for values — ISO/EML/etc, or the SO fields — when info is duplicated?
Dave Vieglais 11:43 AM
how would we handle that as an object in DataONE though? There’s two metadata docs, with separate PIDs that generate a single index record
Matt Jones 11:44 AM
yeah, that’ why I asked about the ORE
11:44
if we harvest it as a package, we could put both metadata docs in and link them via an ORE
11:44
and index them both with a precedence order
Dave Vieglais 11:45 AM
But they get indexed to separate index docs
Matt Jones 11:45 AM
we’ve always theoretically had the ability to have multiple metadata docs in a package
Dave Vieglais 11:45 AM
so there’s no precedence to consider - each populates a different index record
Matt Jones 11:46 AM
so the package shows up twice in searches? (edited)
Dave Vieglais 11:46 AM
potentially I guess - what happens now if there’s two metadata docs in one package?
Matt Jones 11:47 AM
I’m not sure we have encountered it
11:47
we’ve talked about doing it, but so far I think client tools avoid doing so
11:48
another use case for it is to have dataset metadata for data files (EML/ISO) and software metadata for software files (e.g., CodeMeta)
Dave Vieglais 11:49 AM
I guess the ORE really represents the single thing that is actually discovered
Matt Jones 11:49 AM
yeah
Dave Vieglais 11:49 AM
kind of flips the UI around a bit
Matt Jones 11:49 AM
but the indexer treats the METADATA records as primary, and then pulls in the ORE later to link to other parts of the package
11:49
so I think we straddle both models a bit
11:50
in theory I think the package is the right metaphor for an “entry” in our index
11:50
i.e., we should be indexing complex data packages and their content
Dave Vieglais 11:52 AM
yeah, resulting in one index row per package, with lots of properties on that row.
Matt Jones 11:52 AM
this is also the root of the DOI assignment issue between LTER and our other systems. In Metacat, we assign the DOI to the metadata doc, and it is used in the citation. In LTER they assign the DOI to the package, and it doesn’t show up properly in our citation. There’s an old issue around on this.
Dave Vieglais 11:54 AM
Yeah, I wondered about that. DOI should really point to the resource map, since from there you can discover the pieces of the package. imho
Matt Jones 11:54 AM
https://redmine.dataone.org/issues/8077
11:55
yeah, it just came from our historical use of EML as the “package” listing, with entities referenced in the EML, and ORE only added in later
Dave Vieglais 11:55 AM
yep
Matt Jones 11:56 AM
ok, well, thoughts on how I should respond to James given this context?
11:56
maybe I could tell them SO is an option, but then their EML wouldn’t be indexed, but that we hope to support both in the future?
11:57
or I could tell them SO is an option, and we could discuss the ramifications on a call?
11:58
sounds like its going to be low priority for them to keep their Metacat running
Dave Vieglais 11:59 AM
probably the second choice. SO option and a call to discuss consequences
Matt Jones 12:00 PM
sounds good
12:00
should we open an issue on resolving the multiple metadata problem in SO links in the future?
Dave Vieglais 12:01 PM
yeah, good point
Matt Jones 12:01 PM
a lot of this slack convo would be good background
12:01
where would that go? d1_cn_index_processor? (edited)
Dave Vieglais 12:03 PM
I’d be inclined to drop it in dataone
Matt Jones 12:04 PM
ah the top level repo?
12:04
ok, I’lll enter it there
New
Dave Vieglais 12:04 PM
there’s some other stuff in there at about the same level - and this SO+ thing touches on a bunch of stuff through the whole stack
12:04
thanks
The text was updated successfully, but these errors were encountered:
As for other solutions, I wonder how well it'd work if, when we encounter a reference to a more detailed record (via a schema:subjectOf triple with a suitable schema:encodingFormat for our systems), we just harvest and use that as the primary metadata record for dataset/DataPackage.
If we did want to hang on to the original JSON-LD and any other alternate formats we didn't havest, an appropriate place might be in the ORE using rdfs:seeAlso or ore:similarTo (See Section 4.4 in the ORE Spec).
When we harvest a data package from schema.org, we create a canonical copy of the schema.org JSON-LD, and index that. If the SO entry contains a link to a more detailed metadata record as proposed int he SOSO guidelines, then we should also index that content. To do so means we need to resolve conflicts and issues of precedence (e.g., if the two metadata sources provide different titles), and determine how to merge them into a single package so they do not show up in the index as distinct data packages. This could involve creating an ORE and having both metadata docs be a member of the package, or other solutions.
Dave and I had a slack conversation on this, some of which is included below for context.
The text was updated successfully, but these errors were encountered: