Releases: surfedushare/harvester
Overwrite API
Re-enables an old API that allows clients to overwrite certain data coming from sources. For now this only enables to overwrite metrics data, which isn't provided by sources at all. In the future it may also allow to overwrite content data.
New search client
This release is the prerelease of the new search-client inside the harvester.
There are significant changes and these stand out especially:
- If the data is invalid according to our new validation layer then a product/material will be considered "inactive". We'll need to hunt these down in the admin and correct the data in the source where possible (or decide to adjust validation)
- For certain queries (with "leenwoorden") multilingual products are expected to rank higher without actually being more relevant. This is a drawback of the new index schema, that has no solution at the moment.
- Moving from multiple indices with different languages to a single index with multiple languages means that the single index takes on quite a bit of complexity. See for instance the new search fields variable. I’ve tried to cut down on all this repetition by introducing a way to shorthand fields. This field notation can be interpolated to all appropriate language fields.
- We’re still experimenting with Pydantic and although it looks very good on first use we still need to change a lot of things to fully leverage its potential.
- Once Harvester uses the new search-client on production and provides the new Metadata tree it, then will be harder for people to update the filter translations. Translations for some values will need to be done twice.
Testing can be done on /api/v1/docs and needs to include the following.
- Search with no entities parameter. This should result in identical search with previous releases.
- Search with entities=products:default. This should result in searching with the new index
- English search for study vocabulary
- English search for consortium
- English and Dutch search for disciplines
- Switching language in Sharekit should result in improved search (when switching to correct language)
- Metadata tree should return the same tree as previous releases when no entity has been specified.
- Metadata tree should return a tree with different fields when entity=products:default has been given as a parameter.
- "study_vocabulary.keyword" field replaces "study_vocabulary"
- "disciplines_normalized.keyword" replaces "learning_material_disciplines_normalized"
- "language" replaces "language.keyword"
- "published_at" replaces "publisher_date" when trying to search alphabetically. The field "modified_at" can sort on last modified date.
- "licenses" replaces "copyright" and filters based on all licenses for all files that belong to a product/material.
- "technical_types" replaces "technical_type" and allows filtering on file type of all files that belong to a product/material.
- Autocomplete as well as suggestions have not changed, but use entities=products:default as parameter to use the new index.
- Stats has changed slightly. It will return counts per entity and you can test this by specifying entities=products:default as parameter.
- The "find document" endpoints now require a SRN instead of an external_id.
- There are a number of additional fields in API responses that can be used:
- "entity" contains a string with the entity type like: products or projects (NB: a material is a product)
- "score" contains the score given by the search engine to a result (default is 0.0).
- Authors contain a "is_external" boolean, but currently it's always set to false.
- For files "priority" has been added.
- For Publinova "types" contains all file types for a product and "licenses" contains all copyright licenses for the product.
- And then there are minor API fields updates. These should be double checked whether they accidentally break functionality on Publinova:
- The "highlight" field might be null, but if "text" or "description" is set the other property will be null instead of undefined.
- The fields "published_at" and "modified_at" contain dates and no longer times.
New search client
v1.41.24 Ups version for release (v1.41.23)
Sharekit new files structure
Adds properties to Sharekit files like "priority" and "copyright".
It also uses E-Tag hash values to determine if files have changed and need to re-run tasks.
Adds BUAS, Hanze and HKU to the new harvester, making the new harvester feature complete.
Django 4.2
This release updates Django because 3.2 no longer receives security updates.
This release prepares for the Python 3.12 update.
This is the last release to completely contain the old harvester code.
New harvester (live)
First version of new harvester that will function on production
New harvester
First release of basic new harvester functionalities
MBO educational level
This release does some minor updates before truly merging the new Harvester into acceptance branch:
- Allow MBO educational level if it's accompanied by HBO or WO level.
- Fill in abstract study vocabulary terms based on concrete study vocabulary terms.
- Standardize the publisher_date format.
- Removes inactive logging for educational levels specifically.
- Updates some Python packages.
Pipeline refactor v2
Another pre-release for the pipeline refactor.
This is the final pre-release and points to a commit where we may want to jump back to for reference in the future.
Pipeline refactor v1
Starts a major refactor that aims to make a lot of things possible in the future. Especially harvesting other entity types than LearningMaterial and Product.
Phase 1 only cleans up some of the code and makes space for the new expected modules.
For Publinova it also introduces author ids based on names if no other ids are returned by the source.