updated documentation

BU-Spark · Oct 15, 2024 · f32ea6a · f32ea6a
1 parent 9c55a70
commit f32ea6a
Show file tree

Hide file tree

Showing 2 changed files with 176 additions and 0 deletions.
diff --git a/api_documentation.md b/api_documentation.md
@@ -0,0 +1,79 @@
+# Digital Commonwealth API Guide
+
+Basic information on available APIs and creating requests can be found here: https://www.digitalcommonwealth.org/api.
+
+The sections below provide more detail on fetching metadata records and associated IIIF manifests, binary files, OCR text, etc.
+
+## Content architecture
+APIs can be used to retrive information about:
+1. **Items**, which are things like books, photographs, maps, newspaper issues, paintings, audio recordings, films, etc.
+2. **Files**, which are binary objects like images, text files, audio files, video files, etc.
+
+## Fetching metadata records
+Use the JSON API to request a list of records, or the full metadata record for an item. A record contains the relevant information about an item, like its title, creator, date, topical coverage, format, etc.
+
+There is a list of fields and their defintions here: [SolrDocument field reference: public API](https://github.com/boston-library/solr-core-conf/wiki/SolrDocument-field-reference:-public-API)
+
+#### List of records
+Any search that can be performed in the [front-end UI](https://www.digitalcommonwealth.org) can be turned into an API query by replacing `/search` with `/search.json` in the request URL.
+
+List of records matching keyword "Boston":
+```
+https://www.digitalcommonwealth.org/search.json?q=Boston&search_field=all_fields
+```
+List of records matching keyword "Boston" and format="Maps":
+```
+https://www.digitalcommonwealth.org/search.json?f%5Bgenre_basic_ssim%5D%5B%5D=Maps&q=Boston&search_field=all_fields
+```
+List all records:
+```
+https://www.digitalcommonwealth.org/search.json?search_field=all_fields&q=
+```
+
+Each API call returns 20 records by default, this can be adjusted by setting `per_page` in the query params (100 max). Iterate over the list of pages to get all the records.
+
+#### Full metadata for an item
+
+To retrieve the full metadata record for an individual item as JSON, append `.json` to the page url, as in the examples below:
+```
+# normal, return HTML
+https://www.digitalcommonwealth.org/search/commonwealth:abcd12345
+
+# return JSON
+https://www.digitalcommonwealth.org/search/commonwealth:abcd12345.json
+```
+
+The _canonical URL_ for an item is found in the `identifier_uri_ss` value from the JSON metadata record, it looks like:
+```
+https://ark.digitalcommonwealth.org/ark:/50959/:id
+```
+
+## Fetching image content using the IIIF Manifest
+
+To retrieve image content asociated with an item, use the item's [IIIF Presentation Manifest](https://iiif.io/api/presentation/2.1/) to get a list of associated images.
+
+The IIIF manifest URL is found in the `identifier_iiif_manifest_ss` field in the JSON metadata record.
+
+Alternately, using the canonical URL (found in the `identifier_uri_ss` value from the JSON metadata record), it is possible to construct a URL that will return a IIIF Presentation manifest for the item by appending `/manifest` to the URL. This will provide a list of the component images for the item, in order, for example:
+```
+https://ark.digitalcommonwealth.org/ark:/50959/abcd12345/manifest
+```
+(If no manifest exists, the URL will return a 404.)
+
+In the manifest, there is a `"sequences"` key that provides a list of the item's image files ("canvases"). For each canvas, use the `["images"]["resource"]["service"]["@id"]` path to obtain the base IIIF image URL. This will look something like:
+```
+https://iiif.digitalcommonwealth.org/iiif/2/commonwealth:zyxw98765
+```
+You can use the IIIF Image API to obtain any size image you want (see the IIIF [Image API 2.1.1](https://iiif.io/api/image/2.1/) guide for more information). For example, to obtain the full size image in JPEG format, use this syntax:
+```
+https://iiif.digitalcommonwealth.org/iiif/2/commonwealth:zyxw98765/full/full/0/default.jpg
+```
+
+## Fetching text content for an item
+
+A plain-text serialization of the transcribed text for an item (if it exists) can be returned by appending `/text` to the canonical URL (found in the `identifier_uri_ss` value from the JSON metadata record), for example:
+```
+https://ark.digitalcommonwealth.org/ark:/50959/abcd12345/text
+```
+(If no text is available the URL will return a 404.)
+
diff --git a/dataset-documentation/DATASETDOC-Fall24.md b/dataset-documentation/DATASETDOC-Fall24.md
@@ -0,0 +1,97 @@
+***Project Information*** 
+
+* What is the project name?
+
+Our name: LibRAG
+Name given by Spark!: Boston Public Library: Retrieval Augmented Generation
+
+* What is the link to your project’s GitHub repository?
+https://github.com/BU-Spark/ml-bpl-rag/tree/main
+
+* What is the link to your project’s Google Drive folder? \*\**This should be a Spark\! Owned Google Drive folder \- please contact your PM if you do not have access\*\**  
+https://drive.google.com/drive/u/1/folders/12_tsVcUgwdfUdXalD67NOgUL3tGeI6ss
+
+* In your own words, what is this project about? What is the goal of this project?   
+For our project, we are creating a Retrieval-Augmented Generation pipeline for the Boston Public Library. The goal is to
+add support to BPL's search function to allow for natural langauge queries.
+
+* Who is the client for the project?
+Eben English, the Digital Repository Services Manager at BPL.
+
+* Who are the client contacts for the project?  
+* What class was this project part of?
+CS/DS 549 Machine Learning Practicum
+
+***Dataset Information***
+
+* What data sets did you use in your project? Please provide a link to the data sets, this could be a link to a folder in your GitHub Repo, Spark\! owned Google Drive Folder for this project, or a path on the SCC, etc.  
+* Please provide a link to any data dictionaries for the datasets in this project. If one does not exist, please create a data dictionary for the datasets used in this project. **(Example of data dictionary)**   
+* What keywords or tags would you attach to the data set?  
+  * Domain(s) of Application: Computer Vision, Object Detection, OCR, Image Classification, Image Segmentation, Facial Recognition, NLP, Topic Modeling, Sentiment Analysis, Named Entity Recognition, Text Classification, Summarization, Anomaly Detection, Other   
+  * Sustainability, Health, Civic Tech, Voting, Housing, Policing, Budget, Education, Transportation, etc. 
+
+*The following questions pertain to the datasets you used in your project.*   
+*Motivation* 
+
+* For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. 
+
+*Composition*
+
+* What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? What is the format of the instances (e.g., image data, text data, tabular data, audio data, video data, time series, graph data, geospatial data, multimodal (please specify), etc.)? Please provide a description.   
+* How many instances are there in total (of each type, if appropriate)?  
+* Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).  
+* What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.   
+* Is there any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include redacted text.   
+* Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them  
+* Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.   
+* Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources,   
+  * Are there guarantees that they will exist, and remain constant, over time;  
+  * Are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created)?  
+  * Are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points as appropriate.   
+* Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.   
+* Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.   
+* Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.   
+* Dataset Snapshot, if there are multiple datasets please include multiple tables for each dataset. 
+
+
+| Size of dataset |  |
+| :---- | :---- |
+| Number of instances |  |
+| Number of fields  |  |
+| Labeled classes |  |
+| Number of labels  |  |
+
+
+
+*Collection Process*
+
+* What mechanisms or procedures were used to collect the data (e.g., API, artificially generated, crowdsourced \- paid, crowdsourced \- volunteer, scraped or crawled, survey, forms, or polls, taken from other existing datasets, provided by the client, etc)? How were these mechanisms or procedures validated?  
+* If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?  
+* Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. 
+
+*Preprocessing/cleaning/labeling* 
+
+* Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section.   
+* Were any transformations applied to the data (e.g., cleaning mismatched values, cleaning missing values, converting data types, data aggregation, dimensionality reduction, joining input sources, redaction or anonymization, etc.)? If so, please provide a description.   
+* Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data, this could be a link to a folder in your GitHub Repo, Spark\! owned Google Drive Folder for this project, or a path on the SCC, etc.  
+* Is the code that was used to preprocess/clean the data available? If so, please provide a link to it (e.g., EDA notebook/EDA script in the GitHub repository). 
+
+*Uses* 
+
+* What tasks has the dataset been used for so far? Please provide a description.   
+* What (other) tasks could the dataset be used for?  
+* Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?   
+* Are there tasks for which the dataset should not be used? If so, please provide a description.
+
+*Distribution*
+
+* Based on discussions with the client, what access type should this dataset be given (eg., Internal (Restricted), External Open Access, Other)?
+
+*Maintenance* 
+
+* If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. 
+
+*Other*
+
+* Is there any other additional information that you would like to provide that has not already been covered in other sections?
+