Re-structure Capella Bucket=>Scope=>Collection configuration #379

gopa-noaa · 2024-05-29T17:06:03Z

No change to bucket, 3 scopes , development, integration, production, and 3 collections under each, currently just METAR, RAOB, and COMMON.

randytpierce · 2024-05-29T17:17:15Z

It implies that the metadata is moved from METAR collection to COMMON collection and that METAR collection will only have type "DD" documents (the same for RAOB collection). This will require code changes to ingest, metadata scripts, and client. randy

…

On Wed, May 29, 2024 at 11:06 AM Gopa ***@***.***> wrote: No change to bucket, 3 scopes , development, integration, production, and 2 scopes under each, currently just METAR and COMMON. — Reply to this email directly, view it on GitHub <#379>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGDVQPSO6P5YDQA2D2J6CNDZEYDJBAVCNFSM6AAAAABIPLDB3CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZDGOBRGY4DQOI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Randy Pierce

gopa-noaa · 2024-05-29T20:42:47Z

From a quick Google-ing a scope cannot be renamed after it is created. Have sent email to Couchbase ...
Worst case, we can do the following:

create new "development" scope
create a new "METAR" collection
re-configure our XDCR to vxdata=>development=>MEAR
wait for data sync to complete
Delete the original _default=>METAR

ian-noaa · 2024-07-17T20:18:40Z

A couple of other questions:

Would it make sense to move more document types out into their own collections? I think we currently have MD (Metadata), DD (Data Document), and JOB/JOB-TEST documents. Are there other document types that would make sense to put in their own collections?
Could our document types be replaced by using collections more? If they are useful, when does it make sense to have a collection vs a document type field? E.g. - if the METAR collection solely contains type=DD documents, I could see dropping the type field unless there are reasons clients need to track that type.
Should the JOB-TEST docs be renamed to JOB and left in a "test" scope?
Does it make sense for the scorecard to be its own scope or does it make sense to be scoped with the rest of vxdata?
Can we XDCR at a scope level instead of a collection?

ian-noaa · 2024-07-19T15:45:20Z

To summarize the discussion from the dev meeting:

We decided we need to move this issue up and address how best to use collections, scopes, and buckets for our project & application.

We would like to come up with some use cases & whiteboard through how key parts of the application lifecycle would work with different data models. Ideally this would happen during the ingest meeting.

During the meeting we

debated what would go into a common collection. The point was made that common is pretty generic (like default) and it could be better to have explicit & meaningful names to describe the data that collections hold so that we don't end up with a grab bag of data. However, we’re unsure of the performance tradeoffs of multiple collections.
Called out that we will need scripts or SDK calls to create our DB schema if it becomes more complicated.

Information needed

What are collections, scopes, and buckets? What are their use cases?
How do collections, scopes, and buckets interact with XDCR & Time-To-Live fields?
It'd be useful to get a list of the Types, DocTypes, and Subsets we have in our documents and an idea of how we are using them. @randytpierce and @gopa-noaa may have the best input here.
Can we use collections/scopes/buckets to obviate some of the above fields (type, docType, and subset) in our documents? And do we want to? (I suspect no, to support our archiving & retrieval use case)
What use cases should we explore to ensure we have thought the DB schema through? This is something @ian-noaa, @randytpierce & @gopa-noaa should consider by the vxingest meeting. Off the top of my head I have:
- Ingesting data via cron, for various data types if relevant
- Ingesting data via event, for various data types if relevant
- Expiring data
- Retrieving archived data
- Querying data from MATS
Where does the scorecard fit into this? Should the data be stored in a separate bucket, scope, or collection?

Context

Couchbase Server 7 (released in 2021) introduced Scopes & Collections. Previously it was recommended to put all data in a “Bucket” and distinguish the documents with a type field. It appears scopes are recommended for data isolation (prod/dev environments, introducing schema changes, etc…) and collections are intended as a replacement for the previously recommended “type” field.

gopa-noaa · 2024-07-22T17:27:02Z

This link explains Collections and Scope:
https://docs.couchbase.com/server/current/learn/data/scopes-and-collections.html

Just noting down some salient points below:

A collection is a data container.
Up to 1000 collections can be created per cluster.
A collection can be indexed; and it can be dropped. The data in a collection can be replicated, by means of XDCR.

A scope is a mechanism for the grouping of multiple collections. Up to 1000 scopes can be created per cluster.
A scope can be dropped. A scope cannot be indexed. The contents of a scope can be replicated, by means XDCR.

Benefits of Scopes and Collections
The benefits of scopes and collections include:

The logical grouping of similar documents; potentially simplifying operations such as query, XDCR, and backup and restore.

The increased efficiency of indexing, due to the Data Service being able to provide documents from specific collections to the Index Service.

Simplified querying, since query statements are able to easily specify particular subsets of documents.

Easier migration from relational databases to Couchbase Server, since collections can be designed to correspond to pre-existing relational tables.

Secure isolation of different document-types, within a bucket; allowing applications to be specifically authorized to use only their appropriate subsets of data (see Access to Scopes and Collections, below).

This should help give us some guidance in organizing our document hierarchy. Lets plan to discuss further.

ian-noaa · 2024-07-22T21:58:55Z

Thanks, Gopa! That makes it sound like it would be beneficial to explore using collections more.

2. How do collections, scopes, and buckets interact with XDCR & Time-To-Live fields?

TTL fields

Couchbase can have a default TTL set on buckets and collections but not scopes. You can also use the SDK to set TTL individually for each document. If we went the second route, having the import process be in charge of setting TTL values would seem to make sense.
See Couchbase's Data Expiration docs.

XDCR

Is configured at the bucket level. However, filtering can be applied to map data to different collections or exclude collections/documents.
XDCR will not automatically create scopes and collections. Scopes & Collections must be preconfigured on each DB cluster.
See XDCR with Scopes & Collections

ian-noaa · 2024-07-25T12:55:42Z

During the dev meeting we confirmed that we:

Want the Database Scope to reflect the environment development, test, and prod were mentioned.
- We also noted that we could use a new scope to distinguish between the on-prem and aws ingest systems.
- Do we want 3 copies of the data? How much data is retained/what data goes where?
Want the Database Collections to mirror the document subset fields
- We will need to redo our indices to take advantage of this
- The contents and naming of the "metadata" collection is still an open discussion. Do we have a singular metadata collection, or do we have multiple collections based on metadata type? (Job, Stations, etc...)

And we need the following for today:

@randytpierce & @gopa-noaa - To provide a list of document type, docType, and subset fields currently in use.
@randytpierce & @gopa-noaa - To consider what scenarios we want to whiteboard out. Currently, we have:

Ingesting data via cron, for various data types if relevant
Ingesting data via event, for various data types if relevant
Expiring data
Retrieving archived data
Querying data from MATS

randytpierce · 2024-07-25T16:14:28Z

So here are a few answers from my point of view. - *We also noted that we could use a new scope to distinguish between the on-prem and aws ingest systems." * - No. There should not be a scope that distinguishes between "where" a system is deployed. That should not matter. The data is the data regardless of where it resides. - *"Do we want 3 copies of the data?" - * No. The test and development scopes should always be quite limited in size. Data duplication is not a good thing in this case, and in my opinion. - *"Want the Database Collections to mirror the document subset fields" *- This is only because it is handy. The type "DD" documents are the lion share of the data, and they are best suited to benefit from differentiated scopes. The type "MD" metadata documents won't benefit much from having different collections. In my opinion it would be fine to just put them all in a "metadata" collection named whatever. Indexing will be efficient because the data set will be small. I think the challenge for the metadata update scripts comes from querying the actual data anyway. Let's just pick a name and not worry about this too much. The primary types that we have are ...

> select distinct raw type from vxdata._default.RAOB

[ "DD", "DF", "JOB", "MD" ] DD is data DF is data file (records what data file is already ingested) JOB is a JOB spec MD is a metadata document I'm doing a more exhaustive query on the METAR collection but that query may take a long time. I'll send those results later when the query finishes. In addition these there are test types i.e. DD-TEST, MD-TEST, JOB-TEST etc. subsets are METAR, RAOB, and COMMON - same story as the above on the exhaustive list docType is a much more dynamic field that DD or MD.

> select distinct raw docType from vxdata._default.RAOB

[ "obs", "ingest", "ingest_mapping", "station", "stationReference" ] These are all I have for RAOBS so far. obs are observations ingest are ingest templates ingest_mapping is used for prepbufr mnemonic mappings station is a station document stationReference is used for keeping a list of stations that we are interested in In addition there will be... model - a model document partial_sums - a partial sums document ctc - a contingency document, and probably others. For scenarios, use cases, etc I think the five listed above are enough to get us started. Eventually we will need to do much more specific ones but we do not have enough context yet to approach those. randy

…

On Thu, Jul 25, 2024 at 6:56 AM Ian McGinnis ***@***.***> wrote: During the dev meeting we confirmed that we: - Want the Database Scope to reflect the environment development, test, and prod were mentioned. - We also noted that we could use a new scope to distinguish between the on-prem and aws ingest systems. - Do we want 3 copies of the data? How much data is retained/what data goes where? - Want the Database Collections to mirror the document subset fields - We will need to redo our indices to take advantage of this - The contents and naming of the "metadata" collection is still an open discussion. Do we have a singular metadata collection, or do we have multiple collections based on metadata type? (Job, Stations, etc...) And we need the following for today: 1. @randytpierce <https://github.com/randytpierce> & @gopa-noaa <https://github.com/gopa-noaa> - To provide a list of document type, docType, and subset fields currently in use. 2. @randytpierce <https://github.com/randytpierce> & @gopa-noaa <https://github.com/gopa-noaa> - To consider what scenarios we want to whiteboard out. Currently, we have: - Ingesting data via cron, for various data types if relevant - Ingesting data via event, for various data types if relevant - Expiring data - Retrieving archived data - Querying data from MATS — Reply to this email directly, view it on GitHub <#379 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGDVQPT6XTZKJOCBAV4SFLLZODYWJAVCNFSM6AAAAABIPLDB3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJQGI2TKNRTGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Randy Pierce

gopa-noaa · 2024-07-25T18:01:07Z

Here are the results from the METAR Collection:

select distinct raw docType from vxdata._default.METAR

[
  "obs",
  "model",
  "CTC",
  "SUMS"
]

gopa-noaa · 2024-07-25T18:06:10Z

And on the On-Prem Cluster:

select distinct raw docType from vxdata._default.METAR

[
  null,
  "CTC",
  "SUMS",
  "classic_stations",
  "ingest",
  "landUseTypes",
  "matsAux",
  "matsGui",
  "model",
  "obs",
  "region",
  "station"
]

gopa-noaa · 2024-07-25T18:32:52Z

On-Prem Cluster output for types:

select distinct raw type from vxdata._default.METAR

[
  "DF",
  "JOB-TEST",
  "JOB",
  "LJ",
  null,
  "DD",
  "MD-TEST",
  "MD",
  "DD-TEST"
]

JeffHamiltonNOAA · 2024-07-26T13:12:37Z

ian-noaa · 2024-07-31T20:15:19Z

To summarize the meeting last week:

We want to have two collection "types" largely based on document's subset field. The largest by far will be the "data" collections and will be based on the subset field of "Data" documents where type=DD. We also want to have one or more metadata collection(s).
It is important to retain subset and type fields in our documents so that those documents can be properly imported from the long term store.
Document ID's are constructed out of top-level predicates and follow this form: type:version:subset:docType:subdocType:<level>:<valid time epoch>. Note that <> items are optional.
MATS GUI documents should at a minimum have their own collection.
Currently we have a common subset - it encompasses regions and landuse and should be renamed to region as landuse is handled like a region. Potentially , we want to make this its own collection. Within that collection, it may be important to distinguish between region & landuse documents and to include metadata in the landuse documents specifying which landuse tables apply to which models.
If we have a "common" collection, it would be small so we may be able to create a primary index on it.

Remaining questions:

where does the scorecard fit into this?
It's still unclear what we want to do with metadata. Should it go in a single collection or multiple? It was pointed out that the metadata may be small enough that we could set up a primary index on it. It sounds like common should be renamed to region.
scopes - federated db will need to write to a dev scope until it's operational. When it becomes operational, we'll need to determine if we use the existing prod scope, or create a new one and drop the old one.
What do we do with *_TEST types? Are they obviated if we have dev and test scopes?

I'm sure I missed a few things. 🙂

gopa-noaa · 2024-09-18T17:37:59Z

Forgot to take notes in last meeting, if I remember correctly, here are the main points:

We will have multiple buckets, at least 2 for now, 1 - vxdata_prod, 2 - vxdata_dev. Both these buckets would be readily available for testing without resorting to any involved data/index setup.
Integration tests would be done against vxdata_prod bucket
Developers can create addition buckets for specific needs.
Currently we plan to use the "_default" scope
Multiple Collections under default scope like: METAR, RAOB etc

Questions:

What would be a good mechanism to load a subset of production data in another bucket ?

gopa-noaa added couchbase VXingest issues related to the VXingest project task Tasks break a project down into discrete steps labels May 29, 2024

gopa-noaa self-assigned this May 29, 2024

ian-noaa mentioned this issue Jul 24, 2024

Move metadata records into a common collection #376

Open

ian-noaa mentioned this issue Aug 21, 2024

Design of development, integration and production data storage hierarchy in Couchbase and Capella #411

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-structure Capella Bucket=>Scope=>Collection configuration #379

Re-structure Capella Bucket=>Scope=>Collection configuration #379

gopa-noaa commented May 29, 2024 •

edited by ian-noaa

Loading

randytpierce commented May 29, 2024 via email

gopa-noaa commented May 29, 2024

ian-noaa commented Jul 17, 2024 •

edited

Loading

ian-noaa commented Jul 19, 2024 •

edited

Loading

gopa-noaa commented Jul 22, 2024

ian-noaa commented Jul 22, 2024

ian-noaa commented Jul 25, 2024

randytpierce commented Jul 25, 2024 via email

gopa-noaa commented Jul 25, 2024 •

edited by ian-noaa

Loading

gopa-noaa commented Jul 25, 2024 •

edited by ian-noaa

Loading

gopa-noaa commented Jul 25, 2024 •

edited by ian-noaa

Loading

JeffHamiltonNOAA commented Jul 26, 2024

ian-noaa commented Jul 31, 2024

gopa-noaa commented Sep 18, 2024

Re-structure Capella Bucket=>Scope=>Collection configuration #379

Re-structure Capella Bucket=>Scope=>Collection configuration #379

Comments

gopa-noaa commented May 29, 2024 • edited by ian-noaa Loading

randytpierce commented May 29, 2024 via email

gopa-noaa commented May 29, 2024

ian-noaa commented Jul 17, 2024 • edited Loading

ian-noaa commented Jul 19, 2024 • edited Loading

During the meeting we

Information needed

Context

gopa-noaa commented Jul 22, 2024

ian-noaa commented Jul 22, 2024

2. How do collections, scopes, and buckets interact with XDCR & Time-To-Live fields?

ian-noaa commented Jul 25, 2024

randytpierce commented Jul 25, 2024 via email

gopa-noaa commented Jul 25, 2024 • edited by ian-noaa Loading

gopa-noaa commented Jul 25, 2024 • edited by ian-noaa Loading

gopa-noaa commented Jul 25, 2024 • edited by ian-noaa Loading

JeffHamiltonNOAA commented Jul 26, 2024

ian-noaa commented Jul 31, 2024

gopa-noaa commented Sep 18, 2024

gopa-noaa commented May 29, 2024 •

edited by ian-noaa

Loading

ian-noaa commented Jul 17, 2024 •

edited

Loading

ian-noaa commented Jul 19, 2024 •

edited

Loading

gopa-noaa commented Jul 25, 2024 •

edited by ian-noaa

Loading

gopa-noaa commented Jul 25, 2024 •

edited by ian-noaa

Loading

gopa-noaa commented Jul 25, 2024 •

edited by ian-noaa

Loading