Merge pull request #91 from LSSTDESC/u/stuart/v0.4.0

Placeholder branch for v0.4.0
LSSTDESC · Apr 27, 2024 · c93c847 · c93c847
2 parents bb23167 + fc2b7bc
commit c93c847
Show file tree

Hide file tree

Showing 50 changed files with 3,174 additions and 1,810 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -137,9 +137,6 @@ jobs:
         run: |
           cd tests/end_to_end_tests
 
-          # Register more database entries using the CLI
-          bash create_test_entries_cli.sh
-
           # Run some test queries
           pytest -v test_*.py
 
@@ -193,8 +190,5 @@ jobs:
         run: |
           cd tests/end_to_end_tests
 
-          # Register more database entries using the CLI
-          bash create_test_entries_cli.sh
-
           # Run some test queries
           pytest -v -m "not skip" test_*.py
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,43 @@
+## Version 0.4.0
+
+Version 0.4.0 focuses around being able to manipulate data already within the
+dataregistry, i.e., adding the ability to delete and modify previous datasets.
+
+### Changelog for developers:
+
+- `Registrar` now has a class for each table. They inherit from a `BaseTable`
+  class, this means that shared functions, like deleting entries, are available
+  for all tables. (#92)
+- Working with tables via the python interface has slightly different syntax
+  (see user changelog below). (#92)
+- `is_valid` is removed as a `dataset` property. It has been replaced with
+  `status` which is a bitmask (bit 0="valid", bit 1= "deleted" and bit
+  2="archived"), so now datasets can a combination of multiple states. (#93)
+- `archive_date`, `archive_path`, `delete_date`, `delete_uid` and `move_date`
+  have been added as new `dataset` fields. (#93)
+- Database version bumped to `2.0.1` (#93)
+- `dataset` entries can be deleted (see below) (#94)
+- The CI for the CLI is now pure Python (i.e., there is no more bash script to
+  ingest dummy entries into the registry for testing).
+- Can no longer "bump" a dataset that has a version suffix (trying to do so
+  will raise an error). If a user wants to make a new version of a dataset with
+  a suffix they can still do so by manually specifying the version and suffix
+  (#97 ).
+- Dataset entries can be modified (see below, #100)
+
+### Changelog for users:
+
+- All database tables (`dataset`, `execution`, etc) have a more universal
+  syntax. The functionality is still accessed via the `Registrar` class, but
+  now for example to register a dataset it's `Registrar.dataset.register()`,
+  similarly for an execution `Registrar.execution.register()` (#92). The docs
+  and tutorials have been updated (#95).
+- `dataset` entries can now be deleted using the
+  `Registrar.dataset.delete(dataset_id=...)` function. This will also delete
+  the raw data within the `root_dir`. Note that the entry in the database will
+  always remain (with an updated `status` field to indicate it has been
+  deleted). (#94)
+- Documentation has been updated to make things a bit clearer. Now split into
+  more focused tutorials (#95).
+- Certain dataset quantities can be modified after registration (#100).
+  Documentation has been updated with examples.
diff --git a/docs/source/reference_cli.rst b/docs/source/reference_cli.rst
@@ -8,18 +8,6 @@ The DESC data registry also comes with a Command Line Interface (CLI) tool,
 
 See the :ref:`tutorials section <tutorials-cli>` for a demonstration of its usage.
 
-Registering a new entry in the database
----------------------------------------
-
-.. autoprogram:: cli.cli:arg_register
-   :prog: dregs register
-
-Listing datasets within the data registry
------------------------------------------
-
-The ``dregs ls`` command can be used to quickly list the datasets within the
-DESC data registry. Two basic filters can be applied; on the `owner` and/or
-`owner_type`. All entries can also be retured using the ``--all`` flag.
-
-.. autoprogram:: cli.cli:arg_ls
-   :prog: dregs ls
+.. autoprogram:: dataregistry_cli.cli:get_parser()
+   :prog: dregs
+   :groups:
diff --git a/docs/source/reference_python.rst b/docs/source/reference_python.rst
@@ -18,9 +18,11 @@ It connects the user to the database, and serves as a wrapper to both the
 .. autoclass:: dataregistry.DataRegistry
    :members:
 
-   .. automethod:: dataregistry.Registrar.register_dataset
    .. automethod:: dataregistry.Registrar.get_owner_types
-   .. automethod:: dataregistry.Registrar.register_execution
-   .. automethod:: dataregistry.Registrar.register_dataset_alias
    .. automethod:: dataregistry.Query.find_datasets
 
+.. automethod:: dataregistry.registrar.dataset.DatasetTable.register
+
+.. automethod:: dataregistry.registrar.execution.ExecutionTable.register
+
+.. automethod:: dataregistry.registrar.dataset_alias.DatasetAliasTable.register
diff --git a/.../tutorial_notebooks/getting_started.ipynb → ...getting_started_1_register_datasets.ipynb b/.../tutorial_notebooks/getting_started.ipynb → ...getting_started_1_register_datasets.ipynb
diff --git a/docs/source/tutorial_notebooks/getting_started_2_query_datasets.ipynb b/docs/source/tutorial_notebooks/getting_started_2_query_datasets.ipynb
@@ -0,0 +1,270 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "9337f001-5e7c-4141-a60c-5e99052aee3d",
+   "metadata": {},
+   "source": [
+    "<div style=\"overflow: hidden;\">\n",
+    "    <img src=\"images/DREGS_logo_v2.png\" width=\"300\" style=\"float: left; margin-right: 10px;\">\n",
+    "</div>\n",
+    "\n",
+    "# Getting started: Part 2 - Simple queries\n",
+    "\n",
+    "Here we continue our getting started tutorial, introducing queries.\n",
+    "\n",
+    "### What we cover in this tutorial\n",
+    "\n",
+    "In this tutorial we will learn how to:\n",
+    "\n",
+    "1) Perform a simple query with a single filter\n",
+    "2) Perform a simple query with multiple filters\n",
+    "\n",
+    "### Before we begin\n",
+    "\n",
+    "If you haven't done so already, check out the [getting setup](https://lsstdesc.org/dataregistry/tutorial_setup.html) page from the documentation if you want to run this tutorial interactively.\n",
+    "\n",
+    "A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ead9b84-4933-4213-93cb-301d79ef1167",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import dataregistry\n",
+    "print(\"Working with dataregistry version:\", dataregistry.__version__)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f48aec2e-2b35-49ed-be76-8818d9e79b2c",
+   "metadata": {},
+   "source": [
+    "## 1) Querying the data registry with a single filter\n",
+    "\n",
+    "Now that we have covered the basics of dataset registration, we can have a look at how to query entries in the database. Note you can only query for datasets within the schema you have connected to.\n",
+    "\n",
+    "We learned how to connect to the DESC data registry in the last tutorial using the `DataRegistry` class, let's connect again using the defaults:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "66a6f3ac-15cc-4706-b230-63681ba3a4b7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from dataregistry import DataRegistry\n",
+    "\n",
+    "# Establish connection to database (using defaults)\n",
+    "datareg = DataRegistry()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd1eb855-a3fd-4dd4-8e45-d444c3d1cad6",
+   "metadata": {},
+   "source": [
+    "### Constructing the query \n",
+    "\n",
+    "Queries are constructed from one or more boolean logic \"filters\", which translate to SQL `WHERE` clauses in the code. \n",
+    "\n",
+    "For example, to create a filter that will query for all datasets in registry with the name \"my_desc_dataset\" would be as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f9901d89-b1d7-48c9-8110-ce16ecba3a7e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Create a filter that queries on the dataset name\n",
+    "f = datareg.Query.gen_filter('dataset.name', '==', 'my_desc_dataset')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "305a8df6-6967-4280-a5e8-6ea8831eff09",
+   "metadata": {},
+   "source": [
+    "Where the first argument is the column name we are searching against, the second argument is the logic operator, and the final argument is the condition. \n",
+    "\n",
+    "Like with SQL, column names can either be explicit, or not, with the prefix of their table name. For example `name` rather than `dataset.name`. However this would only be valid if the column `name` was unique across all tables in the database, which it is not. We would always recommend being explicit, and including the table name with filters.\n",
+    "\n",
+    "The allowed boolean logic operators are: `==`, `!=`, `<`, `<=`, `>` and `>=`.\n",
+    "\n",
+    "### Performing the query\n",
+    "\n",
+    "Now we can pass this filter through to a query using the `Query` extension of the `DataRegistry` class, e.g.,"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "00c6d355-dca0-42a1-ae82-7fdbd1a46afa",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Query the database\n",
+    "results = datareg.Query.find_datasets(['dataset.dataset_id', 'dataset.name', 'dataset.relative_path'], [f])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8dc05dc6-43e9-4d10-af44-0e4a9353c0b4",
+   "metadata": {},
+   "source": [
+    "Which takes a list of column names we want to return (in this case `dataset.dataset_id`, `dataset.name` and `dataset.relative_path`), and a list of filter objects for the query (just `f` in our case here).\n",
+    "\n",
+    "We can look at the results like so:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0841f472-4ae6-4ca1-810d-6996c58fa14a",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "print(results)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "66444995",
+   "metadata": {},
+   "source": [
+    "### Query return formats\n",
+    "\n",
+    "Note that three return formats are supported, selected via the optional `return_format` attribute passed to the `find_datasets` function:\n",
+    "\n",
+    "- `return_format=\"property_dict\"` : a dictionary with keys in the format `<table_name>.<column_name>` (default)\n",
+    "- `return_format=\"dataframe\"` : a pandas DataFrame with keys in the format `<table_name>.<column_name>`\n",
+    "- `return_format=\"cursorresult\"` : a SQLAlchemy CursorResult object (see [here](https://docs.sqlalchemy.org/en/20/core/connections.html#sqlalchemy.engine.CursorResult) for details)\n",
+    "\n",
+    "Note that for the `CursorResult` object, the property names are still in the format `<table_name>.<column_name>`. Because there is a `.` in the column names, to retrieve the properties you need to do `getattr(r, \"dataset.name\")`, where `r` is the row of the `CursorResult` object. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c48f5445",
+   "metadata": {},
+   "source": [
+    "To get a list of all columns in the database, along with what table they belong to, you can use the `Query.get_all_columns()` function, i.e.,"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "54a52029-2908-4056-bc68-4a87f6c3e6df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(datareg.Query.get_all_columns())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84b25f55-eef4-43b6-9c60-8ada92488dd6",
+   "metadata": {},
+   "source": [
+    "## 2) Querying the data registry with multiple filters\n",
+    "\n",
+    "We are not limited to using a single filter during queries.\n",
+    "\n",
+    "Now let's say we want to return all datasets in the registry with a particular `owner`, that were registered after a certain date. We also want the results in a Pandas dataframe format.\n",
+    "\n",
+    "To do this we construct two filter objects, i.e.:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8eec33d8-2139-473f-ab27-3a04ebd5e7f1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a filter that queries on the owner\n",
+    "f = datareg.Query.gen_filter('dataset.owner', '==', 'DESC')\n",
+    "\n",
+    "# Create a 2nd filter that queries on the entry date\n",
+    "f2 = datareg.Query.gen_filter('dataset.creation_date', '>', '01-01-2024')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a241887-2443-4552-a832-d5701d599229",
+   "metadata": {},
+   "source": [
+    "Then we query the database as before:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d21e982a-5b86-4f75-8b54-7923dec11e04",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Query the database\n",
+    "results = datareg.Query.find_datasets(['dataset.dataset_id', 'dataset.name', 'dataset.owner',\n",
+    "                                       'dataset.relative_path', 'dataset.creation_date'],\n",
+    "                                      [f,f2],\n",
+    "                                      return_format=\"dataframe\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "65c1392e-c9d9-4b3a-9f36-9163bf8edd02",
+   "metadata": {},
+   "source": [
+    "and print the results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "908aa870-c0a4-4e59-a11c-97185e4a3db1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(results)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}