This document summarizes the usage of the
cfde_deriva.dashboard_queries
Python module to generate statistics
about a CFDE C2M2 catalog. It is part of the installable cfde_deriva
package. Examples throughout assume this bit of setup:
from cfde_deriva.dashboard_queries import StatsQuery, DashboardQueryHelper
The DashboardQueryHelper
class embodies a connection to an ERMrest
catalog instance containing CFDE C2M2 catalog content. This reusable
object is passed as a constructor argument to the StatsQuery
class
discussed below. It also provides a few demonstration and utility
functions of its own.
hostname = 'app.nih-cfde.org'
catalogid = '1'
helper = DashboardQueryHelper(hostname, catalogid)
This example connects to the initial CFDE deployment. It is important to understand that in practical use, clients may wish to target other catalogs to retrieve statistics about specific prepared sets of C2M2 content.
The DashboardQueryHelper
instance binds to a catalog and has
several sub-objects relevant for lower-level access to the
C2M2 portal catalog:
catalog
: anErmrestCatalog
instancebuilder
: a cached result ofcatalog.getPathBuilder()
, providing deriva-py's datapath API for data access
The builder
can be used to retrieve other supporting portal
content. For example:
# get a list of all DCCs known by the catalog
dccs = list(helper.builder.CFDE.dcc.path.fetch())
The StatsQuery2
class encapsulates an interface to build
multi-dimensional grouped statistics over C2M2 content. Conceptually,
the caller always makes a sequence of choices:
- Provide an appropriate helper bound to the target catalog.
- Specify the base entity type which is being summarized.
- Specify zero or more dimension categories for grouped statistics.
- Initiate query execution by fetching results.
The results are a set of rows identified by a combination of
coordinates in each dimension. Each coordinate is a set of zero or
more terms representing a combination of terms linked to entities for
that dimensional concept. Likewise, the row represents a specific
combination of such coordinates across all selected dimensions. Finally,
the row has a statistical summary of the selected entity type, e.g.
a num_files
count of linked files or total_size_in_bytes
sum of
all linked files' sizes.
The set of supported base entities is built into a release of the
dashboaord_queries
module. Each supported base entity type has its
own predfined set of statstical measures which are implicitly included
in the query result.
The base entity names are table names from C2M2. Currently these tables and respective measures are supported:
file
num_files
: count of file recordstotal_size_in_bytes
: sum of (non-null)total_size_in_bytes
for counted file records
biosample
num_biosamples
: count of biosample records
subject
num_subjects
: count of subject records
collection
num_collections
: count of collection records
file_stats = StatsQuery2(helper).entity('file')
biosample_stats = StatsQuery2(helper).entity('biosample')
subject_stats = StatsQuery2(helper).entity('subject')
The preceding queries could be fetched for global statistics, or optional grouping dimensions could be added prior to fetching in order to get grouped statistics.
By adding dimensions to a query, the caller can obtain grouped statistics for each unique term in that dimension. Adding n dimensions will result in a n -ary grouping key for statistical results.
The dimension names are related to C2M2 concepts. When a dimension is added, a new output field will contain a list of linked term information (each term represented as an object with various fields):
anatomy
: "slim" term mapped from originalbiosample.anatomy
termsassay_type
compression_type
: C2M2file_format
terms related byfile.compression_format
data_type
disease
dcc
:ethnicity
file_format
: C2M2file_format
terms related byfile.file_format
gene
mime_type
ncbi_taxonomy
race
sex
species
: a subset ofncbi_taxonomy
terms representing subject speciessubstance
subject_granularity
subject_role
The dcc
dimension has properties of the C2M2 dcc
table which
deviate from other vocabulary table definitions:
dcc_abbreviation
has no equivalent in vocabulariesdcc_name
instead ofname
dcc_description
instead ofdescription
These examples all involve only a base entity (for counting) and some metadata about that entity (including the attributed project which produced the entity).
Global file statistics:
StatsQuery2(helper).entity("file").fetch()
Per-DCC file statistics:
StatsQuery2(helper).entity("file").dimension("dcc").fetch()
Per-DCC file statistics further divided by file data type:
StatsQuery2(helper).entity("file").dimension("dcc").dimension("data_type").fetch()
Global biosample counts:
StatsQuery(helper).entity("biosample").fetch()
Per-DCC biosample counts:
StatsQuery(helper).entity("biosample").dimension("dcc").fetch()
Per-DCC biosample counts further divided by biosample assay type:
StatsQuery(helper).entity("biosample").dimension("dcc").dimension("assay_type").fetch()
Per-DCC biosample counts further divided by biosample anatomy and assay type:
StatsQuery(helper).entity("biosample").dimension("dcc").dimension("anatomy").dimension("assay_type").fetch()
The StatsQuery
class is a legacy API for accessing a
backwards-compatibility summary of C2M2 content. This summary is based
on a different warehousing approach, and uses single-term coordinates
for dimensions rather than term sets. This leads to multiple-counting
of entities in many scenarios, e.g. a file related to more than one
biosample might then be related to more than one anatomy term and end
up counted under each term. This legacy API also does not include support
for more recently added vocabulary concepts in C2M2, such as substance,
gene, or clinical concepts like sex or race.