From d1615d5c08c6fafd80e912c8c251c23feafcc514 Mon Sep 17 00:00:00 2001 From: Peter Desmet Date: Tue, 2 Jul 2024 17:31:56 +0200 Subject: [PATCH] Use single-line paragraphs --- .../docs/recipes/enum-labels-and-ordering.md | 273 +++--------------- content/docs/recipes/files-inside-archives.md | 17 +- content/docs/standard/security.md | 123 +++----- 3 files changed, 82 insertions(+), 331 deletions(-) diff --git a/content/docs/recipes/enum-labels-and-ordering.md b/content/docs/recipes/enum-labels-and-ordering.md index d0de5c3f..0a7964db 100644 --- a/content/docs/recipes/enum-labels-and-ordering.md +++ b/content/docs/recipes/enum-labels-and-ordering.md @@ -13,8 +13,7 @@ sidebar: ## Overview -Many software packages for manipulating and analyzing tabular data have special -features for working with categorical variables. These include: +Many software packages for manipulating and analyzing tabular data have special features for working with categorical variables. These include: - Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf), [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm) @@ -23,59 +22,26 @@ features for working with categorical variables. These include: - [Factors (R)](https://www.stat.berkeley.edu/~s133/factors.html) - [CategoricalVectors (Julia)](https://dataframes.juliadata.org/stable/man/categorical/) -These features can result in more efficient storage and faster runtime -performance, but more importantly, facilitate analysis by indicating that a -variable should be treated as categorical and by permitting the logical order -of the categories to differ from their lexical order. And in the case of value -labels, they permit the analyst to work with variables in numeric form (e.g., -in expressions, when fitting models) while generating output (e.g., tables, -plots) that is labeled with informative strings. - -While these features are of limited use in some disciplines, others rely -heavily on them (e.g., social sciences, epidemiology, clinical research, -etc.). Thus, before these disciplines can begin to use Frictionless in a -meaningful way, both the standards and the software tools need to support -these features. This pattern addresses necessary extensions to the -[Table Schema](https://specs.frictionlessdata.io//table-schema/). +These features can result in more efficient storage and faster runtime performance, but more importantly, facilitate analysis by indicating that a variable should be treated as categorical and by permitting the logical order of the categories to differ from their lexical order. And in the case of value labels, they permit the analyst to work with variables in numeric form (e.g., in expressions, when fitting models) while generating output (e.g., tables, plots) that is labeled with informative strings. + +While these features are of limited use in some disciplines, others rely heavily on them (e.g., social sciences, epidemiology, clinical research, etc.). Thus, before these disciplines can begin to use Frictionless in a meaningful way, both the standards and the software tools need to support these features. This pattern addresses necessary extensions to the [Table Schema](https://specs.frictionlessdata.io//table-schema/). ## Principles -Before describing the proposed extensions, here are the principles on which -they are based: - -1. Extensions should be software agnostic (i.e., no additions to the official - schema targeted toward a specific piece of software). While the extensions - are intended to support the use of features not available in all software, - the resulting data package should continue to work as well as possible with - software that does not have those features. -2. Related to (1), extensions should only include metadata that describe the - data themselves—not instructions for what a specific software package should - do with the data. Users who want to include the latter may do so within - a sub-namespace such as `custom` (e.g., see Issues [#103](https://github.com/frictionlessdata/specs/issues/103) - and [#663](https://github.com/frictionlessdata/specs/issues/663)). -3. Extensions must be backward compatible (i.e., not break existing tools, - workflows, etc. for working with Frictionless packages). - -It is worth emphasizing that the scope of the proposed extensions is strictly -limited to the information necessary to make use of the features for working -with categorical data provided by the software packages listed above. Previous -discussions of this issue have occasionally included references to additional -variable-level metadata (e.g., multiple sets of category labels such as both -"short labels" and longer "descriptions", or links to common data elements, -controlled vocabularies or rdfTypes). While these additional metadata are -undoubtedly useful, we speculate that the large majority of users who would -benefit from the extensions propopsed here would not have and/or utilize such -information, and therefore argue that these should be considered under a -separate proposal. +Before describing the proposed extensions, here are the principles on which they are based: + +1. Extensions should be software agnostic (i.e., no additions to the official schema targeted toward a specific piece of software). While the extensions are intended to support the use of features not available in all software, the resulting data package should continue to work as well as possible with software that does not have those features. +2. Related to (1), extensions should only include metadata that describe the data themselves—not instructions for what a specific software package should do with the data. Users who want to include the latter may do so within a sub-namespace such as `custom` (e.g., see Issues [#103](https://github.com/frictionlessdata/specs/issues/103) and [#663](https://github.com/frictionlessdata/specs/issues/663)). +3. Extensions must be backward compatible (i.e., not break existing tools, workflows, etc. for working with Frictionless packages). + +It is worth emphasizing that the scope of the proposed extensions is strictly limited to the information necessary to make use of the features for working with categorical data provided by the software packages listed above. Previous discussions of this issue have occasionally included references to additional variable-level metadata (e.g., multiple sets of category labels such as both "short labels" and longer "descriptions", or links to common data elements, controlled vocabularies or rdfTypes). While these additional metadata are undoubtedly useful, we speculate that the large majority of users who would benefit from the extensions propopsed here would not have and/or utilize such information, and therefore argue that these should be considered under a separate proposal. ## Implementations -Our proposal to add a field-specific `enumOrdered` property has been raised -[here](https://github.com/frictionlessdata/specs/issues/739) and +Our proposal to add a field-specific `enumOrdered` property has been raised [here](https://github.com/frictionlessdata/specs/issues/739) and [here](https://github.com/frictionlessdata/specs/issues/156). -Discussions regarding supporting software providing features for working with -categorical variables appear in the following GitHub issues: +Discussions regarding supporting software providing features for working with categorical variables appear in the following GitHub issues: - [https://github.com/frictionlessdata/specs/issues/156](https://github.com/frictionlessdata/specs/issues/156) - [https://github.com/frictionlessdata/specs/issues/739](https://github.com/frictionlessdata/specs/issues/739) @@ -85,28 +51,18 @@ and in the Frictionless Data forum: - [https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/](https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/) - [https://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/](https://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/) -Finally, while we are unaware of any existing implementations intended for -general use, it is likely that many users are already exploiting the fact that -arbitrary fields may be added to the -[table schema](https://specs.frictionlessdata.io//table-schema/) +Finally, while we are unaware of any existing implementations intended for general use, it is likely that many users are already exploiting the fact that +arbitrary fields may be added to the [table schema](https://specs.frictionlessdata.io//table-schema/) to support internal implementations. ## Proposed extensions We propose two extensions to [Table Schema](https://specs.frictionlessdata.io/table-schema/): -1. Add an optional field-specific `enumOrdered` property, which can be used - when contructing a categorical (or factor) to indicate that the variable is - ordinal. -2. Add an optional field-specific `enumLabels` property for use when data are - stored using integer or other codes rather than using the category labels. - This contains an object mapping the codes appearing in the data (keys) to - what they mean (values), and can be used by software to construct - corresponding value labels or categoricals (when supported) or to translate - the values when reading the data. +1. Add an optional field-specific `enumOrdered` property, which can be used when contructing a categorical (or factor) to indicate that the variable is ordinal. +2. Add an optional field-specific `enumLabels` property for use when data are stored using integer or other codes rather than using the category labels. This contains an object mapping the codes appearing in the data (keys) to what they mean (values), and can be used by software to construct corresponding value labels or categoricals (when supported) or to translate the values when reading the data. -These extensions are fully backward compatible, since they are optional and -not providing them is valid. +These extensions are fully backward compatible, since they are optional and not providing them is valid. Here is an example of a categorical variable using extension (1): @@ -132,24 +88,9 @@ Here is an example of a categorical variable using extension (1): } ``` -This is our preferred strategy, as it provides all of the information -necessary to support the categorical functionality of the software packages -listed above, while still yielding a useable result for software without such -capability. As described below, value labels or categoricals can be created -automatically based on the ordering of the values in the `enum` array, and the -`missingValues` can be incorporated into the value labels or categoricals if -desired. In those cases where it is desired to have more control over how the -value labels are constructed, this information can be stored in a separate -file in JSON format or as part of a custom extension to the table schema. -Since such instructions do not describe the data themselves (but only how a -specific software package should handle them), and since they are often -software- and/or user-specific, we argue that they should not be included in -the official table schema. - -Alternatively, those who wish to store their data in encoded form (e.g., this -is the default for data exports from [REDCap](https://projectredcap.org), a -commonly-used platform for collecting data for clinical studies) may use -extension (2) to do so: +This is our preferred strategy, as it provides all of the information necessary to support the categorical functionality of the software packages listed above, while still yielding a useable result for software without such capability. As described below, value labels or categoricals can be created automatically based on the ordering of the values in the `enum` array, and the `missingValues` can be incorporated into the value labels or categoricals if desired. In those cases where it is desired to have more control over how the value labels are constructed, this information can be stored in a separate file in JSON format or as part of a custom extension to the table schema. Since such instructions do not describe the data themselves (but only how a specific software package should handle them), and since they are often software- and/or user-specific, we argue that they should not be included in the official table schema. + +Alternatively, those who wish to store their data in encoded form (e.g., this is the default for data exports from [REDCap](https://projectredcap.org), a commonly-used platform for collecting data for clinical studies) may use extension (2) to do so: ``` { @@ -174,9 +115,7 @@ extension (2) to do so: } ``` -Note that although the field type is `integer`, the keys in the `enumLabels` -object must be wrapped in double quotes because this is required by the JSON -file format. +Note that although the field type is `integer`, the keys in the `enumLabels` object must be wrapped in double quotes because this is required by the JSON file format. A second variant of the example above is the following: @@ -206,170 +145,44 @@ A second variant of the example above is the following: } ``` -This represents encoded data exported from software with support for value -labels. The values `.a`, `.b`, etc. are known as _extended missing values_ -(Stata and SAS only) and provide 26 unique missing values for numeric fields -(both integer and float) in addition to the system missing value ("`.`"); in -SPSS these would be replaced with specially designated integers, typically -negative (e.g., -97, -98 and -99). +This represents encoded data exported from software with support for value labels. The values `.a`, `.b`, etc. are known as _extended missing values_ (Stata and SAS only) and provide 26 unique missing values for numeric fields (both integer and float) in addition to the system missing value ("`.`"); in SPSS these would be replaced with specially designated integers, typically negative (e.g., -97, -98 and -99). ## Specification -1. A field with an `enum` constraint or an `enumLabels` property MAY have an - `enumOrdered` property that MUST be a boolean. A value of `true` indicates - that the field should be treated as having an ordinal scale of measurement, - with the ordering given by the order of the field's `enum` array or by the - lexical order of the `enumLabels` object's keys, with the latter taking - precedence. Fields without an `enum` constraint or an `enumLabels` property - or for which the `enumLabels` keys do not include all values observed - in the data (excluding any values specified in the `missingValues` - property) MUST NOT have an `enumOrdered` property since in that case the - correct ordering of the data is ambiguous. The absence of an `enumOrdered` - property MUST NOT be taken to imply `enumOrdered: false`. - -2. A field MAY have an `enumLabels` property that MUST be an object. This - property SHOULD be used to indicate how the values in the data (represented - by the object's keys) are to be labeled or translated (represented by the - corresponding value). As required by the JSON format, the object's keys - must be listed as strings (i.e., wrapped in double quotes). The keys MAY - include values that do not appear in the data and MAY omit some values that - do appear in the data. For clarity and to avoid unintentional loss of - information, the object's values SHOULD be unique. +1. A field with an `enum` constraint or an `enumLabels` property MAY have an `enumOrdered` property that MUST be a boolean. A value of `true` indicates that the field should be treated as having an ordinal scale of measurement, with the ordering given by the order of the field's `enum` array or by the lexical order of the `enumLabels` object's keys, with the latter taking precedence. Fields without an `enum` constraint or an `enumLabels` property or for which the `enumLabels` keys do not include all values observed in the data (excluding any values specified in the `missingValues` property) MUST NOT have an `enumOrdered` property since in that case the correct ordering of the data is ambiguous. The absence of an `enumOrdered` property MUST NOT be taken to imply `enumOrdered: false`. +2. A field MAY have an `enumLabels` property that MUST be an object. This property SHOULD be used to indicate how the values in the data (represented by the object's keys) are to be labeled or translated (represented by the corresponding value). As required by the JSON format, the object's keys must be listed as strings (i.e., wrapped in double quotes). The keys MAY include values that do not appear in the data and MAY omit some values that do appear in the data. For clarity and to avoid unintentional loss of information, the object's values SHOULD be unique. ## Suggested implementations -Note: The use cases below address only _reading data_ from a Frictionless data -package; it is assumed that implementations will also provide the ability to -write Frictionless data packages using the schema extensions proposed above. -We suggest two types of implementations: - -1. Additions to the official Python Frictionless Framework to generate - software-specific scripts that may be executed by a specific software - package to read data from a Frictionless data package and create the - appropriate value labels or categoricals, as described below. These - scripts can then be included along with the data in the package itself. - -2. Software-specific extension packages that may be installed to permit users - of that software to read data from a Frictionless data package directly, - automatically creating the appropriate value labels or categoricals as - described below. - -The advantage of (1) is that it doesn't require users to install another -software package, which may in some cases be difficult or impossible. The -advantage of (2) is that it provides native support for working with -Frictionless data packages, and may be both easier and faster once the package -is installed. We are in the process of implementing both approaches for Stata; -implementations for the other software listed above are straightforward. +Note: The use cases below address only _reading data_ from a Frictionless data package; it is assumed that implementations will also provide the ability to write Frictionless data packages using the schema extensions proposed above. We suggest two types of implementations: + +1. Additions to the official Python Frictionless Framework to generate software-specific scripts that may be executed by a specific software package to read data from a Frictionless data package and create the appropriate value labels or categoricals, as described below. These scripts can then be included along with the data in the package itself. +2. Software-specific extension packages that may be installed to permit users of that software to read data from a Frictionless data package directly, automatically creating the appropriate value labels or categoricals as described below. + +The advantage of (1) is that it doesn't require users to install another software package, which may in some cases be difficult or impossible. The advantage of (2) is that it provides native support for working with Frictionless data packages, and may be both easier and faster once the package is installed. We are in the process of implementing both approaches for Stata; implementations for the other software listed above are straightforward. ### Software that supports value labels (Stata, SAS or SPSS) -1. In cases where a field has an `enum` constraint but no `enumLabels` - property, automatically generate a value label mapping the integers 1, 2, - 3, ... to the `enum` values in order, use this to encode the field (thereby - changing its type from `string` to `integer`), and attach the value label - to the field. Provide option to skip automatically dropping values - specified in the `missingValues` property and instead add them in order to - the end of the value label, encoded using extended missing values if - supported. - -2. In cases where the data are stored in encoded form (e.g., as integers) and - a corresponding `enumLabels` property is present, and assuming that the - keys in the `enumLabels` object are limited to integers and extended - missing values (if supported), use the `enumLabels` object to generate a - value label and attach it to the field. As with (1), provide option to skip - automatically dropping values specified in the `missingValues` property and - instead add them in order to the end of the value label, encoded using - extended missing values if supported. - -3. Although none of Stata, SAS or SPSS currently permits designating a specific - variable as ordered, Stata permits attaching arbitrary metadata to - individual variables. Thus, in cases where the `enumOrdered` property is - present, this information can be stored in Stata to inform the analyst and - to prevent loss of information when generating Frictionless data packages - from within Stata. +1. In cases where a field has an `enum` constraint but no `enumLabels` property, automatically generate a value label mapping the integers 1, 2, 3, ... to the `enum` values in order, use this to encode the field (thereby changing its type from `string` to `integer`), and attach the value label to the field. Provide option to skip automatically dropping values specified in the `missingValues` property and instead add them in order to the end of the value label, encoded using extended missing values if supported. +2. In cases where the data are stored in encoded form (e.g., as integers) and a corresponding `enumLabels` property is present, and assuming that the keys in the `enumLabels` object are limited to integers and extended missing values (if supported), use the `enumLabels` object to generate a value label and attach it to the field. As with (1), provide option to skip automatically dropping values specified in the `missingValues` property and instead add them in order to the end of the value label, encoded using extended missing values if supported. +3. Although none of Stata, SAS or SPSS currently permits designating a specific variable as ordered, Stata permits attaching arbitrary metadata to individual variables. Thus, in cases where the `enumOrdered` property is present, this information can be stored in Stata to inform the analyst and to prevent loss of information when generating Frictionless data packages from within Stata. ### Software that supports categoricals or factors (Pandas, R, Julia) -1. In cases where a field has an `enum` constraint but no `enumLabels` - property, automatically define a categorical or factor using the `enum` - values in order, and convert the variable to categorical or factor type - using this definition. Provide option to skip automatically dropping values - specified in the `missingValues` property and instead add them in order to - the end of the `enum` values when defining the categorical or factor. - -2. In cases where the data are stored in encoded form (e.g., as integers) and - a corresponding `enumLabels` property is present, translate the data using - the `enumLabels` object, define a categorical or factor using the values of - the `enumLabels` object in lexical order of the keys, and convert the - variable to categorical or factor type using this definition. Provide - option to skip automatically dropping values specified in the - `missingValues` property and instead add them to the end of the - `enumLabels` values when defining the categorical or factor. - -3. In cases where a field has an `enumOrdered` property, use that when - defining the categorical or factor. +1. In cases where a field has an `enum` constraint but no `enumLabels` property, automatically define a categorical or factor using the `enum` values in order, and convert the variable to categorical or factor type using this definition. Provide option to skip automatically dropping values specified in the `missingValues` property and instead add them in order to the end of the `enum` values when defining the categorical or factor. +2. In cases where the data are stored in encoded form (e.g., as integers) and a corresponding `enumLabels` property is present, translate the data using the `enumLabels` object, define a categorical or factor using the values of the `enumLabels` object in lexical order of the keys, and convert the variable to categorical or factor type using this definition. Provide option to skip automatically dropping values specified in the `missingValues` property and instead add them to the end of the `enumLabels` values when defining the categorical or factor. +3. In cases where a field has an `enumOrdered` property, use that when defining the categorical or factor. ### All software -Although the extensions proposed here are intended primarily to support the -use of value labels and categoricals in software that supports them, they also -provide additional functionality when reading data into any software that can -handle tabular data. Specifically, the `enumLabels` property may be used to -support the use of enums even in cases where value labels or categoricals are -not being used. For example, it is standard practice in software for analyzing -genetic data to code sex as 0, 1 and 2 (corresponding to "Unknown", "Male" and -"Female") and affection status as 0, 1 and 2 (corresponding to "Unknown", -"Unaffected" and "Affected"). In such cases, the `enumLabels` property may be -used to confirm that the data follow the standard convention or to indicate -that they deviate from it; it may also be used to translate those codes into -human-readable values, if desired. +Although the extensions proposed here are intended primarily to support the use of value labels and categoricals in software that supports them, they also provide additional functionality when reading data into any software that can handle tabular data. Specifically, the `enumLabels` property may be used to support the use of enums even in cases where value labels or categoricals are not being used. For example, it is standard practice in software for analyzing genetic data to code sex as 0, 1 and 2 (corresponding to "Unknown", "Male" and "Female") and affection status as 0, 1 and 2 (corresponding to "Unknown", "Unaffected" and "Affected"). In such cases, the `enumLabels` property may be used to confirm that the data follow the standard convention or to indicate that they deviate from it; it may also be used to translate those codes into human-readable values, if desired. ## Notes While this pattern is designed as an extension to [Table Schema](https://specs.frictionlessdata.io/table-schema/) fields, it could also be used to document `enum` values of properties in [profiles](https://specs.frictionlessdata.io/profiles/), such as contributor roles. -This pattern originally included a proposal to add an optional field-specific -`missingValues` property similar to that described in the pattern -"[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)" -appearing in this document above. The objective was to provide a mechnanism to -distinguish between so-called _system missing values_ (i.e., values that -indicate only that the corresponding data are missing) and other values that -convey meaning but are typically excluded when fitting statistical models. The -latter may be represented by _extended missing values_ (`.a`, `.b`, `.c`, -etc.) in Stata and SAS, or in SPSS by negative integers that are then -designated as missing by using the `MISSING VALUES` command. For example, -values such as "NA", "Not applicable", ".", etc. could be specified in the -resource level `missingValues` property, while values such as "Don't know" and -"Refused"—often used when generating tabular summaries and occasionally used -when fitting certain statistical models—could be specified in the -corresponding field level `missingValues` property. The former would still be -converted to `null` before type-specific string conversion (just as they are -now), while the latter could be used by capable software when creating value -labels or categoricals. - -While this proposal was consistent with the principles outlined at the -beginning (in particular, existing software would still yield a usable result -when reading the data), we realized that it would conflict with what appears -to be an emerging consensus regarding field-specific `missingValues`; i.e., -that they should _replace_ the less specific resource level `missingValues` -for the corresponding field rather than be combined with them (see the discussion -[here](https://github.com/frictionlessdata/specs/issues/551) as well as the -"[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)" -pattern above). While there is good reason for replacing rather than combining -here (e.g., it is more explicit), it would unfortunately conflict with the -idea of using the field-specific `missingValues` in conjunction with the -resource level `missingValues` as just described; namely, if the -field-specific property replaced the resource level property then the system -missing values would no longer be converted to `null`, as desired. - -For this reason, we have dropped the proposal to add a field-specific -`missingValues` property from this pattern, and assert that implementation of -this pattern by software should assume that if a field-specific `missingValues` -property is added to the -[table schema](https://specs.frictionlessdata.io//table-schema/) -it should, if present, replace the resource level `missingValues` property for -the corresponding field. We do not believe that this change represents a -substantial limitation when creating value labels or categoricals, since -system missing values can typically be easily distinguished from other missing -values when exported in CSV format (e.g., "." in Stata or SAS, "NA" in R, or -"" in Pandas). +This pattern originally included a proposal to add an optional field-specific `missingValues` property similar to that described in the pattern "[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)" appearing in this document above. The objective was to provide a mechnanism to distinguish between so-called _system missing values_ (i.e., values that indicate only that the corresponding data are missing) and other values that convey meaning but are typically excluded when fitting statistical models. The latter may be represented by _extended missing values_ (`.a`, `.b`, `.c`, etc.) in Stata and SAS, or in SPSS by negative integers that are then designated as missing by using the `MISSING VALUES` command. For example, values such as "NA", "Not applicable", ".", etc. could be specified in the resource level `missingValues` property, while values such as "Don't know" and "Refused"—often used when generating tabular summaries and occasionally used when fitting certain statistical models—could be specified in the corresponding field level `missingValues` property. The former would still be converted to `null` before type-specific string conversion (just as they are now), while the latter could be used by capable software when creating value labels or categoricals. + +While this proposal was consistent with the principles outlined at the beginning (in particular, existing software would still yield a usable result when reading the data), we realized that it would conflict with what appears to be an emerging consensus regarding field-specific `missingValues`; i.e., that they should _replace_ the less specific resource level `missingValues` for the corresponding field rather than be combined with them (see the discussion [here](https://github.com/frictionlessdata/specs/issues/551) as well as the "[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)" pattern above). While there is good reason for replacing rather than combining here (e.g., it is more explicit), it would unfortunately conflict with the idea of using the field-specific `missingValues` in conjunction with the resource level `missingValues` as just described; namely, if the field-specific property replaced the resource level property then the system missing values would no longer be converted to `null`, as desired. + +For this reason, we have dropped the proposal to add a field-specific `missingValues` property from this pattern, and assert that implementation of this pattern by software should assume that if a field-specific `missingValues` property is added to the [table schema](https://specs.frictionlessdata.io//table-schema/) it should, if present, replace the resource level `missingValues` property for the corresponding field. We do not believe that this change represents a substantial limitation when creating value labels or categoricals, since system missing values can typically be easily distinguished from other missing values when exported in CSV format (e.g., "." in Stata or SAS, "NA" in R, or "" in Pandas). diff --git a/content/docs/recipes/files-inside-archives.md b/content/docs/recipes/files-inside-archives.md index f4813bae..25159b93 100644 --- a/content/docs/recipes/files-inside-archives.md +++ b/content/docs/recipes/files-inside-archives.md @@ -9,12 +9,9 @@ title: Files Inside Archives -Some datasets need to contain a Zip file (or tar, other formats) containing a -set of files. +Some datasets need to contain a Zip file (or tar, other formats) containing a set of files. -This might happen for practical reasons (datasets containing thousands of files) -or for technical limitations (for example, currently Zenodo doesn't support subdirectories and -datasets might need subdirectory structures to be useful). +This might happen for practical reasons (datasets containing thousands of files) or for technical limitations (for example, currently Zenodo doesn't support subdirectories and datasets might need subdirectory structures to be useful). ## Implementations @@ -22,8 +19,7 @@ There are no known implementations at present. ## Specification -The `resources` in a `data-package` can contain "recursive resources": identifying -a new resource. +The `resources` in a `data-package` can contain "recursive resources": identifying a new resource. ## Example @@ -67,9 +63,6 @@ For a `.tar.gz` it would be the same changing the `"format"` and the ## Types of files -Support for `Zip` and `tar.gz` might be enough: hopefully everything can be -re-packaged using these formats. +Support for `Zip` and `tar.gz` might be enough: hopefully everything can be re-packaged using these formats. -To keep the implementation and testing testing: only one recursive level is -possible. A `resource` can list `resources` inside (like in the example). But -the inner resources cannot contain resources again. +To keep the implementation and testing testing: only one recursive level is possible. A `resource` can list `resources` inside (like in the example). But the inner resources cannot contain resources again. diff --git a/content/docs/standard/security.md b/content/docs/standard/security.md index f5f8048d..f712479e 100644 --- a/content/docs/standard/security.md +++ b/content/docs/standard/security.md @@ -20,27 +20,18 @@ The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `S ## Usage Perspective -Data packages is a container format that allows the creator to specify payload data (Resources) either as JSON -objects/arrays or via pointers. There are two pointer formats: +Data packages is a container format that allows the creator to specify payload data (Resources) either as JSON objects/arrays or via pointers. There are two pointer formats: -- local file system references. Those follow POSIX naming conventions and have to be relative to the Package Descriptor - file ("datapackage.json"). Absolute paths are disallowed as they would open data exfiltration attacks. They would also - be rarely useful, considering you typically cannot know the file system layout of the user's computer -- URLs as pointers to remote Resources. They are intended to load datasets from sites like statistic's offices as the - basis of Data Packages. Only HTTP/HTTPS URLs are allowed, library maintainers have to filter out others like file-URLs +- local file system references. Those follow POSIX naming conventions and have to be relative to the Package Descriptor file ("datapackage.json"). Absolute paths are disallowed as they would open data exfiltration attacks. They would also be rarely useful, considering you typically cannot know the file system layout of the user's computer +- URLs as pointers to remote Resources. They are intended to load datasets from sites like statistic's offices as the basis of Data Packages. Only HTTP/HTTPS URLs are allowed, library maintainers have to filter out others like file-URLs -Both formats can open security holes that can be used to attack the user's computer and/or network. It is therefore -STRONGLY recommended to limit the kind of Resource pointers you allow on your machines if you accept Data Packages -from third party sources. +Both formats can open security holes that can be used to attack the user's computer and/or network. It is therefore STRONGLY recommended to limit the kind of Resource pointers you allow on your machines if you accept Data Packages from third party sources. -ONLY in a trusted environment (eg. your own computer during development of Data Packages) is it recommended to allow -all kinds of Resource pointers. In every other environment, you MUST keep the various attack scenarios in mind and -filter out potentially dangerous Resource pointer types +ONLY in a trusted environment (eg. your own computer during development of Data Packages) is it recommended to allow all kinds of Resource pointers. In every other environment, you MUST keep the various attack scenarios in mind and filter out potentially dangerous Resource pointer types ### Dangerous Descriptor/Resource pointer combinations -How to read the table: if your "datapackage.json"-file comes from one of the sources on the left, you should treat -Resources in the format on the top as: +How to read the table: if your "datapackage.json"-file comes from one of the sources on the left, you should treat Resources in the format on the top as: - red: disallowed - yellow: potentially dangerous @@ -50,113 +41,67 @@ Resources in the format on the top as: #### Descriptor source is a URL -If your descriptor is loaded via URL, and the server to which the URL points is not fully trusted, you -SHOULD NOT allow Data Packages with Resource pointers in +If your descriptor is loaded via URL, and the server to which the URL points is not fully trusted, you SHOULD NOT allow Data Packages with Resource pointers in -- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author - of a Data Package can be used in a "keyhole" attack to probe your network layout. -- Absolute file system references. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on - Unix-like systems). Relative paths will be converted to URLs relative to the descriptor URL, so they will - not load data from the local file system and are therefore safe. +- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author of a Data Package can be used in a "keyhole" attack to probe your network layout. +- Absolute file system references. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on Unix-like systems). Relative paths will be converted to URLs relative to the descriptor URL, so they will not load data from the local file system and are therefore safe. -URL-based Resource pointers can furthermore be used for denial of service attacks on either the user's system or a -service hosting Resource data. A relatively small Data Package could still hold thousands of Resource URLs that -each could point to very large CSV files hosted somewhere. The Data Package processing library would load all -those CSV files which might overwhelm the user's computer. If an attacker were able to spread such a malicious -Data Package, this could exhaust the resources of a hosting service. +URL-based Resource pointers can furthermore be used for denial of service attacks on either the user's system or a service hosting Resource data. A relatively small Data Package could still hold thousands of Resource URLs that each could point to very large CSV files hosted somewhere. The Data Package processing library would load all those CSV files which might overwhelm the user's computer. If an attacker were able to spread such a malicious Data Package, this could exhaust the resources of a hosting service. #### Descriptor source is a local relative path -If your descriptor is loaded via a local relative path, and the source of the Data Package is not fully trusted, you -SHOULD NOT allow Data Packages with Resource pointers in +If your descriptor is loaded via a local relative path, and the source of the Data Package is not fully trusted, you SHOULD NOT allow Data Packages with Resource pointers in -- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author - of a Data Package can be used in a "keyhole" attack to probe your network layout. -- Absolute file system references. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on - Unix-like systems). Relative paths will be converted to paths relative to the Descriptor file system reference, - so they are considered harmless. +- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author of a Data Package can be used in a "keyhole" attack to probe your network layout. +- Absolute file system references. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on Unix-like systems). Relative paths will be converted to paths relative to the Descriptor file system reference, so they are considered harmless. -As long as the producer of the Data Package is on the same local network as the computer/server parsing it, it is -considered safe to reference Resources via URLs, as the creator could map the network from their own workstation just -as well as crafting malicious Data Packages. In the above table, this case is therefore coded in yellow. +As long as the producer of the Data Package is on the same local network as the computer/server parsing it, it is considered safe to reference Resources via URLs, as the creator could map the network from their own workstation just as well as crafting malicious Data Packages. In the above table, this case is therefore coded in yellow. -If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the -internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers. +If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers. #### Descriptor source is a local relative path -While it is never safe to accept absolute file paths for Resources, it is perfectly safe to accept them for Descriptor -files. If your descriptor is loaded via a local absolute path, and the source of the Data Package is not fully -trusted, you SHOULD NOT allow Data Packages with Resource pointers in +While it is never safe to accept absolute file paths for Resources, it is perfectly safe to accept them for Descriptor files. If your descriptor is loaded via a local absolute path, and the source of the Data Package is not fully trusted, you SHOULD NOT allow Data Packages with Resource pointers in -- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author - of a Data Package can be used in a "keyhole" attack to probe your network layout. -- Absolute file system references. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on - Unix-like systems). Relative paths will be converted to paths relative to the Descriptor file system reference, - so they are considered harmless. +- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author of a Data Package can be used in a "keyhole" attack to probe your network layout. +- Absolute file system references. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on Unix-like systems). Relative paths will be converted to paths relative to the Descriptor file system reference, so they are considered harmless. -As long as the producer of the Data Package is on the same local network as the computer/server parsing it, it is -considered safe to reference Resources via URLs, as the creator could map the network from their own workstation just -as well as crafting malicious Data Packages. In the above table, this case is therefore coded in yellow. +As long as the producer of the Data Package is on the same local network as the computer/server parsing it, it is considered safe to reference Resources via URLs, as the creator could map the network from their own workstation just as well as crafting malicious Data Packages. In the above table, this case is therefore coded in yellow. -If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the -internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers. +If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers. #### Descriptor source is a JSON object -If the Descriptor is not loaded from file but created in-memory and the source of the Data Package is not fully -trusted, you SHOULD NOT allow Data Packages with Resource pointers in +If the Descriptor is not loaded from file but created in-memory and the source of the Data Package is not fully trusted, you SHOULD NOT allow Data Packages with Resource pointers in -- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author - of a Data Package can be used in a "keyhole" attack to probe your network layout. -- file system references, relative or absolute. Absolute paths can be used to exfiltrate system files - (eg. /etc/passwd on Unix-like systems). Relative paths would be constructed relative to the parsing software's working - directory and could be used to guess at configuration files to exfiltrate. OTOH, in creation of a Data Package, - and if the relative paths are confined to a subdirectory, it is safe to use relative paths. +- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author of a Data Package can be used in a "keyhole" attack to probe your network layout. +- file system references, relative or absolute. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on Unix-like systems). Relative paths would be constructed relative to the parsing software's working directory and could be used to guess at configuration files to exfiltrate. OTOH, in creation of a Data Package, and if the relative paths are confined to a subdirectory, it is safe to use relative paths. -As long as the producer of the Data Package is on the same local network as the computer/server parsing it, it is -considered safe to reference Resources via URLs, as the creator could map the network from their own workstation just -as well as crafting malicious Data Packages. In the above table, this case is therefore coded in yellow. +As long as the producer of the Data Package is on the same local network as the computer/server parsing it, it is considered safe to reference Resources via URLs, as the creator could map the network from their own workstation just as well as crafting malicious Data Packages. In the above table, this case is therefore coded in yellow. -If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the -internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers. +If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers. #### Descriptor source is a self-created JSON object -If the Descriptor is not loaded from file or created via a third-party application but by your software, it is -generally assumed you know what you do and therefore, loading Resources from URLs or file is considered safe. You -still SHOULD NOT use absolute paths as a matter of precaution - and implementing libraries should filter them out. +If the Descriptor is not loaded from file or created via a third-party application but by your software, it is generally assumed you know what you do and therefore, loading Resources from URLs or file is considered safe. You still SHOULD NOT use absolute paths as a matter of precaution - and implementing libraries should filter them out. ## Implemention Perspective Two kinds of Resource pointers can never be guaranteed to be totally safe: -- Absolute file system references. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on - Unix-like systems). In your implementation, you SHOULD either raise an error if an absolute local path is encountered - or relativize it to the Descriptor path. -- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author - of a Data Package can be used in a "keyhole" attack to probe your user's network layout. It is up to the library creator - to create means that allow their users to mitigate this attack. +- Absolute file system references. Absolute paths can be used to exfiltrate system files (eg. /etc/passwd on Unix-like systems). In your implementation, you SHOULD either raise an error if an absolute local path is encountered or relativize it to the Descriptor path. +- URLs. As described in [issue #650](https://github.com/frictionlessdata/specs/issues/650), URLs crafted by the author of a Data Package can be used in a "keyhole" attack to probe your user's network layout. It is up to the library creator to create means that allow their users to mitigate this attack. -As URLs are part of the DNA of Data Packages, it is not advisable to disallow their use completely. However, you should -allow for a security setting that stops your implementation from loading URL-based Resources. This could be done +As URLs are part of the DNA of Data Packages, it is not advisable to disallow their use completely. However, you should allow for a security setting that stops your implementation from loading URL-based Resources. This could be done -- via a setting switch (`insecure`/`default`) that allows the user of your library implementation to allow or - disallow absolute file paths and URL-based Resource pointers -- via a pluggable security filter that is applied as an interceptor _before_ loading any pointer-based Resources. If - you decide to use such a scheme, you SHOULD provide default implementations for a filter disallowing URL-based - Resource and an insecure filter that allows loading of all Resources. +- via a setting switch (`insecure`/`default`) that allows the user of your library implementation to allow or disallow absolute file paths and URL-based Resource pointers +- via a pluggable security filter that is applied as an interceptor _before_ loading any pointer-based Resources. If you decide to use such a scheme, you SHOULD provide default implementations for a filter disallowing URL-based Resource and an insecure filter that allows loading of all Resources. ### Security Filters -If disallowing all URL-based Resources is too heavy-handed and allowing all is too insecure, finer-grained filters -should be implemented. Those finer security filters can be implemented as either blacklist or whitelist filters. -Blacklist filters in principle allow all URLs and restrict some, whereas whitelist filters deny all as a default +If disallowing all URL-based Resources is too heavy-handed and allowing all is too insecure, finer-grained filters should be implemented. Those finer security filters can be implemented as either blacklist or whitelist filters. Blacklist filters in principle allow all URLs and restrict some, whereas whitelist filters deny all as a default and have a limited list of allowed URLs. -Blacklist filters in their most basic implementation would have to disallow all non-routed IP-addresses like the -192.168.x.x range or the 10.100.x.x range. This would blunt mapping attacks against the internal network of your users -but needs to be well thought out as even one omission could endanger network security +Blacklist filters in their most basic implementation would have to disallow all non-routed IP-addresses like the 192.168.x.x range or the 10.100.x.x range. This would blunt mapping attacks against the internal network of your users but needs to be well thought out as even one omission could endanger network security -Whitelist filters are much more secure as they allow the loading of Resources from a named list of domains only, but -might be too restrictive for some uses. +Whitelist filters are much more secure as they allow the loading of Resources from a named list of domains only, but might be too restrictive for some uses.