Simplify `base_document` column usage with auxiliary instructions in pipeline config #228

bbrowning · 2024-07-29T18:15:42Z

Currently, we expect users that are creating auxiliary instructions to create a base_document column that contains the original document, as well as ensuring that gets set as a dataset_type. An example from our full pipeline config:

blocks:
  - name: duplicate_document_col
    type: DuplicateColumnsBlock
    config:
      columns_map:
        document: base_document
  - name: gen_spellcheck
    type: LLMBlock
    config:
      config_path: ../../configs/knowledge/spellcheck.yaml
      output_cols:
        - spellcheck
      gen_kwargs:
        max_tokens: 2048
  - name: flatten_auxiliary_columns
    type: FlattenColumnsBlock
    config:
      var_cols:
        - spellcheck
        - base_document
      value_name: corrected_document
      var_name: dataset_type

Is there a way to simplify this for authors of pipeline config, where we automatically handle the base_document dataset without the user ever needing to include references to that column in their config? That specific dataset_type string has a special meaning in the code, but how would a user know to include it without reading the code?

This issue is created to track a comment in another PR at #204 (comment) so we don't lose sight of improving this.

The text was updated successfully, but these errors were encountered:

bbrowning mentioned this issue Jul 29, 2024

Add support for auxiliary dataset generation #204

Merged

nathan-weinberg added the enhancement New feature or request label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify `base_document` column usage with auxiliary instructions in pipeline config #228

Simplify `base_document` column usage with auxiliary instructions in pipeline config #228

bbrowning commented Jul 29, 2024

Simplify base_document column usage with auxiliary instructions in pipeline config #228

Simplify base_document column usage with auxiliary instructions in pipeline config #228

Comments

bbrowning commented Jul 29, 2024

Simplify `base_document` column usage with auxiliary instructions in pipeline config #228

Simplify `base_document` column usage with auxiliary instructions in pipeline config #228