Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify base_document column usage with auxiliary instructions in pipeline config #228

Open
bbrowning opened this issue Jul 29, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@bbrowning
Copy link
Contributor

Currently, we expect users that are creating auxiliary instructions to create a base_document column that contains the original document, as well as ensuring that gets set as a dataset_type. An example from our full pipeline config:

blocks:
  - name: duplicate_document_col
    type: DuplicateColumnsBlock
    config:
      columns_map:
        document: base_document
  - name: gen_spellcheck
    type: LLMBlock
    config:
      config_path: ../../configs/knowledge/spellcheck.yaml
      output_cols:
        - spellcheck
      gen_kwargs:
        max_tokens: 2048
  - name: flatten_auxiliary_columns
    type: FlattenColumnsBlock
    config:
      var_cols:
        - spellcheck
        - base_document
      value_name: corrected_document
      var_name: dataset_type

Is there a way to simplify this for authors of pipeline config, where we automatically handle the base_document dataset without the user ever needing to include references to that column in their config? That specific dataset_type string has a special meaning in the code, but how would a user know to include it without reading the code?

This issue is created to track a comment in another PR at #204 (comment) so we don't lose sight of improving this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants