Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip an operator if this is a duplicate op instead of raise error #102

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

chjuncn
Copy link
Collaborator

@chjuncn chjuncn commented Feb 1, 2025

Rationals

I checked how to generate basic nodes from Dataset and I think this is the best change to achieve what we want for the interface. But we might have better solutions if we change how we generate nodes from Dataset. If this change violates some principles of the design in PZ, I'm happy to discuss other solutions.

Final Effects

  1. Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface.
  2. No extra convert() will be added to the plan.
  3. When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error.

Issue

Currently Dataset(src, schema) initiation has 2 responsibilities:

  1. read source
  2. convert source to schema.

When we want to save schema param for users and use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like:

  1. Read source to schema that DataSource provides. This schema is derived by the system, so the users don't know (don't need to know).
  2. Convert Source schema to DefaultSchema.

The system will make one more convert call to convert SourceSchema to DefaultSchema every time, which is definitely wrong.

Solution

  1. We use schema from Datasource if exists, which is reasonable.
  2. If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror.

Real Examples

Before

Generated plan:
0. MarshalAndScanDataOp -> PDFFile

  1. PDFFile -> LLMConvertBonded -> DefaultSchema (contents, filename, text_conte) -> (value) Model: Model.GPT_4o Prompt Strategy: PromptStrategy.COT_QA

  2. DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper (value) -> (contents, filename, paper_auth) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.0] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

  3. ScientificPaper -> LLMFilter -> ScientificPaper (contents, filename, paper_auth) -> (contents, filename, paper_auth) Model: Model.GPT_4o Filter: The paper mentions phosphorylation of Exo1

  4. ScientificPaper -> MixtureOfAgentsConvert -> Reference (contents, filename, paper_auth) -> (reference_first_author, refere) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.8] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

After the change

Generated plan:
0. MarshalAndScanDataOp -> PDFFile

  1. PDFFile -> LLMConvertBonded -> ScientificPaper (contents, filename, text_conte) -> (contents, filename, paper_auth) Model: Model.GPT_4o Prompt Strategy: PromptStrategy.COT_QA

  2. ScientificPaper -> LLMFilter -> ScientificPaper (contents, filename, paper_auth) -> (contents, filename, paper_auth) Model: Model.GPT_4o Filter: The paper mentions phosphorylation of Exo1

  3. ScientificPaper -> MixtureOfAgentsConvert -> Reference (contents, filename, paper_auth) -> (reference_first_author, refere) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.8] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

vitaglianog and others added 2 commits January 30, 2025 18:41
* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst
…error

#Final Effects
1. Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface.
2. No extra convert() will be added to the plan.
3. When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error.

#Issue
Currently Dataset(src, schema) initiation has 2 responsibilities:
1. read source
2. convert source to schema.

When we use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like:
1. Read source to schema that DataSource provides. This schema is derived by system, so the users don't know (don't need to know).
2. Convert Source schema to DefaultSchema.

So everytime, the system will make one more convert call to convert SourceSchema to DefaultSchema, which is definitely wrong.

#Solution
1. We use schema from Datasource if exists, which is reasonable.
2. If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror.

#Real Examples
##Before
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> DefaultSchema
    (contents, filename, text_conte) -> (value)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper
    (value) -> (contents, filename, paper_auth)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.0]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

 3. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 4. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

##After
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> ScientificPaper
    (contents, filename, text_conte) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 3. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation
@chjuncn chjuncn requested a review from mdr223 February 1, 2025 15:05
chjuncn added a commit that referenced this pull request Feb 3, 2025
This change is based on #101 and #102, please review them first then this change.

1. This is to refactor all demos to use .sem_add_columns or .add_columns, and remove .convert().

2. Remove Schema from demos, except demos using ValidationDataSource and dataset.retrieve() that need schema now. We can refactor these cases later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants