Skip an operator if this is a duplicate op instead of raise error #102

chjuncn · 2025-02-01T15:05:25Z

Rationals

I checked how to generate basic nodes from Dataset and I think this is the best change to achieve what we want for the interface. But we might have better solutions if we change how we generate nodes from Dataset. If this change violates some principles of the design in PZ, I'm happy to discuss other solutions.

Final Effects

Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface.
No extra convert() will be added to the plan.
When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error.

Issue

Currently Dataset(src, schema) initiation has 2 responsibilities:

read source
convert source to schema.

When we want to save schema param for users and use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like:

Read source to schema that DataSource provides. This schema is derived by the system, so the users don't know (don't need to know).
Convert Source schema to DefaultSchema.

The system will make one more convert call to convert SourceSchema to DefaultSchema every time, which is definitely wrong.

Solution

We use schema from Datasource if exists, which is reasonable.
If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror.

Real Examples

Before

Generated plan:
0. MarshalAndScanDataOp -> PDFFile

PDFFile -> LLMConvertBonded -> DefaultSchema (contents, filename, text_conte) -> (value) Model: Model.GPT_4o Prompt Strategy: PromptStrategy.COT_QA
DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper (value) -> (contents, filename, paper_auth) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.0] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation
ScientificPaper -> LLMFilter -> ScientificPaper (contents, filename, paper_auth) -> (contents, filename, paper_auth) Model: Model.GPT_4o Filter: The paper mentions phosphorylation of Exo1
ScientificPaper -> MixtureOfAgentsConvert -> Reference (contents, filename, paper_auth) -> (reference_first_author, refere) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.8] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

After the change

Generated plan:
0. MarshalAndScanDataOp -> PDFFile

PDFFile -> LLMConvertBonded -> ScientificPaper (contents, filename, text_conte) -> (contents, filename, paper_auth) Model: Model.GPT_4o Prompt Strategy: PromptStrategy.COT_QA
ScientificPaper -> LLMFilter -> ScientificPaper (contents, filename, paper_auth) -> (contents, filename, paper_auth) Model: Model.GPT_4o Filter: The paper mentions phosphorylation of Exo1
ScientificPaper -> MixtureOfAgentsConvert -> Reference (contents, filename, paper_auth) -> (reference_first_author, refere) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.8] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

* Create chat.rst * Update pyproject.toml Hotfix for chat * Update conf.py Hotfix for chat.rst

…error #Final Effects 1. Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface. 2. No extra convert() will be added to the plan. 3. When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error. #Issue Currently Dataset(src, schema) initiation has 2 responsibilities: 1. read source 2. convert source to schema. When we use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like: 1. Read source to schema that DataSource provides. This schema is derived by system, so the users don't know (don't need to know). 2. Convert Source schema to DefaultSchema. So everytime, the system will make one more convert call to convert SourceSchema to DefaultSchema, which is definitely wrong. #Solution 1. We use schema from Datasource if exists, which is reasonable. 2. If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror. #Real Examples ##Before Generated plan: 0. MarshalAndScanDataOp -> PDFFile 1. PDFFile -> LLMConvertBonded -> DefaultSchema (contents, filename, text_conte) -> (value) Model: Model.GPT_4o Prompt Strategy: PromptStrategy.COT_QA 2. DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper (value) -> (contents, filename, paper_auth) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.0] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation 3. ScientificPaper -> LLMFilter -> ScientificPaper (contents, filename, paper_auth) -> (contents, filename, paper_auth) Model: Model.GPT_4o Filter: The paper mentions phosphorylation of Exo1 4. ScientificPaper -> MixtureOfAgentsConvert -> Reference (contents, filename, paper_auth) -> (reference_first_author, refere) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.8] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation ##After Generated plan: 0. MarshalAndScanDataOp -> PDFFile 1. PDFFile -> LLMConvertBonded -> ScientificPaper (contents, filename, text_conte) -> (contents, filename, paper_auth) Model: Model.GPT_4o Prompt Strategy: PromptStrategy.COT_QA 2. ScientificPaper -> LLMFilter -> ScientificPaper (contents, filename, paper_auth) -> (contents, filename, paper_auth) Model: Model.GPT_4o Filter: The paper mentions phosphorylation of Exo1 3. ScientificPaper -> MixtureOfAgentsConvert -> Reference (contents, filename, paper_auth) -> (reference_first_author, refere) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.8] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

This change is based on #101 and #102, please review them first then this change. 1. This is to refactor all demos to use .sem_add_columns or .add_columns, and remove .convert(). 2. Remove Schema from demos, except demos using ValidationDataSource and dataset.retrieve() that need schema now. We can refactor these cases later.

vitaglianog and others added 2 commits January 30, 2025 18:41

Create chat.rst (#96)

ef28691

* Create chat.rst * Update pyproject.toml Hotfix for chat * Update conf.py Hotfix for chat.rst

chjuncn requested a review from mdr223 February 1, 2025 15:05

chjuncn mentioned this pull request Feb 3, 2025

Refactor demos to use .sem_add_columns or .add_columns instead of convert(), remove Schema from demos when possible. #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip an operator if this is a duplicate op instead of raise error #102

Skip an operator if this is a duplicate op instead of raise error #102

chjuncn commented Feb 1, 2025

Skip an operator if this is a duplicate op instead of raise error #102

Are you sure you want to change the base?

Skip an operator if this is a duplicate op instead of raise error #102

Conversation

chjuncn commented Feb 1, 2025

Rationals

Final Effects

Issue

Solution

Real Examples

Before

After the change