Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Text] Regular expression for data cleansing #1131

Open
ShihChun-H opened this issue Oct 8, 2024 · 12 comments · May be fixed by instill-ai/pipeline-backend#760
Open

[Text] Regular expression for data cleansing #1131

ShihChun-H opened this issue Oct 8, 2024 · 12 comments · May be fixed by instill-ai/pipeline-backend#760
Assignees
Labels
component feature New feature or request hacktoberfest hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 help-wanted Help from the community is appreciated improvement Improvement on existing features instill core

Comments

@ShihChun-H
Copy link
Member

ShihChun-H commented Oct 8, 2024

Describe Your Proposed Tutorial

Issue Description

Current State

  • Users cannot clean their data in VDP with simple flow

Why We Want to Change?

  • We want to exclude some data to make the chunks cleaner, which can improve the efficiency of RAG.

Proposed Change

Pseudo Recipe

# VDP Version
version: v1beta

component:
  text-0:
    type: text
    input:
      # "Array of text to be cleaned."
      texts:
      setting:
        # option 1
        clean-method: Regex
        # When the text is matched, it will be removed from the array of text.
        exclude-patterns: 
        # When the text is matched, it will be remained in the array of text.
        include-patterns:

        
        # option 2
        clean-method: Substring
        # When the text contains the substrings, it will be removed from the array of text.
        exclude-substrings: 
        # When the text contains the substrings, it will be remained in the array of text.
        include-substrings:
        # A flag indicating whether the substring matching is case-sensitive. When it is true, the matching is case-sensitive. When it is false, the matching is case-insensitive. The default value is false. For example, when it is case-sensitive, cat would only match 'cat' but not 'Cat' or 'CAT'. When cat is case-insensitive, on the other hand, would match 'cat', 'Cat', 'CAT', or any other variation of uppercase and lowercase letters.
        case-sensitive: 
          
    condition:
    task: TASK_CLEAN_DATA

Rules for the Component Hackathon

  • Each issue will only be assigned to one person/team at a time.
  • You can only work on one issue at a time.
  • To express interest in an issue, please comment on it and tag @kuroxx, allowing the Instill AI team to assign it to you.
  • Ensure you address all feedback and suggestions provided by the Instill AI team.
  • If no commits are made within five days, the issue may be reassigned to another contributor.
  • Join our Discord to engage in discussions and seek assistance in #hackathon channel. For technical queries, you can tag @chuang8511.

Component Contribution Guideline | Documentation | Official Go Tutorial

@ShihChun-H ShihChun-H added documentation Improvements for instill.tech/docs tutorial Improvements for instill.tech/tutorials need-triage Need to be investigated further labels Oct 8, 2024
Copy link

linear bot commented Oct 8, 2024

@ShihChun-H ShihChun-H added help-wanted Help from the community is appreciated improvement Improvement on existing features feature New feature or request instill core component hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 and removed documentation Improvements for instill.tech/docs tutorial Improvements for instill.tech/tutorials need-triage Need to be investigated further labels Oct 8, 2024
@NailaRais
Copy link

@ShihChun-H I am passionate about making a positive contribution.

@ShihChun-H
Copy link
Member Author

Hi @NailaRais, Fantastic! I've assigned the issue to you! Please make sure to read and follow the rules stated above 🙌🏻

@NailaRais
Copy link

Hi @NailaRais, Fantastic! I've assigned the issue to you! Please make sure to read and follow the rules stated above 🙌🏻

Thank You

@NailaRais
Copy link

NailaRais commented Oct 13, 2024

Hi @NailaRais, Fantastic! I've assigned the issue to you! Please make sure to read and follow the rules stated above 🙌🏻

Hello @ShihChun-H , some fixes are required too as I am encountering issues with your .env file and Docker warnings.

Issues example

LegacyKeyValueFormat

Old Format (legacy): ENV key value

New Format: ENV key=value

Warning Example

3 warnings found (use --debug to expand):
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 32)
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 33)
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 37)

naila@Naila MINGW64 ~/pipeline-backend/instill-core/pipeline-backend (main)

$ make dev
.env:2: *** missing separator.  Stop.

Should I fix those, too? Otherwise, I can't test and check my code. But I need time to do so.

Thank You!

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 14, 2024

Hey @NailaRais, could you commit your changes to a branch and submit a PR so that our team member can understand the changes you have made and give you better feedback to support you?

You can follow any tutorial about git branches like this one: https://www.freecodecamp.org/news/git-checkout-remote-branch-tutorial/

@NailaRais
Copy link

Hey @NailaRais, could you commit your changes to a branch and submit a PR so that our team member can understand the changes you have made and give you better feedback to support you?

You can follow any tutorial about git branches like this one: https://www.freecodecamp.org/news/git-checkout-remote-branch-tutorial/

Sure, I have done that. Thank you!

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 14, 2024

Thanks @NailaRais.

@chuang8511 @donch1989 Please review this when you have time 🙏

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 23, 2024

Hey @NailaRais our team has provided you feedback in your PR instill-ai/pipeline-backend#760. Please check and update, thanks!

Also make sure to submit your contribution through this form to make it count: https://forms.gle/v3kdkKJKt8ZbSJYH6

@kuroxx
Copy link
Collaborator

kuroxx commented Nov 4, 2024

Hi @NailaRais are you still working on this?

I wanted to let you know that we will need a PR by the end of this week (8th Nov) since we are closing this event.

Please submit:

to ensure your contribution is counted!

Alternatively, if you cannot complete this within the time frame but would still like to contribute, you are more than welcome to but please note it would not be within the scope of Hacktoberfest 2024.

Thank you and look forward to your contribution! ✨

@NailaRais
Copy link

Hey @NailaRais our team has provided you feedback in your PR instill-ai/pipeline-backend#760. Please check and update, thanks!

Also make sure to submit your contribution through this form to make it count: https://forms.gle/v3kdkKJKt8ZbSJYH6

Hello @kuroxx I have updated as per comment. Should I fill the form now?

Thank you :)

@NailaRais
Copy link

Hi @NailaRais are you still working on this?

I wanted to let you know that we will need a PR by the end of this week (8th Nov) since we are closing this event.

Please submit:

to ensure your contribution is counted!

Alternatively, if you cannot complete this within the time frame but would still like to contribute, you are more than welcome to but please note it would not be within the scope of Hacktoberfest 2024.

Thank you and look forward to your contribution! ✨

Done, waiting for further review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component feature New feature or request hacktoberfest hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 help-wanted Help from the community is appreciated improvement Improvement on existing features instill core
Projects
Status: In Progress
3 participants