Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Flexible Data Cleansing Functionality for Enhanced Text Processing #760

Open
wants to merge 76 commits into
base: main
Choose a base branch
from

Conversation

NailaRais
Copy link

Because

  • Enhanced Functionality: Introduces a new data cleansing function, expanding the capabilities of the application to process and clean input text data effectively.
  • Improved Code Modularity: The addition of a dedicated CleanData function allows for better code organization and reuse, making it easier to manage and update cleansing logic in the future.
  • Supports Multiple Cleaning Methods: Allows users to choose between different cleaning methods (Regex or Substring), offering flexibility based on specific use cases and requirements.
  • Integration with Existing Framework: Seamlessly integrates with the existing Golang codebase without altering the original logic, maintaining stability while adding new features.
  • Enhanced Data Quality: Improves the overall quality of data processed by the application, which can lead to more accurate insights and decision-making based on clean data.

This commit

  • Introduced Data Cleansing Logic: Added the CleanData, CleanDataInput, CleanDataOutput, and DataCleaningSetting structs to facilitate data cleansing operations.
  • Implemented Cleaning Methods: Created methods for cleaning data using both regular expressions and substring matching, providing multiple options for data filtering.
  • Updated Execution Logic: Modified the Execute method to include a new case for handling data cleansing tasks, ensuring the system can process cleansing jobs appropriately.
  • Maintained Original Functionality: Ensured that the original functionality of the application remains intact, adhering to the principle of non-disruptive code changes.
  • Improved Error Handling: Enhanced error handling during the cleansing process to provide more informative feedback when data fails to clean or convert properly.

@NailaRais
Copy link
Author

@chuang8511 Kindly review whenever you have time.

@chuang8511
Copy link
Member

Hi @NailaRais
The Golang code seems to be the right direction.
Before we get into code review, could you do a few things for us?

  1. To do end-to-end test, you still need to fetch this JSON schema to initialise the component definition.
  2. We will assume you have done end-to-end test when your PR is ready. So, please provide your recipe for us.
  3. If you have time, please also add the unit testing code to ensure the functionality.

Thank you again!

@NailaRais
Copy link
Author

Hi @NailaRais The Golang code seems to be the right direction. Before we get into code review, could you do a few things for us?

  1. To do end-to-end test, you still need to fetch this JSON schema to initialise the component definition.
  2. We will assume you have done end-to-end test when your PR is ready. So, please provide your recipe for us.
  3. If you have time, please also add the unit testing code to ensure the functionality.

Thank you again!

Hello @chuang8511 @kuroxx

I have added feature to fetch json in main.go and below is the unit test code which should be added to main_test.go and recipe.

Updated Main_test.go

package text

import (
	"context"
	"testing"

	"github.com/frankban/quicktest"

	"github.com/instill-ai/pipeline-backend/pkg/component/base"



Thank you
	"github.com/instill-ai/pipeline-backend/pkg/component/internal/mock"
)

func TestOperator(t *testing.T) {
	c := quicktest.New(t)

	testcases := []struct {
		name  string
		task  string
		input ChunkTextInput
	}{
		{
			name: "chunk texts",
			task: "TASK_CHUNK_TEXT",
			input: ChunkTextInput{
				Text: "Hello world. This is a test.",
				Strategy: Strategy{
					Setting: Setting{
						ChunkMethod: "Token",
					},
				},
			},
		},
		{
			name:  "error case",
			task:  "FAKE_TASK",
			input: ChunkTextInput{},
		},
	}
	bc := base.Component{}
	ctx := context.Background()
	for i := range testcases {
		tc := &testcases[i]
		c.Run(tc.name, func(c *quicktest.C) {
			component := Init(bc)
			c.Assert(component, quicktest.IsNotNil)

			execution, err := component.CreateExecution(base.ComponentExecution{
				Component: component,
				Task:      tc.task,
			})
			c.Assert(err, quicktest.IsNil)
			c.Assert(execution, quicktest.IsNotNil)

			ir, ow, eh, job := mock.GenerateMockJob(c)
			ir.ReadDataMock.Set(func(ctx context.Context, v interface{}) error {
				*v.(*ChunkTextInput) = tc.input
				return nil
			})
			ow.WriteDataMock.Optional().Set(func(ctx context.Context, output interface{}) error {
				if tc.name == "error case" {
					c.Assert(output, quicktest.IsNil)
					return nil
				}
				return nil
			})
			if tc.name == "error case" {
				ir.ReadDataMock.Optional()
			}
			eh.ErrorMock.Optional().Set(func(ctx context.Context, err error) {
				if tc.name == "error case" {
					c.Assert(err, quicktest.ErrorMatches, "not supported task: FAKE_TASK")
				}
			})
			err = execution.Execute(ctx, []*base.Job{job})
			c.Assert(err, quicktest.IsNil)
		})
	}
}

// Additional tests for CleanData functionality
func TestCleanData(t *testing.T) {
	c := quicktest.New(t)

	testcases := []struct {
		name         string
		input        CleanDataInput
		expected     CleanDataOutput
		expectedError bool
	}{
		{
			name: "clean with regex",
			input: CleanDataInput{
				Texts: []string{"Hello World!", "This is a test.", "Goodbye!"},
				Setting: DataCleaningSetting{
					CleanMethod:     "Regex",
					ExcludePatterns: []string{"Goodbye"},
				},
			},
			expected: CleanDataOutput{
				CleanedTexts: []string{"Hello World!", "This is a test."},
			},
			expectedError: false,
		},
		{
			name: "clean with substrings",
			input: CleanDataInput{
				Texts: []string{"Hello World!", "This is a test.", "Goodbye!"},
				Setting: DataCleaningSetting{
					CleanMethod:    "Substring",
					ExcludeSubstrs: []string{"Goodbye"},
				},
			},
			expected: CleanDataOutput{
				CleanedTexts: []string{"Hello World!", "This is a test."},
			},
			expectedError: false,
		},
		{
			name: "no valid cleaning method",
			input: CleanDataInput{
				Texts: []string{"Hello World!", "This is a test."},
				Setting: DataCleaningSetting{
					CleanMethod: "InvalidMethod",
				},
			},
			expected: CleanDataOutput{
				CleanedTexts: []string{"Hello World!", "This is a test."},
			},
			expectedError: false,
		},
		{
			name: "error case",
			input: CleanDataInput{
				Texts:   []string{},
				Setting: DataCleaningSetting{},
			},
			expected:     CleanDataOutput{},
			expectedError: true,
		},
	}

	for _, tc := range testcases {
		c.Run(tc.name, func(c *quicktest.C) {
			output := CleanData(tc.input)
			c.Assert(output.CleanedTexts, quicktest.DeepEquals, tc.expected.CleanedTexts)
			if tc.expectedError {
				c.Assert(len(output.CleanedTexts), quicktest.Equals, 0)
			}
		})
	}
}

Recipe

variables:
  data:
    title: Data
    description: The texts to be cleansed.
    instill-format: array[string]

  settings:
    title: Data Cleaning Settings
    description: Configuration for data cleansing.
    instill-format: object
    properties:
      cleanMethod:
        type: string
        enum: ["Regex", "Substring"]
      excludePatterns:
        type: array[string]
      includePatterns:
        type: array[string]
      excludeSubstrings:
        type: array[string]
      includeSubstrings:
        type: array[string]
      caseSensitive:
        type: boolean

component:
  data-cleaner-0:
    type: data-cleaner
    task: TASK_CLEAN_DATA
    input:
      texts: ${variable.data}
      setting: ${variable.settings}

output:
  cleanedTexts:
    title: Cleaned Texts
    description: The output array of cleaned texts.
    value: ${data-cleaner-0.output.cleanedTexts}


@kuroxx
Copy link

kuroxx commented Nov 4, 2024

Hey @NailaRais could you commit your unit test Main_test.go into your PR instead of leaving it as a comment, thanks! 🙏

@chuang8511
Copy link
Member

Hi @NailaRais
Thanks for your contribution.

Your PR still lack the JSON schema that we provided before. Please fetch the schema within this PR.

And, please clean your commit history with command like $ git rebase -i HEAD~x.

And, I found the regex actually can easily fulfil the substring function. So, we won't need substring clean method now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Text] Regular expression for data cleansing
4 participants