Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(googledrive): add the tasks for google drive #725

Merged
merged 28 commits into from
Nov 1, 2024

Conversation

chuang8511
Copy link
Member

@chuang8511 chuang8511 commented Oct 9, 2024

Because

  • we want to add the Google Drive to support sync the data into other data source

This commit

  • implement google drive component

Copy link

linear bot commented Oct 9, 2024

@chuang8511 chuang8511 marked this pull request as draft October 9, 2024 16:36
@chuang8511 chuang8511 force-pushed the chunhao/ins-6542-google-drive-component branch 2 times, most recently from b72493a to 3e38e55 Compare October 21, 2024 16:38
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Curious about why you extracted this client outside of the googledrive package. Are other packages meant to use this?

Also, if we're going to have several common clients perhaps I'd remove some nesting in the directories and I'd have

pkg/component/internal
 └──util
    └──client
       ├──googledrive.go
       ├──grpc.go
       └──http.go

or

/pkg/component/internal
 └──util
    └──client
       ├──googledrive
       │  └──googledrive.go
       ├──grpc
       │  └──grpc.go
       └──http
          └──http.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Curious about why you extracted this client outside of the googledrive package. Are other packages meant to use this?

Originally, I put Interface in googledrive package.
However, when we import mock service, it will cause import cycle.

And, it also makes me think if my original design of the interface is good.

I realised that when I put this interface in googledrive package, I'd think it is ok that client can know the knowledge in googledrive package. It means the output can be googledrive.file.

But, to make the test easier and have higher coverage, we should isolate the external part as much as possible.
So, in the current design, the external service doesn't have to know the knowledge in googledrive package.

And, with the context above, I decide to move out the service interacting with google drive API to googledriveclient package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/pkg/component/internal
 └──util
    └──client
       ├──googledrive
       │  └──googledrive.go
       ├──grpc
       │  └──grpc.go
       └──http
          └──http.go

I will do so! Thanks for advice!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, when we import mock service, it will cause import cycle.

But, to make the test easier and have higher coverage, we should isolate the external part as much as possible.
So, in the current design, the external service doesn't have to know the knowledge in googledrive package.

I do not agree with this. Both util/googledriveclient and /data/googledrive/v0are tied to the Google Drive SDK. And it is highly unlikely that any other package will use the Google Drive client so, if the purpose is just breaking a dependency cycle, I think I'd go for/pkg/component/data/googledrive[/v0]/client/client.go`.

But that wouldn't be my first option, though. I get your point about dependency cycles but it only happens in tests and the main cause is that you're defining an interface at the receiver package, which is a smell that tends to cause dependency cycle issues. In general, try to use interfaces when you want to communicate through packages and define them in the outer layer. Here you're using an interface to "increase the coverage" but in fact you're moving logic to another package that isn't covered, so that argument is a fallacy (better than no coverage but this strategy still leaves some core logic outside of the tests).

I'd try to avoid mocks here and inject the Google Drive API URL into the component (you may use a default URL) and use the WithEndpoint (or WithHTTPClient? I don't know if that would work) option to make requests against a local fake http server instead of the public GDrive API. Such strategy can be found in e.g. the openai package. It takes more effort as you need to manually define the HTTP handler(s) but the coverage is total. It this is not possible due to time constraints, I'd fall back to the option proposed above, keeping the interface isolated but within the googledrive component instead of under util.

Copy link
Member Author

@chuang8511 chuang8511 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/pkg/component/data/googledrive[/v0]/client/client.go`

Oh, I think it is better.

Here you're using an interface to "increase the coverage" but in fact you're moving logic to another package that isn't covered, so that argument is a fallacy (better than no coverage but this strategy still leaves some core logic outside of the tests).

I get your point. My elaboration is not good. I think the discussion points are not these points.
But, actually, I haven't had a clear belief/principle now. So, I just write down what I think and feel recently.

I'd try to avoid mocks here and inject the Google Drive API URL into the component (you may use a default URL) and use the WithEndpoint (or WithHTTPClient? I don't know if that would work) option to make requests against a local fake http server instead of the public GDrive API. Such strategy can be found in e.g. the openai package. It takes more effort as you need to manually define the HTTP handler(s) but the coverage is total.

I also thought about mock server solution recently. But, I actually don't like this way. My "feelings" are like -

  1. It can increase the coverage, but are those parts covered by this solution meaningful?
    • In terms of integration test, it "could" be meaningful. But, for unit testing for business logic (product logic), is it meaningful?
  2. I took a look on some samples like openai, somehow it makes the test code look messy.
    • When we read the test code, we cannot easily check the external I/O, which helps the developer quickly pick up the product knowledge in test code. With the solution like mocking interface, we can easily know how internal component interact with external service.
    • Ideally, I hope test code can make the developer know the structure of the code. So, it means it'd be better to make test cases MECE.

In terms of integration test, it "could" be meaningful.

So, by this feeling, it means I also agree with adding the test code to googledriveclient, which I will move it under the googledrive.
Ideally, I should have added this client test to increase the test coverage.
But, in schedule wise, I usually don't add the client test code when there are other tasks in the team. The reasons are

  • The client code is usually covered by end-to-end test.
  • Though end-to-end test is hard to maintain in general. However, usually the stability of external parts are not controllable. So, even we add the mock part of the client code, it also cannot "protect" the product.

So, I just feel it has a very little value to add the test code to the client part.

In general, I think my point is what coverage is the real important parts we want to focus on.
In my previous company, we gradually move our focus from integration test to business logic testing, which is unit test.
Here, I am still thinking which tests are more important in terms of the project like VDP/ open-source / ....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we read the test code, we cannot easily check the external I/O, which helps the developer quickly pick up the product knowledge in test code. With the solution like mocking interface, we can easily know how internal component interact with external service.

I don't think that's true, at least for components like openai, which interact with the external service directly through an HTTP client. That is what we fake, so the expected outcome of a component execution is a call to the OpenAI API with the provided characteristics (URL, headers, body, etc.). We could work on making the code more readable but if we introduce an interface in the middle, we're just creating an artificial boudary that only means "this is the furthest we'll test".

I get your argument for components like this or Slack, where use use an SDK. The outcome in the execution here can be seen as a call to the SDK, we don't need to worry about how that package works and we can even trust it's tested properly. Therefore, we may set a boundary there and consider our business logic "everything that happens before we call the SDK". I understand that point but I don't share that view, to me the outcome is still the API call. In the openapi example, if tomorrow we adopt an OpenAI Go SDK, we don't need to change our tests. If we use an SDK through an interface and we test it with a mock, if we switch SDKs tomorrow we'll need to change our tests. Also, faking the API we're targeting might be useful in the future to implement end-to-end tests, making requests to the public Instill AI API and checking the calls we make to faked external services (OpenAI, GitHub, Slack...).

In general, I think my point is what coverage is the real important parts we want to focus on.
In my previous company, we gradually move our focus from integration test to business logic testing, which is unit test.

I'm a bit skeptical about mocks as I've seen many times how they made it easy to codify a wrong assumption into the expectations and then have a green CI but business logic that is flawed in production. That's why in the repository package the tests run against a test database instead of a SQL mock. My experience is the opposite, investing more in service / integration tests to cover the business logic, while keeping unit tests for more fine-grained cases like error handling or how input values are transformed to calls to the dependencies injected to a package. I don't mean that in my experience I haven't cared about unit tests, they've always been the part with the most weight in the testing codebase, but when it comes to increasing confidence (not coverage) they were not enough.

In short, I don't love mocking the SDK packages but I won't consider it a blocker for component development, as long as the code structure and the dependencies are clean. In the future I'd like to unify our testing approaches (and, if the approach is faking HTTP servers, move towards a component testing framework that's easy to use and understand by devs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if we introduce an interface in the middle, we're just creating an artificial boudary that only means "this is the furthest we'll test".

In the openapi example, if tomorrow we adopt an OpenAI Go SDK, we don't need to change our tests. If we use an SDK through an interface and we test it with a mock, if we switch SDKs tomorrow we'll need to change our tests.

Yeah, I think it is a good discussion point. I also have been thinking about it for a long time.
Previously, I did think that when we change SDK, we should also change our tests.
But, I am also doubt about if it is meaningful.
It is just like we test because we want to pass the test not because we want to protect the product.
However, on the other hand, in my previous POV, it somehow helps developers (at least me) to understand the code more. (but, not critical part actually. Mainly, I still read the production code to understand the structure.)
I don't have an answer now.

Also, faking the API we're targeting might be useful in the future to implement end-to-end tests, making requests to the public Instill AI API and checking the calls we make to faked external services (OpenAI, GitHub, Slack...).

Yeah, it is a good point. When it really happens, we should care about the time length of running CI.
Making real request is just much more time-consuming. I don't think we want to wait for a long time to pass CI.

I'm a bit skeptical about mocks as I've seen many times how they made it easy to codify a wrong assumption into the expectations and then have a green CI but business logic that is flawed in production.
That's why in the repository package the tests run against a test database instead of a SQL mock.

Yeah, I understand it. I also care about it.
But, mainly, I think the way to solve it is to check the input of mock service.

In the future I'd like to unify our testing approaches (and, if the approach is faking HTTP servers, move towards a component testing framework that's easy to use and understand by devs).

I can follow faking HTTP servers in the future development.
And, I will still keep this topic in my mind. For me, I don't have a strong belief now.

Really appreciate you share your view with me! Learn a lot! ❤️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuang8511 chuang8511 force-pushed the chunhao/ins-6542-google-drive-component branch from 077ed8b to bd427e4 Compare October 29, 2024 16:59
@chuang8511 chuang8511 marked this pull request as ready for review October 29, 2024 19:03
@chuang8511
Copy link
Member Author

Will do end-to-end test again when other services are ready.

Copy link

codecov bot commented Oct 29, 2024

Codecov Report

Attention: Patch coverage is 37.34266% with 448 lines in your changes missing coverage. Please review.

Project coverage is 19.47%. Comparing base (f956ead) to head (7ad113b).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
...omponent/internal/mock/i_drive_service_mock.gen.go 25.29% 294 Missing and 25 partials ⚠️
pkg/component/data/googledrive/v0/client/client.go 0.00% 93 Missing ⚠️
pkg/component/data/googledrive/v0/main.go 89.65% 8 Missing and 4 partials ⚠️
...kg/component/data/googledrive/v0/task_read_file.go 65.71% 8 Missing and 4 partials ⚠️
.../component/data/googledrive/v0/task_read_folder.go 69.23% 8 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #725      +/-   ##
==========================================
- Coverage   19.70%   19.47%   -0.23%     
==========================================
  Files         321      334      +13     
  Lines       64847    66219    +1372     
==========================================
+ Hits        12776    12897     +121     
- Misses      50162    51389    +1227     
- Partials     1909     1933      +24     
Flag Coverage Δ
unittests 19.47% <37.34%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chuang8511 chuang8511 marked this pull request as draft October 31, 2024 10:28
@chuang8511 chuang8511 force-pushed the chunhao/ins-6542-google-drive-component branch from 7ad113b to 2ebb997 Compare October 31, 2024 11:35
@chuang8511 chuang8511 marked this pull request as ready for review November 1, 2024 10:29
@chuang8511 chuang8511 merged commit b6fe968 into main Nov 1, 2024
11 checks passed
@chuang8511 chuang8511 deleted the chunhao/ins-6542-google-drive-component branch November 1, 2024 10:53
donch1989 pushed a commit that referenced this pull request Nov 5, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.46.0-beta](v0.45.2-beta...v0.46.0-beta)
(2024-11-05)


### Features

* add `condition` field support for iterator
([#803](#803))
([04b1252](04b1252))
* add markdowns per pages
([#792](#792))
([3ee428e](3ee428e))
* add resolution field
([#808](#808))
([f15f6bf](f15f6bf))
* add task sync
([#793](#793))
([41a1eeb](41a1eeb))
* **component,audio:** add TASK_DETECT_ACTIVITY and TASK_SEGMENT
([#762](#762))
([9e92a31](9e92a31))
* **component,http:** refactor `restapi` component to `http` component
([#797](#797))
([c2b1862](c2b1862))
* **component:** add error handling for missing conversation
([#806](#806))
([54cc616](54cc616))
* **component:** inject global secrets as environment variables
([#786](#786))
([8d842a6](8d842a6))
* convert time type to string
([#809](#809))
([7de8465](7de8465))
* **googledrive:** add the tasks for google drive
([#725](#725))
([b6fe968](b6fe968))
* **integration:** identify supported OAuth integrations through global
secrets
([#791](#791))
([5a96453](5a96453))
* **minio:** import updated minio package and add tag on file upload
([#779](#779))
([ef86318](ef86318))
* revamp Instill Format
([#774](#774))
([24153e2](24153e2))
* support `length` attribute for array data
([#810](#810))
([fb4f4f7](fb4f4f7))
* **web:** refactor the web operator
([#772](#772))
([ae4e3c2](ae4e3c2))


### Bug Fixes

* **component,image:** fix missing show score draw
([#801](#801))
([a405bf7](a405bf7))
* fix bug not to return error if there is no app or conversation
([#816](#816))
([a946cfd](a946cfd))
* fix iterator upstream check
([#794](#794))
([671971f](671971f))
* **run:** add metadata retention handler
([#800](#800))
([25ec0c2](25ec0c2))
* **run:** add namespace id in response
([#811](#811))
([8d29ffb](8d29ffb))
* **run:** rename pipeline run columns and fix tests
([#776](#776))
([98f1e00](98f1e00))
* **slack:** correct link to OAuth config in documentation
([#805](#805))
([aa0752d](aa0752d))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants