feat(googledrive): add the tasks for google drive #725

chuang8511 · 2024-10-09T16:36:43Z

Because

we want to add the Google Drive to support sync the data into other data source

This commit

implement google drive component

linear · 2024-10-09T16:36:46Z

INS-6542 Milestone 1: Tasks Google Drive

jvallesm · 2024-10-23T05:46:45Z

pkg/component/internal/util/googledriveclient/googledriveclient.go

❓ Curious about why you extracted this client outside of the googledrive package. Are other packages meant to use this?

Also, if we're going to have several common clients perhaps I'd remove some nesting in the directories and I'd have

pkg/component/internal └──util └──client ├──googledrive.go ├──grpc.go └──http.go

or

/pkg/component/internal └──util └──client ├──googledrive │ └──googledrive.go ├──grpc │ └──grpc.go └──http └──http.go

❓ Curious about why you extracted this client outside of the googledrive package. Are other packages meant to use this?

Originally, I put Interface in googledrive package.
However, when we import mock service, it will cause import cycle.

And, it also makes me think if my original design of the interface is good.

I realised that when I put this interface in googledrive package, I'd think it is ok that client can know the knowledge in googledrive package. It means the output can be googledrive.file.

But, to make the test easier and have higher coverage, we should isolate the external part as much as possible.
So, in the current design, the external service doesn't have to know the knowledge in googledrive package.

And, with the context above, I decide to move out the service interacting with google drive API to googledriveclient package.

/pkg/component/internal └──util └──client ├──googledrive │ └──googledrive.go ├──grpc │ └──grpc.go └──http └──http.go

I will do so! Thanks for advice!

However, when we import mock service, it will cause import cycle.

But, to make the test easier and have higher coverage, we should isolate the external part as much as possible.
So, in the current design, the external service doesn't have to know the knowledge in googledrive package.

I do not agree with this. Both util/googledriveclient and /data/googledrive/v0are tied to the Google Drive SDK. And it is highly unlikely that any other package will use the Google Drive client so, if the purpose is just breaking a dependency cycle, I think I'd go for/pkg/component/data/googledrive[/v0]/client/client.go`.

But that wouldn't be my first option, though. I get your point about dependency cycles but it only happens in tests and the main cause is that you're defining an interface at the receiver package, which is a smell that tends to cause dependency cycle issues. In general, try to use interfaces when you want to communicate through packages and define them in the outer layer. Here you're using an interface to "increase the coverage" but in fact you're moving logic to another package that isn't covered, so that argument is a fallacy (better than no coverage but this strategy still leaves some core logic outside of the tests).

I'd try to avoid mocks here and inject the Google Drive API URL into the component (you may use a default URL) and use the WithEndpoint (or WithHTTPClient? I don't know if that would work) option to make requests against a local fake http server instead of the public GDrive API. Such strategy can be found in e.g. the openai package. It takes more effort as you need to manually define the HTTP handler(s) but the coverage is total. It this is not possible due to time constraints, I'd fall back to the option proposed above, keeping the interface isolated but within the googledrive component instead of under util.

/pkg/component/data/googledrive[/v0]/client/client.go`

Oh, I think it is better.

Here you're using an interface to "increase the coverage" but in fact you're moving logic to another package that isn't covered, so that argument is a fallacy (better than no coverage but this strategy still leaves some core logic outside of the tests).

I get your point. My elaboration is not good. I think the discussion points are not these points.
But, actually, I haven't had a clear belief/principle now. So, I just write down what I think and feel recently.

I'd try to avoid mocks here and inject the Google Drive API URL into the component (you may use a default URL) and use the WithEndpoint (or WithHTTPClient? I don't know if that would work) option to make requests against a local fake http server instead of the public GDrive API. Such strategy can be found in e.g. the openai package. It takes more effort as you need to manually define the HTTP handler(s) but the coverage is total.

I also thought about mock server solution recently. But, I actually don't like this way. My "feelings" are like -

It can increase the coverage, but are those parts covered by this solution meaningful?

In terms of integration test, it "could" be meaningful. But, for unit testing for business logic (product logic), is it meaningful?

I took a look on some samples like openai, somehow it makes the test code look messy.

When we read the test code, we cannot easily check the external I/O, which helps the developer quickly pick up the product knowledge in test code. With the solution like mocking interface, we can easily know how internal component interact with external service.

Ideally, I hope test code can make the developer know the structure of the code. So, it means it'd be better to make test cases MECE.

In terms of integration test, it "could" be meaningful.

So, by this feeling, it means I also agree with adding the test code to googledriveclient, which I will move it under the googledrive.
Ideally, I should have added this client test to increase the test coverage.
But, in schedule wise, I usually don't add the client test code when there are other tasks in the team. The reasons are

The client code is usually covered by end-to-end test.

Though end-to-end test is hard to maintain in general. However, usually the stability of external parts are not controllable. So, even we add the mock part of the client code, it also cannot "protect" the product.

So, I just feel it has a very little value to add the test code to the client part.

In general, I think my point is what coverage is the real important parts we want to focus on.
In my previous company, we gradually move our focus from integration test to business logic testing, which is unit test.
Here, I am still thinking which tests are more important in terms of the project like VDP/ open-source / ....

When we read the test code, we cannot easily check the external I/O, which helps the developer quickly pick up the product knowledge in test code. With the solution like mocking interface, we can easily know how internal component interact with external service.

I don't think that's true, at least for components like openai, which interact with the external service directly through an HTTP client. That is what we fake, so the expected outcome of a component execution is a call to the OpenAI API with the provided characteristics (URL, headers, body, etc.). We could work on making the code more readable but if we introduce an interface in the middle, we're just creating an artificial boudary that only means "this is the furthest we'll test".

I get your argument for components like this or Slack, where use use an SDK. The outcome in the execution here can be seen as a call to the SDK, we don't need to worry about how that package works and we can even trust it's tested properly. Therefore, we may set a boundary there and consider our business logic "everything that happens before we call the SDK". I understand that point but I don't share that view, to me the outcome is still the API call. In the openapi example, if tomorrow we adopt an OpenAI Go SDK, we don't need to change our tests. If we use an SDK through an interface and we test it with a mock, if we switch SDKs tomorrow we'll need to change our tests. Also, faking the API we're targeting might be useful in the future to implement end-to-end tests, making requests to the public Instill AI API and checking the calls we make to faked external services (OpenAI, GitHub, Slack...).

In general, I think my point is what coverage is the real important parts we want to focus on.
In my previous company, we gradually move our focus from integration test to business logic testing, which is unit test.

I'm a bit skeptical about mocks as I've seen many times how they made it easy to codify a wrong assumption into the expectations and then have a green CI but business logic that is flawed in production. That's why in the repository package the tests run against a test database instead of a SQL mock. My experience is the opposite, investing more in service / integration tests to cover the business logic, while keeping unit tests for more fine-grained cases like error handling or how input values are transformed to calls to the dependencies injected to a package. I don't mean that in my experience I haven't cared about unit tests, they've always been the part with the most weight in the testing codebase, but when it comes to increasing confidence (not coverage) they were not enough.

In short, I don't love mocking the SDK packages but I won't consider it a blocker for component development, as long as the code structure and the dependencies are clean. In the future I'd like to unify our testing approaches (and, if the approach is faking HTTP servers, move towards a component testing framework that's easy to use and understand by devs).

but if we introduce an interface in the middle, we're just creating an artificial boudary that only means "this is the furthest we'll test".

In the openapi example, if tomorrow we adopt an OpenAI Go SDK, we don't need to change our tests. If we use an SDK through an interface and we test it with a mock, if we switch SDKs tomorrow we'll need to change our tests.

Yeah, I think it is a good discussion point. I also have been thinking about it for a long time.
Previously, I did think that when we change SDK, we should also change our tests.
But, I am also doubt about if it is meaningful.
It is just like we test because we want to pass the test not because we want to protect the product.
However, on the other hand, in my previous POV, it somehow helps developers (at least me) to understand the code more. (but, not critical part actually. Mainly, I still read the production code to understand the structure.)
I don't have an answer now.

Also, faking the API we're targeting might be useful in the future to implement end-to-end tests, making requests to the public Instill AI API and checking the calls we make to faked external services (OpenAI, GitHub, Slack...).

Yeah, it is a good point. When it really happens, we should care about the time length of running CI.
Making real request is just much more time-consuming. I don't think we want to wait for a long time to pass CI.

I'm a bit skeptical about mocks as I've seen many times how they made it easy to codify a wrong assumption into the expectations and then have a green CI but business logic that is flawed in production.
That's why in the repository package the tests run against a test database instead of a SQL mock.

Yeah, I understand it. I also care about it.
But, mainly, I think the way to solve it is to check the input of mock service.

In the future I'd like to unify our testing approaches (and, if the approach is faking HTTP servers, move towards a component testing framework that's easy to use and understand by devs).

I can follow faking HTTP servers in the future development.
And, I will still keep this topic in my mind. For me, I don't have a strong belief now.

Really appreciate you share your view with me! Learn a lot! ❤️

I updated the contribution guideline for testing part.

pkg/component/data/googledrive/v0/main.go

chuang8511 · 2024-10-29T19:03:58Z

Will do end-to-end test again when other services are ready.

codecov · 2024-10-29T19:13:04Z

Codecov Report

Attention: Patch coverage is 37.34266% with 448 lines in your changes missing coverage. Please review.

Project coverage is 19.47%. Comparing base (f956ead) to head (7ad113b).
Report is 22 commits behind head on main.

Files with missing lines	Patch %	Lines
...omponent/internal/mock/i_drive_service_mock.gen.go	25.29%	294 Missing and 25 partials ⚠️
pkg/component/data/googledrive/v0/client/client.go	0.00%	93 Missing ⚠️
pkg/component/data/googledrive/v0/main.go	89.65%	8 Missing and 4 partials ⚠️
...kg/component/data/googledrive/v0/task_read_file.go	65.71%	8 Missing and 4 partials ⚠️
.../component/data/googledrive/v0/task_read_folder.go	69.23%	8 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #725      +/-   ##
==========================================
- Coverage   19.70%   19.47%   -0.23%     
==========================================
  Files         321      334      +13     
  Lines       64847    66219    +1372     
==========================================
+ Hits        12776    12897     +121     
- Misses      50162    51389    +1227     
- Partials     1909     1933      +24

Flag	Coverage Δ
unittests	`19.47% <37.34%> (-0.23%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… drive details

chore: update contribution guideline for unit test

🤖 I have created a release *beep* *boop* --- ## [0.46.0-beta](v0.45.2-beta...v0.46.0-beta) (2024-11-05) ### Features * add `condition` field support for iterator ([#803](#803)) ([04b1252](04b1252)) * add markdowns per pages ([#792](#792)) ([3ee428e](3ee428e)) * add resolution field ([#808](#808)) ([f15f6bf](f15f6bf)) * add task sync ([#793](#793)) ([41a1eeb](41a1eeb)) * **component,audio:** add TASK_DETECT_ACTIVITY and TASK_SEGMENT ([#762](#762)) ([9e92a31](9e92a31)) * **component,http:** refactor `restapi` component to `http` component ([#797](#797)) ([c2b1862](c2b1862)) * **component:** add error handling for missing conversation ([#806](#806)) ([54cc616](54cc616)) * **component:** inject global secrets as environment variables ([#786](#786)) ([8d842a6](8d842a6)) * convert time type to string ([#809](#809)) ([7de8465](7de8465)) * **googledrive:** add the tasks for google drive ([#725](#725)) ([b6fe968](b6fe968)) * **integration:** identify supported OAuth integrations through global secrets ([#791](#791)) ([5a96453](5a96453)) * **minio:** import updated minio package and add tag on file upload ([#779](#779)) ([ef86318](ef86318)) * revamp Instill Format ([#774](#774)) ([24153e2](24153e2)) * support `length` attribute for array data ([#810](#810)) ([fb4f4f7](fb4f4f7)) * **web:** refactor the web operator ([#772](#772)) ([ae4e3c2](ae4e3c2)) ### Bug Fixes * **component,image:** fix missing show score draw ([#801](#801)) ([a405bf7](a405bf7)) * fix bug not to return error if there is no app or conversation ([#816](#816)) ([a946cfd](a946cfd)) * fix iterator upstream check ([#794](#794)) ([671971f](671971f)) * **run:** add metadata retention handler ([#800](#800)) ([25ec0c2](25ec0c2)) * **run:** add namespace id in response ([#811](#811)) ([8d29ffb](8d29ffb)) * **run:** rename pipeline run columns and fix tests ([#776](#776)) ([98f1e00](98f1e00)) * **slack:** correct link to OAuth config in documentation ([#805](#805)) ([aa0752d](aa0752d)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

chuang8511 requested review from GeorgeWilliamStrong, donch1989, pinglin, xiaofei-du and jvallesm as code owners October 9, 2024 16:36

chuang8511 marked this pull request as draft October 9, 2024 16:36

droplet-bot added instill vdp instill component labels Oct 9, 2024

chuang8511 force-pushed the chunhao/ins-6542-google-drive-component branch 2 times, most recently from b72493a to 3e38e55 Compare October 21, 2024 16:38

jvallesm reviewed Oct 23, 2024

View reviewed changes

chuang8511 commented Oct 24, 2024

View reviewed changes

pkg/component/data/googledrive/v0/main.go Outdated Show resolved Hide resolved

chuang8511 mentioned this pull request Oct 28, 2024

feat(instillmodel): migrate instill model component #777

Open

chuang8511 force-pushed the chunhao/ins-6542-google-drive-component branch from 077ed8b to bd427e4 Compare October 29, 2024 16:59

chuang8511 marked this pull request as ready for review October 29, 2024 19:03

chuang8511 added 12 commits October 31, 2024 09:23

feat(googledrive): add the tasks for google drive

7d2ad22

chore: update doc

b8f07b8

chore(googledrive): add icon

d6aecaa

feat(googledrive): design the structure about how to implement google…

5176ad3

… drive details

feat(googledrive): add credential and token setting

a2fe572

feat(googledrive): update input output design

3fc92ee

chore(googledrive): add scope to read users' basic info

8c36316

feat(googledrive): add client function

8de0bd1

feat(googledrive): add implementation code

06f4ae9

chore(googledrive): update doc

be0f77a

fix(googledrive): fix bug when fetching list from Google Drive

4e5c9d8

chore(googledrive): fix golangcilint

f26539b

chuang8511 added 14 commits October 31, 2024 09:23

feat(googledrive): isolate google drive client

cc56a03

feat(googledrive): modify the interface

eb062d2

feat(googledrive): add test code

3f83053

chore: add comments to exported functions

bdd8ae2

fix(googledrive): fix the bug about panic

f9238c6

feat(googledrive): add error cases and test code

7f22a59

chore: arrange pkg

8b1dc85

chore: move google drive client under the googledrive pkg

f9e7707

chore: update contribution guideline for unit test

d017c47

chore: update contribution guideline for unit test

chore: update testing comment

e427e73

chore: update injecting credentials

8b7c7d2

fix: update loading credentials

dfb7d13

chore: clean the googledrive pkg

4371f63

fix: adopt new interface

3832ff1

chuang8511 marked this pull request as draft October 31, 2024 10:28

chore: apply instill format to google drive

2ebb997

chuang8511 force-pushed the chunhao/ins-6542-google-drive-component branch from 7ad113b to 2ebb997 Compare October 31, 2024 11:35

feat: get client info in component

b7d8f74

chuang8511 marked this pull request as ready for review November 1, 2024 10:29

chuang8511 merged commit b6fe968 into main Nov 1, 2024
11 checks passed

chuang8511 deleted the chunhao/ins-6542-google-drive-component branch November 1, 2024 10:53

droplet-bot mentioned this pull request Nov 1, 2024

chore(main): release 0.46.0-beta #789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(googledrive): add the tasks for google drive #725

feat(googledrive): add the tasks for google drive #725

chuang8511 commented Oct 9, 2024 •

edited

Loading

linear bot commented Oct 9, 2024

jvallesm Oct 23, 2024

chuang8511 Oct 23, 2024

chuang8511 Oct 23, 2024

jvallesm Oct 25, 2024

chuang8511 Oct 25, 2024 •

edited

Loading

jvallesm Oct 25, 2024

chuang8511 Oct 25, 2024

chuang8511 Oct 29, 2024

chuang8511 commented Oct 29, 2024

codecov bot commented Oct 29, 2024

feat(googledrive): add the tasks for google drive #725

feat(googledrive): add the tasks for google drive #725

Conversation

chuang8511 commented Oct 9, 2024 • edited Loading

linear bot commented Oct 9, 2024

jvallesm Oct 23, 2024

Choose a reason for hiding this comment

chuang8511 Oct 23, 2024

Choose a reason for hiding this comment

chuang8511 Oct 23, 2024

Choose a reason for hiding this comment

jvallesm Oct 25, 2024

Choose a reason for hiding this comment

chuang8511 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

jvallesm Oct 25, 2024

Choose a reason for hiding this comment

chuang8511 Oct 25, 2024

Choose a reason for hiding this comment

chuang8511 Oct 29, 2024

Choose a reason for hiding this comment

chuang8511 commented Oct 29, 2024

codecov bot commented Oct 29, 2024

Codecov Report

chuang8511 commented Oct 9, 2024 •

edited

Loading

chuang8511 Oct 25, 2024 •

edited

Loading