Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard against undefined partition worker pid #581

Conversation

urmastalimaa
Copy link
Contributor

I've observed a data race when starting a consumer from within another consumer (using co-partitioned topics), where the call to get_partition_worker fails due to badarg in is_process_alive.

It seems that 1f2290b ("Verify partition worker process is alive.", 2022-05-19) already tried to resolve the data race, but did not consider the possibility that the lookup returns undefined.

Example exit report from Elixir logger:

Last message: {:EXIT, #PID<0.2613.0>, {:badarg, [{:erlang, :is_process_alive, [:undefined], [error_info: %{module: :erl_erts_errors}]}, {:brod_client, :get_partition_worker, 2, [file: ~c"/app/deps/brod/src/brod_client.erl", line: 496]}, ...MyCallChain]}}

@urmastalimaa urmastalimaa force-pushed the guard_against_undefined_pid_in_get_partition_worker branch 2 times, most recently from 29a90eb to 95b7eb3 Compare June 14, 2024 10:00
@urmastalimaa urmastalimaa marked this pull request as ready for review June 14, 2024 10:02
@urmastalimaa urmastalimaa force-pushed the guard_against_undefined_pid_in_get_partition_worker branch from 95b7eb3 to ae7fba9 Compare June 14, 2024 10:02
@urmastalimaa
Copy link
Contributor Author

CI failing with

Pulling pause (gcr.io/google_containers/pause-amd64:3.0)...
3.0: Pulling from google_containers/pause-amd64
[DEPRECATION NOTICE] Docker Image Format v1 and Docker Image manifest version 2, schema 1 support is disabled by default and will be removed in an upcoming release. Suggest the author of gcr.io/google_containers/pause-amd64:3.0 to upgrade the image to the OCI Format or Docker Image manifest v2, schema 2. More information at https://docs.docker.com/go/deprecated-image-specs/
Error: Process completed with exit code 1.

@urmastalimaa urmastalimaa force-pushed the guard_against_undefined_pid_in_get_partition_worker branch from ae7fba9 to 3c749c6 Compare June 14, 2024 10:32
@zmstone
Copy link
Contributor

zmstone commented Jun 14, 2024

CI failing with

Pulling pause (gcr.io/google_containers/pause-amd64:3.0)...
3.0: Pulling from google_containers/pause-amd64
[DEPRECATION NOTICE] Docker Image Format v1 and Docker Image manifest version 2, schema 1 support is disabled by default and will be removed in an upcoming release. Suggest the author of gcr.io/google_containers/pause-amd64:3.0 to upgrade the image to the OCI Format or Docker Image manifest v2, schema 2. More information at https://docs.docker.com/go/deprecated-image-specs/
Error: Process completed with exit code 1.

fixed in kafka_protocol. will work on this one later.
please wait for my fix, then rebase. thank you.

@zmstone
Copy link
Contributor

zmstone commented Jun 14, 2024

ci fix ref: kafka4beam/kafka_protocol#121

I've observed a data race when starting a consumer from within another
consumer (using co-partitioned topics), where the call to
`get_partition_worker` fails due to badarg in `is_process_alive`.

It seems that 1f2290b ("Verify partition worker process is alive.",
2022-05-19) already tried to resolve the data race, but did not consider
the possibility that the lookup returns `undefined`.

Example exit report from Elixir logger:
```
Last message: {:EXIT, #PID<0.2613.0>, {:badarg, [{:erlang, :is_process_alive, [:undefined], [error_info: %{module: :erl_erts_errors}]}, {:brod_client, :get_partition_worker, 2, [file: ~c"/app/deps/brod/src/brod_client.erl", line: 496]}, ...MyCallChain]}}
```
@urmastalimaa urmastalimaa force-pushed the guard_against_undefined_pid_in_get_partition_worker branch from 3c749c6 to 43f8f9c Compare June 14, 2024 19:34
@zmstone
Copy link
Contributor

zmstone commented Jun 15, 2024

fixed in #582
please rebase (merge your changelog into 3.19.1)
thanks.

@zmstone
Copy link
Contributor

zmstone commented Jun 26, 2024

pulled the commits into this PR: #587

@zmstone zmstone closed this Jun 26, 2024
@urmastalimaa
Copy link
Contributor Author

Thanks for merging 👍

@urmastalimaa urmastalimaa deleted the guard_against_undefined_pid_in_get_partition_worker branch June 27, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants