Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery: RowIterator connection does not recovered from read: connection reset by peer error #11364

Open
HurSungYun opened this issue Jan 3, 2025 · 5 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release.

Comments

@HurSungYun
Copy link

Client

Bigquery

Environment

  • Docker golang:1.22-bullseye image, AWS Batch on Fargate
  • go 1.22.0

Code and Dependencies

	cli, err := bigquery.NewClient(ctx, projectID, opt...) // credential options only
	if err != nil {
		...
	}

	ri, err := cli.Query(query).Read(ctx)
	if err != nil {
                ...
	}

	var (
		rowsChan         = make(chan []map[string]interface{})

		consecutiveErrorCount = 0
	)

	go func() 
		rows := make([]map[string]bigquery.Value, 0, loadSize)
		for {
			var row map[string]bigquery.Value
			err := ri.Next(&row)

			if errors.Is(err, iterator.Done) {
				if len(rows) > 0 {
					converted := convertRows(rows)
					rowsChan <- converted
				}
				close(rowsChan)
				break
			}
			if err != nil {
				consecutiveErrorCount++

				// log errors

				if consecutiveErrorCount >= consecutiveErrorThreshold { // consecutiveErrorThreshold is 10 in my case
					close(rowsChan)
					break
				}

				if strings.Contains(err.Error(), "read: connection reset") {
					// time.Sleep with exponential backoff. At least one second.
				}

				continue
			}

			...
		}
	}()

Expected behavior

When a read: connection reset by peer error occurs, subsequent calls to ri.Next(&row) should attempt to re-establish the TCP connection and retry the requests.

Actual behavior

It seems that the RowIterator does not re-establish TCP connections after encountering a read: connection reset by peer error. Instead, it continues retrying on the same socket.

The same ephemeral port is logged repeatedly: (I might be wrong. please let me know if i'm wrong)

       "error": "read tcp 10.128.114.155:55904->10.55.192.1:443: read: connection reset by peer"

Additional context

This code is running on AWS Fargate, where intermediate routers or load balancers might close the TCP connection (e.g., due to idleness or bandwidth constraints) without notifying the client.

I believe the connection should be re-established in such cases.

It's practically impossible to eliminate read: connection reset by peer errors entirely in a production environment. It would be helpful if RowIterator could handle these scenarios by internally re-establishing the TCP connection.

I understand there are two APIs involved: the Job API and the Storage API. Handling only the Job API is acceptable for my use case.

@HurSungYun HurSungYun added the triage me I really want to be triaged. label Jan 3, 2025
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Jan 3, 2025
@HurSungYun HurSungYun changed the title bigquery: RowIterator connection does not re from read: connection reset by peer error bigquery: RowIterator connection does not recovered from read: connection reset by peer error Jan 3, 2025
@alvarowolfx alvarowolfx added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Jan 6, 2025
@alvarowolfx
Copy link
Contributor

hey @HurSungYun, thanks for the report. You mentioned the Storage API, are you using the Storage Read API acceleration with the EnableStorageReadClient method ? Just trying to narrow it down to just the API's you're using for now.

We have some retry logic already for the BigQuery v2 and Storage API, but for the BigQuery v2 (which have the jobs and query APIs) are HTTP based, so it doesn't need to keep a TCP connection open, just retry to call the method again. For the BQ Storage APIs things are a bit different, as we do keep some gRPC connection open and try to reuse things as much as possible.

I'll try to reproduce your scenario here and see what we can improve in regards to retry

@HurSungYun
Copy link
Author

HurSungYun commented Jan 6, 2025

@alvarowolfx Thanks for the reply.

You mentioned the Storage API, are you using the Storage Read API acceleration with the EnableStorageReadClient method ?

Yes. In my configuratxion, EnableStorageReadClient is used and enabled to accelerate Read API. However, the query affected by this situation is like below.

SELECT
  CONCAT(column_a, "#", column_b) AS id,
  ANY_VALUE(column_c) AS column_c,
FROM
  (
    SELECT 
      column_a,
      column_b,
      FIRST_VALUE(column_c) OVER (PARTITION BY column_c ORDER BY created_at DESC) as column_c,
    FROM `some_table`
    WHERE
      TIMESTAMP_TRUNC(created_at, DAY) > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  )
WHERE MOD(CAST(column_a AS INT), 50) = 1 -- table is not partitioned by column_a
GROUP BY 1

I believe functions like MOD prevents to use storage API internally. Please let me know if I'm wrong.

(JobID() was returned successfully and had a valid ID for this query. Does it mean that this query was not accelerated?)

Thank you again for investigating my issue.

@shollyman
Copy link
Contributor

The query text doesn't restrict storage acceleration. You can verify if the row iterator is trying to use the read API by consulting https://pkg.go.dev/cloud.google.com/go/bigquery#RowIterator.IsAccelerated.

If it's not being accelerated, possibly this is related to http2? If it is accelerated (which is likely here), then we may need to look deeper at ReadRows retries.

@shollyman shollyman self-assigned this Jan 10, 2025
@HurSungYun
Copy link
Author

@shollyman Thank you for your reply.

The query text doesn't restrict storage acceleration.

Thank you for letting me know.

If it's not being accelerated, possibly this is related to http2? If it is accelerated (which is likely here), then we may need to look deeper at ReadRows retries.

Understood. I am going to verify whether the row iterator is accelerated using the RowIterator.IsAccelerated function.

I’ll let you know once I have the results. It may take about 1 week.

@HurSungYun
Copy link
Author

HurSungYun commented Jan 20, 2025

@shollyman

The error was reproduced. My code and log is below.

	ri, err := client.Query(query).Read(ctx)
	if err != nil {
		// error
	}

	isAccelerated := ri.IsAccelerated()
	logger.
		WithExtras(logger.Extras{
			"is_accelerated": isAccelerated,
		}).
		Infof(ctx, "bigquery: row iterator is accelerated")

Log says that the query is not accelerated.

Image

Can you please look deeper based on this information? Please let me know if more information needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release.
Projects
None yet
Development

No branches or pull requests

3 participants