bigquery: RowIterator connection does not recovered from `read: connection reset by peer` error #11364

HurSungYun · 2025-01-03T06:48:04Z

Client

Bigquery

Environment

Docker golang:1.22-bullseye image, AWS Batch on Fargate
go 1.22.0

Code and Dependencies

	cli, err := bigquery.NewClient(ctx, projectID, opt...) // credential options only
	if err != nil {
		...
	}

	ri, err := cli.Query(query).Read(ctx)
	if err != nil {
                ...
	}

	var (
		rowsChan         = make(chan []map[string]interface{})

		consecutiveErrorCount = 0
	)

	go func() 
		rows := make([]map[string]bigquery.Value, 0, loadSize)
		for {
			var row map[string]bigquery.Value
			err := ri.Next(&row)

			if errors.Is(err, iterator.Done) {
				if len(rows) > 0 {
					converted := convertRows(rows)
					rowsChan <- converted
				}
				close(rowsChan)
				break
			}
			if err != nil {
				consecutiveErrorCount++

				// log errors

				if consecutiveErrorCount >= consecutiveErrorThreshold { // consecutiveErrorThreshold is 10 in my case
					close(rowsChan)
					break
				}

				if strings.Contains(err.Error(), "read: connection reset") {
					// time.Sleep with exponential backoff. At least one second.
				}

				continue
			}

			...
		}
	}()

Expected behavior

When a read: connection reset by peer error occurs, subsequent calls to ri.Next(&row) should attempt to re-establish the TCP connection and retry the requests.

Actual behavior

It seems that the RowIterator does not re-establish TCP connections after encountering a read: connection reset by peer error. Instead, it continues retrying on the same socket.

The same ephemeral port is logged repeatedly: (I might be wrong. please let me know if i'm wrong)

       "error": "read tcp 10.128.114.155:55904->10.55.192.1:443: read: connection reset by peer"

Additional context

This code is running on AWS Fargate, where intermediate routers or load balancers might close the TCP connection (e.g., due to idleness or bandwidth constraints) without notifying the client.

I believe the connection should be re-established in such cases.

It's practically impossible to eliminate read: connection reset by peer errors entirely in a production environment. It would be helpful if RowIterator could handle these scenarios by internally re-establishing the TCP connection.

I understand there are two APIs involved: the Job API and the Storage API. Handling only the Job API is acceptable for my use case.

The text was updated successfully, but these errors were encountered:

alvarowolfx · 2025-01-06T14:30:02Z

hey @HurSungYun, thanks for the report. You mentioned the Storage API, are you using the Storage Read API acceleration with the EnableStorageReadClient method ? Just trying to narrow it down to just the API's you're using for now.

We have some retry logic already for the BigQuery v2 and Storage API, but for the BigQuery v2 (which have the jobs and query APIs) are HTTP based, so it doesn't need to keep a TCP connection open, just retry to call the method again. For the BQ Storage APIs things are a bit different, as we do keep some gRPC connection open and try to reuse things as much as possible.

I'll try to reproduce your scenario here and see what we can improve in regards to retry

HurSungYun · 2025-01-06T14:57:16Z

@alvarowolfx Thanks for the reply.

You mentioned the Storage API, are you using the Storage Read API acceleration with the EnableStorageReadClient method ?

Yes. In my configuratxion, EnableStorageReadClient is used and enabled to accelerate Read API. However, the query affected by this situation is like below.

SELECT
  CONCAT(column_a, "#", column_b) AS id,
  ANY_VALUE(column_c) AS column_c,
FROM
  (
    SELECT 
      column_a,
      column_b,
      FIRST_VALUE(column_c) OVER (PARTITION BY column_c ORDER BY created_at DESC) as column_c,
    FROM `some_table`
    WHERE
      TIMESTAMP_TRUNC(created_at, DAY) > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  )
WHERE MOD(CAST(column_a AS INT), 50) = 1 -- table is not partitioned by column_a
GROUP BY 1

I believe functions like MOD prevents to use storage API internally. Please let me know if I'm wrong.

(JobID() was returned successfully and had a valid ID for this query. Does it mean that this query was not accelerated?)

Thank you again for investigating my issue.

shollyman · 2025-01-10T18:25:06Z

The query text doesn't restrict storage acceleration. You can verify if the row iterator is trying to use the read API by consulting https://pkg.go.dev/cloud.google.com/go/bigquery#RowIterator.IsAccelerated.

If it's not being accelerated, possibly this is related to http2? If it is accelerated (which is likely here), then we may need to look deeper at ReadRows retries.

HurSungYun · 2025-01-11T13:30:30Z

@shollyman Thank you for your reply.

The query text doesn't restrict storage acceleration.

Thank you for letting me know.

If it's not being accelerated, possibly this is related to http2? If it is accelerated (which is likely here), then we may need to look deeper at ReadRows retries.

Understood. I am going to verify whether the row iterator is accelerated using the RowIterator.IsAccelerated function.

I’ll let you know once I have the results. It may take about 1 week.

HurSungYun · 2025-01-20T10:33:39Z

@shollyman

The error was reproduced. My code and log is below.

	ri, err := client.Query(query).Read(ctx)
	if err != nil {
		// error
	}

	isAccelerated := ri.IsAccelerated()
	logger.
		WithExtras(logger.Extras{
			"is_accelerated": isAccelerated,
		}).
		Infof(ctx, "bigquery: row iterator is accelerated")

Log says that the query is not accelerated.

Can you please look deeper based on this information? Please let me know if more information needed.

HurSungYun added the triage me I really want to be triaged. label Jan 3, 2025

product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Jan 3, 2025

blunderbuss-gcf bot assigned alvarowolfx Jan 3, 2025

HurSungYun changed the title ~~bigquery: RowIterator connection does not re from read: connection reset by peer error~~ bigquery: RowIterator connection does not recovered from read: connection reset by peer error Jan 3, 2025

alvarowolfx added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Jan 6, 2025

shollyman self-assigned this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigquery: RowIterator connection does not recovered from `read: connection reset by peer` error #11364

bigquery: RowIterator connection does not recovered from `read: connection reset by peer` error #11364

HurSungYun commented Jan 3, 2025

alvarowolfx commented Jan 6, 2025

HurSungYun commented Jan 6, 2025 •

edited

Loading

shollyman commented Jan 10, 2025

HurSungYun commented Jan 11, 2025

HurSungYun commented Jan 20, 2025 •

edited

Loading

bigquery: RowIterator connection does not recovered from read: connection reset by peer error #11364

bigquery: RowIterator connection does not recovered from read: connection reset by peer error #11364

Comments

HurSungYun commented Jan 3, 2025

Client

Environment

Code and Dependencies

Expected behavior

Actual behavior

Additional context

alvarowolfx commented Jan 6, 2025

HurSungYun commented Jan 6, 2025 • edited Loading

shollyman commented Jan 10, 2025

HurSungYun commented Jan 11, 2025

HurSungYun commented Jan 20, 2025 • edited Loading

bigquery: RowIterator connection does not recovered from `read: connection reset by peer` error #11364

bigquery: RowIterator connection does not recovered from `read: connection reset by peer` error #11364

HurSungYun commented Jan 6, 2025 •

edited

Loading

HurSungYun commented Jan 20, 2025 •

edited

Loading