Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate column: PostgreSQL False negative validation of timestamps with epoch seconds at high volume #1391

Open
nj1973 opened this issue Jan 8, 2025 · 1 comment
Assignees
Labels
priority: p1 High priority. Fix may be included in the next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@nj1973
Copy link
Contributor

nj1973 commented Jan 8, 2025

When validating timestamps using --wildcard-include-timestamp we can get a failed validation for columns that should match.

Test tables

Oracle

CREATE TABLE dvt_test.tab_many_ts
( id         number(10) primary key
, col_ts     timestamp(0));
INSERT INTO dvt_test.tab_many_ts
SELECT ROWNUM
,      to_date('9999-12-31 00:01:00','YYYY-MM-DD HH24:MI:SS') - mod(ROWNUM,100)
FROM   dual
CONNECT BY ROWNUM <= 50000;
INSERT INTO dvt_test.tab_many_ts VALUES (0,TO_DATE('1910-02-01 00:01:59','YYYY-MM-DD HH24:MI:SS'));
COMMIT;

PostgreSQL

CREATE TABLE dvt_test.tab_many_ts
( id         int primary key
, col_ts     timestamp(0));
DELETE FROM dvt_test.tab_many_ts;
INSERT INTO dvt_test.tab_many_ts
SELECT n
,      ('9999-12-31'::date - mod(n,100))::timestamp + interval '1 minute'
FROM   generate_series(1,50000) AS n;
INSERT INTO dvt_test.tab_many_ts VALUES (0,'1910-02-01 00:01:59'::timestamp);

BigQuery

CREATE TABLE dvt_test.tab_many_ts
( id         int64
, col_ts     datetime);
INSERT INTO dvt_test.tab_many_ts
SELECT id
,      DATE_ADD(DATETIME'9999-12-31 00:01:00',INTERVAL -MOD(id,100) DAY)
FROM   UNNEST(GENERATE_ARRAY(1, 50000)) id;
INSERT INTO dvt_test.tab_many_ts VALUES (0,DATETIME'1910-02-01 00:01:59');

Test commands

Oracle vs BigQuery (success):


data-validation -v validate column -sc ora -tc bq \
 --tables-list=dvt_test.tab_many_ts --sum="col_ts" --wildcard-include-timestamp
...
╒════════════════════════════════════════════╤═══════════════════╤══════════════════════╤═══════════════════════════════════════╤════════════════════╤════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name                            │ validation_type   │ source_table_name    │ source_column_name                    │   source_agg_value │   target_agg_value │   pct_difference │ validation_status   │ run_id                               │
╞════════════════════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════════════════╪════════════════════╪════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ sum__epoch_seconds__cast_timestamp__col_ts │ Column            │ dvt_test.tab_many_ts │ epoch_seconds__cast_timestamp__col_ts │  12669894992222519 │  12669894992222519 │                0 │ success             │ a7d06a94-d288-473a-ad61-ab6cff66b8f3 │
├────────────────────────────────────────────┼───────────────────┼──────────────────────┼───────────────────────────────────────┼────────────────────┼────────────────────┼──────────────────┼─────────────────────┼──────────────────────────────────────┤
│ count                                      │ Column            │ dvt_test.tab_many_ts │                                       │              50001 │              50001 │                0 │ success             │ a7d06a94-d288-473a-ad61-ab6cff66b8f3 │
╘════════════════════════════════════════════╧═══════════════════╧══════════════════════╧═══════════════════════════════════════╧════════════════════╧════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

Oracle vs PostgreSQL (fail):

data-validation -v validate column -sc ora -tc pg \
 --tables-list=dvt_test.tab_many_ts --sum="col_ts" --wildcard-include-timestamp
...
╒════════════════════════════╤═══════════════════╤══════════════════════╤═══════════════════════╤════════════════════╤════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name            │ validation_type   │ source_table_name    │ source_column_name    │   source_agg_value │   target_agg_value │   pct_difference │ validation_status   │ run_id                               │
╞════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════╪════════════════════╪════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ sum__epoch_seconds__col_ts │ Column            │ dvt_test.tab_many_ts │ epoch_seconds__col_ts │  12669894992222519 │  12669894992222520 │      7.89273e-15 │ fail                │ 85cabc18-8e62-4e0f-ade7-8d3ac751c621 │
├────────────────────────────┼───────────────────┼──────────────────────┼───────────────────────┼────────────────────┼────────────────────┼──────────────────┼─────────────────────┼──────────────────────────────────────┤
│ count                      │ Column            │ dvt_test.tab_many_ts │                       │              50001 │              50001 │      0           │ success             │ 85cabc18-8e62-4e0f-ade7-8d3ac751c621 │
╘════════════════════════════╧═══════════════════╧══════════════════════╧═══════════════════════╧════════════════════╧════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

Research
The PostgreSQL query was the one below which has the correct result:

01/08/2025 04:04:09 PM-INFO: -- ** Target Query ** --
01/08/2025 04:04:09 PM-INFO: SELECT count(*) AS count, sum(t0.epoch_seconds__col_ts) AS sum__epoch_seconds__col_ts
FROM (SELECT t1.id AS id, t1.col_ts AS col_ts, CAST(EXTRACT(epoch FROM t1.col_ts) AS BIGINT) AS epoch_seconds__col_ts
FROM dvt_test.tab_many_ts AS t1) AS t0;

 count | sum__epoch_seconds__col_ts
-------+----------------------------
 50001 |          12669894992222519

I debugged the validation and at the point we fetch data from PostgreSQL we get the correct result:

> /.../professional-services-data-validator/.venv/lib/python3.11/site-packages/pandas/core/frame.py(2229)from_records()
-> dtype = None
(Pdb) p first_row
(50001, Decimal('12669894992222519'))

But in the same code we enter this section of code, note that Pandas sees Decimal as an object):

2267 ->	            if coerce_float:
2268 	                for i, arr in enumerate(arrays):
2269 	                    if arr.dtype == object:
2270 	                        # error: Argument 1 to "maybe_convert_objects" has
2271 	                        # incompatible type "Union[ExtensionArray, ndarray]";
2272 	                        # expected "ndarray"
2273 	                        arrays[i] = lib.maybe_convert_objects(
2274 	                            arr,  # type: ignore[arg-type]
2275 	                            try_float=True,
2276 	                        )

And after the coerce_float section completes the data has been converted from an integer to a floating point value and become lossy:

(Pdb) p arrays
[array([50001]), array([1.2669895e+16])]
@nj1973 nj1973 added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jan 8, 2025
@nj1973
Copy link
Contributor Author

nj1973 commented Jan 8, 2025

Actually, I fear this test case is flawed therefore this issue should be left on hold for the time being.

In order to do the test above I temporarily made a change described in #1390 which then gave me the Decimal results noted in the body of the issue.

I need to repeat the test but with a lot more data instead of using a small amount of old dates to push the SUM result to a high precision.

@nj1973 nj1973 self-assigned this Jan 14, 2025
@nj1973 nj1973 added the priority: p1 High priority. Fix may be included in the next release. label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 High priority. Fix may be included in the next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

1 participant