Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TD to BQ] DVT does not account for differences in encoding for primary keys #1375

Open
karcot1 opened this issue Dec 20, 2024 · 1 comment
Labels
type: question Request for information or clarification. Not an issue.

Comments

@karcot1
Copy link

karcot1 commented Dec 20, 2024

Hello team,

We have encountered a scenario where a Teradata CHAR column that has special characters is evaluated differently depending on whether or not the column is passed as a primary key.

If the column is not a primary key, the row hash process decodes TD values and translates them to UTF-8 before comparing against the BigQuery value. This allows identical data with different encodings to still pass a validation.

If the column is a primary key, DVT does not alter the value in any way before joining the source and target data together. In this case, the column will be interpreted as two different values in TD and BQ. This results in two distinct rows in the output which both fail, instead of one row that succeeds.

We are exploring custom query options to workaround this issue, but ideally DVT would be able to account for the different encodings in Primary Keys in the same way it accounts for it in a row hash.

Example value: ‘ABCDEFGÂ ‘

The hex code in Teradata is 004100420043004400450046004700C200A0
The hex code in BigQuery is 41424344454647c382c2a0

The key differences here are the following values:

  1. "C2" in TD which maps to "c382" in BQ
  2. "A0" in TD which maps to "c2a0" in BQ

If this column is not a primary key, the row hash validation succeeds. If the column is a primary key, DVT does not recognize these two rows as the same in the two systems.

@sundar-mudupalli-work
Copy link
Collaborator

sundar-mudupalli-work commented Dec 24, 2024

Team,

After some investigation, realized that the reported behavior is not happening. There could be another issue. Consider the following BQ and TD tables

CREATE TABLE data_validator_issues.issue1375 (
id INT64,
latin_pkey STRING,
my_value  NUMERIC);

INSERT INTO data_validator_issues.issue1375 (id, latin_pkey, my_value) VALUES
(1, 'row1', 30.30000),
(2, 'row2', 40.45000),
(3, 'GASPÉ', 50.56100),
(4, 'SAINT-RENÉ', 60.47230);
CREATE TABLE udfs.issue1375 (
id INTEGER,
latin_pkey VARCHAR(30) CHARACTER SET LATIN,
my_value DECIMAL(9,5))
UNIQUE PRIMARY INDEX test_decimals_pk (id)
;

INSERT INTO udfs.issue1375 (id, latin_pkey, my_value) VALUES
(1, 'row1', 30.30000);
INSERT INTO udfs.issue1375 (id, latin_pkey, my_value) VALUES
(2, 'row2', 40.45000);
INSERT INTO udfs.issue1375 (id, latin_pkey, my_value) VALUES
(3, _latin '47415350C9'XCV, 50.56100);
INSERT INTO udfs.issue1375 (id, latin_pkey, my_value) VALUES
(4, _latin '5341494E542D52454EC9'XCV, 60.47230);

Clearly the strings are stored in UTF8 in BigQuery and in Latin1 on TD and have non ascii characters. So the data validation comparison could fail. However

data-validation validate row -tc teradata -sc bq -tsp -hash='*'  -pk=latin_pkey -tbls=data
_validator_issues.issue1375=udfs.issue1375 
/home/user/professional-services-data-validator/env/lib/python3.12/site-packages/snowflake/sqlalchemy/base.py:1068: SAWarning: The GenericFunction 'flatten' is already registered and is going to be overridden.
  functions.register_function("flatten", flatten)
> /home/user/professional-services-data-validator/env/lib/python3.12/site-packages/data_validation/combiner.py(90)generate_report()
-> con = ibis.pandas.connect(
(Pdb) c
╒═══════════════════╤═══════════════════╤═════════════════════════════════╤══════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name   │ validation_type   │ source_table_name               │ source_column_name   │ source_agg_value                                                 │ target_agg_value                                                 │ pct_difference   │ validation_status   │ run_id                               │
╞═══════════════════╪═══════════════════╪═════════════════════════════════╪══════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ hash__all         │ Row               │ data_validator_issues.issue1375 │ hash__all            │ eb4549b5031fee02ad2a7e8b908612e93a70cb22dfd26811d242fa0916fc110e │ eb4549b5031fee02ad2a7e8b908612e93a70cb22dfd26811d242fa0916fc110e │                  │ success             │ d589ea97-0e7e-44ab-bac6-ea643061d115 │
├───────────────────┼───────────────────┼─────────────────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────┼─────────────────────┼──────────────────────────────────────┤
│ hash__all         │ Row               │ data_validator_issues.issue1375 │ hash__all            │ 17e7e93b03a4f31e2869282709cc0d5dbb6cb7348d976550f07540aa0ac40d40 │ 17e7e93b03a4f31e2869282709cc0d5dbb6cb7348d976550f07540aa0ac40d40 │                  │ success             │ d589ea97-0e7e-44ab-bac6-ea643061d115 │
├───────────────────┼───────────────────┼─────────────────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────┼─────────────────────┼──────────────────────────────────────┤
│ hash__all         │ Row               │ data_validator_issues.issue1375 │ hash__all            │ 9f9435b197faf5a8cfef58df16250f808c8cf3b0f36425f756f0908b63de64eb │ 9f9435b197faf5a8cfef58df16250f808c8cf3b0f36425f756f0908b63de64eb │                  │ success             │ d589ea97-0e7e-44ab-bac6-ea643061d115 │
├───────────────────┼───────────────────┼─────────────────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────┼─────────────────────┼──────────────────────────────────────┤
│ hash__all         │ Row               │ data_validator_issues.issue1375 │ hash__all            │ 5575ee8b92cf6976c3e70c9dfe7c09a7422ce47c50aaef7453eb245074b56c79 │ 5575ee8b92cf6976c3e70c9dfe7c09a7422ce47c50aaef7453eb245074b56c79 │                  │ success             │ d589ea97-0e7e-44ab-bac6-ea643061d115 │
╘═══════════════════╧═══════════════════╧═════════════════════════════════╧══════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

So if different values are stored, how does DVT work in this case ? The answer lies in what DVT is fetching from the database. In the case of primary keys, DVT fetches character strings - which are encoded in UTF8 encoding - which matches with the encoding used in BigQuery - so the correct strings are being brought and compared.

So what was the issue with latin characters earlier? In that case, we are asking TD (the database) to calculate a hash. If the character set was ASCII - the representation of that as bytes is the same in Latin encoding and UTF8 and the generated hashes were the same both in TD and BQ. If the character was a Latin character, the representation was different, so the hash turned out of be different on TD and BQ. So DVT needed to ask TD to convert the characters to UTF8 encoding first and then perform the hash. That is the difference.

Hope that helps.

Sundar Mudupalli

@helensilva14 helensilva14 added the type: question Request for information or clarification. Not an issue. label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants