-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TD to BQ] DVT does not account for differences in encoding for primary keys #1375
Comments
Team, After some investigation, realized that the reported behavior is not happening. There could be another issue. Consider the following BQ and TD tables
Clearly the strings are stored in UTF8 in BigQuery and in Latin1 on TD and have non ascii characters. So the data validation comparison could fail. However
So if different values are stored, how does DVT work in this case ? The answer lies in what DVT is fetching from the database. In the case of primary keys, DVT fetches character strings - which are encoded in UTF8 encoding - which matches with the encoding used in BigQuery - so the correct strings are being brought and compared. So what was the issue with latin characters earlier? In that case, we are asking TD (the database) to calculate a hash. If the character set was ASCII - the representation of that as bytes is the same in Latin encoding and UTF8 and the generated hashes were the same both in TD and BQ. If the character was a Latin character, the representation was different, so the hash turned out of be different on TD and BQ. So DVT needed to ask TD to convert the characters to UTF8 encoding first and then perform the hash. That is the difference. Hope that helps. Sundar Mudupalli |
Hello team,
We have encountered a scenario where a Teradata CHAR column that has special characters is evaluated differently depending on whether or not the column is passed as a primary key.
If the column is not a primary key, the row hash process decodes TD values and translates them to UTF-8 before comparing against the BigQuery value. This allows identical data with different encodings to still pass a validation.
If the column is a primary key, DVT does not alter the value in any way before joining the source and target data together. In this case, the column will be interpreted as two different values in TD and BQ. This results in two distinct rows in the output which both fail, instead of one row that succeeds.
We are exploring custom query options to workaround this issue, but ideally DVT would be able to account for the different encodings in Primary Keys in the same way it accounts for it in a row hash.
Example value: ‘ABCDEFGÂ ‘
The hex code in Teradata is 004100420043004400450046004700C200A0
The hex code in BigQuery is 41424344454647c382c2a0
The key differences here are the following values:
If this column is not a primary key, the row hash validation succeeds. If the column is a primary key, DVT does not recognize these two rows as the same in the two systems.
The text was updated successfully, but these errors were encountered: