Skip to content

Commit

Permalink
Merge branch 'develop' into 1389-some-engines-lack-test-for-query-fea…
Browse files Browse the repository at this point in the history
…ture
  • Loading branch information
gcf-merge-on-green[bot] authored Jan 21, 2025
2 parents 6565136 + de6dad0 commit 10a9ceb
Show file tree
Hide file tree
Showing 3 changed files with 222 additions and 112 deletions.
115 changes: 3 additions & 112 deletions samples/oracle/README.md
Original file line number Diff line number Diff line change
@@ -1,114 +1,5 @@
# Data Validation of Oracle BLOB
# Oracle Samples

For row hash validations DVT uses the standard hash function available in each SQL engine. For Oracle that is called `STANDARD_HASH()`. Unfortunately that function is incompatible with Oracle BLOB columns:
- [Oracle LOB validations](lob_validations.md)
- [Horizontally scaling Oracle LOB validations](scaling_lob_validations.md)

```
SQL> SELECT STANDARD_HASH(col_blob,'SHA256') FROM dvt_test.tab_blob;
SELECT STANDARD_HASH(col_blob,'SHA256') FROM dvt_test.tab_blob
*
ERROR at line 1:
ORA-00902: invalid datatype
```

There is an alternative which is to use the Oracle provided [DBMS_CRYPTO package](https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_CRYPTO.html#GUID-4B200807-A740-4A2E-8828-AC0CFF6127D5). Unfortunately that package is not created by default and, even if it has been created, is not granted to all users. Therefore it is not feasible to natively integrate this hash function with DVT.

Regardless, this README documents how one might perform row hash validations for Oracle BLOB data using [DVT Custom Query feature](https://github.com/GoogleCloudPlatform/professional-services-data-validator?tab=readme-ov-file#custom-query-row-validations).


## Oracle setup

```
sqlplus sys@localhost:1521/service_name as sysdba
@?/rdbms/admin/dbmsobtk.sql
@?/rdbms/admin/prvtobtk.plb
```

Grant execute on DBMS_CRYPTO to your read-only DVT database user:
```
GRANT execute ON DBMS_CRYPTO TO dvt_user;
```


## DBMS_CRYPTO equivalent of STANDARD_HASH

```
SELECT DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW('some binary data'),4)
, STANDARD_HASH(UTL_RAW.CAST_TO_RAW('some binary data'),'SHA256')
FROM dual;
DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW('SOMEBINARYDATA'),4)
--------------------------------------------------------------------------------
STANDARD_HASH(UTL_RAW.CAST_TO_RAW('SOMEBINARYDATA'),'SHA256')
----------------------------------------------------------------
DFB1BCF4EAE00791A2CB06899495F46A3FB9E08467EB81499B5122A0917085A8
DFB1BCF4EAE00791A2CB06899495F46A3FB9E08467EB81499B5122A0917085A8
```

NULL handling is different:
- `DBMS_CRYPTO.HASH` raises an exception for a NULL BLOB input
- `STANDARD_HASH` coerces NULLs to empty string

`EMPTY_BLOB()` is handled as empty string by `DBMS_CRYPTO.HASH`:
```
SELECT DBMS_CRYPTO.HASH((EMPTY_BLOB()),4)
, STANDARD_HASH(UTL_RAW.CAST_TO_RAW(''),'SHA256')
FROM dual;
DBMS_CRYPTO.HASH((EMPTY_BLOB()),4)
--------------------------------------------------------------------------------
STANDARD_HASH(UTL_RAW.CAST_TO_RAW(''),'SHA256')
----------------------------------------------------------------
E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
```

The key thing is to understand how `EMPTY_BLOB()` was inserted in the target system and cater for it accordingly.

## Oracle to PostgreSQL validation example

This example assumes BLOB data has been copied to PostgreSQL bytea and that any `EMPTY_BLOB()` values in Oracle became NULL in PostgreSQL.

It also assumed by BLOB data in a table called `dvt_test.tab_blob` in a column called `col_blob`.

Oracle query, including the primary key column for matching with target data, and preventing NULLs from reaching DBMS_CRYPTO.HASH():
```
SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob;
```

PostgreSQL query, including the primary key column for matching with source data:
```
SELECT id,encode(sha256(col_blob),'hex') AS col_blob FROM dvt_test.tab_blob;
```

DVT command:
```
data-validation validate custom-query row \
-sc ora_conn -tc pg_conn \
--source-query="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob" \
--target-query="SELECT id,encode(sha256(col_blob),'hex') AS col_blob FROM dvt_test.tab_blob" \
--primary-keys=id \
--comparison-fields=col_blob \
--format=text
```

The hash values from both systems match, output edited to stack the values:
```
source_column_name | col_blob
source_agg_value | dfb1bcf4eae00791a2cb06899495f46a3fb9e08467eb81499b5122a0917085a8
target_agg_value | dfb1bcf4eae00791a2cb06899495f46a3fb9e08467eb81499b5122a0917085a8
validation_status | success
```

# Data Validation of Oracle CLOB

The method above can also be used on CLOB columns with only a small change to
the PostgreSQL `--target-query`, for example:
```
data-validation validate custom-query row \
-sc ora_conn -tc pg_conn \
--source-query="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_clob) = 0 OR col_clob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_clob,4)) END col_clob FROM dvt_test.tab_clob" \
--target-query="SELECT id,encode(sha256(convert_to(col_clob,'UTF8')),'hex') AS col_clob FROM dvt_test.tab_clob" \
--primary-keys=id \
--comparison-fields=col_clob \
--format=text
```
112 changes: 112 additions & 0 deletions samples/oracle/lob_validations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Data Validation of Oracle BLOB

For row hash validations DVT uses the standard hash function available in each SQL engine. For Oracle that is called `STANDARD_HASH()`. Unfortunately that function is incompatible with Oracle BLOB columns:

```
SQL> SELECT STANDARD_HASH(col_blob,'SHA256') FROM dvt_test.tab_blob;
SELECT STANDARD_HASH(col_blob,'SHA256') FROM dvt_test.tab_blob
*
ERROR at line 1:
ORA-00902: invalid datatype
```

There is an alternative which is to use the Oracle provided [DBMS_CRYPTO package](https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_CRYPTO.html#GUID-4B200807-A740-4A2E-8828-AC0CFF6127D5). Unfortunately that package is not created by default and, even if it has been created, is not granted to all users. Therefore it is not feasible to natively integrate this hash function with DVT.

Regardless, this README documents how one might perform row hash validations for Oracle BLOB data using [DVT Custom Query feature](https://github.com/GoogleCloudPlatform/professional-services-data-validator?tab=readme-ov-file#custom-query-row-validations).


## Oracle setup

```
sqlplus sys@localhost:1521/service_name as sysdba
@?/rdbms/admin/dbmsobtk.sql
@?/rdbms/admin/prvtobtk.plb
```

Grant execute on DBMS_CRYPTO to your read-only DVT database user:
```sql
GRANT execute ON DBMS_CRYPTO TO dvt_user;
```


## DBMS_CRYPTO equivalent of STANDARD_HASH

```
SELECT DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW('some binary data'),4)
, STANDARD_HASH(UTL_RAW.CAST_TO_RAW('some binary data'),'SHA256')
FROM dual;
DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW('SOMEBINARYDATA'),4)
--------------------------------------------------------------------------------
STANDARD_HASH(UTL_RAW.CAST_TO_RAW('SOMEBINARYDATA'),'SHA256')
----------------------------------------------------------------
DFB1BCF4EAE00791A2CB06899495F46A3FB9E08467EB81499B5122A0917085A8
DFB1BCF4EAE00791A2CB06899495F46A3FB9E08467EB81499B5122A0917085A8
```

NULL handling is different:
- `DBMS_CRYPTO.HASH` raises an exception for a NULL BLOB input
- `STANDARD_HASH` coerces NULLs to empty string

`EMPTY_BLOB()` is handled as empty string by `DBMS_CRYPTO.HASH`:
```
SELECT DBMS_CRYPTO.HASH((EMPTY_BLOB()),4)
, STANDARD_HASH(UTL_RAW.CAST_TO_RAW(''),'SHA256')
FROM dual;
DBMS_CRYPTO.HASH((EMPTY_BLOB()),4)
--------------------------------------------------------------------------------
STANDARD_HASH(UTL_RAW.CAST_TO_RAW(''),'SHA256')
----------------------------------------------------------------
E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
```

The key thing is to understand how `EMPTY_BLOB()` was inserted in the target system and cater for it accordingly.

## Oracle to PostgreSQL validation example

This example assumes BLOB data has been copied to PostgreSQL bytea and that any `EMPTY_BLOB()` values in Oracle became NULL in PostgreSQL.

It also assumed by BLOB data in a table called `dvt_test.tab_blob` in a column called `col_blob`.

Oracle query, including the primary key column for matching with target data, and preventing NULLs from reaching DBMS_CRYPTO.HASH():
```sql
SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob;
```

PostgreSQL query, including the primary key column for matching with source data:
```sql
SELECT id,encode(sha256(col_blob),'hex') AS col_blob FROM dvt_test.tab_blob;
```

DVT command:
```sh
data-validation validate custom-query row \
-sc ora_conn -tc pg_conn \
--source-query="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob" \
--target-query="SELECT id,encode(sha256(col_blob),'hex') AS col_blob FROM dvt_test.tab_blob" \
--primary-keys=id \
--comparison-fields=col_blob \
--format=text
```

The hash values from both systems match, output edited to stack the values:
```
source_column_name | col_blob
source_agg_value | dfb1bcf4eae00791a2cb06899495f46a3fb9e08467eb81499b5122a0917085a8
target_agg_value | dfb1bcf4eae00791a2cb06899495f46a3fb9e08467eb81499b5122a0917085a8
validation_status | success
```

The DVT command above can also be used on CLOB columns with only a small change to
the PostgreSQL `--target-query`, for example:
```sh
data-validation validate custom-query row \
-sc ora_conn -tc pg_conn \
--source-query="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_clob) = 0 OR col_clob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_clob,4)) END col_clob FROM dvt_test.tab_clob" \
--target-query="SELECT id,encode(sha256(convert_to(col_clob,'UTF8')),'hex') AS col_clob FROM dvt_test.tab_clob" \
--primary-keys=id \
--comparison-fields=col_clob \
--format=text
```
107 changes: 107 additions & 0 deletions samples/oracle/scaling_lob_validations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Scaling Out Oracle LOB Validations

This sample shows how row hash validations of Oracle BLOBs can be partitioned to minimize memory requirements and scale processing horizontally. The same principles apply to Oracle CLOBs.

## Test table
Below is the DDL for the 4 million row test table used in this sample.

The contents of the BLOB column are not relevant to this test because we are assessing DVT functionality and DVT only has to deal with hash values and not the column contents. However you should bear in mind that queries over large LOB columns may be resource intensive in the Oracle instance.

```sql
CREATE TABLE dvt_test.tab_blob_4m
( id NUMBER(8) PRIMARY KEY
, col_blob BLOB)
PCTFREE 0;

INSERT INTO dvt_test.tab_blob_4m
SELECT CAST(ROWNUM AS NUMBER(8))
, UTL_RAW.CAST_TO_RAW('DVT')
FROM dual
CONNECT BY ROWNUM <= 2000000;
INSERT INTO dvt_test.tab_blob_4m
SELECT id+2000000, col_blob FROM dvt_test.tab_blob_4m;
COMMIT;
```

## Serial test
The commands below are based on [Oracle LOB validations](lob_validations.md).

```sh
Q="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob_4m"

time data-validation validate custom-query row \
-sc ora23_gcp -tc ora23_gcp \
--primary-keys=id \
--source-query="${Q}" \
--target-query="${Q}" \
--comparison-fields=col_blob \
--format=csv > /tmp/col_blob.csv
```

The job took 3 minutes to complete and Linux showed a memory usage delta of approx 6.5GB and peak CPU of 100% of a single CPU thread.

## Scaling horizontally
Based on the serial test above we can see that we are limited by DVT only being able to consume a single CPU. We can also see that if the table contains 100s of millions of rows then available RAM would become a limiting factor.

### Generate partition YAML files
Even though we must use custom queries for BLOB validations we can still partition the validation.

This sample splits the validation into 20 partitions, each processing 200,000 rows. Because there is a startup overhead for each task in Cloud Run `--parts-per-file` has been used to include 2 validations per configuration file:

```sh
BUCKET=dvt-bucket
export PSO_DV_CONN_HOME=gs://${BUCKET}/conn

Q="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob_4m"

data-validation generate-table-partitions \
-sc ora23_gcp \
-tc ora23_gcp \
--primary-keys=id \
--source-query="${Q}" \
--target-query="${Q}" \
--comparison-fields=col_blob \
--format=csv \
--config-dir gs://${BUCKET}/col_blob/config \
--partition-num 20 \
--parts-per-file=2
```

Output shows it only took a couple of seconds to partition the 4 million row table:
```
01/17/2025 05:27:12 PM-INFO: Writing table partition configs to directory: gs://dvt-bucket/col_blob/config
01/17/2025 05:27:13 PM-INFO: Success! Table partition configs written to directory: gs://dvt-bucket/col_blob/config
```

### Execute via Cloud Run
The commands below execute the validation over the 10 configuration files in 4 parallel streams:

```sh
PROJECT=my-project
LOCATION=my-region
REPO=dbbb-dvt
BUCKET=dvt-bucket
NETWORK=net
PSO_DV_CONN_HOME=gs://${BUCKET}/conn
YAML_DIR="gs://${BUCKET}/col_blob/config/custom.caudu"
JOB_NAME="dvt-$(date +'%Y%m%d%H%M%S')"

gcloud run jobs create ${JOB_NAME} \
--project ${PROJECT} --region ${LOCATION} --network=${NETWORK} \
--image ${LOCATION}-docker.pkg.dev/${PROJECT}/${REPO}/dvt:640 \
--tasks 10 --max-retries 1 --parallelism 4 \
--task-timeout=900s --execute-now \
--memory=2Gi \
--set-env-vars PSO_DV_CONN_HOME=${PSO_DV_CONN_HOME} \
--args="configs,run,-kc,-cdir=${YAML_DIR}"

gcloud run jobs executions list --job ${JOB_NAME} \
--project ${PROJECT} --region ${LOCATION}
```

From the Cloud Run console we can see that all tasks are complete with an elapsed time of 2 minutes:
```
Execution ID Creation time Tasks End time
------------------------ ------------------------ --------------- ------------------------
dvt-20250117173257-8zspt Jan 17, 2025, 5:32:59 PM 10/10 completed Jan 17, 2025, 5:35:10 PM
```

0 comments on commit 10a9ceb

Please sign in to comment.