Merge branch 'develop' into 1389-some-engines-lack-test-for-query-fea…

…ture
GoogleCloudPlatform · Jan 21, 2025 · 10a9ceb · 10a9ceb
2 parents 6565136 + de6dad0
commit 10a9ceb
Show file tree

Hide file tree

Showing 3 changed files with 222 additions and 112 deletions.
diff --git a/samples/oracle/README.md b/samples/oracle/README.md
@@ -1,114 +1,5 @@
-# Data Validation of Oracle BLOB
+# Oracle Samples
 
-For row hash validations DVT uses the standard hash function available in each SQL engine. For Oracle that is called `STANDARD_HASH()`. Unfortunately that function is incompatible with Oracle BLOB columns:
+- [Oracle LOB validations](lob_validations.md)
+- [Horizontally scaling Oracle LOB validations](scaling_lob_validations.md)
 
-```
-SQL> SELECT STANDARD_HASH(col_blob,'SHA256') FROM dvt_test.tab_blob;
-SELECT STANDARD_HASH(col_blob,'SHA256') FROM dvt_test.tab_blob
-                     *
-ERROR at line 1:
-ORA-00902: invalid datatype
-```
-
-There is an alternative which is to use the Oracle provided [DBMS_CRYPTO package](https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_CRYPTO.html#GUID-4B200807-A740-4A2E-8828-AC0CFF6127D5). Unfortunately that package is not created by default and, even if it has been created, is not granted to all users. Therefore it is not feasible to natively integrate this hash function with DVT. 
-
-Regardless, this README documents how one might perform row hash validations for Oracle BLOB data using [DVT Custom Query feature](https://github.com/GoogleCloudPlatform/professional-services-data-validator?tab=readme-ov-file#custom-query-row-validations).
-
-
-## Oracle setup
-
-```
-sqlplus sys@localhost:1521/service_name as sysdba
-@?/rdbms/admin/dbmsobtk.sql
-@?/rdbms/admin/prvtobtk.plb
-```
-
-Grant execute on DBMS_CRYPTO to your read-only DVT database user:
-```
-GRANT execute ON DBMS_CRYPTO TO dvt_user;
-```
-
-
-## DBMS_CRYPTO equivalent of STANDARD_HASH
-
-```
-SELECT DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW('some binary data'),4)
-,      STANDARD_HASH(UTL_RAW.CAST_TO_RAW('some binary data'),'SHA256')
-FROM   dual;
-
-DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW('SOMEBINARYDATA'),4)
---------------------------------------------------------------------------------
-STANDARD_HASH(UTL_RAW.CAST_TO_RAW('SOMEBINARYDATA'),'SHA256')
-----------------------------------------------------------------
-DFB1BCF4EAE00791A2CB06899495F46A3FB9E08467EB81499B5122A0917085A8
-DFB1BCF4EAE00791A2CB06899495F46A3FB9E08467EB81499B5122A0917085A8
-```
-
-NULL handling is different:
-- `DBMS_CRYPTO.HASH` raises an exception for a NULL BLOB input
-- `STANDARD_HASH` coerces NULLs to empty string
-
-`EMPTY_BLOB()` is handled as empty string by `DBMS_CRYPTO.HASH`:
-```
-SELECT DBMS_CRYPTO.HASH((EMPTY_BLOB()),4)
-,      STANDARD_HASH(UTL_RAW.CAST_TO_RAW(''),'SHA256')
-FROM   dual;
-
-DBMS_CRYPTO.HASH((EMPTY_BLOB()),4)
---------------------------------------------------------------------------------
-STANDARD_HASH(UTL_RAW.CAST_TO_RAW(''),'SHA256')
-----------------------------------------------------------------
-E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
-E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
-```
-
-The key thing is to understand how `EMPTY_BLOB()` was inserted in the target system and cater for it accordingly.
-
-## Oracle to PostgreSQL validation example
-
-This example assumes BLOB data has been copied to PostgreSQL bytea and that any `EMPTY_BLOB()` values in Oracle became NULL in PostgreSQL.
-
-It also assumed by BLOB data in a table called `dvt_test.tab_blob` in a column called `col_blob`.
-
-Oracle query, including the primary key column for matching with target data, and preventing NULLs from reaching DBMS_CRYPTO.HASH():
-```
-SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob;
-```
-
-PostgreSQL query, including the primary key column for matching with source data:
-```
-SELECT id,encode(sha256(col_blob),'hex') AS col_blob FROM dvt_test.tab_blob;
-```
-
-DVT command:
-```
-data-validation validate custom-query row \
- -sc ora_conn -tc pg_conn \
- --source-query="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob" \
- --target-query="SELECT id,encode(sha256(col_blob),'hex') AS col_blob FROM dvt_test.tab_blob" \
- --primary-keys=id \
- --comparison-fields=col_blob \
- --format=text
-```
-
-The hash values from both systems match, output edited to stack the values:
-```
-source_column_name | col_blob
-source_agg_value   | dfb1bcf4eae00791a2cb06899495f46a3fb9e08467eb81499b5122a0917085a8
-target_agg_value   | dfb1bcf4eae00791a2cb06899495f46a3fb9e08467eb81499b5122a0917085a8
-validation_status  | success
-```
-
-# Data Validation of Oracle CLOB
-
-The method above can also be used on CLOB columns with only a small change to
-the PostgreSQL `--target-query`, for example:
-```
-data-validation validate custom-query row \
- -sc ora_conn -tc pg_conn \
- --source-query="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_clob) = 0 OR col_clob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_clob,4)) END col_clob FROM dvt_test.tab_clob" \
- --target-query="SELECT id,encode(sha256(convert_to(col_clob,'UTF8')),'hex') AS col_clob FROM dvt_test.tab_clob" \
- --primary-keys=id \
- --comparison-fields=col_clob \
- --format=text
-```
diff --git a/samples/oracle/lob_validations.md b/samples/oracle/lob_validations.md
@@ -0,0 +1,112 @@
+# Data Validation of Oracle BLOB
+
+For row hash validations DVT uses the standard hash function available in each SQL engine. For Oracle that is called `STANDARD_HASH()`. Unfortunately that function is incompatible with Oracle BLOB columns:
+
+```
+SQL> SELECT STANDARD_HASH(col_blob,'SHA256') FROM dvt_test.tab_blob;
+SELECT STANDARD_HASH(col_blob,'SHA256') FROM dvt_test.tab_blob
+                     *
+ERROR at line 1:
+ORA-00902: invalid datatype
+```
+
+There is an alternative which is to use the Oracle provided [DBMS_CRYPTO package](https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_CRYPTO.html#GUID-4B200807-A740-4A2E-8828-AC0CFF6127D5). Unfortunately that package is not created by default and, even if it has been created, is not granted to all users. Therefore it is not feasible to natively integrate this hash function with DVT.
+
+Regardless, this README documents how one might perform row hash validations for Oracle BLOB data using [DVT Custom Query feature](https://github.com/GoogleCloudPlatform/professional-services-data-validator?tab=readme-ov-file#custom-query-row-validations).
+
+
+## Oracle setup
+
+```
+sqlplus sys@localhost:1521/service_name as sysdba
+@?/rdbms/admin/dbmsobtk.sql
+@?/rdbms/admin/prvtobtk.plb
+```
+
+Grant execute on DBMS_CRYPTO to your read-only DVT database user:
+```sql
+GRANT execute ON DBMS_CRYPTO TO dvt_user;
+```
+
+
+## DBMS_CRYPTO equivalent of STANDARD_HASH
+
+```
+SELECT DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW('some binary data'),4)
+,      STANDARD_HASH(UTL_RAW.CAST_TO_RAW('some binary data'),'SHA256')
+FROM   dual;
+
+DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW('SOMEBINARYDATA'),4)
+--------------------------------------------------------------------------------
+STANDARD_HASH(UTL_RAW.CAST_TO_RAW('SOMEBINARYDATA'),'SHA256')
+----------------------------------------------------------------
+DFB1BCF4EAE00791A2CB06899495F46A3FB9E08467EB81499B5122A0917085A8
+DFB1BCF4EAE00791A2CB06899495F46A3FB9E08467EB81499B5122A0917085A8
+```
+
+NULL handling is different:
+- `DBMS_CRYPTO.HASH` raises an exception for a NULL BLOB input
+- `STANDARD_HASH` coerces NULLs to empty string
+
+`EMPTY_BLOB()` is handled as empty string by `DBMS_CRYPTO.HASH`:
+```
+SELECT DBMS_CRYPTO.HASH((EMPTY_BLOB()),4)
+,      STANDARD_HASH(UTL_RAW.CAST_TO_RAW(''),'SHA256')
+FROM   dual;
+
+DBMS_CRYPTO.HASH((EMPTY_BLOB()),4)
+--------------------------------------------------------------------------------
+STANDARD_HASH(UTL_RAW.CAST_TO_RAW(''),'SHA256')
+----------------------------------------------------------------
+E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
+E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
+```
+
+The key thing is to understand how `EMPTY_BLOB()` was inserted in the target system and cater for it accordingly.
+
+## Oracle to PostgreSQL validation example
+
+This example assumes BLOB data has been copied to PostgreSQL bytea and that any `EMPTY_BLOB()` values in Oracle became NULL in PostgreSQL.
+
+It also assumed by BLOB data in a table called `dvt_test.tab_blob` in a column called `col_blob`.
+
+Oracle query, including the primary key column for matching with target data, and preventing NULLs from reaching DBMS_CRYPTO.HASH():
+```sql
+SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob;
+```
+
+PostgreSQL query, including the primary key column for matching with source data:
+```sql
+SELECT id,encode(sha256(col_blob),'hex') AS col_blob FROM dvt_test.tab_blob;
+```
+
+DVT command:
+```sh
+data-validation validate custom-query row \
+ -sc ora_conn -tc pg_conn \
+ --source-query="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob" \
+ --target-query="SELECT id,encode(sha256(col_blob),'hex') AS col_blob FROM dvt_test.tab_blob" \
+ --primary-keys=id \
+ --comparison-fields=col_blob \
+ --format=text
+```
+
+The hash values from both systems match, output edited to stack the values:
+```
+source_column_name | col_blob
+source_agg_value   | dfb1bcf4eae00791a2cb06899495f46a3fb9e08467eb81499b5122a0917085a8
+target_agg_value   | dfb1bcf4eae00791a2cb06899495f46a3fb9e08467eb81499b5122a0917085a8
+validation_status  | success
+```
+
+The DVT command above can also be used on CLOB columns with only a small change to
+the PostgreSQL `--target-query`, for example:
+```sh
+data-validation validate custom-query row \
+ -sc ora_conn -tc pg_conn \
+ --source-query="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_clob) = 0 OR col_clob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_clob,4)) END col_clob FROM dvt_test.tab_clob" \
+ --target-query="SELECT id,encode(sha256(convert_to(col_clob,'UTF8')),'hex') AS col_clob FROM dvt_test.tab_clob" \
+ --primary-keys=id \
+ --comparison-fields=col_clob \
+ --format=text
+```
diff --git a/samples/oracle/scaling_lob_validations.md b/samples/oracle/scaling_lob_validations.md
@@ -0,0 +1,107 @@
+# Scaling Out Oracle LOB Validations
+
+This sample shows how row hash validations of Oracle BLOBs can be partitioned to minimize memory requirements and scale processing horizontally. The same principles apply to Oracle CLOBs.
+
+## Test table
+Below is the DDL for the 4 million row test table used in this sample.
+
+The contents of the BLOB column are not relevant to this test because we are assessing DVT functionality and DVT only has to deal with hash values and not the column contents. However you should bear in mind that queries over large LOB columns may be resource intensive in the Oracle instance.
+
+```sql
+CREATE TABLE dvt_test.tab_blob_4m
+( id       NUMBER(8) PRIMARY KEY
+, col_blob BLOB)
+PCTFREE 0;
+
+INSERT INTO dvt_test.tab_blob_4m
+SELECT CAST(ROWNUM AS NUMBER(8))
+,      UTL_RAW.CAST_TO_RAW('DVT')
+FROM   dual
+CONNECT BY ROWNUM <= 2000000;
+INSERT INTO dvt_test.tab_blob_4m
+SELECT id+2000000, col_blob FROM dvt_test.tab_blob_4m;
+COMMIT;
+```
+
+## Serial test
+The commands below are based on [Oracle LOB validations](lob_validations.md).
+
+```sh
+Q="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob_4m"
+
+time data-validation validate custom-query row \
+ -sc ora23_gcp -tc ora23_gcp \
+ --primary-keys=id \
+ --source-query="${Q}" \
+ --target-query="${Q}" \
+ --comparison-fields=col_blob \
+ --format=csv > /tmp/col_blob.csv
+```
+
+The job took 3 minutes to complete and Linux showed a memory usage delta of approx 6.5GB and peak CPU of 100% of a single CPU thread.
+
+## Scaling horizontally
+Based on the serial test above we can see that we are limited by DVT only being able to consume a single CPU. We can also see that if the table contains 100s of millions of rows then available RAM would become a limiting factor.
+
+### Generate partition YAML files
+Even though we must use custom queries for BLOB validations we can still partition the validation.
+
+This sample splits the validation into 20 partitions, each processing 200,000 rows. Because there is a startup overhead for each task in Cloud Run `--parts-per-file` has been used to include 2 validations per configuration file:
+
+```sh
+BUCKET=dvt-bucket
+export PSO_DV_CONN_HOME=gs://${BUCKET}/conn
+
+Q="SELECT id,CASE WHEN DBMS_LOB.GETLENGTH(col_blob) = 0 OR col_blob IS NULL THEN NULL ELSE LOWER(DBMS_CRYPTO.HASH(col_blob,4)) END col_blob FROM dvt_test.tab_blob_4m"
+
+data-validation generate-table-partitions \
+ -sc ora23_gcp \
+ -tc ora23_gcp \
+ --primary-keys=id \
+ --source-query="${Q}" \
+ --target-query="${Q}" \
+ --comparison-fields=col_blob \
+ --format=csv \
+ --config-dir gs://${BUCKET}/col_blob/config \
+ --partition-num 20 \
+ --parts-per-file=2
+```
+
+Output shows it only took a couple of seconds to partition the 4 million row table:
+```
+01/17/2025 05:27:12 PM-INFO: Writing table partition configs to directory: gs://dvt-bucket/col_blob/config
+01/17/2025 05:27:13 PM-INFO: Success! Table partition configs written to directory: gs://dvt-bucket/col_blob/config
+```
+
+### Execute via Cloud Run
+The commands below execute the validation over the 10 configuration files in 4 parallel streams:
+
+```sh
+PROJECT=my-project
+LOCATION=my-region
+REPO=dbbb-dvt
+BUCKET=dvt-bucket
+NETWORK=net
+PSO_DV_CONN_HOME=gs://${BUCKET}/conn
+YAML_DIR="gs://${BUCKET}/col_blob/config/custom.caudu"
+JOB_NAME="dvt-$(date +'%Y%m%d%H%M%S')"
+
+gcloud run jobs create ${JOB_NAME} \
+  --project ${PROJECT} --region ${LOCATION} --network=${NETWORK} \
+  --image ${LOCATION}-docker.pkg.dev/${PROJECT}/${REPO}/dvt:640 \
+  --tasks 10 --max-retries 1 --parallelism 4 \
+  --task-timeout=900s --execute-now \
+  --memory=2Gi \
+  --set-env-vars PSO_DV_CONN_HOME=${PSO_DV_CONN_HOME} \
+  --args="configs,run,-kc,-cdir=${YAML_DIR}"
+
+gcloud run jobs executions list --job ${JOB_NAME} \
+--project ${PROJECT} --region ${LOCATION}
+```
+
+From the Cloud Run console we can see that all tasks are complete with an elapsed time of 2 minutes:
+```
+Execution ID             Creation time            Tasks           End time
+------------------------ ------------------------ --------------- ------------------------
+dvt-20250117173257-8zspt Jan 17, 2025, 5:32:59 PM 10/10 completed Jan 17, 2025, 5:35:10 PM
+```