description
The API documentation for creating comparing datasets programmatically using Data Diff.

Data Diff

Creating a new Diff

This method creates a new data diff job asynchronously and returns the id that can later be used to retrieve the diff results.


Endpoint	`/api/v1/datadiffs`
Method	`POST`
Request Content-Type	`application/json`
Response Content-Type	`application/json`

{% tabs %} {% tab title="Request" %} Example request body:

{
    "data_source1_id": 10,
    "data_source2_id": 10,
    "table1": ["df", "prod", "page_views"],
    "query2": "SELECT * FROM df.dev.page_views WHERE date > '2020-01-01'",
    "materialize_dataset1": true,
    "pk_columns": ["organization_id", "user_email"],
    "filter1": "date > '2020-01-01'",
    "sampling_tolerance": 0.0001,
    "sampling_confidence": 99.9,
    "diff_tolerances_per_column": [
        {"column_name": "col_name",
         "tolerance_value": 0.001,
         "tolerance_mode": "absolute"}]
}

Parameter	Required	Type	Description
data_source1_id	required	integer	Id of data source. You can look it up in Datafold UI: Settings -> Data Sources -> Connection, it'll be the last number in the URL.
data_source2_id	required	integer	Id of the second datasource. It should be the same as `data_source1_id`, unless you are doing cross-database diffs
table1		[string]	Use the table as the first datasource. Table is specified as a list of segments. Names are case-sensitive.
query1		string	Use provided query as the first datasource. Either `table1`, or `query1` must be specified.
table2		[string]	Use this table as the second data source.
query2		string	Use the query as the second data source. Either `table2` or `query2` must be specified.
materialize_dataset1	optional	bool	Force materialization of dataset 1. It's useful if `table1` is a view and it needs to be materialized prior to diffing. If a query is used as a dataset, then this flag is ignored.
materialize_dataset2	optional	bool	Force materialization of dataset 2.
pk_columns	required	[string]	List of primary key columns. It's case-sensitive.
include_columns	optional	[string]	Include only specified columns, besides PK columns.
exclude_columns	optional	[string]	Exclude those columns from diff. Useful for excluding data transfer timestamps and such.
time_column	optional	string	Time column to use for DataDiff time view.
filter1	optional	string	SQL filter for dataset 1. It'll be injected as-is to WHERE clauses during diffing.
filter2	optional	string	SQL filter for dataset 2.
time_travel_point1	optional	string	Allows you to select a specific version of the table. Snowflake: Relative using a negative integer (ex. -220) for an offset in seconds. At a certain point in time using a timestamp (ex. 2021-08-13T00:00:00) At a certain statement using the query id. BigQuery: At a certain point in time using a timestamp (ex. 2021-08-13T00:00:00)
time_travel_point2	optional	string	Same as the abovementioned, for the second data source.
sampling_tolerance	optional	float	Ratio of acceptable number of rows that can be different to the total number of rows. Used only in sampling.
sampling_confidence	optional	float	Confidence that number of differing rows is less than allowed by sampling_tolerance, in percents. E.g. `99.9`.
diff_tolerances_per_column	optional	[object]	List of objects to configure tolerance value per column.
diff_tolerances_per_column. column_name	required	string	The name of the column to which the tolerance applies. Must be of type integer or float, ignored otherwise.
diff_tolerances_per_column. tolerance_value	required	float	The value applied to the tolerance.
diff_tolerances_per_column. tolerance_mode	required	string	The tolerance mode to apply. `absolute` : `abs(b-a) < X` `relative` : `abs((b-a) / a) < X`
{% endtab %}

{% tab title="Response" %} The call returns all data from the request with missing values filled in with either nulls or defaults. In addition to that, it returns the diff id that can be used later to get diff results.

{
  "id": 69004,
  "user_id": 1,
  "data_source1_id": 7,
  "data_source2_id": 7,
  "table1": ["df", "prod", "page_views"],
  "table2": null,
  "query1": null,
  "query2": "SELECT * FROM df.dev.page_views WHERE date > '2020-01-01'",
  "filter1": null,
  "filter2": null,
  "pk_columns": ["col0"],
  "include_columns": [],
  "exclude_columns": [],
  "time_column": null,
  "time_aggregate": null,
  "time_interval_start": null,
  "time_interval_end": null,
  "sampling_tolerance": null,
  "sampling_confidence": null,
  "materialize_dataset1": false,
  "materialize_dataset2": false,
  "done": false,
  "result_statuses": {},  
}

{% endtab %} {% endtabs %}

Getting Diff Results

This method retrieves results of previously created data diff.


Endpoint	/api/v1/datadiffs/<id>/summary_results
Method	GET
Response Content-Type	application/json

{% tabs %} {% tab title="Response" %} Immediately after the diff is submitted, the status will be either waiting for processing:

{
  "status": "running",
}

When the diff task is done, the status will be set to done and additional fields will contain high-level statistics:

{
  "status": "success",
  
  "schema": {
    "columns_mismatched": [3, 4],
    "column_type_mismatches": 0,
    "column_reorders": 0,
    "column_counts": [20, 21]
  },
  "pks": {
    "total_rows": [17343000, 19294000],
    "nulls": [0, 0],
    "dupes": [0, 0],
    "exclusives": [0, 1951000],
    "distincts": [17343000, 19294000]
  },
  "values": {
    "total_rows": 17343000,
    "rows_with_differences": 13780000,
    "total_values": 277488000,
    "values_with_differences": 19505000,
    "compared_columns": 16,
    "columns_with_differences": 5
  },
}

Field Path	Description
schema.columns_mismatched	The number of columns across both datasets compared that didn't match. Datafold matches columns by name and type.
schema.column_type_mismatches	The number of columns across both datasets compared that had a matching name but different types.
schema.column_reorders	The approximate number of columns that were reordered (changed the ordinal position between datasets)
schema.column_counts	Column counts in each dataset that were considered for diffing (after include/exclude filters)
pks.total_rows	The total number of rows in each dataset.
pks.nulls	The number of `NULL` primary keys in each dataset. In the case of a composite primary key, if any part of it is `NULL`, then the whole key is counted as `NULL`.
pks.dupes	The number of rows that have duplicate primary keys for each dataset.
pks.exclusives	The number of primary keys that are exclusive to the first and the second dataset.
pks.distincts	The number of distinct primary keys in each dataset.
values.total_rows	The total number of rows for which value comparison was done. Rows with exclusive, duplicate, and null primary keys are excluded from the comparison.
values.compared_columns	The number of columns that were compared between datasets (not counting primary keys). Mismatched columns (different names or types) are not compared.
values.total_values	The number of values (individual dataset cells) that were compared, excluding primary keys.
values.rows_with_differences	The number of rows with at least one value different between the datasets.
values.columns_with_differences	The number of columns with at least one value different between the datasets.
values.values_with_differences	The number of individual cells across both rows and columns that are different between datasets.
{% endtab %}

{% tab title="Second Tab" %}

{% endtab %} {% endtabs %}

In case of error status will be set to failed and the message field will contain error description:

{
   "status" : "error",
   "error" : {
      "error_type" : "NoSuchColumnException",
      "error_value" : "unique_key"
   }
}

Example: using DataDiff with curl

$ cat >diff_request.json << EOF
{
  "data_source1_id": 138,
  "data_source2_id": 138,
  "table1": ["datadiff-demo", "public", "taxi_trips"],
  "table2": ["datadiff-demo", "dev", "taxi_trips"],
  "pk_columns": ["trip_id"]
}
EOF

$ curl -d "@diff_request.json" \
     -X POST -H 'Content-Type: application/json' \
     -H 'Authorization: Key YOUR_DATAFOLD_API_KEY' \
     'https://app.datafold.com/api/v1/datadiffs' | json_pp

# use diff id returned by previous request
$ curl -H 'Authorization: Key YOUR_DATAFOLD_API_KEY' \
     'https://app.datafold.com/api/v1/datadiffs/12345/summary_results' | json_pp

Example: using DataDiff with Python and requests lib

#!/usr/bin/env python3
import pprint
import time

import requests

HEADERS = {'Authorization': 'Key YOUR_DATAFOLD_API_KEY'}
DIFF_REQUEST = {
  'data_source1_id': 138,
  'data_source2_id': 138,
  'table1': ['datadiff-demo', 'public', 'taxi_trips'],
  'table2': ['datadiff-demo', 'dev', 'taxi_trips'],
  'pk_columns': ['unique_key'],
}

# Create a new Diff task
resp = requests.post(
    'https://app.datafold.com/api/v1/datadiffs',
    headers=HEADERS,
    json=DIFF_REQUEST,
)
diff = resp.json()

# Poll until we get results back
while 1:
    resp = requests.get(
        f'https://app.datafold.com/api/v1/datadiffs/{diff["id"]}/'
        'summary_results',
        headers=HEADERS,
    )
    results = resp.json()
    
    if results['status'] == 'error':
        print('ERROR!', results['error'])
        break
    elif results['status'] == 'success':
        pprint.pprint(results)
        break
        
    time.sleep(20)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datadiff.md

datadiff.md

Data Diff

Creating a new Diff

Getting Diff Results

Example: using DataDiff with curl

Example: using DataDiff with Python and requests lib

Files

datadiff.md

Latest commit

History

datadiff.md

File metadata and controls

Data Diff

Creating a new Diff

Getting Diff Results

Example: using DataDiff with curl

Example: using DataDiff with Python and requests lib