Run your `googlecloud-to-neo4j` pipeline locally

Prerequisites

Java 21
Apache Maven
Docker
GCP account with Dataflow enabled
GCS bucket accessible for writes
gcloud CLI

Building the CLI

First, clone the googlecloud-to-neo4j template locally:

git clone https://github.com/GoogleCloudPlatform/DataflowTemplates.git

NOTE: If you want to align with the template version currently deployed in your GCP region, run the following commands after cloning the DataflowTemplates repository (here the region is set to europe-west8):
tag=$(gsutil ls gs://dataflow-templates-europe-west8/ | grep -E '\d{4}-\d{2}-\d{2}' | sort -V -r | head -n 1 | cut -d/ -f4)
git checkout "${tag}"

Run the following to locally cache the template:

mvn --file DataflowTemplates/pom.xml --also-make --projects v2/googlecloud-to-neo4j install -DskipTests -Djib.skip

Then, go back to this project and run:

mvn package

You should then be able to run:

java -jar target/local-runner-1.0-SNAPSHOT-shaded.jar --help

And see some output similar to:

Usage: local-dataflow [-hV] -b=<bucket> [-i=<checkInterval>] -p=<project>
                      -r=<region> -s=<spec> [-t=<maxTimeout>]
                      [-c=<countQueryChecks>]...
  -b, --bucket=<bucket>     GCS bucket
  -c, --count-query-check=<countQueryChecks>
                            Count query checks (syntax: "<count>:<Cypher count
                              query>" with a single "count" column)
  -h, --help                Show this help message and exit.
  -i, --interval-check-duration=<checkInterval>
                            Execution completion check interval
  -p, --project=<project>   GCP project
  -r, --region=<region>     GCP region
  -s, --spec=<spec>         Path to local googlecloud-to-neo4j spec file
  -t, --max-timeout=<maxTimeout>
                            Execution timeout
  -V, --version             Print version information and exit.

Quick start

For the guide, you will need:

to have built the CLI locally (see previous section)
to know your GCP project name
to pick a GCS bucket name accessible for writes
a running Docker Daemon
to have set up Google Application Default Credentials

Create a local spec file, let's save it somewhere (the rest of the guide assumes /path/to/spec.json):

{
  "sources": [
    {
      "type": "text",
      "name": "persons",
      "ordered_field_names": "id",
      "data": [
        ["person0"],
        ["person1"],
        ["person2"],
        ["person3"],
        ["person4"]
      ]
    }
  ],
  "targets": [
    {
      "node": {
        "source": "persons",
        "name": "person import",
        "mode": "merge",
        "mappings": {
          "labels": [
            "\"Person\""
          ],
          "properties": {
            "keys": [
              {"id": "id"}
            ]
          }
        }
      }
    }
  ]
}

If not already set up google authentication through gcloud CLI, run

gcloud auth application-default login

Assuming the current location is the root of this project, now run:

java -jar ./target/local-runner-1.0-SNAPSHOT-shaded.jar \
  --project=<YOUR GCP PROJECT> \
  --region=<YOUR GCP REGION> \
  --bucket=<YOUR GCS BUCKET> \
  --spec=/path/to/spec.json

And that's it! A local Neo4j instance is going to be started via Docker and the pipeline will run directly on your machine. All logs are sent to standard output directly. Once the execution is done, the container is shut down.

You can also specify Cypher query checks to make sure the data is created in the way you expect:

java -jar ./target/local-runner-1.0-SNAPSHOT-shaded.jar \
  --project=<YOUR GCP PROJECT> \
  --region=<YOUR GCP REGION> \
  --bucket=<YOUR GCS BUCKET> \
  --spec=/path/to/spec.json \
  --count-query-check="5:MATCH (p:Person) RETURN count(p) AS count" \
  --count-query-check="0:MATCH (p:Person) WHERE NOT p.id STARTS WITH 'person' RETURN count(p) AS count"

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Run your `googlecloud-to-neo4j` pipeline locally

Prerequisites

Building the CLI

Quick start

About

Releases

Packages

Contributors 3

Languages

License

neo4j-contrib/local-dataflow-runner

Folders and files

Latest commit

History

Repository files navigation

Run your googlecloud-to-neo4j pipeline locally

Prerequisites

Building the CLI

Quick start

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Run your `googlecloud-to-neo4j` pipeline locally

Packages