Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Spec For Credential Management and SQLAlchemy Database Connectors #1420

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
33078aa
init commit
sundarshankar89 Jan 15, 2025
e12db4d
Added Base Connector
sundarshankar89 Jan 15, 2025
7fc59ff
Moved the Abstract class to private
sundarshankar89 Jan 15, 2025
978f8bc
Added pyodbc dependency
sundarshankar89 Jan 15, 2025
39cdc3e
fmt fixes
sundarshankar89 Jan 15, 2025
a9d799f
Merge branch 'main' into feature/credential_manager
sundarshankar89 Jan 15, 2025
d1263b5
Added Vault Manager
sundarshankar89 Jan 16, 2025
0c40eca
Added TODO
sundarshankar89 Jan 17, 2025
d7eed08
Adding credential manager for multiple secret
sundarshankar89 Jan 20, 2025
bd0d012
Added reading credentials from env and then falling back to key itself
sundarshankar89 Jan 20, 2025
c7f91f5
fixed case agnostic connection creation.
sundarshankar89 Jan 20, 2025
1dcff15
Added UT
sundarshankar89 Jan 20, 2025
211944e
fmt fixes
sundarshankar89 Jan 20, 2025
a445aba
initial test case setup
sundarshankar89 Jan 20, 2025
3cb9c05
test case setup
sundarshankar89 Jan 21, 2025
8b1c254
Refactored to better
sundarshankar89 Jan 21, 2025
74030d3
Added Integration Test
sundarshankar89 Jan 24, 2025
29b14be
Added Integration Test
sundarshankar89 Jan 24, 2025
0aa457b
fmt fixes
sundarshankar89 Jan 24, 2025
9e1f7fd
added fixture
sundarshankar89 Jan 24, 2025
ee162b0
Merge branch 'main' into feature/credential_manager
sundarshankar89 Jan 27, 2025
79e3a86
add acceptance (#1428)
sundarshankar89 Jan 28, 2025
8e9dea6
Merge branch 'main' into feature/credential_manager
sundarshankar89 Jan 29, 2025
f44a09e
fmt fixes
sundarshankar89 Jan 29, 2025
355b76d
Simplified installation journey
sundarshankar89 Feb 3, 2025
5570790
Merge branch 'main' into feature/simplified_installation
sundarshankar89 Feb 5, 2025
a0eee07
Merge branch 'feature/simplified_installation' into feature/credentia…
sundarshankar89 Feb 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/scripts/setup_mssql_odbc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env bash

set -xve
#Repurposed from https://github.com/Yarden-zamir/install-mssql-odbc

curl -sSL -O https://packages.microsoft.com/config/ubuntu/$(grep VERSION_ID /etc/os-release | cut -d '"' -f 2)/packages-microsoft-prod.deb

sudo dpkg -i packages-microsoft-prod.deb
#rm packages-microsoft-prod.deb

sudo apt-get update
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18

58 changes: 58 additions & 0 deletions .github/workflows/acceptance.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
name: acceptance

on:
pull_request:
types: [ opened, synchronize, ready_for_review ]
merge_group:
types: [ checks_requested ]
push:
branches:
- main

permissions:
id-token: write
contents: read
pull-requests: write

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
integration:
if: github.event_name == 'pull_request' && github.event.pull_request.draft == false
environment: tool
runs-on: larger
steps:
- name: Checkout Code
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Install Python
uses: actions/setup-python@v5
with:
cache: 'pip'
cache-dependency-path: '**/pyproject.toml'
python-version: '3.10'

- name: Install hatch
run: pip install hatch==1.9.4

- name: Install MSSQL ODBC Driver
run: |
chmod +x $GITHUB_WORKSPACE/.github/scripts/setup_mssql_odbc.sh
$GITHUB_WORKSPACE/.github/scripts/setup_mssql_odbc.sh

- name: Run integration tests
uses: databrickslabs/sandbox/acceptance@acceptance/v0.4.2
with:
vault_uri: ${{ secrets.VAULT_URI }}
directory: ${{ github.workspace }}
timeout: 2h
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
TEST_ENV: 'ACCEPTANCE'

4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ fmt:
setup_spark_remote:
.github/scripts/setup_spark_remote.sh

test: setup_spark_remote
test:
hatch run test

integration:
integration: setup_spark_remote
hatch run integration

coverage:
Expand Down
10 changes: 7 additions & 3 deletions labs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@ name: remorph
description: Code Transpiler and Data Reconciliation tool for Accelerating Data onboarding to Databricks from EDW, CDW and other ETL sources.
install:
min_runtime_version: 13.3
require_running_cluster: false
require_databricks_connect: false
script: src/databricks/labs/remorph/install.py
script: src/databricks/labs/remorph/base_install.py
uninstall:
script: src/databricks/labs/remorph/uninstall.py
entrypoint: src/databricks/labs/remorph/cli.py
Expand Down Expand Up @@ -66,3 +64,9 @@ commands:
description: Utility to setup Scope and Secrets on Databricks Workspace
- name: debug-me
description: "[INTERNAL] Debug SDK connectivity"
- name: install-assessment
description: "Install Assessment"
- name: install-transpile
description: "Install Transpile"
- name: install-reconcile
description: "Install Reconcile"
6 changes: 5 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,10 @@ dependencies = [
"databricks-labs-blueprint[yaml]>=0.2.3",
"databricks-labs-lsql>=0.7.5,<0.14.0", # TODO: Limit the LSQL version until dependencies are correct.
"cryptography>=41.0.3",
"pyodbc",
"SQLAlchemy",
"pygls>=2.0.0a2",

]

[project.urls]
Expand Down Expand Up @@ -50,6 +53,7 @@ dependencies = [
"pytest",
"pytest-cov>=5.0.0,<6.0.0",
"pytest-asyncio>=0.24.0",
"pytest-xdist~=3.5.0",
"black>=23.1.0",
"ruff>=0.0.243",
"databricks-connect==15.1",
Expand All @@ -66,7 +70,7 @@ reconcile = "databricks.labs.remorph.reconcile.execute:main"
[tool.hatch.envs.default.scripts]
test = "pytest --cov src --cov-report=xml tests/unit"
coverage = "pytest --cov src tests/unit --cov-report=html"
integration = "pytest --cov src tests/integration --durations 20"
integration = "pytest --cov src tests/integration --durations 20 --ignore=tests/integration/connections"
fmt = ["black .",
"ruff check . --fix",
"mypy --disable-error-code 'annotation-unchecked' .",
Expand Down
9 changes: 9 additions & 0 deletions src/databricks/labs/remorph/base_install.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from databricks.labs.blueprint.entrypoint import get_logger, is_in_debug

if __name__ == "__main__":
logger = get_logger(__file__)
logger.setLevel("INFO")
if is_in_debug():
logger.getLogger("databricks").setLevel(logger.setLevel("DEBUG"))

logger.info("Successfully Setup Remorph Components Locally")
48 changes: 48 additions & 0 deletions src/databricks/labs/remorph/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from databricks.labs.remorph.config import TranspileConfig
from databricks.labs.remorph.contexts.application import ApplicationContext
from databricks.labs.remorph.helpers.recon_config_utils import ReconConfigPrompts
from databricks.labs.remorph.__about__ import __version__
from databricks.labs.remorph.install import WorkspaceInstaller
from databricks.labs.remorph.reconcile.runner import ReconcileRunner
from databricks.labs.remorph.lineage import lineage_generator
from databricks.labs.remorph.transpiler.execute import transpile as do_transpile
Expand All @@ -34,6 +36,32 @@ def raise_validation_exception(msg: str) -> Exception:
proxy_command(remorph, "debug-bundle")


def _installer(ws: WorkspaceClient) -> WorkspaceInstaller:
app_context = ApplicationContext(_verify_workspace_client(ws))
return WorkspaceInstaller(
app_context.workspace_client,
app_context.prompts,
app_context.installation,
app_context.install_state,
app_context.product_info,
app_context.resource_configurator,
app_context.workspace_installation,
)


def _verify_workspace_client(ws: WorkspaceClient) -> WorkspaceClient:
"""
[Private] Verifies and updates the workspace client configuration.
"""

# Using reflection to set right value for _product_info for telemetry
product_info = getattr(ws.config, '_product_info')
if product_info[0] != "remorph":
setattr(ws.config, '_product_info', ('remorph', __version__))

return ws


@remorph.command
def transpile(
w: WorkspaceClient,
Expand Down Expand Up @@ -168,5 +196,25 @@ def configure_secrets(w: WorkspaceClient):
recon_conf.prompt_and_save_connection_details()


@remorph.command(is_unauthenticated=True)
def install_assessment():
"""Install the Remorph Assessment package"""
raise NotImplementedError("Assessment package is not available yet.")


@remorph.command()
def install_transpile(w: WorkspaceClient):
"""Install the Remorph Transpile package"""
installer = _installer(w)
installer.run(module="transpile")


@remorph.command(is_unauthenticated=False)
def install_reconcile(w: WorkspaceClient):
"""Install the Remorph Reconcile package"""
installer = _installer(w)
installer.run(module="reconcile")


if __name__ == "__main__":
remorph()
49 changes: 49 additions & 0 deletions src/databricks/labs/remorph/connections/credential_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from pathlib import Path
import logging
import yaml

from databricks.labs.blueprint.wheels import ProductInfo
from databricks.labs.remorph.connections.env_getter import EnvGetter

logger = logging.getLogger(__name__)


class Credentials:
def __init__(self, product_info: ProductInfo, env: EnvGetter) -> None:
self._product_info = product_info
self._env = env
self._credentials: dict[str, str] = self._load_credentials(self._get_local_version_file_path())

def _get_local_version_file_path(self) -> Path:
user_home = f"{Path(__file__).home()}"
return Path(f"{user_home}/.databricks/labs/{self._product_info.product_name()}/credentials.yml")

def _load_credentials(self, file_path: Path) -> dict[str, str]:
with open(file_path, encoding="utf-8") as f:
return yaml.safe_load(f)

def load(self, source: str) -> dict[str, str]:
error_msg = f"source system: {source} credentials not found in file credentials.yml"
if source in self._credentials:
value = self._credentials[source]
if isinstance(value, dict):
return {k: self._get_secret_value(v) for k, v in value.items()}
raise KeyError(error_msg)
raise KeyError(error_msg)

def _get_secret_value(self, key: str) -> str:
secret_vault_type = self._credentials.get('secret_vault_type', 'local').lower()
if secret_vault_type == 'local':
return key
if secret_vault_type == 'env':
try:
value = self._env.get(str(key)) # Port numbers can be int
except KeyError:
logger.debug(f"Environment variable {key} not found Failing back to actual string value")
return key
return value

if secret_vault_type == 'databricks':
raise NotImplementedError("Databricks secret vault not implemented")

raise ValueError(f"Unsupported secret vault type: {secret_vault_type}")
88 changes: 88 additions & 0 deletions src/databricks/labs/remorph/connections/database_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
import logging
from abc import ABC, abstractmethod
from typing import Any

from sqlalchemy import create_engine
from sqlalchemy.engine import Engine, Result, URL
from sqlalchemy.orm import sessionmaker
from sqlalchemy import text
from sqlalchemy.exc import OperationalError

logger = logging.getLogger(__name__)


class _ISourceSystemConnector(ABC):
@abstractmethod
def _connect(self) -> Engine:
pass

@abstractmethod
def execute_query(self, query: str) -> Result[Any]:
pass


class _BaseConnector(_ISourceSystemConnector):
def __init__(self, config: dict[str, Any]):
self.config = config
self.engine: Engine = self._connect()

def _connect(self) -> Engine:
raise NotImplementedError("Subclasses should implement this method")

def execute_query(self, query: str) -> Result[Any]:
if not self.engine:
raise ConnectionError("Not connected to the database.")
try:
session = sessionmaker(bind=self.engine)
connection = session()
return connection.execute(text(query))
except OperationalError:
raise ConnectionError("Error connecting to the database check credentials") from None


def _create_connector(db_type: str, config: dict[str, Any]) -> _ISourceSystemConnector:
connectors = {
"snowflake": SnowflakeConnector,
sundarshankar89 marked this conversation as resolved.
Show resolved Hide resolved
"mssql": MSSQLConnector,
"tsql": MSSQLConnector,
"synapse": MSSQLConnector,
}

connector_class = connectors.get(db_type.lower())

if connector_class is None:
raise ValueError(f"Unsupported database type: {db_type}")

return connector_class(config)


class SnowflakeConnector(_BaseConnector):
def _connect(self) -> Engine:
raise NotImplementedError("Snowflake connector not implemented")


class MSSQLConnector(_BaseConnector):
def _connect(self) -> Engine:
query_params = {"driver": self.config['driver']}

for key, value in self.config.items():
if key not in ["user", "password", "server", "database", "port"]:
query_params[key] = value
connection_string = URL.create(
"mssql+pyodbc",
username=self.config['user'],
password=self.config['password'],
host=self.config['server'],
port=self.config.get('port', 1433),
database=self.config['database'],
query=query_params,
)
return create_engine(connection_string)


class DatabaseManager:
def __init__(self, db_type: str, config: dict[str, Any]):
self.connector = _create_connector(db_type, config)

def execute_query(self, query: str) -> Result[Any]:
return self.connector.execute_query(query)
24 changes: 24 additions & 0 deletions src/databricks/labs/remorph/connections/env_getter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import os
import json
import logging


class EnvGetter:
def __init__(self, is_debug: bool = False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not read debug-env.json when the app starts and set env variables accordingly (using os.environ['my_name'] = 'my_value') ?

self.env = self._get_debug_env() if is_debug else dict(os.environ)

def get(self, key: str) -> str:
if key in self.env:
return self.env[key]
raise KeyError(f"not in env: {key}")

def _get_debug_env(self) -> dict:
try:
debug_env_file = f"{os.path.expanduser('~')}/.databricks/debug-env.json"
with open(debug_env_file, 'r', encoding='utf-8') as file:
contents = file.read()
logging.debug(f"Found debug env file: {debug_env_file}")
raw = json.loads(contents)
return raw.get("ucws", {})
except FileNotFoundError:
return dict(os.environ)
Loading