Skip to content

Commit

Permalink
Azure HDInsight (#24)
Browse files Browse the repository at this point in the history
* Cleanup region azure

* mend
  • Loading branch information
mwiewior authored Nov 15, 2022
1 parent c99b44f commit 2e5c325
Show file tree
Hide file tree
Showing 20 changed files with 406 additions and 14 deletions.
60 changes: 59 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Table of Contents
* [EMR Serverless](#emr-serverless)
* [Azure](#azure-1)
* [Login](#login)
* [HDInsight](#hdinsight)
* [AKS](#aks)
* [GCP](#gcp-1)
* [Login](#login-2)
Expand Down Expand Up @@ -67,6 +68,7 @@ as well. Check code comments for details.
| GCP | Dataproc |2.0.27-ubuntu18| 3.1.3 | 1.0.0 | 0.3.3 | -|
| GCP | Dataproc Serverless|1.0.21| 3.2.2 | 1.1.0 | 0.4.1 | gcr.io/${TF_VAR_project_name}/spark-py:pysequila-0.4.1-dataproc-latest |
| Azure | AKS |1.23.12|3.2.2|1.1.0|0.4.1| docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest|
| Azure | HDInsight| 5.0.300.1 | 3.2.2 | 1.1.0 | 0.4.1 |- |
| AWS | EKS|1.23.9 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-eks-latest|
| AWS | EMR Serverless|emr-6.7.0 | 3.2.1 | 1.1.0 | 0.4.1 |- |

Expand Down Expand Up @@ -118,8 +120,10 @@ terraform init

## Using SeQuiLa cli Docker image for Azure
```bash
export TF_VAR_region=westeurope
docker pull biodatageeks/sequila-cloud-cli:latest
docker run --rm -it \
-e TF_VAR_region=${TF_VAR_region} \
-e TF_VAR_pysequila_version=${TF_VAR_pysequila_version} \
-e TF_VAR_sequila_version=${TF_VAR_sequila_version} \
-e TF_VAR_pysequila_image_aks=${TF_VAR_pysequila_image_aks} \
Expand Down Expand Up @@ -162,7 +166,7 @@ terraform init

## Azure
* [AKS (Azure Kubernetes Service)](#AKS): :white_check_mark:

* [HDInsight](#hdinsight): :white_check_mark:
## AWS
* [EMR Serverless](#emr-serverless): :white_check_mark:
* [EKS(Elastic Kubernetes Service)](#EKS): :white_check_mark:
Expand Down Expand Up @@ -277,6 +281,60 @@ az login
az account set --subscription "Azure subscription 1"
```

## HDInsight
:bulb: According to the [release notes](https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-50-component-versioning?source=recommendations)
HDInisght 5.0 comes with Apache Spark 3.1.2. Unfortunately it is 3.0.2:

![img.png](doc/images/hdinsight-spark-version.png)

Since HDInsight is in fact a full-fledged Hadoop cluster we were able to add to the Terraform module support for Apache Spark 3.2.2 using
[a script action](https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux) mechanism.


### Deploy
```bash
export TF_VAR_hdinsight_gateway_password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16 ; echo '')
export TF_VAR_hdinsight_ssh_password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16 ; echo '')
terraform apply -var-file=../../env/azure.tfvars -var-file=../../env/azure-hdinsight.tfvars -var-file=../../env/_all.tfvars
```
Check Terraform output variables for ssh connection string, credentials and Spark Submit command, e.g.
```bash
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:

hdinsight_gateway_password = "w8aN6oVSJobq7eu4"
hdinsight_ssh_password = "wun6RzBBPWD9z9ke"
pysequila_submit_command = <<EOT
export SPARK_HOME=/opt/spark
spark-submit \
--master yarn \
--packages org.biodatageeks:sequila_2.12:1.1.0 \
--conf spark.pyspark.python=/usr/bin/miniforge/envs/py38/bin/python3 \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=3g \
--conf spark.executor.instances=1 \
--conf spark.files=wasb://[email protected]/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta,wasb://[email protected]/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai \
wasb://[email protected]/jobs/pysequila/sequila-pileup.py
EOT
ssh_command = "ssh [email protected]"
```

### Run
1. Use `ssh_command` and `hdinsight_ssh_password` to connect to the head node.
2. Run `pysequila_submit_command` command.

![img.png](doc/images/hdinsight-job.png)
![img.png](doc/images/hdinsight-job-2.png)

### Cleanup
```bash
terraform destroy -var-file=../../env/azure.tfvars -var-file=../../env/azure-hdinsight.tfvars -var-file=../../env/_all.tfvars
```

## AKS
### Deploy

Expand Down
1 change: 1 addition & 0 deletions cloud/aws/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ module "emr-job" {
source = "../../modules/aws/emr-serverless"
aws-emr-release = var.aws-emr-release
bucket = module.storage.bucket
spark_version = var.spark_version
pysequila_version = var.pysequila_version
sequila_version = var.sequila_version
data_files = [for f in var.data_files : "s3://${module.storage.bucket}/data/${f}" if length(regexall("fasta", f)) > 0]
Expand Down
2 changes: 1 addition & 1 deletion cloud/aws/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,4 @@ variable "eks_preemptible" {
type = bool
default = true
description = "Enable preemtible(spot) instance in a Kubernetes pool"
}
}
12 changes: 10 additions & 2 deletions cloud/azure/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ No providers.
| <a name="module_aks"></a> [aks](#module\_aks) | ../../modules/azure/aks | n/a |
| <a name="module_azure-resources"></a> [azure-resources](#module\_azure-resources) | ../../modules/azure/resource-mgmt | n/a |
| <a name="module_azure-staging-blob"></a> [azure-staging-blob](#module\_azure-staging-blob) | ../../modules/azure/jobs-code | n/a |
| <a name="module_hdinsight"></a> [hdinsight](#module\_hdinsight) | ../../modules/azure/hdinsight | n/a |
| <a name="module_spark-on-k8s-operator-aks"></a> [spark-on-k8s-operator-aks](#module\_spark-on-k8s-operator-aks) | ../../modules/kubernetes/spark-on-k8s-operator | n/a |

## Resources
Expand All @@ -30,15 +31,22 @@ No resources.
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_azure-aks-deploy"></a> [azure-aks-deploy](#input\_azure-aks-deploy) | Deploy AKS cluster | `bool` | `false` | no |
| <a name="input_azure-hdinsight-deploy"></a> [azure-hdinsight-deploy](#input\_azure-hdinsight-deploy) | Deploy HDInsight cluster | `bool` | `false` | no |
| <a name="input_data_files"></a> [data\_files](#input\_data\_files) | Data files to copy to staging bucket | `list(string)` | n/a | yes |
| <a name="input_hdinsight_gateway_password"></a> [hdinsight\_gateway\_password](#input\_hdinsight\_gateway\_password) | Hadoop gateway password (i.e. Ambari, YARN UI console, etc) | `string` | `null` | no |
| <a name="input_hdinsight_ssh_password"></a> [hdinsight\_ssh\_password](#input\_hdinsight\_ssh\_password) | SSH password to all nodes in the cluster | `string` | `null` | no |
| <a name="input_pysequila_image_aks"></a> [pysequila\_image\_aks](#input\_pysequila\_image\_aks) | AKS PySeQuiLa image | `string` | n/a | yes |
| <a name="input_pysequila_version"></a> [pysequila\_version](#input\_pysequila\_version) | PySeQuiLa version | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | Location of the cluster | `string` | n/a | yes |
| <a name="input_sequila_version"></a> [sequila\_version](#input\_sequila\_version) | SeQuiLa version | `string` | n/a | yes |
| <a name="input_spark_version"></a> [spark\_version](#input\_spark\_version) | Apache Spark version | `string` | `"3.2.2"` | no |
| <a name="input_zone"></a> [zone](#input\_zone) | Zone of the cluster | `string` | n/a | yes |

## Outputs

No outputs.
| Name | Description |
|------|-------------|
| <a name="output_hdinsight_gateway_password"></a> [hdinsight\_gateway\_password](#output\_hdinsight\_gateway\_password) | n/a |
| <a name="output_hdinsight_ssh_password"></a> [hdinsight\_ssh\_password](#output\_hdinsight\_ssh\_password) | n/a |
| <a name="output_pysequila_submit_command"></a> [pysequila\_submit\_command](#output\_pysequila\_submit\_command) | n/a |
| <a name="output_ssh_command"></a> [ssh\_command](#output\_ssh\_command) | n/a |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
23 changes: 21 additions & 2 deletions cloud/azure/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
module "azure-resources" {
source = "../../modules/azure/resource-mgmt"
region = var.region
count = var.azure-aks-deploy ? 1 : 0
count = (var.azure-aks-deploy || var.azure-hdinsight-deploy) ? 1 : 0
}

module "azure-staging-blob" {
Expand All @@ -16,7 +16,26 @@ module "azure-staging-blob" {
pysequila_version = var.pysequila_version
sequila_version = var.sequila_version
pysequila_image_aks = var.pysequila_image_aks
count = var.azure-aks-deploy ? 1 : 0
count = (var.azure-aks-deploy || var.azure-hdinsight-deploy) ? 1 : 0
}

#### Azure HDInsight
module "hdinsight" {
depends_on = [module.azure-staging-blob]
source = "../../modules/azure/hdinsight"
region = var.region
resource_group = module.azure-resources[0].resource_group
storage_container_id = module.azure-resources[0].storage_container_id
storage_account_access_key = module.azure-resources[0].storage_account_access_key
storage_container_name = module.azure-resources[0].azurerm_storage_container
storage_account_name = module.azure-resources[0].storage_account
pysequila_version = var.pysequila_version
sequila_version = var.sequila_version
node_ssh_password = var.hdinsight_ssh_password
gateway_password = var.hdinsight_gateway_password
data_files = [for f in var.data_files : "wasb://sequila@${module.azure-resources[0].storage_account}.blob.core.windows.net/data/${f}" if length(regexall("fasta", f)) > 0]
count = var.azure-hdinsight-deploy ? 1 : 0

}


Expand Down
15 changes: 15 additions & 0 deletions cloud/azure/output.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
output "hdinsight_gateway_password" {
value = try(module.hdinsight[0].hdinsight_gateway_password, "No HDInsight setup.")
}

output "hdinsight_ssh_password" {
value = try(module.hdinsight[0].hdinsight_ssh_password, "No HDInsight setup.")
}

output "ssh_command" {
value = try(module.hdinsight[0].ssh_command, "No HDInsight setup.")
}

output "pysequila_submit_command" {
value = try(module.hdinsight[0].pysequila_submit_command, "No HDInsight setup.")
}
23 changes: 18 additions & 5 deletions cloud/azure/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,6 @@ variable "region" {
description = "Location of the cluster"
}

variable "zone" {
type = string
description = "Zone of the cluster"
}

variable "data_files" {
type = list(string)
description = "Data files to copy to staging bucket"
Expand All @@ -37,4 +32,22 @@ variable "azure-aks-deploy" {
type = bool
default = false
description = "Deploy AKS cluster"
}

variable "azure-hdinsight-deploy" {
type = bool
default = false
description = "Deploy HDInsight cluster"
}

variable "hdinsight_gateway_password" {
type = string
default = null
description = "Hadoop gateway password (i.e. Ambari, YARN UI console, etc)"
}

variable "hdinsight_ssh_password" {
type = string
default = null
description = "SSH password to all nodes in the cluster"
}
Binary file added doc/images/hdinsight-job-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/hdinsight-job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/hdinsight-spark-version.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions env/azure-hdinsight.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
azure-hdinsight-deploy = true
1 change: 0 additions & 1 deletion env/azure.tfvars
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
region = "westeurope"
2 changes: 1 addition & 1 deletion modules/azure/aks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ No modules.
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | Azure machine type | `string` | `"Standard_D2_v2"` | no |
| <a name="input_max_node_count"></a> [max\_node\_count](#input\_max\_node\_count) | Maximum number of AKS nodes | `number` | `2` | no |
| <a name="input_region"></a> [region](#input\_region) | Location of the cluster | `string` | n/a | yes |
| <a name="input_resource_group"></a> [resource\_group](#input\_resource\_group) | n/a | `string` | n/a | yes |
| <a name="input_resource_group"></a> [resource\_group](#input\_resource\_group) | Azure resource group | `string` | n/a | yes |

## Outputs

Expand Down
3 changes: 2 additions & 1 deletion modules/azure/aks/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ variable "region" {
}

variable "resource_group" {
type = string
type = string
description = "Azure resource group"
}

variable "machine_type" {
Expand Down
54 changes: 54 additions & 0 deletions modules/azure/hdinsight/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# hdinsight

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

No requirements.

## Providers

| Name | Version |
|------|---------|
| <a name="provider_azurerm"></a> [azurerm](#provider\_azurerm) | n/a |
| <a name="provider_random"></a> [random](#provider\_random) | n/a |

## Modules

No modules.

## Resources

| Name | Type |
|------|------|
| [azurerm_hdinsight_spark_cluster.sequila](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/hdinsight_spark_cluster) | resource |
| [azurerm_storage_blob.sequila](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_blob) | resource |
| [azurerm_storage_blob.spark](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_blob) | resource |
| [random_string.random-suffix](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/string) | resource |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_data_files"></a> [data\_files](#input\_data\_files) | Data files to copy to staging bucket | `list(string)` | n/a | yes |
| <a name="input_gateway_password"></a> [gateway\_password](#input\_gateway\_password) | Hadoop gateway password (i.e. Ambari, YARN UI console, etc) | `string` | n/a | yes |
| <a name="input_hdinsight_version"></a> [hdinsight\_version](#input\_hdinsight\_version) | HDInsight version | `string` | `"5.0"` | no |
| <a name="input_node_ssh_password"></a> [node\_ssh\_password](#input\_node\_ssh\_password) | SSH password to all nodes in the cluster | `string` | n/a | yes |
| <a name="input_pysequila_version"></a> [pysequila\_version](#input\_pysequila\_version) | PySeQuiLa version | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | Location of the cluster | `string` | n/a | yes |
| <a name="input_resource_group"></a> [resource\_group](#input\_resource\_group) | Azure resource group | `string` | n/a | yes |
| <a name="input_sequila_version"></a> [sequila\_version](#input\_sequila\_version) | SeQuiLa version | `string` | n/a | yes |
| <a name="input_spark_version"></a> [spark\_version](#input\_spark\_version) | Apache Spark version | `string` | `"3.2.2"` | no |
| <a name="input_storage_account_access_key"></a> [storage\_account\_access\_key](#input\_storage\_account\_access\_key) | Storage account access key | `string` | n/a | yes |
| <a name="input_storage_account_name"></a> [storage\_account\_name](#input\_storage\_account\_name) | n/a | `string` | n/a | yes |
| <a name="input_storage_container_id"></a> [storage\_container\_id](#input\_storage\_container\_id) | Azure storage container | `string` | n/a | yes |
| <a name="input_storage_container_name"></a> [storage\_container\_name](#input\_storage\_container\_name) | n/a | `string` | n/a | yes |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_hdinsight_gateway_password"></a> [hdinsight\_gateway\_password](#output\_hdinsight\_gateway\_password) | n/a |
| <a name="output_hdinsight_ssh_password"></a> [hdinsight\_ssh\_password](#output\_hdinsight\_ssh\_password) | n/a |
| <a name="output_pysequila_submit_command"></a> [pysequila\_submit\_command](#output\_pysequila\_submit\_command) | n/a |
| <a name="output_ssh_command"></a> [ssh\_command](#output\_ssh\_command) | n/a |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Loading

0 comments on commit 2e5c325

Please sign in to comment.