Skip to content

Commit

Permalink
Merge pull request #883 from FederatedAI/develop-1.11.1
Browse files Browse the repository at this point in the history
Merge develop-1.11.1 into master for release purposes
  • Loading branch information
wfangchi authored May 10, 2023
2 parents 8e3c880 + 6b705ab commit 33ee34f
Show file tree
Hide file tree
Showing 52 changed files with 402 additions and 300 deletions.
8 changes: 4 additions & 4 deletions build/ci/docker-deploy/docker_deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ cd ${target_dir}
tar -xzf confs-${target_party_id}.tar

cd confs-${target_party_id}
docker-compose down
docker compose down
docker volume rm -f confs-${target_party_id}_shared_dir_examples
docker volume rm -f confs-${target_party_id}_shared_dir_federatedml
# exclude client service to save time !
docker-compose up -d
docker compose up -d

cd ../
rm -f confs-${target_party_id}.tar
Expand All @@ -34,8 +34,8 @@ echo "# party ${target_party_id} training cluster deploy is ok!"
echo "# serving cluster deploy begin"
tar -xzf serving-${target_party_id}.tar
cd serving-${target_party_id}
docker-compose down
docker-compose up -d
docker compose down
docker compose up -d

cd ../
rm -f serving-${target_party_id}.tar
Expand Down
2 changes: 1 addition & 1 deletion docker-deploy/.env
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
RegistryURI=
TAG=1.10.0-release
TAG=1.11.1-release
SERVING_TAG=2.1.6-release
SSH_PORT=22

Expand Down
35 changes: 26 additions & 9 deletions docker-deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ This guide describes the process of deploying FATE using Docker Compose.
The nodes (target nodes) to install FATE must meet the following requirements:

1. A Linux host
2. Docker: 18+
3. Docker-Compose: 1.24+
2. Docker: 19.03.0+
3. Docker Compose: 1.27.0+
4. The deployment machine have access to the Internet, so the hosts can communicate with each other;
5. Network connection to Internet to pull container images from Docker Hub. If network connection to Internet is not available, consider to set up [Harbor as a local registry](../registry/README.md) or use [offline images](https://github.com/FederatedAI/FATE/tree/master/build/docker-build).
6. A host running FATE is recommended to be with 8 CPUs and 16G RAM.
Expand Down Expand Up @@ -117,6 +117,23 @@ bash ./generate_config.sh

Now, tar files have been generated for each party including the exchange node (party). They are named as ```confs-<party-id>.tar``` and ```serving-<party-id>.tar```.

### GPU support

Starting from v1.11.1, docker compose deployment supports FATE deployment using GPU. If you want to use GPU, you need to get the docker environment of GPU first. You can refer to the official documentation of docker (<https://docs.docker.com/config/containers/resource_constraints/#gpu>).

To use the GPU, you need to modify the configuration, both of which need to be modified

```sh
algorithm=NN
device=GPU

gpu_count=1
```

Only the fateflow component is used for FATE GPU, so each Party needs at least one GPU.

*gpu_count will be mapped to count, refer to [Docker compose GPU support](https://docs.docker.com/compose/gpu-support/)*

### Deploying FATE to target hosts

**Note:** Before running the below commands, all target hosts must
Expand Down Expand Up @@ -166,12 +183,12 @@ CONTAINER ID IMAGE COMMAND
3dca43f3c9d5 federatedai/serving-admin:2.1.5-release "/bin/sh -c 'java -c…" 5 minutes ago Up 5 minutes 0.0.0.0:8350->8350/tcp, :::8350->8350/tcp serving-9999_serving-admin_1
fe924918509b federatedai/serving-proxy:2.1.5-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8059->8059/tcp, :::8059->8059/tcp, 0.0.0.0:8869->8869/tcp, :::8869->8869/tcp, 8879/tcp serving-9999_serving-proxy_1
b62ed8ba42b7 bitnami/zookeeper:3.7.0 "/opt/bitnami/script…" 5 minutes ago Up 5 minutes 0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 8080/tcp, 0.0.0.0:49226->2888/tcp, :::49226->2888/tcp, 0.0.0.0:49225->3888/tcp, :::49225->3888/tcp serving-9999_serving-zookeeper_1
3c643324066f federatedai/client:1.10.0-release "/bin/sh -c 'flow in…" 5 minutes ago Up 5 minutes 0.0.0.0:20000->20000/tcp, :::20000->20000/tcp confs-9999_client_1
3fe0af1ebd71 federatedai/fateboard:1.10.0-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp confs-9999_fateboard_1
635b7d99357e federatedai/fateflow:1.10.0-release "container-entrypoin…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:9360->9360/tcp, :::9360->9360/tcp, 8080/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp confs-9999_fateflow_1
8b515f08add3 federatedai/eggroll:1.10.0-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 8080/tcp, 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp confs-9999_rollsite_1
108cc061c191 federatedai/eggroll:1.10.0-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4670/tcp, 8080/tcp confs-9999_clustermanager_1
f10575e76899 federatedai/eggroll:1.10.0-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4671/tcp, 8080/tcp confs-9999_nodemanager_1
3c643324066f federatedai/client:1.11.1-release "/bin/sh -c 'flow in…" 5 minutes ago Up 5 minutes 0.0.0.0:20000->20000/tcp, :::20000->20000/tcp confs-9999_client_1
3fe0af1ebd71 federatedai/fateboard:1.11.1-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp confs-9999_fateboard_1
635b7d99357e federatedai/fateflow:1.11.1-release "container-entrypoin…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:9360->9360/tcp, :::9360->9360/tcp, 8080/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp confs-9999_fateflow_1
8b515f08add3 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 8080/tcp, 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp confs-9999_rollsite_1
108cc061c191 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4670/tcp, 8080/tcp confs-9999_clustermanager_1
f10575e76899 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4671/tcp, 8080/tcp confs-9999_nodemanager_1
aa0a0002de93 mysql:8.0.28 "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 3306/tcp, 33060/tcp confs-9999_mysql_1
```

Expand Down Expand Up @@ -474,6 +491,6 @@ To delete the cluster completely, log in to each host and run the commands as fo

```bash
cd /data/projects/fate/confs-<id>/ # id of party
docker-compose down
docker compose down
rm -rf ../confs-<id>/ # delete the legacy files
```
37 changes: 27 additions & 10 deletions docker-deploy/README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ Compose是用于定义和运行多容器Docker应用程序的工具。通过Comp
## 准备工作

1. 两个主机(物理机或者虚拟机,都是Centos7系统);
2. 所有主机安装Docker 版本 : 18+;
3. 所有主机安装Docker-Compose 版本: 1.24+;
2. 所有主机安装Docker 版本 : 19.03.0+;
3. 所有主机安装Docker Compose 版本: 1.27.0+;
4. 部署机可以联网,所以主机相互之间可以网络互通;
5. 运行机已经下载FATE的各组件镜像,如果无法连接dockerhub,请考虑使用harbor([Harbor 作为本地镜像源](../registry/README.md))或者使用离线部署(离线构建镜像参考文档[构建镜像](https://github.com/FederatedAI/FATE/tree/master/build/docker-build))。
6. 运行FATE的主机推荐配置8CPUs和16G RAM。
Expand Down Expand Up @@ -138,9 +138,9 @@ compute_core=4
# 设置用户密码
[user@localhost]$ sudo passwd fate
# 创建docker-compose部署目录
[user@localhost]$ sudo mkdir -p /data/projects/fate
[user@localhost]$ sudo mkdir -p /data/projects/fate /home/fate
# 修改docker-compose部署目录对应用户和组
[user@localhost]$ sudo chown -R fate:docker /data/projects/fate
[user@localhost]$ sudo chown -R fate:docker /data/projects/fate /home/fate
# 选择用户
[user@localhost]$ sudo su fate
# 查看是否拥有docker权限
Expand All @@ -152,6 +152,23 @@ total 0
drwxr-xr-x. 2 fate docker 6 May 27 00:51 fate
```

### GPU支持

从v1.11.1开始docker compose部署支持使用GPU的FATE部署,如果要使用GPU,你需要先搞定GPU的docker环境。可以参考docker的官方文档(<https://docs.docker.com/config/containers/resource_constraints/#gpu>)。

要使用GPU需要修改配置,这两个都需要修改

```sh
algorithm=NN
device=GPU

gpu_count=1
```

FATE GPU的使用只有fateflow组件,所以每个Party最少需要有一个GPU。

*gpu_count会映射为count,参考 [Docker compose GPU support](https://docs.docker.com/compose/gpu-support/)*

### 执行部署脚本

以下修改可在任意机器执行。
Expand Down Expand Up @@ -185,12 +202,12 @@ CONTAINER ID IMAGE COMMAND
3dca43f3c9d5 federatedai/serving-admin:2.1.5-release "/bin/sh -c 'java -c…" 5 minutes ago Up 5 minutes 0.0.0.0:8350->8350/tcp, :::8350->8350/tcp serving-9999_serving-admin_1
fe924918509b federatedai/serving-proxy:2.1.5-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8059->8059/tcp, :::8059->8059/tcp, 0.0.0.0:8869->8869/tcp, :::8869->8869/tcp, 8879/tcp serving-9999_serving-proxy_1
b62ed8ba42b7 bitnami/zookeeper:3.7.0 "/opt/bitnami/script…" 5 minutes ago Up 5 minutes 0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 8080/tcp, 0.0.0.0:49226->2888/tcp, :::49226->2888/tcp, 0.0.0.0:49225->3888/tcp, :::49225->3888/tcp serving-9999_serving-zookeeper_1
3c643324066f federatedai/client:1.10.0-release "/bin/sh -c 'flow in…" 5 minutes ago Up 5 minutes 0.0.0.0:20000->20000/tcp, :::20000->20000/tcp confs-9999_client_1
3fe0af1ebd71 federatedai/fateboard:1.10.0-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp confs-9999_fateboard_1
635b7d99357e federatedai/fateflow:1.10.0-release "container-entrypoin…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:9360->9360/tcp, :::9360->9360/tcp, 8080/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp confs-9999_fateflow_1
8b515f08add3 federatedai/eggroll:1.10.0-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 8080/tcp, 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp confs-9999_rollsite_1
108cc061c191 federatedai/eggroll:1.10.0-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4670/tcp, 8080/tcp confs-9999_clustermanager_1
f10575e76899 federatedai/eggroll:1.10.0-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4671/tcp, 8080/tcp confs-9999_nodemanager_1
3c643324066f federatedai/client:1.11.1-release "/bin/sh -c 'flow in…" 5 minutes ago Up 5 minutes 0.0.0.0:20000->20000/tcp, :::20000->20000/tcp confs-9999_client_1
3fe0af1ebd71 federatedai/fateboard:1.11.1-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp confs-9999_fateboard_1
635b7d99357e federatedai/fateflow:1.11.1-release "container-entrypoin…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:9360->9360/tcp, :::9360->9360/tcp, 8080/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp confs-9999_fateflow_1
8b515f08add3 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 8080/tcp, 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp confs-9999_rollsite_1
108cc061c191 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4670/tcp, 8080/tcp confs-9999_clustermanager_1
f10575e76899 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4671/tcp, 8080/tcp confs-9999_nodemanager_1
aa0a0002de93 mysql:8.0.28 "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 3306/tcp, 33060/tcp confs-9999_mysql_1
```

Expand Down
22 changes: 11 additions & 11 deletions docker-deploy/docker_deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -163,10 +163,10 @@ mv ~/confs-$target_party_id.tar $dir
cd $dir
tar -xzf confs-$target_party_id.tar
cd confs-$target_party_id
docker-compose down
docker compose down
docker volume rm -f confs-${target_party_id}_shared_dir_examples
docker volume rm -f confs-${target_party_id}_shared_dir_federatedml
docker-compose up -d
docker compose up -d
cd ../
rm -f confs-${target_party_id}.tar
exit
Expand Down Expand Up @@ -214,8 +214,8 @@ mv ~/serving-$target_party_id.tar $dir
cd $dir
tar -xzf serving-$target_party_id.tar
cd serving-$target_party_id
docker-compose down
docker-compose up -d
docker compose down
docker compose up -d
cd ../
rm -f serving-$target_party_id.tar
exit
Expand Down Expand Up @@ -250,15 +250,15 @@ DeleteCluster() {
if [ "$cluster_type" == "--training" ]; then
ssh -p ${SSH_PORT} -tt $user@$target_party_ip <<eeooff
cd $dir/confs-$target_party_id
docker-compose down
docker compose down
exit
eeooff
echo "party $target_party_id training cluster is deleted!"
# delete serving cluster
elif [ "$cluster_type" == "--serving" ]; then
ssh -p ${SSH_PORT} -tt $user@$target_party_serving_ip <<eeooff
cd $dir/serving-$target_party_id
docker-compose down
docker compose down
exit
eeooff
echo "party $target_party_id serving cluster is deleted!"
Expand All @@ -268,18 +268,18 @@ eeooff
if [ "$target_party_id" == "exchange" ]; then
ssh -p ${SSH_PORT} -tt $user@$target_party_ip <<eeooff
cd $dir/confs-$target_party_id
docker-compose down
docker compose down
exit
eeooff
else
ssh -p ${SSH_PORT} -tt $user@$target_party_serving_ip <<eeooff
cd $dir/serving-$target_party_id
docker-compose down
docker compose down
exit
eeooff
ssh -p ${SSH_PORT} -tt $user@$target_party_ip <<eeooff
cd $dir/confs-$target_party_id
docker-compose down
docker compose down
exit
eeooff
echo "party $target_party_id training cluster is deleted!"
Expand All @@ -300,8 +300,8 @@ handleLocally() {
mkdir -p $dir
tar -xf ${WORKINGDIR}/outputs/${type}-${target_party_id}.tar -C $dir
cd ${dir}/${type}-${target_party_id}
docker-compose down
docker-compose up -d
docker compose down
docker compose up -d
local_flag="true"
return 0
fi
Expand Down
33 changes: 30 additions & 3 deletions docker-deploy/generate_config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ function CheckConfig(){
computing_list="Eggroll Spark Spark_local"
spark_federation_list="RabbitMQ Pulsar"
algorithm_list="Basic NN"
device_list="CPU IPCL"
device_list="CPU IPCL GPU"

if ! `list_include_item "$computing_list" "$computing"`; then
echo "[ERROR]: Please check whether computing is one of $computing_list"
Expand Down Expand Up @@ -154,6 +154,9 @@ GenerateConfig() {

eval exchange_ip=${exchangeip}

# gpu_count defaulet 1
eval gpu_count=${gpu_count:-1}

echo package $party_id start!

rm -rf confs-$party_id/
Expand Down Expand Up @@ -220,10 +223,10 @@ GenerateConfig() {
# federation
if [ "$federation" == "RabbitMQ" ]; then
cp -r training_template/backends/spark/rabbitmq confs-$party_id/confs/
sed -i '147,161d' confs-$party_id/docker-compose.yml
sed -i '147,159d' confs-$party_id/docker-compose.yml
elif [ "$federation" == "Pulsar" ]; then
cp -r training_template/backends/spark/pulsar confs-$party_id/confs/
sed -i '129,145d' confs-$party_id/docker-compose.yml
sed -i '127,143d' confs-$party_id/docker-compose.yml
fi
fi
fi
Expand All @@ -247,6 +250,9 @@ GenerateConfig() {
if [ "$device" == "IPCL" ]; then
Suffix=$Suffix"-ipcl"
fi
if [ "$device" == "GPU" ]; then
Suffix=$Suffix"-gpu"
fi

# federatedai/fateflow-${computing}-${algorithm}-${device}:${version}

Expand All @@ -259,6 +265,27 @@ GenerateConfig() {
sed -i "s#image: \"federatedai/spark-worker:\${TAG}\"#image: \"federatedai/spark-worker${Suffix}:\${TAG}\"#g" ./confs-$party_id/docker-compose.yml
fi

# GPU
if [ "$device" == "GPU" ]; then
line=0 # line refers to the line number of the fateflow `command` line in docker-compose.yaml
if [ "$computing" == "Eggroll" ]; then
line=137
fi
if [ "$computing" == "Spark" ]; then
line=84
fi
if [ "$computing" == "Spark_local" ]; then
line=85
fi
sed -i "${line}i\\
deploy:\\
resources:\\
reservations:\\
devices:\\
- driver: nvidia\\
count: $gpu_count\\
capabilities: [gpu]" ./confs-$party_id/docker-compose.yml
fi
# RegistryURI
if [ "$RegistryURI" != "" ]; then
sed -i 's#federatedai#${RegistryURI}/federatedai#g' ./confs-$party_id/docker-compose.yml
Expand Down
3 changes: 3 additions & 0 deletions docker-deploy/parties.conf
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ device=CPU
# spark and eggroll
compute_core=4

# You only need to configure this parameter when you want to use the GPU, the default value is 1
gpu_count=1

# default
exchangeip=

Expand Down
6 changes: 3 additions & 3 deletions docker-deploy/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ toy_example() {
cd $dir
cd confs-$target_party_id
docker-compose exec -T client bash -c '
docker compose exec -T client bash -c '
flow test toy --guest-party-id $guest --host-party-id $host
'
Expand All @@ -108,7 +108,7 @@ upload_data() {
cd $dir
cd confs-$target_party_id
docker-compose exec -T python bash -c '
docker compose exec -T python bash -c '
cd examples/scripts;
python upload_default_data.py -f 1
'
Expand Down Expand Up @@ -142,7 +142,7 @@ min_test_task(){
cd $dir
cd confs-$target_party_id
docker-compose exec -T python bash -c '
docker compose exec -T python bash -c '
cd examples/min_test_task;
python run_task.py -gid ${guest_id} -hid ${host_id} -aid ${arbiter_id}
'
Expand Down
Loading

0 comments on commit 33ee34f

Please sign in to comment.