Skip to content

Commit

Permalink
refactor hyperparameter tuning & add Dask-ML. (#44)
Browse files Browse the repository at this point in the history
  • Loading branch information
luweizheng authored May 8, 2024
1 parent 01256be commit 9236a0a
Show file tree
Hide file tree
Showing 49 changed files with 6,146 additions and 2,575 deletions.
16 changes: 9 additions & 7 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,15 @@ subtrees:
entries:
- file: ch-data-science/data-science-lifecycle
- file: ch-data-science/machine-learning
- file: ch-data-science/deep-learning
- file: ch-data-science/hyperparameter
- file: ch-data-science/python-ecosystem
- file: ch-dask/index
entries:
- file: ch-dask/dask-intro
- file: ch-dask/dask-dataframe-intro
- file: ch-dask/dask-distributed
- file: ch-dask/gpu
- file: ch-dask/task-graph-partitioning
- file: ch-dask-dataframe/index
entries:
Expand All @@ -29,6 +32,8 @@ subtrees:
- file: ch-dask-dataframe/shuffle
- file: ch-dask-ml/index
entries:
- file: ch-dask-ml/preprocessing
- file: ch-dask-ml/hyperparameter
- file: ch-dask-ml/distributed-training
- file: ch-ray-core/index
entries:
Expand All @@ -47,14 +52,11 @@ subtrees:
- file: ch-ray-data/data-load-inspect-save
- file: ch-ray-data/data-transform
- file: ch-ray-data/preprocessor
- file: ch-ray-train-tune/index
- file: ch-ray-ml/index
entries:
- file: ch-ray-train-tune/ray-train
- file: ch-ray-train-tune/ray-tune
- file: ch-ray-train-tune/tune-algorithm-scheduler
- file: ch-ray-serve/index
entries:
- file: ch-ray-serve/ray-serve
- file: ch-ray-ml/ray-train
- file: ch-ray-ml/ray-tune
- file: ch-ray-ml/ray-serve
- file: ch-mpi/index
entries:
- file: ch-mpi/mpi-intro
Expand Down
2 changes: 1 addition & 1 deletion ch-dask-dataframe/indexing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
"import os\n",
"import sys\n",
"sys.path.append(\"..\")\n",
"from datasets import nyc_flights\n",
"from utils import nyc_flights\n",
"\n",
"import dask\n",
"dask.config.set({'dataframe.query-planning': False})\n",
Expand Down
2 changes: 1 addition & 1 deletion ch-dask-dataframe/map-partitions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
"source": [
"import sys\n",
"sys.path.append(\"..\")\n",
"from datasets import nyc_taxi\n",
"from utils import nyc_taxi\n",
"\n",
"import pandas as pd\n",
"import dask\n",
Expand Down
2 changes: 1 addition & 1 deletion ch-dask-dataframe/read-write.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
"\n",
"import sys\n",
"sys.path.append(\"..\")\n",
"from datasets import nyc_flights\n",
"from utils import nyc_flights\n",
"\n",
"import warnings\n",
"warnings.simplefilter(action='ignore', category=FutureWarning)\n",
Expand Down
15 changes: 9 additions & 6 deletions ch-dask-ml/distributed-training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
"\n",
"如果训练数据量很大,Dask-ML 提供了分布式机器学习功能,可以在集群上对大数据进行训练。目前,Dask 提供了两类分布式机器学习 API:\n",
"\n",
"* scikit-learn:与 scikit-learn 的调用方式类似\n",
"* XGBoost 和 LightGBM:与 XGBoost 和 LightGBM 的调用方式类似\n",
"* scikit-learn:与 scikit-learn 的调用方式类似\n",
"* XGBoost 和 LightGBM 决策树式:与 XGBoost 和 LightGBM 的调用方式类似\n",
"\n",
"## scikit-learn API\n",
"\n",
Expand Down Expand Up @@ -463,7 +463,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"训练好的模型可以用来预测(`predict`),也可以计算准确度(`score`)。"
"训练好的模型可以用来预测(`predict()`),也可以计算准确度(`score()`)。"
]
},
{
Expand Down Expand Up @@ -515,7 +515,10 @@
"\n",
"尽管 Dask-ML 这种分布式训练的 API 与 scikit-learn 极其相似,scikit-learn 只能使用单核,Dask-ML 可以使用多核甚至集群,但并不意味着所有场景下都选择 Dask-ML,因为有些时候 Dask-ML 并非性能或性价比最优的选择。这一点与 Dask DataFrame 和 pandas 关系一样,如果数据量能放进单机内存,原生的 pandas 、NumPy 和 scikit-learn 的性能和兼容性总是最优的。\n",
"\n",
"下面的代码对不同规模的训练数据进行了性能分析,在单机多核且数据量较小的场景,Dask-ML 的性能并不比 scikit-learn 更快。主要因为:很多机器学习算法是迭代式的,scikit-learn 中,迭代式算法使用 Python 原生 `for` 循环来实现;Dask-ML 参考了这种 `for` 循环,但对于 Dask 的 Task Graph 来说,`for` 循环会使得 Task Graph 很臃肿,执行效率并不是很高。\n",
"下面的代码对不同规模的训练数据进行了性能分析,在单机多核且数据量较小的场景,Dask-ML 的性能并不比 scikit-learn 更快。原因有很多,包括:\n",
"\n",
"* 很多机器学习算法是迭代式的,scikit-learn 中,迭代式算法使用 Python 原生 `for` 循环来实现;Dask-ML 参考了这种 `for` 循环,但对于 Dask 的 Task Graph 来说,`for` 循环会使得 Task Graph 很臃肿,执行效率并不是很高。\n",
"* 分布式实现需要在不同进程间分发和收集数据,相比单机单进程,额外增加了很多数据同步和通信开销。\n",
"\n",
"你也可以根据你所拥有的内存来测试一下性能。"
]
Expand Down Expand Up @@ -2115,9 +2118,9 @@
"\n",
"XGBoost 和 LightGBM 是两种决策树模型的实现,他们本身就对分布式训练友好,且集成了 Dask 的分布式能力。下面以 XGBoost 为例,介绍 XGBoost 如何基于 Dask 实现分布式训练,LightGBM 与之类似。\n",
"\n",
"在 XGBoost 中,训练一个模型既可以使用 `train` 方法,也可以使用 scikit-learn 式的 `fit` 方法。这两种方式都支持 Dask 分布式训练。\n",
"在 XGBoost 中,训练一个模型既可以使用 `train` 方法,也可以使用 scikit-learn 式的 `fit()` 方法。这两种方式都支持 Dask 分布式训练。\n",
"\n",
"下面的代码对单机的 XGBoost 和 Dask 分布式训练两种方式进行了性能对比。如果使用 Dask,需要将 [`xgboost.DMatrix`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.DMatrix) 修改为 [`xgboost.dask.DaskDMatrix`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.dask.DaskDMatrix), [`xgboost.train`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.train) 修改为 [`xgboost.dask.train`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.dask.train);并传入 Dask 集群客户端 `client`。"
"下面的代码对单机的 XGBoost 和 Dask 分布式训练两种方式进行了性能对比。如果使用 Dask,需要将 [`xgboost.DMatrix`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.DMatrix) 修改为 [`xgboost.dask.DaskDMatrix`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.dask.DaskDMatrix),`xgboost.dask.DaskDMatrix` 可以将分布式的 Dask Array 或 Dask DataFrame 转化为 XGBoost 所需要的数据格式;再将 [`xgboost.train`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.train) 修改为 [`xgboost.dask.train`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.dask.train);并传入 Dask 集群客户端 `client`。"
]
},
{
Expand Down
Loading

0 comments on commit 9236a0a

Please sign in to comment.