From d77be7d6e1ec90b1b777d8769e5b00c7689e72d5 Mon Sep 17 00:00:00 2001 From: Nate Parsons <4307001+thehomebrewnerd@users.noreply.github.com> Date: Wed, 25 Mar 2020 13:50:08 -0500 Subject: [PATCH] Update docs for testing Dask and Spark integrations (#867) * update docs for testing dask and spark integrations * update changelog.rst * update form url --- docs/source/changelog.rst | 1 + docs/source/featuretools_enterprise.rst | 6 ++++-- docs/source/guides/parallel.rst | 6 ++++++ docs/source/guides/performance.rst | 6 ++++-- 4 files changed, 15 insertions(+), 4 deletions(-) diff --git a/docs/source/changelog.rst b/docs/source/changelog.rst index 235a383842..e85fa3411b 100644 --- a/docs/source/changelog.rst +++ b/docs/source/changelog.rst @@ -10,6 +10,7 @@ Changelog * Documentation Changes * Add links to primitives.featurelabs.com (:pr:`860`) * Add source code links to API reference (:pr:`862`) + * Update links for testing Dask/Spark integrations (:pr:`867`) * Testing Changes * Miscellaneous changes (:pr:`861`) diff --git a/docs/source/featuretools_enterprise.rst b/docs/source/featuretools_enterprise.rst index ddaffa1036..1d7360a3aa 100644 --- a/docs/source/featuretools_enterprise.rst +++ b/docs/source/featuretools_enterprise.rst @@ -8,10 +8,12 @@ Premium Primitives Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Featuretools Enterprise contains over 100 domain-specific premium primitives to help you build better features for more accurate models. A list of all premium primitives can be obtained by visiting `primitives.featurelabs.com `__. -Spark and Dask --------------- +Running Featuretools with Spark and Dask +---------------------------------------- Looking to easily scale Featuretools to bigger datasets or integrate it into your existing big data infrastructure? Whether it’s on-premise or in the cloud, you can run Featuretools Enterprise with Apache Spark and Dask. We have yet to encounter a dataset that is too large to handle. +The Featuretools development team is continually working to improve integration with Dask and Spark for performing feature engineering at scale. If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing `this simple request form `__. + Expert Support -------------- diff --git a/docs/source/guides/parallel.rst b/docs/source/guides/parallel.rst index 7c14e984b0..fde4450db0 100644 --- a/docs/source/guides/parallel.rst +++ b/docs/source/guides/parallel.rst @@ -12,6 +12,12 @@ Featuretools can optionally compute features on multiple cores. The simplest way The above command will start 2 processes to compute chunks of the feature matrix in parallel. Each process receives its own copy of the entity set, so memory use will be proportional to the number of parallel processes. Because the entity set has to be copied to each process, there is overhead to perform this operation before calculation can begin. To avoid this overhead on successive calls to ``calculate_feature_matrix``, read the section below on using a persistent cluster. +Running Featuretools with Spark and Dask +---------------------------------------- +The Featuretools development team is continually working to improve integration with Dask and Spark for performing feature engineering at scale. If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing `this simple request form `__. + +Continue reading below to learn how to perform parallel feature computation with the current integrations. + Using persistent cluster ------------------------ Behind the scenes, Featuretools uses `dask's `_ distributed scheduler to implement multiprocessing. When you only specify the ``n_jobs`` parameter, a cluster will be created for that specific feature matrix calculation and destroyed once calculations have finished. A drawback of this is that each time a feature matrix is calculated, the entity set has to be transmitted to the workers again. To avoid this, we would like to reuse the same cluster between calls. The way to do this is by creating a cluster first and telling featuretools to use it with the ``dask_kwargs`` parameter:: diff --git a/docs/source/guides/performance.rst b/docs/source/guides/performance.rst index 2fd134bbf1..859592b0e0 100644 --- a/docs/source/guides/performance.rst +++ b/docs/source/guides/performance.rst @@ -46,8 +46,10 @@ An additional example of partitioning data to distribute on multiple cores or a For a similar partition and distribute implementation using Apache Spark with PySpark, refer to the `Feature Engineering on Spark notebook `_. This implementation shows how to carry out feature engineering on a cluster of EC2 instances using Spark as the distributed framework. A write-up of this approach is described in the `Featuretools on Spark article `_ on the Feature Labs engineering blog. +Running Featuretools with Spark and Dask +---------------------------------------- +The Featuretools development team is continually working to improve integration with Dask and Spark for performing feature engineering at scale. If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing `this simple request form `__. + Featuretools Enterprise ----------------------- If you don't want to build it yourself, Featuretools Enterprise has native integrations with Apache Spark and Dask. More information is available `here `__. - -If you would like to test `Featuretools Enterprise APIs `_ for running Featuretools natively on Apache Spark or Dask, please let us know `here `__.