Skip to content

Commit

Permalink
Update docs for testing Dask and Spark integrations (#867)
Browse files Browse the repository at this point in the history
* update docs for testing dask and spark integrations

* update changelog.rst

* update form url
  • Loading branch information
thehomebrewnerd authored Mar 25, 2020
1 parent 8540c28 commit d77be7d
Show file tree
Hide file tree
Showing 4 changed files with 15 additions and 4 deletions.
1 change: 1 addition & 0 deletions docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Changelog
* Documentation Changes
* Add links to primitives.featurelabs.com (:pr:`860`)
* Add source code links to API reference (:pr:`862`)
* Update links for testing Dask/Spark integrations (:pr:`867`)
* Testing Changes
* Miscellaneous changes (:pr:`861`)

Expand Down
6 changes: 4 additions & 2 deletions docs/source/featuretools_enterprise.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@ Premium Primitives
Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Featuretools Enterprise contains over 100 domain-specific premium primitives to help you build better features for more accurate models. A list of all premium primitives can be obtained by visiting `primitives.featurelabs.com <https://primitives.featurelabs.com/>`__.


Spark and Dask
--------------
Running Featuretools with Spark and Dask
----------------------------------------
Looking to easily scale Featuretools to bigger datasets or integrate it into your existing big data infrastructure? Whether it’s on-premise or in the cloud, you can run Featuretools Enterprise with Apache Spark and Dask. We have yet to encounter a dataset that is too large to handle.

The Featuretools development team is continually working to improve integration with Dask and Spark for performing feature engineering at scale. If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing `this simple request form <https://forms.office.com/Pages/ResponsePage.aspx?id=2TkvUj0wj0id66bXfx6v2ASd4JAap6pFigRj7EKGsuBUNDI4WDlGSzI1VVRHTUdMS0gyR1EyMkdJVi4u>`__.


Expert Support
--------------
Expand Down
6 changes: 6 additions & 0 deletions docs/source/guides/parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ Featuretools can optionally compute features on multiple cores. The simplest way

The above command will start 2 processes to compute chunks of the feature matrix in parallel. Each process receives its own copy of the entity set, so memory use will be proportional to the number of parallel processes. Because the entity set has to be copied to each process, there is overhead to perform this operation before calculation can begin. To avoid this overhead on successive calls to ``calculate_feature_matrix``, read the section below on using a persistent cluster.

Running Featuretools with Spark and Dask
----------------------------------------
The Featuretools development team is continually working to improve integration with Dask and Spark for performing feature engineering at scale. If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing `this simple request form <https://forms.office.com/Pages/ResponsePage.aspx?id=2TkvUj0wj0id66bXfx6v2ASd4JAap6pFigRj7EKGsuBUNDI4WDlGSzI1VVRHTUdMS0gyR1EyMkdJVi4u>`__.

Continue reading below to learn how to perform parallel feature computation with the current integrations.

Using persistent cluster
------------------------
Behind the scenes, Featuretools uses `dask's <http://dask.pydata.org/>`_ distributed scheduler to implement multiprocessing. When you only specify the ``n_jobs`` parameter, a cluster will be created for that specific feature matrix calculation and destroyed once calculations have finished. A drawback of this is that each time a feature matrix is calculated, the entity set has to be transmitted to the workers again. To avoid this, we would like to reuse the same cluster between calls. The way to do this is by creating a cluster first and telling featuretools to use it with the ``dask_kwargs`` parameter::
Expand Down
6 changes: 4 additions & 2 deletions docs/source/guides/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,10 @@ An additional example of partitioning data to distribute on multiple cores or a

For a similar partition and distribute implementation using Apache Spark with PySpark, refer to the `Feature Engineering on Spark notebook <https://github.com/Featuretools/predicting-customer-churn/blob/master/churn/4.%20Feature%20Engineering%20on%20Spark.ipynb>`_. This implementation shows how to carry out feature engineering on a cluster of EC2 instances using Spark as the distributed framework. A write-up of this approach is described in the `Featuretools on Spark article <https://blog.featurelabs.com/featuretools-on-spark-2/>`_ on the Feature Labs engineering blog.

Running Featuretools with Spark and Dask
----------------------------------------
The Featuretools development team is continually working to improve integration with Dask and Spark for performing feature engineering at scale. If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing `this simple request form <https://forms.office.com/Pages/ResponsePage.aspx?id=2TkvUj0wj0id66bXfx6v2ASd4JAap6pFigRj7EKGsuBUNDI4WDlGSzI1VVRHTUdMS0gyR1EyMkdJVi4u>`__.

Featuretools Enterprise
-----------------------
If you don't want to build it yourself, Featuretools Enterprise has native integrations with Apache Spark and Dask. More information is available `here <https://www.featurelabs.com/featuretools>`__.

If you would like to test `Featuretools Enterprise APIs <https://docs.featurelabs.com/>`_ for running Featuretools natively on Apache Spark or Dask, please let us know `here <https://forms.gle/TtFTH5QKM4gZtu7U7>`__.

0 comments on commit d77be7d

Please sign in to comment.