-
Notifications
You must be signed in to change notification settings - Fork 308
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Simplify data I/O utils. Previously, the data I/O utilities were responsible for aggregating hierarchical time series data from a single source CSV. We instead delegate this to pyspark in this PR. * Allow tree models to have max_forecast_steps=None. * Robust target_seq_index for ForecasterEnsemble. This commit converts target_seq_index into a property for the ForecasterEnsemble. This allows us to enforce a single value of target_seq_index for all models. * Bugfix w/ TimeSeries from string index dataframe. * Create merlion.spark API with pyspark Pandas UDFs. We now have pyspark pandas UDFs for forecasting (parallelized over time series ID), anomaly detection (parallelized over time series ID), and hierarchical time series reconciliation (parallelized over time). * Fix minor edge cases. * Update Java install instructions. * Switch from yaml to json. * Allow datasets to not have data_cols specified. * Allow null index_cols values in spark datasets. * Simplify pyspark forecasting app. Move visualization features to a different file. * Fix spark data processing bug when >1 index_cols. * Create pyspark app for anomaly detection. * Add Dockerfile. * Add docs & clean up code. * Remove wheel & pytest dependencies. * Add backward compatibility with Spark 3.1.1 * Remove strict pyspark version requirement. * Remove pyspark session helper. * Update Dockerfile to extend pyspark image. * Minor rearrangement of Dockerfile. * Update version to 1.2.2. * Add .dockerignore. * Streamline Dockerfile.
- Loading branch information
Showing
21 changed files
with
687 additions
and
130 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# package | ||
__pycache__ | ||
*.egg-info | ||
data | ||
docs | ||
tmp | ||
ts_datasets | ||
# pytest | ||
.pytest_cache | ||
.coverage* | ||
htmlcov | ||
# IDE/system | ||
.idea | ||
*.swp | ||
.DS_Store | ||
sandbox | ||
.vscode | ||
Icon? | ||
# build files | ||
docs/build/* | ||
.ipynb_checkpoints | ||
venv/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
ARG spark_uid=185 | ||
FROM apache/spark-py:v3.2.1 | ||
|
||
# Change to root user for installation steps | ||
USER 0 | ||
|
||
# Uninstall existing python and replace it with miniconda. | ||
# This is to get the right version of Python in Debian, since Prophet doesn't play nice with Python 3.9+. | ||
# FIXME: maybe optimize the size? this image is currently 3.2GB. | ||
RUN apt-get update && \ | ||
apt-get remove -y python3 python3-pip && \ | ||
apt-get install -y --no-install-recommends curl && \ | ||
apt-get autoremove -yqq --purge && \ | ||
apt-get clean && \ | ||
rm -rf /var/lib/apt/lists/* | ||
RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ | ||
chmod +x ~/miniconda.sh && \ | ||
~/miniconda.sh -b -p /opt/conda && \ | ||
rm ~/miniconda.sh && \ | ||
# Install prophet while we're at it, since this is easier to conda install than pip install | ||
/opt/conda/bin/conda install -y prophet && \ | ||
/opt/conda/bin/conda clean -ya | ||
ENV PATH="/opt/conda/bin:${SPARK_HOME}/bin:${PATH}" | ||
|
||
# Install (for spark-sql) and Merlion; get pyspark & py4j from the PYTHONPATH | ||
ENV PYTHONPATH="${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:${PYTHONPATH}" | ||
COPY *.md ./ | ||
COPY setup.py ./ | ||
COPY merlion merlion | ||
RUN pip install pyarrow "./[prophet]" && pip uninstall -y py4j | ||
|
||
# Copy Merlion pyspark apps | ||
COPY spark /opt/spark/apps | ||
USER ${spark_uid} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
merlion.spark package | ||
===================== | ||
This module implements APIs to integrate Merlion with PySpark. The expected use case is to | ||
use distributed computing to train and run inference on multiple time series in parallel. | ||
|
||
.. automodule:: merlion.spark | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
.. autosummary:: | ||
dataset | ||
pandas_udf | ||
|
||
Submodules | ||
---------- | ||
|
||
merlion.spark.dataset module | ||
---------------------------- | ||
|
||
.. automodule:: merlion.spark.dataset | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
merlion.spark.pandas\_udf module | ||
-------------------------------- | ||
|
||
.. automodule:: merlion.spark.pandas_udf | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.