Merge pull request #1 from rapidsai/branch-0.14

Merge updates
efajardo-nv · May 5, 2020 · 8258e86 · 8258e86
2 parents 998cd9e + 5803d9a
commit 8258e86
Show file tree

Hide file tree

Showing 17 changed files with 1,035 additions and 79 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,11 +2,16 @@
 
 ## New Features
 - PR #141 CUDA BERT Tokenizer
+- PR #152 Local gpuCI build script
+- PR #133 Phishing detection using BERT
 
 ## Improvements
 - PR #149 Add Versioneer
+- PR #151 README and CONTRIBUTING updates
 
 ## Bug Fixes
+- PR #150 Fix splunk alert workflow test
+- PR #154 Local gpuCI build fix
 
 # clx 0.13.0 (Date TBD)
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -85,7 +85,7 @@ To install CLX from source, ensure the dependencies are met and follow the steps
 1) Clone the repository and submodules
 
   ```bash
-  # Set the localtion to CLX in an environment variable CLX_HOME
+  # Set the location to CLX in an environment variable CLX_HOME
   export CLX_HOME=$(pwd)/clx
 
   # Download the CLX repo
@@ -173,11 +173,6 @@ $ ./build.sh libclx -n                # compile libclx but do not install
 
 Note: This conda installation only applies to Linux and Python versions 3.6/3.7.
 
-### Building and Testing on a gpuCI image locally
-
-Before submitting a pull request, you can do a local build and test on your machine that mimics our gpuCI environment using the `ci/local/build.sh` script.
-For detailed information on usage of this script, see [here](ci/local/README.md).
-
 ## Creating documentation
 
 Python API documentation can be generated from [docs](docs) directory.

diff --git a/Dockerfile b/Dockerfile
@@ -15,7 +15,7 @@ RUN apt update -y --fix-missing && \
     apt install -y vim
 
 RUN source activate rapids \
-    && conda install -c pytorch pytorch==1.3.1 torchvision=0.4.2 datashader>=0.10.* panel=0.6.* geopandas>=0.6.* pyppeteer s3fs \
+    && conda install -c pytorch pytorch==1.3.1 torchvision=0.4.2 datashader>=0.10.* panel=0.6.* geopandas>=0.6.* pyppeteer s3fs ipywidgets \
     && pip install "git+https://github.com/rapidsai/cudatashader.git"
 
 # libclx build/install

diff --git a/README.md b/README.md
@@ -71,31 +71,69 @@ for rule in alerts_per_day_piv.columns:
 
 ```
 
-## Installation
-CLX is available in a Docker container, by building from source, and through Conda installation. There are multiple ways to start the CLX container, depending on if you want a container with only RAPIDS and CLX or you want multiple contianers to run that enable SIEM integration and data ingest.
+## Getting Started With Workflows
+
+In addition to traditional Python files and Jupyter notebooks, CLX also includes structure in the form of a workflow. A workflow is a series of data transformations performed on a [GPU dataframe](https://github.com/rapidsai/cudf) that contains raw cyber data, with the goal of surfacing meaningful cyber analytical output. Multiple I/O methods are available, including Kafka and on-disk file stores.
+
+Example flow workflow reading and writing to file:
+
+```python
+from clx.workflow import netflow_workflow
+
+source = {
+   "type": "fs",
+   "input_format": "csv",
+   "input_path": "/path/to/input",
+   "schema": ["firstname","lastname","gender"],
+   "delimiter": ",",
+   "required_cols": ["firstname","lastname","gender"],
+   "dtype": ["str","str","str"],
+   "header": "0"
+}
+dest = {
+   "type": "fs",
+   "output_format": "csv",
+   "output_path": "/path/to/output"
+}
+wf = netflow_workflow.NetflowWorkflow(source=source, destination=dest, name="my-netflow-workflow")
+wf.run_workflow()
+```
+
+For additional examples, browse our complete [API documentation](https://rapidsai.github.io/clx/), or check out our more detailed [notebooks](https://github.com/rapidsai/clx/tree/master/notebooks).
 
-### Docker Container without SIEM Integration
 
-#### Install via CLX Docker Container
+
+## Getting CLX
+### Intro
+There are 3 ways to get CLX :
+1. [Quick Start with CLX Docker Container](#quick)
+1. [Conda Installation](#conda)
+1. [Build from Source](#source)
+
+<a name="quick"></a>
+
+## Quick Start Docker Container
 
 Prerequisites
 
 * NVIDIA Pascal™ GPU architecture or better
-* CUDA 9.2 or 10.0 compatible NVIDIA driver
+* CUDA 10.0+ compatible NVIDIA driver
 * Ubuntu 16.04/18.04 or CentOS 7
 * Docker CE v18+
 * nvidia-docker v2+
 
 Pull the RAPIDS image suitable to your environment and build CLX image.
 
 ```aidl
-docker pull rapidsai/rapidsai-dev-nightly:0.12-cuda9.2-devel-ubuntu18.04-py3.7
-docker build --build-arg image=rapidsai/rapidsai-dev-nightly:0.12-cuda9.2-devel-ubuntu18.04-py3.7 -t clx:latest .
+docker pull rapidsai/rapidsai-dev-nightly:0.14-cuda10.1-devel-ubuntu18.04-py3.7
+docker build --build-arg image=rapidsai/rapidsai-dev-nightly:0.14-cuda10.1-devel-ubuntu18.04-py3.7 -t clx:latest .
 ```
 
-Now start the container and the notebook server. There are multiple ways to do this, depending on what version of Docker you have.
+### Docker Container without SIEM Integration
 
-##### Preferred - Docker CE v19+ and nvidia-container-toolkit
+Start the container and the notebook server. There are multiple ways to do this, depending on what version of Docker you have.
+
+#### Preferred - Docker CE v19+ and nvidia-container-toolkit
 ```aidl
 docker run  --gpus '"device=0"' \
   --rm -d \
@@ -105,7 +143,7 @@ docker run  --gpus '"device=0"' \
   clx:latest
 ```
 
-##### Legacy - Docker CE v18 and nvidia-docker2
+#### Legacy - Docker CE v18 and nvidia-docker2
 ```aidl
 docker run --runtime=nvidia \
   --rm -d \
@@ -117,61 +155,29 @@ docker run --runtime=nvidia \
 
 ### Docker Container with SIEM Integration
 
-If you want a CLX container with SIEM integration (including data ingest), follow the steps above to pull and build the CLX container. Then use `docker-compose` to start multiple containers running CLX, Kafka, and Zookeeper. 
+If you want a CLX container with SIEM integration (including data ingest), follow the steps above to build the CLX image. Then use `docker-compose` to start multiple containers running CLX, Kafka, and Zookeeper. 
 
 ```aidl
 docker-compose up
 ```
 
-### Install from Source
-You can install CLX from source on an existing RAPIDS container. A RAPIDS image suitable for your environment can be pulled from [https://hub.docker.com/r/rapidsai/rapidsai/](https://hub.docker.com/r/rapidsai/rapidsai/).
 
-```aidl
-# Run tests
-pip install pytest
-pytest
 
-# Build and install
-python setup.py install
-```
-### Conda Install 
-You can conda install CLX on an existing RAPIDS container. A RAPIDS image suitable for your environment can be pulled from [https://hub.docker.com/r/rapidsai/rapidsai/](https://hub.docker.com/r/rapidsai/rapidsai/). 
+<a name="conda"></a>
 
-```
-conda install -c rapidsai-nightly -c rapidsai -c nvidia -c pytorch -c conda-forge -c defaults clx
-```
+## Conda Install 
+It is easy to install CLX using conda. You can get a minimal conda installation with Miniconda or get the full installation with Anaconda.
 
-## Getting Started With Workflows
-
-In addition to traditional Python files and Jupyter notebooks, CLX also includes structure in the form of a workflow. A workflow is a series of data transformations performed on a [GPU dataframe](https://github.com/rapidsai/cudf) that contains raw cyber data, with the goal of surfacing meaningful cyber analytical output. Multiple I/O methods are available, including Kafka and on-disk file stores.
+Install and update CLX using the conda command:
 
-Example flow workflow reading and writing to file:
-
-```python
-from clx.workflow import netflow_workflow
-
-source = {
-   "type": "fs",
-   "input_format": "csv",
-   "input_path": "/path/to/input",
-   "schema": ["firstname","lastname","gender"],
-   "delimiter": ",",
-   "required_cols": ["firstname","lastname","gender"],
-   "dtype": ["str","str","str"],
-   "header": "0"
-}
-dest = {
-   "type": "fs",
-   "output_format": "csv",
-   "output_path": "/path/to/output"
-}
-wf = netflow_workflow.NetflowWorkflow(source=source, destination=dest, name="my-netflow-workflow")
-wf.run_workflow()
+```
+conda install -c rapidsai-nightly -c nvidia -c pytorch -c conda-forge -c defaults clx
 ```
 
-For additional examples, browse our complete [API documentation](https://rapidsai.github.io/clx/), or check out our more detailed [notebooks](https://github.com/rapidsai/clx/tree/master/notebooks).
 
 
-## Contributing
+<a name="source"></a>
+
+## Building from Source and Contributing
 
-For contributing guildelines please reference our [guide for contributing](https://github.com/rapidsai/clx/blob/master/CONTRIBUTING.md).
+For contributing guildelines please reference our [guide for contributing](CONTRIBUTING.md).
diff --git a/ci/cpu/clx/upload-anaconda.sh b/ci/cpu/clx/upload-anaconda.sh
@@ -23,4 +23,4 @@ fi
 
 echo "Upload"
 echo ${UPLOADFILE}
-anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --force ${UPLOADFILE}
+anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --skip-existing ${UPLOADFILE}
diff --git a/ci/cpu/libclx/upload-anaconda.sh b/ci/cpu/libclx/upload-anaconda.sh
@@ -29,7 +29,7 @@ if [ "$UPLOAD_LIBCLX" == "1" ]; then
 
   echo "Upload"
   echo ${UPLOADFILE}
-  anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --force ${UPLOADFILE}
+  anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --skip-existing ${UPLOADFILE}
 else
     echo "Skipping libclx upload"
 fi
diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
@@ -76,6 +76,7 @@ $WORKSPACE/build.sh clean libclx clx
 if hasArg --skip-tests; then
     logger "Skipping Tests..."
 else
+    cd ${WORKSPACE}/python
     py.test --ignore=ci --cache-clear --junitxml=${WORKSPACE}/junit-clx.xml -v
     ${WORKSPACE}/ci/gpu/test-notebooks.sh 2>&1 | tee nbtest.log
     python ${WORKSPACE}/ci/utils/nbtestlog2junitxml.py nbtest.log

diff --git a/ci/gpu/test-notebooks.sh b/ci/gpu/test-notebooks.sh
@@ -10,7 +10,7 @@ TOPLEVEL_NB_FOLDERS=$(find . -name *.ipynb |cut -d'/' -f2|sort -u)
 
 # Add notebooks that should be skipped here
 # (space-separated list of filenames without paths)
-SKIPNBS="DGA_Detection.ipynb FLAIR_DNS_Log_Parsing.ipynb Alert_Analysis_with_CLX.ipynb cybert_example_training.ipynb CLX_Workflow_Notebook1.ipynb CLX_Workflow_Notebook2.ipynb CLX_Workflow_Notebook3.ipynb Network_Mapping_With_RAPIDS_And_CLX.ipynb"
+SKIPNBS="DGA_Detection.ipynb FLAIR_DNS_Log_Parsing.ipynb Alert_Analysis_with_CLX.ipynb cybert_example_training.ipynb CLX_Workflow_Notebook1.ipynb CLX_Workflow_Notebook2.ipynb CLX_Workflow_Notebook3.ipynb Network_Mapping_With_RAPIDS_And_CLX.ipynb Phishing_Detection_using_Bert_CLX.ipynb"
 
 
 ## Check env

diff --git a/ci/local/README.md b/ci/local/README.md
@@ -0,0 +1,57 @@
+## Purpose
+
+This script is designed for developer and contributor use. This tool mimics the actions of gpuCI on your local machine. This allows you to test and even debug your code inside a gpuCI base container before pushing your code as a GitHub commit.
+The script can be helpful in locally triaging and debugging RAPIDS continuous integration failures.
+
+## Requirements
+
+```
+nvidia-docker
+```
+
+## Usage
+
+```
+bash build.sh [-h] [-H] [-s] [-r <repo_dir>] [-i <image_name>]
+Build and test your local repository using a base gpuCI Docker image
+
+where:
+    -H   Show this help text
+    -r   Path to repository (defaults to working directory)
+    -i   Use Docker image (default is gpuci/rapidsai-base:cuda10.0-ubuntu16.04-gcc5-py3.6)
+    -s   Skip building and testing and start an interactive shell in a container of the Docker image
+```
+
+Example Usage:
+`bash build.sh -r ~/rapids/clx -i gpuci/rapidsai-base:cuda10.1-ubuntu16.04-gcc5-py3.6`
+
+For a full list of available gpuCI docker images, visit our [DockerHub](https://hub.docker.com/r/gpuci/rapidsai-base/tags) page.
+
+Style Check:
+```bash
+$ bash ci/local/build.sh -r ~/rapids/clx -s
+$ source activate gdf    #Activate gpuCI conda environment
+$ cd rapids
+$ flake8 python
+```
+
+## Information
+
+There are some caveats to be aware of when using this script, especially if you plan on developing from within the container itself.
+
+
+### Docker Image Build Repository
+
+The docker image will generate build artifacts in a folder on your machine located in the `root` directory of the repository you passed to the script. For the above example, the directory is named `~/rapids/clx/build_rapidsai-base_cuda10.1-ubuntu16.04-gcc5-py3.6/`. Feel free to remove this directory after the script is finished.
+
+*Note*: The script *will not* override your local build repository. Your local environment stays in tact.
+
+
+### Where The User is Dumped
+
+The script will build your repository and run all tests. If any tests fail, it dumps the user into the docker container itself to allow you to debug from within the container. If all the tests pass as expected the container exits and is automatically removed. Remember to exit the container if tests fail and you do not wish to debug within the container itself.
+
+
+### Container File Structure
+
+Your repository will be located in the `/rapids/` folder of the container. This folder is volume mounted from the local machine. Any changes to the code in this repository are replicated onto the local machine. The `cpp/build` and `python/build` directories within your repository is on a separate mount to avoid conflicting with your local build artifacts.