Skip to content

Developers

Mark E. Haase edited this page Mar 15, 2024 · 5 revisions

Developer Setup

The following steps are only required for local development and testing. The containerized version is recommended for production use.

  1. Install the following packages using your OS package manager (apt, yum, homebrew, etc.):

    1. make
    2. shellcheck
    3. shfmt
  2. Start by cloning this repository.

    git clone [email protected]:center-for-threat-informed-defense/tram.git
  3. Change to the TRAM directory.

    cd tram/
  4. Create a virtual environment and activate the new virtual environment.

    1. Mac and Linux

      python3 -m venv venv
      source venv/bin/activate
    2. Windows

      venv\Scripts\activate.bat
  5. Install Python application requirements.

    pip install -r requirements/requirements.txt
    pip install -r requirements/test-requirements.txt
  6. Install pre-commit hooks

    pre-commit install
  7. Set up the application database.

    tram makemigrations tram
    tram migrate
  8. Run the Machine learning training.

    tram attackdata load
    tram pipeline load-training-data
    tram pipeline train --model nb
    tram pipeline train --model logreg
    tram pipeline train --model nn_cls
  9. Download the pre-trained tokenizer and BERT models.

    python3 -c "import os; import transformers; mdl = transformers.AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased'); mdl.save_pretrained('data/ml-models/priv-allenai-scibert-scivocab-uncased')"
    
    curl -L "https://ctidtram.blob.core.windows.net/tram-models/single-label-202308303/config.json" \
        -o data/ml-models/bert_model/config.json
    curl -L "https://ctidtram.blob.core.windows.net/tram-models/single-label-202308303/pytorch_model.bin" \
        -o data/ml-models/bert_model/pytorch_model.bin
  10. Create a superuser (web login)

```sh
tram createsuperuser
```
  1. Run the application server

    DJANGO_DEBUG=1 tram runserver
  2. Open the application in your web browser.

    1. Navigate to http://localhost:8000 and use the superuser to log in
  3. In a separate terminal window, run the ML pipeline

    cd tram/
    source venv/bin/activate
    tram pipeline run

Makefile Targets

The repository includes a Makefile that includes handy shortcuts for common development tasks:

  • Run TRAM application
    • make start-container
  • Stop TRAM application
    • make stop-container
  • View TRAM logs
    • make container-logs
  • Build Python virtualenv
    • make venv
  • Install production Python dependencies
    • make install
  • Install prod and dev Python dependencies
    • make install-dev
  • Manually run pre-commit hooks without performing a commit
    • make pre-commit-run
  • Build container image (By default, container is tagged with timestamp and git hash of codebase) See note below about custom CA certificates in the Docker build.)
    • make build-container
  • Run linting locally
    • make lint
  • Run unit tests, safety, and bandit locally
    • make test

Testing

The automated test suite runs inside tox, which guarantees a consistent testing environment, but also has considerable overhead. When writing code, it may be useful to run pytest directly, which is considerably faster and can also be used to run a specific test. Here are some useful pytest commands:

# Run the entire test suite:
$ pytest tests/

# Run tests in a specific file:
$ pytest tests/tram/test_models.py

# Run a test by name:
$ pytest tests/ -k test_mapping_repr_is_correct

# Run tests with code coverage tracking, and show which lines are missing coverage:
$ pytest --cov=tram --cov-report=term-missing tests/

Custom CA Certificate

If you are building the container in an environment that intercepts SSL connections, you can specify a root CA certificate to inject into the container at build time. (This is only necessary for the TRAM application container. The TRAM Nginx container does not make outbound connections.)

Export the following two variables in your environment.

$ export TRAM_CA_URL="http://your.domain.com/root.crt"
$ export TRAM_CA_THUMBPRINT="C7:E0:F9:69:09:A4:A3:E7:A9:76:32:5F:68:79:9A:85:FD:F9:B3:BD"

The first variable is a URL to a PEM certificate containing a root certificate that you want to inject into the container. (If you use an https URL, then certificate checking is disabled.) The second variable is a SHA-1 certificate thumbprint that is used to verify that the correct certificate was downloaded. You can obtain the thumbprint with the following OpenSSL command:

$ openssl x509 -in <your-cert.crt> -fingerprint -noout
SHA1 Fingerprint=C7:E0:F9:69:09:A4:A3:E7:A9:76:32:5F:68:79:9A:85:FD:F9:B3:BD

After exporting these two variables, you can run make build-container as usual and the TRAM container will contain your specified root certificate.

Machine Learning Development

All source code related to machine learning is located in TRAM src/tram/ml.

Existing ML Models

TRAM has five machine learning models that can be used out-of-the-box:

  • SKLearn models
    1. LogisticRegressionModel - Uses SKLearn's Logistic Regression.
    2. NaiveBayesModel - Uses SKLearn's Multinomial NB.
    3. Multilayer Perception - Uses SKLearn's MLPClassifier.
    4. DummyModel - Uses SKLearn's Dummy Classifier for testing purposes.
  • Large Language Models (PyTorch)
    1. BERT Classifier - Uses Huggingface's transformers library with a fine-tuned BERT model.

The SKLearn models are each implemented as an SKLearn Pipeline. Machine learning engineers will find that it's pretty easy to plug in a new SKLearn model (see Creating Your Own SKLearn Model).

Creating Your Own SKLearn Model

In order to write your own model, take the following steps:

  1. Create a subclass of tram.ml.base.SKLearnModel that implements the get_model function. See existing ML Models for examples that can be copied.

    class DummyModel(SKLearnModel):
        def get_model(self):
            # Your model goes here
            return Pipeline([
                ("features", CountVectorizer(lowercase=True, stop_words='english', min_df=3)),
                ("clf", DummyClassifier(strategy='uniform'))
            ])
  2. Add your model to the ModelManager registry

    1. Note: This method can be improved. Pull requests welcome!
    class ModelManager(object):
        model_registry = {
            'dummy': DummyModel,
            'nb': NaiveBayesModel,
            'logreg': LogisticRegressionModel,
            # Your model on the line below
            'your-model': python.path.to.your.model
        }
  3. You can now train your model, and the model will appear in the application interface.

    tram pipeline train --model your-model
  4. If you are interested in sharing your model with the community, thank you! Please open a Pull Request with your model, and please include performance statistics in your Pull Request description.