This respository contains scripts for parsing training data and models related to the WWW 2021 paper: Towards a Lightweight, Hybrid Approach for Detecting DOM XSS Vulnerabilities with Machine Learning.
@inproceedings{domxss-ml:www2021,
title = {Towards a Lightweight, Hybrid Approach for Detecting {DOM} {XSS} Vulnerabilities with Machine Learning},
author = {William Melicher and Clement Fung and Lujo Bauer and Limin Jia},
booktitle = {Proceedings of The Web Conference},
year = 2021,
url = {https://www.ece.cmu.edu/~lbauer/papers/2021/www2021-dom-xss-dnn.pdf},
doi = {10.1145/3442381.3450062},
copyright = {International World Wide Web Conference Committee},
license = {CC-BY 4.0},
}
The datasets used in this study, as well as files for pre-trained models, can also be found at the following link.
This project relies on Tensorflow 1.14. Additional C modules for parsing word_bag objects are required, they are provided in js-build/libword_bag_ops.so.
word_bag/word_bag.tf
is the main script that handles all training, testing and evaluation of the models.
As a minimal example, the scripts train.sh
and eval.sh
are provided for quickstart.
To load configuration options, use the scripts provided in config.sh
and the configs
directory.
c_module_dir
: Points to the C dependencies. Relative paths are okay, so in most cases,js-build
is appropriate.
n_features
: The hashsize of the word bag hashing, used as the size of the input to the embedding layer. In our study, a hashsize of 2^18, or 262144 was used.dnn_embedding_size
: The size of the first embedding layer. In our study, we used an embedding size of 64.batch_size
: Batch size used for training or evaluation.classifier_name
:custom_dnn_classifier
for a DNN,linear_classifier
for a linear model.dnn_hidden_units
: A nested list of integers, where each added number adds a new layer of the given size. For a first layer of N, our convention was [N, N/2, N/4]
When loading the data (we provide both GZIP and LZMA options), the same configuration is used, whether training or testing is occurring.
compression_type
:GZIP
compression_type
:LZMA
file_format
:lines-cache