This ReadMe gives precise steps to reproduce the results presented in "Leveraging Automated Unit Tests for Unsupervised Code Translation".
The scripts creating the offline and online data for self-training are in test_generation.
First, follow the instructions of the CodeGen repository to create a function-level monolingual dataset for java:
- Follow the BigQuery instructions and download raw java data
- run
preprocess.py
python -m codegen_sources.preprocessing.preprocess
<DATASET_PATH> # folder containing raw data i.e json.gz
--langs java # languages to prepocess
--mode=monolingual_functions # dataset mode
--local=False # Run on your local machine if True. If False run on a cluster (requires submitit setup)
--bpe_mode=fast
--train_splits=NGPU # nb of splits for training data - corresponds to the number of GPU you have
Then download the TransCoder and DOBF models that they made available and run create_self_training_dataset.sh
to create the online and offline self-training datasets (from the root of this repository):
JAVA_FUNC_DATASET=<PATH_TO_THE_JAVA_DATASET>
MODELS_PATH=<PATH_TO_YOUR_MODELS>
OUTPUT_DIRECTORY=<OUTPUT_PATH>
# Download models
cd $MODELS_PATH
wget https://dl.fbaipublicfiles.com/transcoder/pre_trained_models/translator_transcoder_size_from_DOBF.pth
wget https://dl.fbaipublicfiles.com/transcoder/pre_trained_models/TransCoder_model_1.pth
# Create data (it will take a while)
bash codegen_sources/test_generation/create_self_training_dataset.sh $JAVA_FUNC_DATASET $MODELS_PATH $OUTPUT_DIRECTORY
Your files for online training will be created in ${OUTPUT_DIRECTORY}/online_ST_files/
.
ONLINE_ST_FILES=${OUTPUT_DIRECTORY}/online_ST_files/
DUMP_PATH=<DUMP_PATH>
MODEL_TO_RELOAD='${MODELS_PATH}/TransCoder_model_1.pth'
python codegen_sources/model/train.py \
## General parameters
--dump_path $DUMP_PATH \
--exp_name online_st \
--data_path $ONLINE_ST_FILES \
--unit_tests_path "${ONLINE_ST_FILES}/translated_tests.json"
## Model parameter
--encoder_only false \
--n_layers 0 \
--n_layers_decoder 6 \
--n_layers_encoder 6 \
--emb_dim 1024 \
--n_heads 8 \
--dropout '0.1' \
--lgs 'cpp_sa-java_sa-python_sa' \
--max_vocab 64000 \
--max_len 512 \
--reload_model '${MODEL_TO_RELOAD},${MODEL_TO_RELOAD}' \
## Optimization parameters
--optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.00003,weight_decay=0.01' \
--amp 2 \
--fp16 true \
--max_batch_size 128 \
--tokens_per_batch 4000 \
--epoch_size 2500 \
--max_epoch 10000000 \
--clip_grad_norm 1 \
--stopping_criterion 'valid_java_sa-python_sa_mt_comp_acc,25' \
--validation_metrics 'valid_java_sa-python_sa_mt_comp_acc,valid_java_sa-cpp_sa_mt_comp_acc,valid_python_sa-java_sa_mt_comp_acc,valid_python_sa-cpp_sa_mt_comp_acc,valid_cpp_sa-java_sa_mt_comp_acc,valid_cpp_sa-python_sa_mt_comp_acc' \
--has_sentence_ids 'valid|para,test|para,self_training|java_sa' \
## Evaluation parameters
--eval_bleu true \
--eval_computation true \
--generate_hypothesis true \
--eval_st false \
--eval_only false \
## self-training parameters
--st_steps 'java_sa-python_sa|cpp_sa' \
--st_beam_size 20 \
--lambda_st 1 \
--robin_cache false \
--st_sample_size 200 \
--st_limit_tokens_per_batch true \
--st_remove_proba '0.3' \
--st_sample_cache_ratio '0.5' \
--cache_init_path "${ONLINE_ST_FILES}/initial_cache/" \
--cache_warmup 500
simply replace the variableMODEL_TO_RELOAD
with the path to your model in the command above and add --eval_only True
.
Here are the best checkpoints for each direction:
CPP -> Java: Online_ST_CPP_Java.pth
CPP -> Python: Online_ST_CPP_Python.pth
Java -> CPP: Online_ST_Java_CPP.pth
Java -> Python: Online_ST_Java_Python.pth
Python -> CPP: Online_ST_Python_CPP.pth
Python -> Java: Online_ST_Python_Java.pth