Follow these steps to train Qwen-2.5-0.5B with the EBAR and EBAE training method from the following paper: 'Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval'
Run the following notebook to extract the first 1000 articles from Spanish Wikipedia and save them as a pickle file:
get_spanish_wiki.ipynb
This script will handle data extraction and save the articles for further processing.
Use this notebook to preprocess the Wikipedia data and prepare it for EBAR and EBAE training:
prepare_dataset_for_ebar_ebae.ipynb
The preprocessing includes:
- Tokenizing the articles.
- Preparing chunks for input prompts and next sentences.
Finally, run the main training notebook to adapt the model with EBAR and EBAE methods:
llama_2_vec.ipynb
This notebook will:
- Load the preprocessed data.
- Train the model using the specified loss functions for EBAR and EBAE.
- Save the trained model for further use.
- Ensure all dependencies are installed before running the notebooks.
- For detailed explanations of the training procedure, refer to the documentation in each notebook.
Happy training!