-
These scripts expect a data/trivago subdirectory...
- mkdir data
- put the trivago dataset in data/ OR if using ada, create a link to the trivago folder in the storage node.
- the files you need are:
train.csv,validation.csv,confirmation.csv
-
compute features for the experiments:
- python3 load_trivago.py
- python3 hash_user_id.py
- python3 load_trivago_blind.py
-
run the experiments:
- python3 train_eval.py
- python3 train_eval_blind.py
This script creates a readable parquet file which contains all the data which is required to train the models. In this file, I've included a range of dataclasses to help me understand the dataset, and to make feature extraction more intuitive and readable. The objects I've defined are as follows:
- Hotel
- Interaction
- Session
- UserProfile
- SessionData
Following LogicAI's 2019 recsys strategy, I computed user features on a rolling basis to prevent overfitting. Particularly, I sorted sessions by their starting timestamp, and added each interaction to a user profile and graph only after features had been extracted from that interaction.
This script computes the click-through ratios (CTRs) for all items.
Construct hotel (item-based) features for learning.
We are interested in knowing what happens when we pretend that all users are new visitors to the site. We can't just null out the user based features because we use a graph based model as one of our features, which still may be a useful features even if edges only occur between sessions and items (rather than users and items).
This script would remarkably similar to the above, but it excludes user profiles and computes fewer features. It also constructs a user-item graph as a session-item graph instead, which is more sparse.
I tackled a limited range of features which were considered most important in the challenge, while using my best judgement to exclude ones which I considered "cheating". For example, the first-place submission had
Some more features may have been helpful in boosting our final test MRR (.576)
In these experiments, I found that the user was really not important in recommending items over such a short period. That is, people likely do not plan multiple, different vacations/work trips in one week.