Recommendation datasets are often modelled as graph datasets because of high predictive prowess of Graph Neural Networks. Here, we create a graph dataset by using either a column or a group of columns from ShareChat dataset and train a supervised GraphSage model on it with an emphasis on ensuring no information from test data is used during training and all test examples are inferred independently i.e. no two test edges can see eachother.
This is an end-to-end README, Train and Inference instructions are both included.
- Create working directory and subfolders for storing intermediate data files
mkdir data && cd data
mkdir supGNN_graph_data sharechat_recsys2023_data
cd supGNN_graph_data
mkdir full_graph train_graph test_graph
- Download ShareChat dataset inside
sharechat_recsys2023_data
folder. It should containtrain
andtest
folders with the given csv files.
Please see outline for section 1 in following diagram -
We model Group dataset into nodes based on features f_6 for role-1 and f_2, f_4, and f_16 for role-2.
- Convert tabular training data into training graph
Using only
./sharechat_recsys2023_data/train
folder, we create./data/supGNN_graph_data/train_graph/train_edges.csv.gz
which will be used for supGNN train/localdisk/akakne/recsys2023/assets/recsys_supgnn.jpging later.
python3 0_train/2_supGNN/convert_train_tabular_data_to_edge_list.py
- Create full graph for supGNN inference
Using both
./sharechat_recsys2023_data/train
and./sharechat_recsys2023_data/test
folders, we create./supGNN_graph_data/full_edges.csv.gz
which will be used for supGNN inference later.
python3 1_inference/2_supGNN/convert_full_tabular_data_to_edge_list.py
- Convert training graph into train CSVDataset (for GNN training)
python3 0_train/2_supGNN/convert_train_edge_list_to_CSVDataset.py
Please see newly created files inside ./data/supGNN_graph_data/train_graph/recsys_graph
folder.
- Convert full graph into full CSVDataset (for GNN inference)
python3 1_inference/2_supGNN/convert_full_edge_list_to_CSVDataset.py
Please see newly created files inside ./data/supGNN_graph_data/full_graph/recsys_graph
folder.
- Train supervised GNN on train graph using
./data/supGNN_graph_data/train_graph/recsys_graph
.
python3 0_train/2_supGNN/train_supervised_graphsage.py
Above script saves best model (across epochs) as ./data/supGNN_graph_data/train_graph/model.pt
IMPORTANT NOTE : we are using only train graph in this step i.e. GNN model will never see any test data during training.
- Run inference on full graph using saved GNN model
python3 1_inference/2_supGNN/infer_supervised_graphsage.py
IMPORTANT NOTE : to ensure indepedent testing for all test edges, we disable message passing through test edges i.e. any test edge e
will see only train and validation edges around it's source and destination nodes. Edge e
will never see another test edge e'
. This is done by setting the probablity of picking test edges as neighbours to 0. For more details see our inference code in file ./1_inference/2_supGNN/infer_supervised_graphsage.py
line 112 and 20 respectively. We also handle unseen nodes in this script.
Please see outline for section 2 in following diagram -
python3 1_inference/2_supGNN/map_node_emb_to_edge_list.py
this script will map GNN-generated node embeddings to their respective edges and save GNN-boosted features for train and test split separately.
- Merge train supGNN data with train FE data
python3 0_train/2_supGNN/merge_train_FE_and_train_supgnn_data.py
- Merge test supGNN data with test FE data
python3 1_inference/2_supGNN/merge_test_FE_and_test_supgnn_data.py
- Train XGB on merged train data
python3 0_train/2_supGNN/train_xgb.py
- Infer XGB on merged test data
python3 1_inference/2_supGNN/infer_xgb.py
Finally, we get ./data/supGNN_graph_data/test_graph/submission.csv
as our final submission file which we upload to the leaderboard for evaluation on test set.