-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neural Click Models #8
base: main
Are you sure you want to change the base?
Conversation
sim4rec/response/nn_response.py
Outdated
print("Warning: the historical data is empty") | ||
hist_data = spark.createDataFrame([], schema=SIM_LOG_SCHEMA) | ||
# filter users whom we don't need | ||
hist_data = hist_data.join(new_recs, on="user_idx", how="inner").select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is going on here? why do you join all new_recs columns? you need to take at least new_recs.select("user_idx").distinct() and do not select(hist_data["*"])) after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is really a huge bug. I think, it is an artifact of one of the intermediate versions, where I tried to work with tables whose user_idx is unique and each row represent the whole itertaion. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the fix was wrong, see the new suggestion #8 (comment)
sim4rec/response/nn_response.py
Outdated
print("Warning: the simulator log is empty") | ||
simlog = spark.createDataFrame([], schema=SIM_LOG_SCHEMA) | ||
# filter users whom we don't need | ||
simlog = simlog.join(new_recs, on="user_idx", how="inner").select(simlog["*"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sim4rec/response/nn_response.py
Outdated
) | ||
) | ||
|
||
# not very optimal way, it makes one worker to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to discuss. you batch id should not influence the partitioning. one partition != one batch and the users are grouped to batches within one partition. do not now how to implement it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchID won't influence the partition, because each batch must consist of the whole interaction history of a specific group of users. I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i see, just remove the comment
self.backbone_response_model = None | ||
|
||
def _fit(self, train_data): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls describe the dataframe format here and for transform. what should be included to properly convert dataframe to the RecommendationData. pls add corresponding docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is exactly the same as the simulator logs format. Please give me advice, where I can obtain it's description.
Thank you for your contribution! Please, have a look at the comments and add a time measurements to the notebook to show the speed of the main stages of simulation pipeline. |
sim4rec/response/nn_response.py
Outdated
""" | ||
Predict responses for given dataframe with recommendations. | ||
|
||
:param dataframe: new recommendations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the param name is not correct, should be new_recs
print("Warning: the historical data is empty") | ||
hist_data = spark.createDataFrame([], schema=SIM_LOG_SCHEMA) | ||
# filter users whom we don't need | ||
hist_data = hist_data.join(new_recs, on="user_idx", how="semi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want to leave the history of only distinct users from new_recs
in hist_data
.
hist_data = hist_data.join(new_recs, on="user_idx", how="semi") | |
hist_data = hist_data.join(sf.broadcast(new_recs.select("user_idx").distinct()), on="user_idx", how="inner") |
print("Warning: the simulator log is empty") | ||
simlog = spark.createDataFrame([], schema=SIM_LOG_SCHEMA) | ||
# filter users whom we don't need | ||
simlog = simlog.join(new_recs, on="user_idx", how="semi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for hist data...
simlog = simlog.join(new_recs, on="user_idx", how="semi") | |
simlog = simlog.join(sf.broadcast(new_recs.select("user_idx").distinct()), on="user_idx", how="inner") |
No description provided.