Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash.mapping of interaction might be incorrect #67

Open
wush978 opened this issue Mar 26, 2015 · 7 comments
Open

hash.mapping of interaction might be incorrect #67

wush978 opened this issue Mar 26, 2015 · 7 comments
Assignees
Labels

Comments

@wush978
Copy link
Owner

wush978 commented Mar 26, 2015

The construction of interaction term in hash.mapping uses the inverse_mapping which might be overlapped by collision.

@wush978 wush978 added the bug label Mar 26, 2015
@wush978 wush978 self-assigned this Mar 26, 2015
@wush978 wush978 added this to the CRAN v0.9 milestone Mar 26, 2015
@pommedeterresautee
Copy link
Contributor

What do you mean by overlap by collision? If there are some collisions, I supposed it was expected to have some redundancy in the mapping. Or do you mean there are something else making worse the number of collisions in the mapping?

@wush978
Copy link
Owner Author

wush978 commented Mar 26, 2015

It is an internal error and the resulting mapping might miss some feature of interaction if the main effects (non-interacted features) have collision on the space of 2^32(before taking modulo). The possibility should be low, but I need to figure out a way to eliminate the possibility.

@wush978
Copy link
Owner Author

wush978 commented Mar 26, 2015

This should be the last change before sumitting v0.9.

@wush978
Copy link
Owner Author

wush978 commented Mar 27, 2015

Well, after reviewing the code, I think the probability of occurring this bug is very very small.

The bug only occurs If there is an interaction of any kind of feature with an array-type feature, and the hashed values of the array-type feature have collision in one record. For example, suppose there are two columns a and b while the value are 1 and 1,2,3 respectively. The user mark the column b with split. If the hashed value of b1 and b2 are collided, then the interaction a1:b1 will not appear in the returned mapping table. Note that the collision needs to be in the space of {0, 1, 2, ..., 2^32 - 1} before taking modulo.

To resolve this issue, I need to sacrifice lots of efficiency under current structure. Therefore, I prefer to do nothing.

This package is originally designed for predictive analysis and the mapping should not play an important role of predictive analysis.

I'll leave this issue open to see if anyone could give me sufficient reason to drive me fix it. This requires either sacrificing the efficiency or re-writing the core functions.

@wush978 wush978 removed this from the CRAN v0.9 milestone Mar 27, 2015
@pommedeterresautee
Copy link
Contributor

I fully agree of not fixing it. Anyway there will be collisions. However it would be interesting to add a word in the detail documentation of the hashing function. What do you think?

Kind regards,
Michaël

@wush978
Copy link
Owner Author

wush978 commented Mar 29, 2015

You're right!

Thanks for reminding me.

wush978 added a commit that referenced this issue Mar 29, 2015
@wush978 wush978 added this to the CRAN v0.9.1 milestone May 9, 2015
@wush978
Copy link
Owner Author

wush978 commented Sep 15, 2015

I still have not found any realization of this bug. It is still theoretical.

I'll remove this from milestone 0.9.1 because it blocks me.

@wush978 wush978 removed this from the CRAN v0.9.1 milestone Sep 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants