-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement 'weights' and 'axis' in sample at DataFrame and Series #1893
base: master
Are you sure you want to change the base?
Conversation
… such as n, axis, weights. Now there are two unsupported situations: 1.does not support axis=1 2.If the value of the frac parameter > 1, the weights parameter is not supported
Codecov Report
@@ Coverage Diff @@
## master #1893 +/- ##
==========================================
+ Coverage 94.20% 94.22% +0.02%
==========================================
Files 40 41 +1
Lines 9939 10031 +92
==========================================
+ Hits 9363 9452 +89
- Misses 576 579 +3
Continue to review full report at Codecov.
|
@itholic can you review this please? |
Sure, let me take a look |
Notes | ||
----- | ||
If `frac` > 1, `replacement` should be set to `True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
databricks/koalas/frame.py
Outdated
# Because ks.Series currently does not support the Series.__iter__ method, | ||
# It cannot be initialized to the pandas Series, so here is to_pandas. | ||
if isinstance(weights, ks.Series): | ||
weights = pd.Series(weights.to_pandas(), dtype="float64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is weights
always expected to small enough so that good to use to_pandas()
??
If not, I think we better don't support weights
as a Series
for now since it could occur serious performance degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right!
If the series is relatively large, it may indeed cause performance problems.
The series weight is not supported here right now.
databricks/koalas/frame.py
Outdated
raise ValueError("weight vector may not include `inf` values") | ||
|
||
if (weights < 0).any(): | ||
raise ValueError("weight vector many not include negative values") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: many -> may
Of course I understand that you've just followed pandas' message 👍 , but It looks obviously typo.
withReplacement=replace, fraction=float(frac), seed=random_state | ||
) | ||
return DataFrame(self._internal.with_new_sdf(sdf)) | ||
locs = rs.choice(axis_length, size=n, replace=replace, p=weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the estimated size of locs
? Will it also be much huge?
e.g., for the case A random 50% sample of the ``DataFrame`` with replacement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rs.choice
method is similar to the numpy.random.RandomState.choice.
The return value of rs.choice
is the size of the size
parameter, in this case is n
.
So the size of locs
here is n
.
If parameter n
is set, locs
is equal to n
, otherwise locs
is equal to n = int(round(frac * axis_length))
.
For the case A random 50% sample of the ``DataFrame`` with replacement
, the size of locs
will be one-half of the size of the DataFrame
, to be precise, it should be int(round(0.5 * DataFrame.nrows))
.
If n
is large, or frac
is large, locs
will be large.
Performance depends on the performance of the take
method, which is actually the performance of the iloc
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I wouldn't recommend to use take
or iloc
for this.
Thinking of Koalas workload, the length of DataFrame
could be so huge, then locs
will be too huge for a single Driver node.
Also, row access by its row number is essentially heavy on Spark (to be exact, Spark doesn't provide the way to access rows by its row number), and so iloc
is heavy in Koalas, especially if the locs
is huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah!
I agree with you, because spark does not have the concept of rowindex, iloc is a heavy operation.
But here the weights parameter of the sample function is the same as iloc, which must depend on the row index.
There may be no other way to support the weights parameter right now.
The weights parameter specifies the corresponding weight for the corresponding row, so it may be necessary to
access rows by its row number.
Just like iloc is a heavy operation, but it must also be implemented based on row index right now.
Maybe the current sample function supports weights operation and must also be based on row index.
…port str and series temporarily 2. Optimize part of the code based on review comments.
Hi @chi2liu , since Koalas is ported to Spark, would you like to migrate this PR to Spark repo? If not, I will port it next week. |
Implement sample. Resolves #1887
Support the remaining parameters of the sample function of DataFrame, such as n, axis, weights.
Now there are two unsupported situations:
1.does not support axis=1
2.If the value of the frac parameter > 1, the weights parameter is not supported