Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when filtering a Series using a condition from a DataFrame #2199

Open
lamesjaidler opened this issue Sep 13, 2021 · 1 comment
Open

Comments

@lamesjaidler
Copy link

I'm wanting to filter down a Koalas Series based on a condition from a related Koalas DataFrame

X = ks.DataFrame({
    'A': [1,2,3,4,5]
})
y = ks.Series([1,0,1,0,1])
y[X['A']>3]

However, running the last line gives the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/hw/y20855_146x8gbvbpqmtpdp00000gq/T/ipykernel_27136/1423666562.py in <module>
      3 })
      4 y = ks.Series([1,0,1,0,1])
----> 5 y[X['A']>3]

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/series.py in __getitem__(self, key)
   6134                 # with ints, searches based on index values when the value is int.
   6135                 return self.iloc[key]
-> 6136             return self.loc[key]
   6137         except SparkPandasIndexingError:
   6138             raise KeyError(

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/indexing.py in __getitem__(self, key)
    419 
    420                 kdf[temp_col] = key
--> 421                 return type(self)(kdf[self._kdf_or_kser.name])[kdf[temp_col]]
    422 
    423             cond, limit, remaining_index = self._select_rows(key)

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/frame.py in __getitem__(self, key)
  11705 
  11706         if key is None:
> 11707             raise KeyError("none key")
  11708         elif isinstance(key, Series):
  11709             return self.loc[key.astype(bool)]

KeyError: 'none key'

This syntax works as expected with Pandas Series/DataFrames:

X = pd.DataFrame({
    'A': [1,2,3,4,5]
})
y = pd.Series([1,0,1,0,1])
y[X['A']>3]

Gives:

3    0
4    1
dtype: int64

Note that I have the following option set:

ks.set_option('compute.ops_on_diff_frames', True)

This seems like a bug? Or does it need to be carried out in a different way?

@xinrong-meng
Copy link
Contributor

xinrong-meng commented Sep 20, 2021

Unfortunately, the exact use case above is not supported.

However, there is workaround by making the original series(y) have the same name as the conditioning series work.

So in the above example,

>>> X = ks.DataFrame({
...     'A': [1,2,3,4,5]
... })
>>> y = ks.Series([1,0,1,0,1], name='A')
>>> y[X['A']>3]
3    0
4    1
Name: A, dtype: int64
>>> 

We will improve that in pandas API on Spark, under https://issues.apache.org/jira/browse/SPARK-36394.

Thanks for letting us know!

FYI @ueshin @HyukjinKwon @itholic

ueshin pushed a commit to apache/spark that referenced this issue Sep 22, 2021
### What changes were proposed in this pull request?
Fix filtering a Series (without a name) by a boolean Series.

### Why are the changes needed?
A bugfix. The issue is raised as databricks/koalas#2199.

### Does this PR introduce _any_ user-facing change?
Yes.

#### From
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
Traceback (most recent call last):
...
KeyError: 'none key'

```

#### To
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
0    0
1    1
2    2
dtype: int64
```

### How was this patch tested?
Unit test.

Closes #34061 from xinrong-databricks/filter_series.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
(cherry picked from commit 6a5ee02)
Signed-off-by: Takuya UESHIN <[email protected]>
ueshin pushed a commit to apache/spark that referenced this issue Sep 22, 2021
### What changes were proposed in this pull request?
Fix filtering a Series (without a name) by a boolean Series.

### Why are the changes needed?
A bugfix. The issue is raised as databricks/koalas#2199.

### Does this PR introduce _any_ user-facing change?
Yes.

#### From
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
Traceback (most recent call last):
...
KeyError: 'none key'

```

#### To
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
0    0
1    1
2    2
dtype: int64
```

### How was this patch tested?
Unit test.

Closes #34061 from xinrong-databricks/filter_series.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants