Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to get only row keys #12

Closed
vskr opened this issue Jan 16, 2013 · 18 comments
Closed

API to get only row keys #12

vskr opened this issue Jan 16, 2013 · 18 comments

Comments

@vskr
Copy link

vskr commented Jan 16, 2013

Is there a way to get just the row keys (and don't get the column data) from hbase using happybase.

Usecase:
I am trying to implement pagination on rows. My row keys are random integers, they are unique but not sequential.

The closest to efficient pagination I could think of is

a. Get all the row keys
b. Loop through row keys (in batch of 100) and get the column data, when needed

@wbolster
Copy link
Member

No, this is not possible, and functionality like that is not in the Thrift API either.

I think you should rethink your design. Scanning rows like you suggested is horribly inefficient, since it results in a lot of useless I/O on the region servers (the data is still read from disk, even though it will not be used). A better option is to keep aggregate counters when inserting data (use Table.counter_inc() for that) and build your pagination using that information.

@wbolster
Copy link
Member

Oh, and for the 'next page' link you should remember the last row key from the current page and scan from that row onwards.

@vskr vskr closed this as completed Jan 17, 2013
@wbolster
Copy link
Member

Hi, out of curiosity: is your problem solved?

@vskr
Copy link
Author

vskr commented Jan 25, 2013

Not really. My problem is getting row keys between given range, and not all row keys.

so it would look like get_all_row_keys(start_row, end_row):
and returns [row_key_1, row_key_2,....row_key_last_index]

I was looking at KeyOnlyFilter() http://hbase.apache.org/book/thrift.html but that gives column keys too

@wbolster
Copy link
Member

Have you looked at the part about scanners in the tutorial? That can be used to specify start and stop keys. Combine it with FirstKeyOnlyFilter to avoid sending complete rows (but only a single cell per row) over the wire. I think it's not going to get any better than that with the current Thrift API (and not with the Java API either).

@wbolster wbolster reopened this Jan 25, 2013
@vskr
Copy link
Author

vskr commented Jan 25, 2013

Missed FirstKeyOnlyFilter, looks like that should return only row_keys and first column key and value, which is definitely better than getting all column keys (and values)

@wbolster
Copy link
Member

Code example (untested):

scanner = table.scan(row_start=b'aaa', row_stop=b'bbb', filter=b'FirstKeyOnlyFilter()')
row_keys = [key for key, data in scanner] 

@vskr
Copy link
Author

vskr commented Jan 25, 2013

Yeah, I tested using my data and it works

@vskr
Copy link
Author

vskr commented Jan 25, 2013

Cool thanks!!

@wbolster
Copy link
Member

I have just opened issue #14. Ideas and patches welcome. :)

@vskr
Copy link
Author

vskr commented Jan 26, 2013

haha! Sure, but I like what you have currently. Thin client which acts as "pass through" to Thrift service on hbase server. This way, you don't have to update python-client-api whenever Thrift service updates list of commands/filters it supports.

I will add some examples on how to construct proper filter string expressions

@wbolster
Copy link
Member

Great, thanks. I agree with you about keeping up to date, but some helper functions might be useful nonetheless, mostly for properly escaping binary data and so on.

@wbolster
Copy link
Member

Just figured out that filter=b'KeyOnlyFilter() AND FirstKeyOnlyFilter() is even better for your use case (counting rows).

@wbolster
Copy link
Member

Complete answer/example:

scanner = table.scan(
    row_start=b'aaa',
    row_stop=b'bbb',
    filter=b'KeyOnlyFilter() AND FirstKeyOnlyFilter()',
)

for row_key, data in scanner:
    pass  # do something with row_key

@vskr
Copy link
Author

vskr commented Jan 29, 2013

Yeah, I probably should have updated this thread. But I was already using the above compound filter

@wbolster
Copy link
Member

I assumed so, but I posted it anyway for posteriority and for others on the internet who may stumble upon this issue. :)

@ghost
Copy link

ghost commented Feb 2, 2015

table.scan(filter=b'KeyOnlyFilter() AND FirstKeyOnlyFilter()')

works like a charm!
Thank you @wbolster!

@bhalgat20
Copy link

table.scan(filter=b'KeyOnlyFilter() AND FirstKeyOnlyFilter()')

This the most efficient I found till now. Thanks wbolster

@python-happybase python-happybase deleted a comment from UmfintechWtc Mar 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants