Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start using bloscpack for the text serialization too. #34

Closed
wants to merge 5 commits into from
Closed

Start using bloscpack for the text serialization too. #34

wants to merge 5 commits into from

Conversation

esc
Copy link
Contributor

@esc esc commented Aug 18, 2015

Still needs to be benchmarked for speed and memory performance. Also the
blosc_args can probably be tweaked.

@esc
Copy link
Contributor Author

esc commented Aug 18, 2015

And eventually support for object arrays should probably move to bloscpack itself.

@esc
Copy link
Contributor Author

esc commented Aug 18, 2015

@mrocklin @jcrist do you have a benchmark handy for profiling text data storage with castra?

@mrocklin
Copy link
Member

I have historically used actual datasets for this; I don't have anything artificial. @jcrist was recently working on the reddit data dumps. He might have something interesting to work with (if you're willing to download a bit of data.)

@jcrist
Copy link
Member

jcrist commented Aug 18, 2015

The reddit data provides a pretty good benchmark, a wide variety of string data.

This script will convert the comment data from here into a castra. The body column is composed of ~55 million strings of varying lengths. Note that the datafile is ~5 GB compressed, and 32 GB decompressed (no need to decompress, the script does that in a streaming fashion). Conversion took around 45 minutes on my computer.

@esc
Copy link
Contributor Author

esc commented Aug 28, 2015

I am closing this as I won't have time to finish it for now. Perhaps another time.

@esc esc closed this Aug 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants