Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance when reading large array of strings. #24

Open
benhamad opened this issue Feb 23, 2021 · 3 comments
Open

Bad performance when reading large array of strings. #24

benhamad opened this issue Feb 23, 2021 · 3 comments

Comments

@benhamad
Copy link

benhamad commented Feb 23, 2021

keys(LazyJSON.value(json_file))

The above is asymptotically problematic when json_file contain large array of strings.
I run this code

for i in 1:5
    items = i * 10000000
    json_file = open("/tmp/json", "w")
    write(json_file, JSON.json(Dict("a"=> "a", "b"=>repeat(["test"], items))))
    json_file = open("/tmp/json")
    t = @elapsed collect(keys(LazyJSON.value(json_file)))
    println("$items $t")
end

And as you can see from the result it's far away from linear (the second column is in seconds and the first is the number of items in the array)

10000000 75.786150509
20000000 317.985342906
30000000 724.489721802
40000000 1305.421886045
50000000 2040.987945434
60000000 2977.542937743

Compared to JSON.parse which return

10000000 8.384795834
20000000 18.123253007
30000000 27.854969659
40000000 38.360378806
50000000 51.391322248
60000000 73.577127605

We did some profiling and it seems that most of the time is spent in

LazyJSON.jl/src/LazyJSON.jl

Lines 478 to 496 in 53c63f0

function scan_string(s, i)
i, c = next_ic(s, i)
has_escape = false
while c != '"'
if isnull(c) || c == IOStrings.ASCII_ETB
throw(JSON.ParseError(s, i, c, "input incomplete"))
end
escape = c == '\\'
i, c = next_ic(s, i)
if escape && !(isnull(c) || c == IOStrings.ASCII_ETB)
has_escape = true
i, c = next_ic(s, i)
end
end
return i, has_escape
end

Screen Shot 2021-02-23 at 23 09 31

@mattBrzezinski
Copy link
Member

This package has not been touched in a long long time, and I don't think that there is anyone in the Julia community who understands its inner workings anymore.

@notinaboat
Copy link

Hi @Chakerbh,

The README suggests using mmap for large files.

This seems much faster:

julia> for i in 1:5
           items = i * 10000000
           json_file = open("/tmp/json", "w")
           write(json_file, JSON.json(Dict("a"=> "a", "b"=>repeat(["test"], items))))
           close(json_file)
           json_file = open("/tmp/json")
           s = String(Mmap.mmap(json_file))
           t = @elapsed collect(keys(LazyJSON.value(s)))
           println("$items $t")
           close(json_file)
       end
10000000 0.184056137
20000000 0.368260537
30000000 0.594284688
40000000 0.792973539
50000000 1.011099886

Even using LazyJSON.value(read(json_file)) is much faster than reading directly from the file stream.

The problem seems to be here on line 98 where data is read using readavailable:

function pump(s::IOString)
if eof(s.io)
Base.ensureroom(s.buf, 1)
s.buf.data[s.buf.size + 1] = 0x00
@assert !incomplete(s)
else
write(s.buf, readavailable(s.io))

Printing out the number of bytes read at that point, it looks like readavailable only returns 32k at a time, even though there are many MB available. Bug in base?

I think the IOString.jl interface is intended for streaming from network sockets, it may be that readavailable behaves differently for network IO.

Anyway, if I replace line 98 with write(s.buf, read(s.io, 100_000_000; all=false)), then it is much faster:

julia> for i in 1:5
           items = i * 10000000
           json_file = open("/tmp/json", "w")
           write(json_file, JSON.json(Dict("a"=> "a", "b"=>repeat(["test"], items))));
           close(json_file)
           json_file = open("/tmp/json")
           t = @elapsed collect(keys(LazyJSON.value(json_file)))
           println("$items $t");
           close(json_file)
       end
10000000 0.673818205
20000000 0.817360912
30000000 1.864231564
40000000 2.132910674
50000000 2.84851389

However, if you are dealing with local files, use Mmap.

@ghost
Copy link

ghost commented Jan 31, 2023

If you make a String from the mem mapped file you are copying the buffer. You want to use StringViews

s = StringView(Mmap.mmap(json_file))
j = LazyJSON.value(s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants