Breaking Down korge-core into Different Modules #2088

itboy87 · 2023-12-24T11:22:54Z

itboy87
Dec 24, 2023

Hi, I'm testing korge-core in one of my Kotlin Multiplatform HML & XML Parser projects ksoup, and it has been really helpful. However, I have a few concerns about korge-core that could greatly benefit the community. I would like to use charset encoding, streaming API, and gzip compression only from korge-core, but the library is coupled with a lot of extra functionalities.

It would be helpful if the library could be divided into modules, such as korge-core (containing only the core without network, compression, or parser code), and create other separate modules like korge-network, korge-compression, korge-parser, etc. This modular approach would make the library more versatile for the Kotlin Multiplatform community.

Additionally, do you have any plans for implementing gzip compression with a streaming API, similar to GZIPInputStream, to decompress files on the fly as only the necessary bytes decompressed?"

soywiz · 2023-12-26T19:24:17Z

soywiz
Dec 26, 2023
Maintainer

Hi, I'm testing korge-core in one of my Kotlin Multiplatform HML & XML Parser projects ksoup, and it has been really helpful. However, I have a few concerns about korge-core that could greatly benefit the community. I would like to use charset encoding, streaming API, and gzip compression only from korge-core, but the library is coupled with a lot of extra functionalities.

It would be helpful if the library could be divided into modules, such as korge-core (containing only the core without network, compression, or parser code), and create other separate modules like korge-network, korge-compression, korge-parser, etc. This modular approach would make the library more versatile for the Kotlin Multiplatform community.

I'm fine with it. Before splitting korge-core, I guess likely korge-foundation should be splitted first. There is this PR for example: #2072 . This is the issue for that: #2040 I will likely won't have time for that myself, but PRs are welcome.

Additionally, do you have any plans for implementing gzip compression with a streaming API, similar to GZIPInputStream, to decompress files on the fly as only the necessary bytes decompressed?"

That's already supported:

fun Deflate(windowBits: Int): CompressionMethod
class GZIP : CompressionMethod

suspend fun CompressionMethod.uncompress(i: AsyncInputStream, o: AsyncOutputStream): Unit

And to use synchronously if the source and destination is synchronous in the end:

fun <T : Any> runBlockingNoSuspensions(callback: suspend () -> T): T

0 replies

itboy87 · 2023-12-28T09:22:02Z

itboy87
Dec 28, 2023
Author

I'm fine with it. Before splitting korge-core, I believe korge-foundation should be split first. For example, there is this PR #2072 and the corresponding issue #2040. I likely won't have time for that myself, but PRs are welcome.

Ok, I will try my best to contribute there.

That's already supported:

suspend fun CompressionMethod.uncompress(i: AsyncInputStream, o: AsyncOutputStream): Unit

Sorry, I missed that. Thanks for pointing it out. I think GZIP.uncompressStream is what I'm looking for.

By the way, I was checking and found a few things missing for my scenario:

AsyncStream doesn't support a markable stream like SyncStream.
When opening a file in SyncStream with readAsSyncStream, it's reading all bytes. I think it should not, or maybe I'm using the wrong function.
There is no CharReaderFromAsyncStream like CharReaderFromSyncStream.
Both AsyncStream and SyncStream have functions like readCharArray, which I think always read or write the exact byte size of a char. This may cause issues when reading a stream of mixed charset strings.
I'm looking for an alternative to java.io.InputStreamReader(InputStream in, Charset cs), which can read char arrays from streams. Do we have something like that? I think CharReaderFromSyncStream exists but doesn't support mark and reset. Can we make it markable and also create CharReaderFromAsyncStream?

Here is an example of Java code that I'm trying to achieve in Kotlin using korge-core. These are Java APIs and are very helpful for the community for IO operations:

val stream = GZIPInputStream(FileInputStream(file)) // InputStream
stream.mark
stream.reset
stream.skip

val inputStreamReader = InputStreamReader(input, Charset.forName(charsetName))
inputStreamReader.read(charArray: CharArray, offset: Int, len: Int)
inputStreamReader.mark
inputStreamReader.reset
inputStreamReader.skip

I've also made modifications to CharReaderFromSyncStream for adding support for markable, but it's not efficient. I'm duplicating tempStringBuilder and remembering the state of stream read and buffer, but there's still room for improvement. Below is the modified code:

internal class CharReaderSyncStream(
    private val stream: SyncStream,
    private val charset: Charset,
    private val chunkSize: Int
) : CharReader {
    private var temp = ByteArray(chunkSize)
    private val buffer = ByteArrayDeque()
    private var tempStringBuilder = StringBuilder()

    private var markedTempStringBuilder: StringBuilder? = null
    private var currentState: ReaderState = ReaderState(consumedBytes = 0, consumedBuffer = 0, lastBytesRead = 0)
    private var markedState: ReaderState? = null

    override fun clone(): CharReader = CharReaderFromSyncStream(stream.clone(), charset, chunkSize)

    init {
        stream.mark(SharedConstants.DefaultBufferSize)
    }

    private fun bufferUp() {
        while (buffer.availableRead < temp.size) {
            val readCount = stream.read(temp)
            if (readCount <= 0) break

            currentState = currentState.copy(
                consumedBytes = currentState.consumedBytes + readCount,
                lastBytesRead = readCount
            )

            buffer.write(temp, 0, readCount)
            if (currentState.applyMarkedState) {
                currentState = currentState.copy(applyMarkedState = false)
                if (currentState.consumedBuffer > 0) {
                    buffer.skip(currentState.consumedBuffer)
                }
            } else {
                currentState = currentState.copy(consumedBuffer = 0)
            }
        }
    }

    override fun read(out: StringBuilder, count: Int): Int {
        bufferUp()

        while (tempStringBuilder.length < count) {
            val readCount = buffer.peek(temp)
            val consumed = charset.decode(tempStringBuilder, temp, 0, readCount)
            if (consumed <= 0) break
            currentState = currentState.copy(consumedBuffer = currentState.consumedBuffer + consumed)
            buffer.skip(consumed)
        }

        val slice = tempStringBuilder.substring(0, kotlin.math.min(count, tempStringBuilder.length))
        tempStringBuilder = StringBuilder(slice.length).append(tempStringBuilder.substring(slice.length))

        out.append(slice)
        return slice.length
    }

    fun mark(readLimit: Int) {
        this.markedTempStringBuilder = StringBuilder(tempStringBuilder)
        this.markedState = this.currentState
    }

    fun reset() {
        if (this.markedState != null) {
            if (this.markedTempStringBuilder != null) {
                this.tempStringBuilder = StringBuilder(this.markedTempStringBuilder!!)
                this.markedTempStringBuilder = null
            }

            buffer.clear()
            temp = ByteArray(chunkSize)
            val skipedBytes = this.markedState!!.consumedBytes - this.markedState!!.lastBytesRead
            this.currentState = this.markedState!!.copy(consumedBytes = skipedBytes, applyMarkedState = true)

            stream.reset()
            stream.mark(SharedConstants.DefaultBufferSize)
            if (skipedBytes > 0) {
                stream.skip(skipedBytes)
            }

            this.markedState = null
        }
    }

    fun skip(count: Int) {
        this.read(count)
    }
}


private data class ReaderState(
    val consumedBytes: Int,
    val consumedBuffer: Int,
    val lastBytesRead: Int,
    val applyMarkedState: Boolean = false
)

Lastly, I've identified some issues in URL.resolve which I've already addressed and will be submitting a pull request for.

I genuinely appreciate all your hard work on this. It has been tremendously helpful for me, and I hope the broader community will also benefit from these improvements.

I apologize if I have initiated several discussions here. Thank you once again for your understanding and assistance.

1 reply

soywiz Dec 29, 2023
Maintainer

Let me write one separate message for each topic, so we can reply to each thing individually in a separate thread.

soywiz · 2023-12-29T08:29:45Z

soywiz
Dec 29, 2023
Maintainer

AsyncStream doesn't support a markable stream like SyncStream.

Why do you need it? I mean, can't you just duplicate the AsyncStrean to emulate a mark? AsyncStream.duplicate()? In the end AsyncStream is a AsyncStreamBase + position.

0 replies

soywiz · 2023-12-29T08:31:51Z

soywiz
Dec 29, 2023
Maintainer

When opening a file in SyncStream with readAsSyncStream, it's reading all bytes. I think it should not, or maybe I'm using the wrong function.

It works like that, because I/O in korio is all asynchronous. There is a separate API for sync I/O but the point on korio is to also work on JS where only async I/O is typically available and also make it difficult to have bottle necks related to I/O. Typically when reading binary stuff, you read asynchronously the header, and with the header, you get the size of the dynamical header information you need and read a full block asynchronously, then you process that block synchronously having that in memory for performance sake.

0 replies

soywiz · 2023-12-29T08:32:35Z

soywiz
Dec 29, 2023
Maintainer

There is no CharReaderFromAsyncStream like CharReaderFromSyncStream.

What's your scenario for an CharReaderFromAsyncStream? I mean, asynchronous is typically slow, so you typically read a full block asynchronously and process it synchronously in memory.

2 replies

itboy87 Dec 31, 2023
Author

I am currently developing an HTML & XML parse ksoup. In this parser, users have the option to load large files for parsing. However, I aim to avoid loading the entire file into memory, especially in scenarios where users only need to read the header of a large HTML file. In such cases, the parser loads just the data within the head tag from the file and parses it.

The parser primarily operates with CharReader using streams. While reading blocks with AsyncStream and then converting them to SyncStream is a possibility, it is not feasible for me as I need to pass a single stream to CharReader.

Therefore, I need to pass a stream to CharReader that can be from either a file or a string. I believe AsyncStream is the best option here as it can be used for both. However, it lacks compatibility with CharReader, and I think its code similar to CharReaderFromSyncStream with few adjustments.

soywiz Jan 3, 2024
Maintainer

Would this work for you? (untested)

vfsFile.openUse { stream ->

val deque = DequeSyncStream()
val charReader = CharReaderFromSyncStream(deque)

while (deque.hasMore || !stream.eof) {
  if (deque.available < 512) {
     deque.write(stream.readBytes(512))
  }
  // ...
  println(charReader.read(8))
}

}

soywiz · 2023-12-29T08:34:20Z

soywiz
Dec 29, 2023
Maintainer

Both AsyncStream and SyncStream have functions like readCharArray, which I think always read or write the exact byte size of a char. This may cause issues when reading a stream of mixed charset strings.

Yeah, because Char is the Kotlin type, not code points. The binary representation of characters change depending on the Charset. I understand in any case that that name can be misleading. But that API is for reading arrays, not strings. If you have ideas on APIs to improve this, feel free to share your thoughts.

0 replies

soywiz · 2023-12-29T08:35:27Z

soywiz
Dec 29, 2023
Maintainer

I'm looking for an alternative to java.io.InputStreamReader(InputStream in, Charset cs), which can read char arrays from streams. Do we have something like that? I think CharReaderFromSyncStream exists but doesn't support mark and reset. Can we make it markable and also create CharReaderFromAsyncStream?

Not implemented AFAIK. Can you put a real example with a case where you would need that mark and reset in a more specific way so I can understand the use-case?

I've also made modifications to CharReaderFromSyncStream for adding support for markable, but it's not efficient. I'm duplicating tempStringBuilder and remembering the state of stream read and buffer, but there's still room for improvement. Below is the modified code:

We can make adjustments so it is efficient. But first I want to understand the use-case to know if that's possible to do it with the current API, without adding extra complexity to the implementation.

2 replies

itboy87 Dec 31, 2023
Author

In the parser, reading characters instead of streams is the requirement. There are several use cases for this approach, including:

Tokenization and Parsing
Configuration File Parsing
Template Processing
Partial Document Parsing

The CharReaderFromSyncStream.clone method has a problem. In scenarios where the user reads an amount less than the chunkSize and subsequently clones it, there is a discrepancy. For instance:

Input data size: 8, ChunkSize: 3, User reads: 2 characters
After the user reads 2 characters, 1 character remains in the buffer, and the stream has already read up to 3 bytes. Upon cloning, it starts from position 4 of the stream, omitting the 3rd byte, which has been read from the stream but remains unread in the CharReader buffer.

soywiz Jan 3, 2024
Maintainer

Wouldn't a small layer on top of that work? Like buffering and/or a CharDeque?

You can read from an asynchronous source like 1024 bytes and keep a deque filled. Then you can peek characters and only consume them when you are sure you can to consume them. Or consume them by default and readding them to the start of the deque.

For templates you would typically have everything in memory. It is true that parsing big XMLs or JSONs might require streaming. But in the case of XMLs or JSONs, I don't see specific reasons for a parser wanting to rewind something already consumed.

itboy87 · 2023-12-31T09:53:31Z

itboy87
Dec 31, 2023
Author

@soywiz Why not exclusively concentrate on the AsyncStream API instead of SyncStream? This is a Kotlin library, and every Kotlin platform supports coroutines. On platforms without coroutines, it can seamlessly function with callbacks. Additionally, we can. create blocking APIs on top of suspend functions. This will simplifies the development process for streams, enabling flexibility. AsyncStream remains versatile, serving purposes for both files and in-memory data.

In java there is one base class for Stream that is InputStream. All streaming APIs are built upon this class, and Reader classes are also utilize InputStream as their basis.

We might require a class similar to InputStream that can be utilized either asynchronously or synchronously.

1 reply

soywiz Jan 3, 2024
Maintainer

From my experience the asynchronou stuff introduces an important overhead, specially for performance scenarios dealing with binary formats. In fact there is a kind of stream that is completely crafted for in-memory ByteArrays for stuff where having the overhead of SyncStreams was too big for the scenario. Used for example in TTF parsing that might include a lot of stuff on it.

I'm open to introduce Async stuff as required if you need it and makes sense, but I would like to keep the synchronous variants that currently exists at least.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking Down korge-core into Different Modules #2088

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Breaking Down korge-core into Different Modules #2088

itboy87 Dec 24, 2023

Replies: 8 comments · 6 replies

soywiz Dec 26, 2023 Maintainer

itboy87 Dec 28, 2023 Author

soywiz Dec 29, 2023 Maintainer

soywiz Dec 29, 2023 Maintainer

soywiz Dec 29, 2023 Maintainer

soywiz Dec 29, 2023 Maintainer

itboy87 Dec 31, 2023 Author

soywiz Jan 3, 2024 Maintainer

soywiz Dec 29, 2023 Maintainer

soywiz Dec 29, 2023 Maintainer

itboy87 Dec 31, 2023 Author

soywiz Jan 3, 2024 Maintainer

itboy87 Dec 31, 2023 Author

soywiz Jan 3, 2024 Maintainer

itboy87
Dec 24, 2023

Replies: 8 comments 6 replies

soywiz
Dec 26, 2023
Maintainer

itboy87
Dec 28, 2023
Author

soywiz Dec 29, 2023
Maintainer

soywiz
Dec 29, 2023
Maintainer

soywiz
Dec 29, 2023
Maintainer

soywiz
Dec 29, 2023
Maintainer

itboy87 Dec 31, 2023
Author

soywiz Jan 3, 2024
Maintainer

soywiz
Dec 29, 2023
Maintainer

soywiz
Dec 29, 2023
Maintainer

itboy87 Dec 31, 2023
Author

soywiz Jan 3, 2024
Maintainer

itboy87
Dec 31, 2023
Author

soywiz Jan 3, 2024
Maintainer