-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance and reduce memory consumption #101
Comments
I also noted the large memory usage, and how large the
|
Hey folks just want to chime in on this and say that option (1) mentioned by @Lundez is a great idea that would greatly benefit some people I'm sure, but for some it would be a complete deal breaker making the library unusable. For the production environments we deploy lingua into, having the library try to fetch things from the Internet at runtime and having the library writing to some local filesystem location would make the library unusable for us. Both these things would fail because of the locked-down deployment environment. |
Hello, So I try to contribute to this task, I hope this becomes helpful.
I absolutely agree.
A proprietary format can lead to the best performance.
The gain should be irrelevant.
The JVM stores fields sequentially, so 1
I agree with this. I elaborated on your consideration, maybe 16-bit precision can be enough (the smallest positive frequency is ~2.3E-41). The following POC contains only binary operators, so conversions should require few CPU cycles. value class Frequency(val rawBits: Short) {
constructor(float: Float) : this(float.toBits().shr(14).toShort())
fun toFloat() = Float.fromBits(rawBits.toInt().shl(14))
override fun toString() = toFloat().toString()
} This class requires the same memory as a
|
Are you sure about this? These maps also make up a (large?) potion of the memory usage, especially when using a custom binary encoding. It is been some time since I last experiented with this, but if I recall correctly increasing the load factor reduced memory usage, possibly at the cost of slower lookup. It would probably still be benefitial to investigate this further. |
No, I not am. I create a POC for your future investigation. class RelativeFrequencies(val ngramLength: Int) {
private var ngrams = charArrayOf()
private var frequencies = floatArrayOf()
val size get() = frequencies.size
fun load(count: Int, data: Iterator<Pair<String, Float>>) {
ngrams = CharArray(count * ngramLength)
frequencies = FloatArray(count)
repeat(count) { i ->
val (ngram, frequency) = data.next()
val base = i * ngramLength
ngram.forEachIndexed { p, c -> ngrams[base + p] = c }
frequencies[i] = frequency
}
}
operator fun get(ngram: String): Float? {
var low = 0
var high = size - 1
while (low <= high) {
val middle = (low + high) / 2
val diff = compareNgram(middle, ngram)
if (diff < 0) low = middle + 1
else if (diff > 0) high = middle - 1
else return frequencies[middle]
}
return null
}
private fun compareNgram(pos: Int, ngram: String): Int {
val base = pos * ngramLength
repeat(ngramLength) { i ->
val diff = ngrams[base + i].compareTo(ngram[i])
if (diff != 0) return diff
}
return 0
}
}
fun main() {
val rf = RelativeFrequencies(ngramLength = 2)
rf.load(count = 3, listOf("aa" to 0.1F, "ab" to 0.2F, "bb" to 0.3F).iterator())
println(rf["aa"])
println(rf["ab"])
println(rf["bb"])
println(rf["zz"])
} |
I made a little test to understand the memory requirement of my previous POC. I attach it. All languages should require 370MB of RAM. import kotlinx.serialization.json.Json
import kotlinx.serialization.json.jsonObject
import kotlinx.serialization.json.jsonPrimitive
import java.io.DataInputStream
import java.io.DataOutputStream
import java.io.File
import java.util.*
import java.util.concurrent.Callable
import java.util.concurrent.ForkJoinPool
fun main() {
// content of https://github.com/pemistahl/lingua/tree/main/src/main/resources/language-models
val jsonModelsDir = File("language-models")
require(jsonModelsDir.isDirectory)
val datModelsDir = File("language-models.dat")
// serialize to binary
if (datModelsDir.mkdir()) translateModel(jsonModelsDir, datModelsDir)
// read en from binary
val en = DataInputStream(File(datModelsDir, "en").inputStream().buffered())
.use(RelativeFrequencies.Companion::read)
println("${en["pc"]} == ${2868.0/1994813.0}")
}
fun translateModel(jsonModelsDir: File, datModelsDir: File) {
val executor = ForkJoinPool.commonPool()
jsonModelsDir.listFiles().forEach { dir ->
val out = DataOutputStream(File(datModelsDir, dir.name).outputStream().buffered())
listOf("unigrams.json", "bigrams.json", "trigrams.json", "quadrigrams.json", "fivegrams.json")
.map { File(dir, it).readText(Charsets.UTF_8) }
.map { content ->
executor.submit(Callable {
val json = Json.parseToJsonElement(content).jsonObject
val map = TreeMap<String, Float>()
json.getValue("ngrams").jsonObject.forEach { (key, value) ->
val frequency = key.split('/').let { (n, d) -> (n.toDouble() / d.toDouble()).toFloat() }
value.jsonPrimitive.content.split(' ').forEach { ngram ->
map[ngram] = frequency
}
}
return@Callable map
})
}.map { it.get() }
.forEach { map ->
out.writeInt(map.size)
map.keys.forEach(out::writeChars)
map.values.forEach(out::writeFloat)
}
out.close()
}
}
class RelativeFrequencies private constructor(private val data: Array<Entries>) {
operator fun get(ngram: String): Float = data[ngram.length - 1][ngram]
private class Entries(private val chars: CharArray, private val frequencies: FloatArray) {
val size get() = frequencies.size
operator fun get(ngram: String): Float {
var low = 0
var high = size - 1
while (low <= high) {
val middle = (low + high) / 2
val diff = compareNgram(middle, ngram)
if (diff < 0) low = middle + 1
else if (diff > 0) high = middle - 1
else return frequencies[middle]
}
return 0F
}
private fun compareNgram(pos: Int, ngram: String): Int {
val base = pos * ngram.length
repeat(ngram.length) { i ->
val diff = chars[base + i].compareTo(ngram[i])
if (diff != 0) return diff
}
return 0
}
}
companion object {
fun read(dataInputStream: DataInputStream): RelativeFrequencies {
val data = Array(5) { n ->
val size = dataInputStream.readInt()
val chars = CharArray(size * (n + 1)) { dataInputStream.readChar() }
val float = FloatArray(size) { dataInputStream.readFloat() }
Entries(chars, float)
}
return RelativeFrequencies(data)
}
}
} |
I create a pull request (#127) containing my previous considerations. |
It is nice to read that, can you share your results, please? |
@fvasco Below are the results of @sigpwned's benchmark on my local machine. Lingua 1.1.1
Lingua 1.2.1 - high accuracy mode
Lingua 1.2.1 - low accuracy mode
The new low accuracy mode is the most performant one as it uses trigrams only. But the accuracy is worse for short text. |
This is super interesting work, @pemistahl. I'm not terribly familiar with the build and release process you use for lingua, but would it be useful to have a set of benchmarks integrated into the build and/or CI/CD for the library? If so, I can certainly take a look at how that might work. |
@sigpwned Sure, feel free to open a PR. Having a benchmark integrated into the project is not a bad idea. I find JMH a bit complicated, so any help is appreciated. |
Over the past months I have been tinkering with Lingua performance optimizations, and I think I have now got it to a somewhat maintainable state: Would be great if you could give it a try and let me know what you think! The changes are pretty extensive and the code is not easily maintainable anymore so I assume you, @pemistahl, won't be interested in most of these changes. But if you notice any particular change you want to include in Lingua, feel free to let me know (or apply the change yourself). I might also try to backport some of the less extensive changes to Lingua (if you are interested). |
Hi @Marcono1234, thank you for your hard work that you have done now and over the past months. You have been of great help for this project. :) May I ask you what motivates you to contribute so much to my work? Is it just for fun, for your own learning purposes or because you need a language detector for your own work? I will gladly study your changes and certainly backport some of them. But if some of the code turns out to be too complicated to maintain, I will most probably refrain from merging it. Of all your changes, the greatest effect has the binary format, I suppose. Am I right? This is something I would probably adopt. Json is obviously not the best fit here. |
Originally I became aware of Lingua and started tinkering around with it because its memory usage was too high for our use case (mojira/arisa-kt#656 (comment)). But I think soon afterwards it was mostly for my own learning purposes and for experimenting with Kotlin. Then I stopped for multiple months and continued with it this year after the discussion here became more active again. Though as warning, my fork is definitely still experimental and might contain bugs, and most likely contains premature optimization. Some of the code also originates from the time when Lingua had more language model files (before you removed them in e9757c7), so the code might be redundant or could be simplified (Marcono1234#1).
That would mostly address loading speed and size of the JAR, but not detection speed or runtime memory consumption (which is certainly a good start nonetheless). And it can probably be implemented without having to adjust any of the language detection code. However, you can most likely not reuse much of the code from my fork because it is strongly tied to how the language models are represented in memory. Interestingly the JAR file of my fork is not much smaller despite using binary encoding (69.7 MB compared to Lingua's 76.7 MB). Most likely the ZIP compression of the JAR file accounts for most of the redundancy in JSON. 7-Zip says the |
As pointed out in #39 and #57 Lingua's great accuracy comes at the cost of high memory usage. This imposes a problem for some projects trying to use Lingua.
In this issue I will try to highlight some main areas where performance can be improved, some of this is already covered by #98. Note that some of the proposed changes might decrease execution speed or require some larger refactoring.
Model files
Instead of storing the model data in JSON format, a binary format could be used matching the in-memory format (see "In-memory models" section). This would have the advantage that:
Char2DoubleOpenHashMap
could be created with the expected size avoiding rehashing of the maps during deserialization.Note that even though the fastutil maps are
Serializable
, using JDK serialization might introduce unnecessary overhead and would make this library dependent on the internal serialization format of the fastutil maps. Instead the data could be written manually to aDataOutputStream
.Model file loading
kotlinx-serialization-json
does not seem to support streaming yet. Therefore currently the complete model files are loaded as String before being parsed. This is (likely) slow and requires large amounts of memory. Instead streaming JSON libraries such as https://github.com/square/moshi should be used.Note that this point becomes obsolete if a binary format (as described in the "Model files" section above) is used.
In-memory models
Object2DoubleOpenHashMap
load factor can increased from the default 0.75 to a higher value. This reduces memory usage but might slow down execution.Ngrams can be encoded using primitives. Since this project uses only up to fivegrams (5 chars), most of the ngrams (and for some languages even ngrams of all lengths) can be encoded as JVM primitives using bitwise operations, e.g.:
Byte
orChar
Short
orInt
Int
orLong
Int
orLong
Long
or in the worst case asString
object. Note that at least for fivegrams the binary encoding should probably be offset based, so one char is the base code point and the the remaining bits of theLong
encode the offsets of the other chars to the base char. This allows encoding alphabets such as Georgian where each char is> Long.SIZE_BITS / 5
.This might even increase execution speed since it avoids
hashCode()
andequals(...)
calls when looking up frequencies (speed-up, if any, has to be tested though).Reduce frequency accuracy for in-memory models and model files from 64-bit
Double
to 32-bit. This can have a big impact on memory usage, saving more than 100MB with all models preloaded. However, instead of using a 32-bitFloat
to store the frequency, a custom 32-bit encoding can (and maybe should) be used sinceFloat
'wastes' some bits for the sign (frequency will never be negative) and the exponent (frequency will never be >= 1.0), though this might decrease language detection speed due to the decoding overhead.Remove Korean fivegrams (and quadrigrams?). The Korean language models are quite large, additionally due to the large range of Korean code points a great majority (> 1.000.000 fivegrams (?)) cannot be encoded with the primitive encoding approach outlined above. Chinese and Japanse don't seem to have quadrigram and fivegram models as well, not sure if this is due to how the languages work, but maybe it would be acceptable to drop them for Korean as well; also because detection of Korean seems to be rather unambiguous.
Runtime performance
Alphabet
. TheAlphabet
class can probably removed,Character.UnicodeScript
seems to be an exact substitute and might allow avoiding some indirection, e.g. only lookupUnicodeScript
for aChar
once and then compare it with expected ones instead of having eachAlphabet
look upUnicodeScript
.Ngram
objects. Similar to the primitive encoding described in "In-memory models" above,Ngram
objects created as part of splitting up the text can be avoided as well (with a different encoding). A Kotlin inline class can be used to still get type safety and have some convenience functions. Primitive encoding can only support trigrams reliably without too much overhead / too complicated encoding, but that is probably fine because since d0f7a7c at most trigrams will be used for longer texts.lazy
frequency lookup in every iteration, it might be faster to access it once at the beginning and then directly use it instead (though this could also be premature optimization).Conclusion
With some / all of these suggestions applied memory usage can be reduced and execution speed can be increased without affecting accuracy. However, some of the suggestions might be premature optimization, and they only work for 16-bit
Char
but not for supplementary code points (> 16-bit) (but the current implementation, mainlyNgram
creation, seems to have that limitation as well).I have implemented some of these optimizations and some other minor improvements in https://github.com/Marcono1234/lingua/tree/experimental/performance. However, these changes are pretty experimental: The Git history is not very nice to look at; in some commits I fixed bugs I introduced before or reverted changes again. Additionally the unit tests and model file writing are broken. Some of the changes might also be premature optimization. Though maybe it is interesting nonetheless, it appears the memory usage with all languages being preloaded went down to about
640MB(Edit: 920MB, made a mistake in the binary encoding) on AdoptOpenJDK 11.The text was updated successfully, but these errors were encountered: