-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unit tests #9
base: master
Are you sure you want to change the base?
Conversation
@snowyu Sorry, I was just putting into the new testing infrastructure! Check it out in This is all through the Python API. I'm not too worried about if being different from the underlying C++. It's a really thin wrapper for tokenization especially. And if something breaks, it'll show up different from huggingface anyway. And incidentally, these tests all work for |
I thought that converting text to token is only necessary when embedding (bert_forward). If you are using python, js or rust, you can directly use the official tokenizer. It is not needed for cpp. This situation will continue until llama.cpp completely implements HF's dynamic tokenizer. The current problem is that the gguf format of llama.cpp lacks the necessary configuration required by the tokenizer. See: ggerganov/llama.cpp#5143 Regarding bge-m3's tokenizer , it has special processing on the normalizer and is also processed in the subsequent post_processor, see its tokenizer.json. |
We still need to support tokenization on the cpp side of things. People aren't always going to want to use the Python interface, if they're doing inference on edge devices for instance. In principle, we could add a I'm not sure putting the full tokenizer config in the GGUF file is the best option here. In the end, the difficult part is implementing the tokenizer logic in C++. As we get more coverage, we can include tokenizer params that reflect that in the GGUF and try to do as much pre-processing in Python land on conversion. Realistically, there aren't that many tokenizer types out there. I think we should err on the side of keeping things as simple as possible. That said, if there was a straightforward As for |
Yes, indeed. But we need a workaround currently.
What's the best options? We now need a way to continue. If GGUF does not embed tokenizer related parameters, how to proceed? Whether implementing tokenizer in CPP or using tokenizer in other languages, parameters are required. If there are no necessary parameters and Still need to use gguf, the only ugly way is to copy "tokenizer.json" and "tokenizer_config.json" to the directory where the GGUF file is located. Therefore, I thought that letting the tokenizer parameter exist in GGUF is the first step, and implementing the tokenizer is the second step. In fact, some parameters are already in the GGUF file, and you are already using them. Go back to I don't understand why there is a problem with pre-writing tokenizer parameters that are temporarily unused? Btw the transformer.js has the |
default bge-large-zh-v1.5 to test, see tests/CMakeLists.txt#TEST_MODEL_NAME to change.
test tokens in test_prompts.txt
run unit test:
make -C build test
It will: