-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StringNormalizer drops strings when they only contain stop words #1031
Conversation
Signed-off-by: Xavier Dupre <[email protected]>
@xadupre Hi! Any updates on this problem ? As I understand the fast fix would be to process every string independently. RIght ? |
That would work. |
Signed-off-by: Xavier Dupre <[email protected]>
My tests show that performance on InferenceSession run is twice slower without vectorization - 400 rps instead of 800 rps on my server. |
If you are using a loop, it is not really suprising. There is no parallelization even though each row is processed independently. |
May you suggest some impovement in this case ? My guess is if I know indices of dropped strings then I can rerun inference session only on them instead loop over all input items. |
Also I want to notice that I don't user stop_words parameter. How exactly set stop_words parameter to empty or None while converting to ONNX model ? |
One issue is StringNormalizer is defined in onnx. To change its behaviour, it has to be changed in onnx and onnxruntime. It is long. onnxruntime-extensions is a project implementing custom kernel without going through the process of validating the definition by onnx mainteners. The best option here is to implement is to implement a custom kernel. I'm hesitating in adding more in onnxruntime or creating a similar projet to host custom kernel dedicated to standard machine learning. But a custom kernel would solve the issue. |
@xadupre What do you mean by custom kernel exactly ? |
A C++ implementation of an operator. Let's assume we define the operator StringNormalizer2. onnxruntime does not know it because it is new so it cannot run the inference unless we give it a piece of code compiled into an assembly. onnxruntime lets you do that. See https://onnxruntime.ai/docs/reference/operators/add-custom-op.html. |
This issue seems to be fixed with scikit-learn==1.5.0. Could you check on your side? |
…x#1031) * investigate Signed-off-by: Xavier Dupre <[email protected]> * fix unit tests Signed-off-by: Xavier Dupre <[email protected]> --------- Signed-off-by: Xavier Dupre <[email protected]>
* Extend CI to test with onnxruntime==1.18.0 (#1093) * Extend CI to test with onnxruntime==1.18.0 Signed-off-by: Xavier Dupre <[email protected]> * update doc Signed-off-by: Xavier Dupre <[email protected]> * simplify pipelines Signed-off-by: Xavier Dupre <[email protected]> * rename master into main Signed-off-by: Xavier Dupre <[email protected]> * action Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * fix ci Signed-off-by: Xavier Dupre <[email protected]> * example Signed-off-by: Xavier Dupre <[email protected]> * remove benchmark Signed-off-by: Xavier Dupre <[email protected]> * doc Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * fix ci Signed-off-by: Xavier Dupre <[email protected]> * fix unittest Signed-off-by: Xavier Dupre <[email protected]> * fix ci Signed-off-by: Xavier Dupre <[email protected]> * fix ci Signed-off-by: Xavier Dupre <[email protected]> --------- Signed-off-by: Xavier Dupre <[email protected]> * Increase last supported opset to 19 in readme (#1087) See: https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/__init__.py#L12 Signed-off-by: XiangRongLin <[email protected]> Signed-off-by: Xavier Dupre <[email protected]> * Fix the converters for scikit-learn==1.5.0 (#1095) * Extend CI to test with onnxruntime==1.18.0 Signed-off-by: Xavier Dupre <[email protected]> * update doc Signed-off-by: Xavier Dupre <[email protected]> * simplify pipelines Signed-off-by: Xavier Dupre <[email protected]> * rename master into main Signed-off-by: Xavier Dupre <[email protected]> * action Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * update CI Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * fix ci Signed-off-by: Xavier Dupre <[email protected]> * example Signed-off-by: Xavier Dupre <[email protected]> * remove benchmark Signed-off-by: Xavier Dupre <[email protected]> * doc Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * fix ci Signed-off-by: Xavier Dupre <[email protected]> * fix unittest Signed-off-by: Xavier Dupre <[email protected]> * fix ci Signed-off-by: Xavier Dupre <[email protected]> * fix ci Signed-off-by: Xavier Dupre <[email protected]> * fix title Signed-off-by: Xavier Dupre <[email protected]> * fix disc Signed-off-by: Xavier Dupre <[email protected]> * better ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * linear Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * fix two unit tests Signed-off-by: Xavier Dupre <[email protected]> * fix PLSRegression Signed-off-by: Xavier Dupre <[email protected]> * fix version Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * ci Signed-off-by: Xavier Dupre <[email protected]> * precision Signed-off-by: Xavier Dupre <[email protected]> * fix unit test Signed-off-by: Xavier Dupre <[email protected]> --------- Signed-off-by: Xavier Dupre <[email protected]> * StringNormalizer drops strings when they only contain stop words (#1031) * investigate Signed-off-by: Xavier Dupre <[email protected]> * fix unit tests Signed-off-by: Xavier Dupre <[email protected]> --------- Signed-off-by: Xavier Dupre <[email protected]> * Fix for gaussian process Signed-off-by: Xavier Dupre <[email protected]> * add test for issue 1073 Signed-off-by: Xavier Dupre <[email protected]> * Fix multi dimensional GaussianRegressor Signed-off-by: Xavier Dupre <[email protected]> * doc Signed-off-by: Xavier Dupre <[email protected]> --------- Signed-off-by: Xavier Dupre <[email protected]> Signed-off-by: XiangRongLin <[email protected]> Co-authored-by: XiangRongLin <[email protected]> Co-authored-by: Liberty Askew <[email protected]> Co-authored-by: dreivmeister <[email protected]>
Issue raised in microsoft/onnxruntime#17795 and #998. After investigation, operator StringNormalizer drops strings containing only stopwords. The expected output can be obtained by running the inference for every input independantly.
A possible fix would be to update onnx specifications to not drop strings. Another fix could be to update the converter and loop over rows but that's not very efficient.