-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors #2129
Comments
I have similar issue: private-gpt % cat version.txt |
@ulnit @yaziciali Hi, can you provide some test data? I've tried a Simplified Chinese or Traditional Chinese PDF, and everything is working fine. settings:
|
Pre-check
Description
10:36:32.953 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449514-0.PDF
10:36:32.954 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449527-0.PDF
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/site-packages/injector/init.py", line 800, in get
return self._context[key]
~~~~~~~~~~~~~^^^^^
KeyError: <class 'private_gpt.server.ingest.ingest_service.IngestService'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents
documents = IngestionHelper._load_file_to_documents(file_name, file_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents
return string_reader.load_data([file_data.read_text()])
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/software/private-gpt/scripts/ingest_folder.py", line 121, in
worker.ingest_folder(root_path, args.ignored)
File "/data/software/private-gpt/scripts/ingest_folder.py", line 57, in ingest_folder
self._ingest_all(self._files_under_root_folder)
File "/data/software/private-gpt/scripts/ingest_folder.py", line 61, in _ingest_all
self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest])
File "/data/software/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest
documents = self.ingest_component.bulk_ingest(files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 279, in bulk_ingest
self._ingest_work_pool.starmap(self.ingest, files)
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 375, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 264, in ingest
documents = self._file_to_documents_work_pool.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 360, in apply
return self.apply_async(func, args, kwds).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
make: *** [Makefile:52:ingest] error 1
Steps to Reproduce
#PGPT_PROFILES=ollama-pg make ingest /data/private_gpt_data/s_reports/s_hk_reports/ -- --watch
Expected Behavior
can ingesting normal
Actual Behavior
UnicodeDecodeError
Environment
CPU Python 3.11.10
Additional Information
No response
Version
No response
Setup Checklist
NVIDIA GPU Setup Checklist
nvidia-smi
to verify).sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
)The text was updated successfully, but these errors were encountered: