Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors #2129

Open
2 of 9 tasks
ulnit opened this issue Nov 25, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@ulnit
Copy link

ulnit commented Nov 25, 2024

Pre-check

  • I have searched the existing issues and none cover this bug.

Description

10:36:32.953 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449514-0.PDF
10:36:32.954 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449527-0.PDF
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/site-packages/injector/init.py", line 800, in get
return self._context[key]
~~~~~~~~~~~~~^^^^^
KeyError: <class 'private_gpt.server.ingest.ingest_service.IngestService'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents
documents = IngestionHelper._load_file_to_documents(file_name, file_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents
return string_reader.load_data([file_data.read_text()])
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/software/private-gpt/scripts/ingest_folder.py", line 121, in
worker.ingest_folder(root_path, args.ignored)
File "/data/software/private-gpt/scripts/ingest_folder.py", line 57, in ingest_folder
self._ingest_all(self._files_under_root_folder)
File "/data/software/private-gpt/scripts/ingest_folder.py", line 61, in _ingest_all
self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest])
File "/data/software/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest
documents = self.ingest_component.bulk_ingest(files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 279, in bulk_ingest
self._ingest_work_pool.starmap(self.ingest, files)
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 375, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 264, in ingest
documents = self._file_to_documents_work_pool.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 360, in apply
return self.apply_async(func, args, kwds).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
make: *** [Makefile:52:ingest] error 1

Steps to Reproduce

#PGPT_PROFILES=ollama-pg make ingest /data/private_gpt_data/s_reports/s_hk_reports/ -- --watch

Expected Behavior

can ingesting normal

Actual Behavior

UnicodeDecodeError

Environment

CPU Python 3.11.10

Additional Information

No response

Version

No response

Setup Checklist

  • Confirm that you have followed the installation instructions in the project’s documentation.
  • Check that you are using the latest version of the project.
  • Verify disk space availability for model storage and data processing.
  • Ensure that you have the necessary permissions to run the project.

NVIDIA GPU Setup Checklist

  • Check that the all CUDA dependencies are installed and are compatible with your GPU (refer to CUDA's documentation)
  • Ensure an NVIDIA GPU is installed and recognized by the system (run nvidia-smi to verify).
  • Ensure proper permissions are set for accessing GPU resources.
  • Docker users - Verify that the NVIDIA Container Toolkit is configured correctly (e.g. run sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi)
@ulnit ulnit added the bug Something isn't working label Nov 25, 2024
@yaziciali
Copy link

yaziciali commented Nov 28, 2024

I have similar issue:
Generating embeddings: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 122, in
worker.ingest_folder(root_path, args.ignored)
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 58, in ingest_folder
self._ingest_all(self._files_under_root_folder)
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 62, in _ingest_all
self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest])
File "/Users/user/AI/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest
documents = self.ingest_component.bulk_ingest(files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_component.py", line 132, in bulk_ingest
documents = IngestionHelper.transform_file_into_documents(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents
documents = IngestionHelper._load_file_to_documents(file_name, file_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents
return string_reader.load_data([file_data.read_text()])
^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 14: invalid start byte
make: *** [ingest] Error 1

private-gpt % cat version.txt
0.6.2

@navono
Copy link

navono commented Jan 10, 2025

@ulnit @yaziciali Hi, can you provide some test data? I've tried a Simplified Chinese or Traditional Chinese PDF, and everything is working fine.

settings:

server:
  env_name: ${APP_ENV:ollama}

llm:
  mode: ollama
  max_new_tokens: 512
  context_window: 3900
  temperature: 0.1     #The temperature of the model. Increasing the temperature will make the model answer more creatively. A value of 0.1 would be more factual. (Default: 0.1)

embedding:
  mode: ollama

ollama:
  llm_model: llama3.2
  embedding_model: bge-m3
  api_base: http://localhost:11434
  embedding_api_base: http://localhost:11434  # change if your embedding model runs on another ollama
  keep_alive: 5m
  tfs_z: 1.0              # Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.
  top_k: 40               # Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)
  top_p: 0.9              # Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
  repeat_last_n: 64       # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
  repeat_penalty: 1.2     # Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)
  request_timeout: 120.0  # Time elapsed until ollama times out the request. Default is 120s. Format is float.

vectorstore:
  database: qdrant

qdrant:
  path: local_data/private_gpt/qdrant

image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants