[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors #2129

ulnit · 2024-11-25T02:41:19Z

Pre-check

I have searched the existing issues and none cover this bug.

Description

10:36:32.953 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449514-0.PDF
10:36:32.954 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449527-0.PDF
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/site-packages/injector/init.py", line 800, in get
return self._context[key]
~~~~~~~~~~~~~^^^^^
KeyError: <class 'private_gpt.server.ingest.ingest_service.IngestService'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents
documents = IngestionHelper._load_file_to_documents(file_name, file_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents
return string_reader.load_data([file_data.read_text()])
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/software/private-gpt/scripts/ingest_folder.py", line 121, in
worker.ingest_folder(root_path, args.ignored)
File "/data/software/private-gpt/scripts/ingest_folder.py", line 57, in ingest_folder
self._ingest_all(self._files_under_root_folder)
File "/data/software/private-gpt/scripts/ingest_folder.py", line 61, in _ingest_all
self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest])
File "/data/software/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest
documents = self.ingest_component.bulk_ingest(files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 279, in bulk_ingest
self._ingest_work_pool.starmap(self.ingest, files)
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 375, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 264, in ingest
documents = self._file_to_documents_work_pool.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 360, in apply
return self.apply_async(func, args, kwds).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
make: *** [Makefile:52：ingest] error 1

Steps to Reproduce

#PGPT_PROFILES=ollama-pg make ingest /data/private_gpt_data/s_reports/s_hk_reports/ -- --watch

Expected Behavior

can ingesting normal

Actual Behavior

UnicodeDecodeError

Environment

CPU Python 3.11.10

Additional Information

No response

Version

No response

Setup Checklist

Confirm that you have followed the installation instructions in the project’s documentation.
Check that you are using the latest version of the project.
Verify disk space availability for model storage and data processing.
Ensure that you have the necessary permissions to run the project.

NVIDIA GPU Setup Checklist

Check that the all CUDA dependencies are installed and are compatible with your GPU (refer to CUDA's documentation)
Ensure an NVIDIA GPU is installed and recognized by the system (run nvidia-smi to verify).
Ensure proper permissions are set for accessing GPU resources.
Docker users - Verify that the NVIDIA Container Toolkit is configured correctly (e.g. run sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi)

The text was updated successfully, but these errors were encountered:

yaziciali · 2024-11-28T07:29:11Z

I have similar issue:
Generating embeddings: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 122, in
worker.ingest_folder(root_path, args.ignored)
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 58, in ingest_folder
self._ingest_all(self._files_under_root_folder)
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 62, in _ingest_all
self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest])
File "/Users/user/AI/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest
documents = self.ingest_component.bulk_ingest(files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_component.py", line 132, in bulk_ingest
documents = IngestionHelper.transform_file_into_documents(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents
documents = IngestionHelper._load_file_to_documents(file_name, file_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents
return string_reader.load_data([file_data.read_text()])
^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 14: invalid start byte
make: *** [ingest] Error 1

private-gpt % cat version.txt
0.6.2

navono · 2025-01-10T02:03:58Z

@ulnit @yaziciali Hi, can you provide some test data? I've tried a Simplified Chinese or Traditional Chinese PDF, and everything is working fine.

settings:

server:
  env_name: ${APP_ENV:ollama}

llm:
  mode: ollama
  max_new_tokens: 512
  context_window: 3900
  temperature: 0.1     #The temperature of the model. Increasing the temperature will make the model answer more creatively. A value of 0.1 would be more factual. (Default: 0.1)

embedding:
  mode: ollama

ollama:
  llm_model: llama3.2
  embedding_model: bge-m3
  api_base: http://localhost:11434
  embedding_api_base: http://localhost:11434  # change if your embedding model runs on another ollama
  keep_alive: 5m
  tfs_z: 1.0              # Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.
  top_k: 40               # Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)
  top_p: 0.9              # Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
  repeat_last_n: 64       # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
  repeat_penalty: 1.2     # Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)
  request_timeout: 120.0  # Time elapsed until ollama times out the request. Default is 120s. Format is float.

vectorstore:
  database: qdrant

qdrant:
  path: local_data/private_gpt/qdrant

ulnit added the bug Something isn't working label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors #2129

[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors #2129

ulnit commented Nov 25, 2024

yaziciali commented Nov 28, 2024 •

edited

Loading

navono commented Jan 10, 2025 •

edited

Loading

[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors #2129

[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors #2129

Comments

ulnit commented Nov 25, 2024

Pre-check

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Additional Information

Version

Setup Checklist

NVIDIA GPU Setup Checklist

yaziciali commented Nov 28, 2024 • edited Loading

navono commented Jan 10, 2025 • edited Loading

yaziciali commented Nov 28, 2024 •

edited

Loading

navono commented Jan 10, 2025 •

edited

Loading