Skip to content

Takes in OCR text of receipt images, then transforms them to JSON utilizing a LLM.

Notifications You must be signed in to change notification settings

RecieptsParse/OCR_TO_JSON

Repository files navigation

Reciept Parse

People are contionusly purchasing things from stores. This aids in assists in uncovering the nuances of the types of items that people buy, and the types of stores that people frequent through examining receiepts.

About

Converts receipts that are in Optical Character Recognition (OCR) format into JSON objects that are representative of the way humans may classify receipts with the assistance of ChatGPT4. Upon creation of JSONs, K-Nearest Neighbors was used to assist in classifcation of vendor and product categories.

The findings are displayed through visualizations on a Dashbaord, that can be found in this repository

Development Setup

Initial Setup

  1. Clone the Repository:
    • Clone the repository to your local machine.
  2. Add Text Data:
    • Place your .txt receipt data files in a folder within the cloned repository.
    • For example you can place the .txt files in the data/txt_data folder
  3. Add openai api key
    • Include your OpenAI API key on line 17 of master.py

Generating JSON Receipts from Text Files

To convert text receipt files to JSON format, follow these steps:

  1. Navigate to the Repository Directory:
    • Open your terminal and navigate to the cloned repository's directory.
  2. Run the Conversion Script:
    • Execute the master.py script with the command:
      python master.py <path_to_receipt_txt_folder> <desired_path_of_output>
    • path_to_receipt_txt_folder: Path to your receipt text folder. This must be a folder containing the .txt files
    • desired_path_of_output: Path where you want to save the output JSON file. If the specified file exists, new data will be appended to it. This path must end with .json.
    • The output JSON file at desired_path_of_output will have JSON objects stored by line. This means that every line in the file will have a unique JSON object associated with it. JSON objects will not extend past a single line.

Adding Classifications to JSON Receipts

To add classifications to your JSON receipts, follow these steps:

  1. Run the Classification Script:
    • In the same directory, execute the classification.py script using the command:
      python classification.py <path_to_previous_output> <new_output_path>
    • path_to_previous_output: Path to the JSON file generated by master.py.
    • new_output_path: Path for the new JSON file which will include classifications. This file will be created at this path, must end with .json
    • Contrary to the prior output file, the output JSON file at new_output_path will have the JSON objects indented for easier readability. JSON objects will now span multiple lines and will include indents based on the hierarchical level.

Workflow Example

Here's an example of the entire workflow:

  1. Convert text receipts to JSON:
    python master.py data/txt_data json_receipts.json
  2. Create new JSON file with classifications added:
    python classification.py json_receipts.json classified_json_receipts.json

Team

  • Jeremiah Dy
  • Kylie Higashionna
  • Grayson Levy
  • Amanda Nitta

Acknowledgement: Fall 2023 Big Data Analytics Course with Dr. Mahdi Belcaid

About

Takes in OCR text of receipt images, then transforms them to JSON utilizing a LLM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •