People are contionusly purchasing things from stores. This aids in assists in uncovering the nuances of the types of items that people buy, and the types of stores that people frequent through examining receiepts.
Converts receipts that are in Optical Character Recognition (OCR) format into JSON objects that are representative of the way humans may classify receipts with the assistance of ChatGPT4. Upon creation of JSONs, K-Nearest Neighbors was used to assist in classifcation of vendor and product categories.
The findings are displayed through visualizations on a Dashbaord, that can be found in this repository
- Clone the Repository:
- Clone the repository to your local machine.
- Add Text Data:
- Place your
.txt
receipt data files in a folder within the cloned repository. - For example you can place the
.txt
files in the data/txt_data folder
- Place your
- Add openai api key
- Include your OpenAI API key on line 17 of master.py
To convert text receipt files to JSON format, follow these steps:
- Navigate to the Repository Directory:
- Open your terminal and navigate to the cloned repository's directory.
- Run the Conversion Script:
- Execute the
master.py
script with the command:python master.py <path_to_receipt_txt_folder> <desired_path_of_output>
path_to_receipt_txt_folder
: Path to your receipt text folder. This must be a folder containing the.txt
filesdesired_path_of_output
: Path where you want to save the output JSON file. If the specified file exists, new data will be appended to it. This path must end with.json
.- The output JSON file at
desired_path_of_output
will have JSON objects stored by line. This means that every line in the file will have a unique JSON object associated with it. JSON objects will not extend past a single line.
- Execute the
To add classifications to your JSON receipts, follow these steps:
- Run the Classification Script:
- In the same directory, execute the
classification.py
script using the command:python classification.py <path_to_previous_output> <new_output_path>
path_to_previous_output
: Path to the JSON file generated bymaster.py
.new_output_path
: Path for the new JSON file which will include classifications. This file will be created at this path, must end with.json
- Contrary to the prior output file, the output JSON file at
new_output_path
will have the JSON objects indented for easier readability. JSON objects will now span multiple lines and will include indents based on the hierarchical level.
- In the same directory, execute the
Here's an example of the entire workflow:
- Convert text receipts to JSON:
python master.py data/txt_data json_receipts.json
- Create new JSON file with classifications added:
python classification.py json_receipts.json classified_json_receipts.json
- Jeremiah Dy
- Kylie Higashionna
- Grayson Levy
- Amanda Nitta
Acknowledgement: Fall 2023 Big Data Analytics Course with Dr. Mahdi Belcaid