Here’s the complete and updated README.md
:
This project provides a script to consolidate the contents of a Git repository into Markdown files, split them into manageable chunks, and save the resulting files in a designated folder. The tool is designed to address challenges posed by the limitations of large language models (LLMs) and other processing systems.
The primary goal of this tool is to enable the seamless integration of full Git repositories into large language models (LLMs) that are constrained by upload, token, or size limitations. Many LLMs cannot access external files, interpret compressed formats like ZIP, TAR, or RAR, or process Git-specific formats such as .bundle
. By converting repositories into fully marked-down, human-readable files—either as a single consolidated document or split into manageable parts—this tool empowers users to overcome these technical barriers.
This approach has a wide range of applications beyond LLM ingestion:
- AI Code Review: Facilitates comprehensive AI analysis of repositories for documentation improvement, bug identification, and code optimization.
- Data Archiving: Serves as a practical solution for archiving and sharing repositories in a format that is universally accessible without specialized tools.
- Collaboration: Provides non-technical team members with an easy way to view repository contents.
- Static Repository Analysis: Prepares repositories for offline or air-gapped environments where file extraction and inspection are critical.
- Code Documentation: Enhances onboarding processes by generating structured, easy-to-navigate summaries of repositories.
- Education and Research: Converts large repositories into study materials, allowing educators and researchers to focus on specific segments without overwhelming detail.
By offering a Markdown-based solution, this tool ensures accessibility, portability, and ease of use, while making repositories compatible with a variety of downstream applications.
-
Consolidate Repository:
- Combines all files in a Git repository into a single Markdown file.
- Organizes content with clear section headers and code blocks.
-
Split Into Chunks:
- Splits the consolidated file into multiple Markdown files, each containing up to 500 lines.
-
Excludes
.git
and Ignored Files:- Skips files and folders specified in
.gitignore
. - Ensures
.git
folders and submodules are fully excluded.
- Skips files and folders specified in
-
trainers/
:- Contains the split Markdown files named sequentially as
consolidated_repository_part_<n>.md
in therepo_dir
- Contains the split Markdown files named sequentially as
-
ai_ingest.md
:- Instructions to give to LLMs to ingest the generated files.
-
Python 3.8+
- Ensure Python is installed and available in your PATH.
-
Virtual Environment (Optional but Recommended):
python3 -m venv venv source venv/bin/activate
-
Required Python Libraries: Install dependencies using pip:
pip install pathspec
-
Clone the Repository:
git clone https://github.com/username/repository.git cd repository
-
Run the Script: Execute the script to consolidate and split repository files:
python consolidate_and_split.py
-
Output:
- A full consolidated file (
consolidated_repository_full.md
) is created in thetrainers/
folder. - The file is split into chunks of 500 lines, saved as
consolidated_repository_part_<n>.md
.
- A full consolidated file (
-
Ingest Instructions: Refer to
injest.md
for detailed instructions on processing the generated files.
If the repository has 1200 lines of content, the script generates:
trainers/consolidated_repository_part_1.md
(lines 1–500)trainers/consolidated_repository_part_2.md
(lines 501–1000)trainers/consolidated_repository_part_3.md
(lines 1001–1200)
Building on the foundational functionality, this tool has the potential to evolve into a more robust and versatile solution. Here are some ideas for future enhancements:
-
Advanced Splitting Options:
- Allow users to customize splitting parameters (e.g., lines, tokens, or file size limits) to accommodate specific LLM constraints.
-
Syntax Highlighting:
- Automatically detect and apply syntax highlighting for code blocks based on file types, enhancing readability.
-
Customizable Exclusion Rules:
- Provide an interface for users to specify additional exclusion patterns beyond
.gitignore
.
- Provide an interface for users to specify additional exclusion patterns beyond
-
Interactive Dashboard:
- Develop a web-based or command-line dashboard to visualize repository structure and generate split files dynamically.
-
Integration with LLM APIs:
- Include features to directly upload the split files to LLMs that support API-based interactions.
-
Content Summarization:
- Integrate summarization tools to create concise overviews of each file or folder in the repository.
-
Preservation of Repository Structure:
- Include options to maintain the directory hierarchy within the Markdown files for better contextual understanding.
-
Metadata Inclusion:
- Add metadata such as file sizes, modification dates, and Git commit information to the output for enhanced insights.
-
Support for Non-Markdown Formats:
- Expand output formats to include JSON, CSV, or plain text, catering to varied use cases and tools.
-
Error Reporting and Debugging:
- Improve error handling to provide detailed logs or reports for files that fail to process.
-
Git Integration:
- Introduce a feature to fetch repositories directly from remote Git URLs and process them without requiring a local clone.
-
Collaborative Sharing:
- Enable easy sharing and commenting on the generated Markdown files for team-based review processes.
By implementing these features, this tool could become a cornerstone for managing, analyzing, and interacting with repositories in innovative ways.
Contributions to improve the script or extend functionality are welcome! Feel free to fork the repository and submit a pull request.
This project is licensed under the MIT License. See the https://opensource.org/license/mit file for details.