Skip to content

This project provides a script to consolidate the contents of a Git repository into Markdown files, split them into manageable chunks, and save the resulting files in a designated folder. The tool is designed to address challenges posed by the limitations of large language models (LLMs) and other processing systems.

Notifications You must be signed in to change notification settings

tmcleod3/repconsplit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Here’s the complete and updated README.md:


Repository Consolidation and Splitting Tool

This project provides a script to consolidate the contents of a Git repository into Markdown files, split them into manageable chunks, and save the resulting files in a designated folder. The tool is designed to address challenges posed by the limitations of large language models (LLMs) and other processing systems.


Purpose of the Tool

The primary goal of this tool is to enable the seamless integration of full Git repositories into large language models (LLMs) that are constrained by upload, token, or size limitations. Many LLMs cannot access external files, interpret compressed formats like ZIP, TAR, or RAR, or process Git-specific formats such as .bundle. By converting repositories into fully marked-down, human-readable files—either as a single consolidated document or split into manageable parts—this tool empowers users to overcome these technical barriers.

This approach has a wide range of applications beyond LLM ingestion:

  • AI Code Review: Facilitates comprehensive AI analysis of repositories for documentation improvement, bug identification, and code optimization.
  • Data Archiving: Serves as a practical solution for archiving and sharing repositories in a format that is universally accessible without specialized tools.
  • Collaboration: Provides non-technical team members with an easy way to view repository contents.
  • Static Repository Analysis: Prepares repositories for offline or air-gapped environments where file extraction and inspection are critical.
  • Code Documentation: Enhances onboarding processes by generating structured, easy-to-navigate summaries of repositories.
  • Education and Research: Converts large repositories into study materials, allowing educators and researchers to focus on specific segments without overwhelming detail.

By offering a Markdown-based solution, this tool ensures accessibility, portability, and ease of use, while making repositories compatible with a variety of downstream applications.


Features

  • Consolidate Repository:

    • Combines all files in a Git repository into a single Markdown file.
    • Organizes content with clear section headers and code blocks.
  • Split Into Chunks:

    • Splits the consolidated file into multiple Markdown files, each containing up to 500 lines.
  • Excludes .git and Ignored Files:

    • Skips files and folders specified in .gitignore.
    • Ensures .git folders and submodules are fully excluded.

Folder Structure

  • trainers/:

    • Contains the split Markdown files named sequentially as consolidated_repository_part_<n>.md in the repo_dir
  • ai_ingest.md:

    • Instructions to give to LLMs to ingest the generated files.

Prerequisites

  1. Python 3.8+

    • Ensure Python is installed and available in your PATH.
  2. Virtual Environment (Optional but Recommended):

    python3 -m venv venv
    source venv/bin/activate
  3. Required Python Libraries: Install dependencies using pip:

    pip install pathspec

Usage

  1. Clone the Repository:

    git clone https://github.com/username/repository.git
    cd repository
  2. Run the Script: Execute the script to consolidate and split repository files:

    python consolidate_and_split.py
  3. Output:

    • A full consolidated file (consolidated_repository_full.md) is created in the trainers/ folder.
    • The file is split into chunks of 500 lines, saved as consolidated_repository_part_<n>.md.
  4. Ingest Instructions: Refer to injest.md for detailed instructions on processing the generated files.


Example Output

If the repository has 1200 lines of content, the script generates:

  • trainers/consolidated_repository_part_1.md (lines 1–500)
  • trainers/consolidated_repository_part_2.md (lines 501–1000)
  • trainers/consolidated_repository_part_3.md (lines 1001–1200)

Future Plans

Building on the foundational functionality, this tool has the potential to evolve into a more robust and versatile solution. Here are some ideas for future enhancements:

  1. Advanced Splitting Options:

    • Allow users to customize splitting parameters (e.g., lines, tokens, or file size limits) to accommodate specific LLM constraints.
  2. Syntax Highlighting:

    • Automatically detect and apply syntax highlighting for code blocks based on file types, enhancing readability.
  3. Customizable Exclusion Rules:

    • Provide an interface for users to specify additional exclusion patterns beyond .gitignore.
  4. Interactive Dashboard:

    • Develop a web-based or command-line dashboard to visualize repository structure and generate split files dynamically.
  5. Integration with LLM APIs:

    • Include features to directly upload the split files to LLMs that support API-based interactions.
  6. Content Summarization:

    • Integrate summarization tools to create concise overviews of each file or folder in the repository.
  7. Preservation of Repository Structure:

    • Include options to maintain the directory hierarchy within the Markdown files for better contextual understanding.
  8. Metadata Inclusion:

    • Add metadata such as file sizes, modification dates, and Git commit information to the output for enhanced insights.
  9. Support for Non-Markdown Formats:

    • Expand output formats to include JSON, CSV, or plain text, catering to varied use cases and tools.
  10. Error Reporting and Debugging:

    • Improve error handling to provide detailed logs or reports for files that fail to process.
  11. Git Integration:

    • Introduce a feature to fetch repositories directly from remote Git URLs and process them without requiring a local clone.
  12. Collaborative Sharing:

    • Enable easy sharing and commenting on the generated Markdown files for team-based review processes.

By implementing these features, this tool could become a cornerstone for managing, analyzing, and interacting with repositories in innovative ways.


Contributing

Contributions to improve the script or extend functionality are welcome! Feel free to fork the repository and submit a pull request.


License

This project is licensed under the MIT License. See the https://opensource.org/license/mit file for details.

About

This project provides a script to consolidate the contents of a Git repository into Markdown files, split them into manageable chunks, and save the resulting files in a designated folder. The tool is designed to address challenges posed by the limitations of large language models (LLMs) and other processing systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages