Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenericLoader.from_filesystem "exclude" not working #13751

Closed
2 of 14 tasks
giancarloerra opened this issue Nov 22, 2023 · 4 comments
Closed
2 of 14 tasks

GenericLoader.from_filesystem "exclude" not working #13751

giancarloerra opened this issue Nov 22, 2023 · 4 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@giancarloerra
Copy link

System Info

Python 3.9.6, Langchain 0.0.334

Who can help?

@eyurtsev

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

I'm experimenting with some simple code to load a local repository to test CodeLlama, but the "exclude" in GenericLoader.from_filesystem seems not working:

`from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
from langchain.text_splitter import Language

repo_path = "../../my/laravel/project/"

Load

loader = GenericLoader.from_filesystem(
repo_path,
glob="**/*",
suffixes=[".php"],
parser=LanguageParser(
parser_threshold=2000,
),
exclude=["../../my/laravel/project/vendor/", "../../my/laravel/project/node_modules/", "../../my/laravel/project/storage/", "../../my/laravel/project/public/", "../../my/laravel/project/tests/", "../../my/laravel/project/resources/"]
)

documents = loader.load()
len(documents)
`

Am I missing something obvious? I cannot find any example...with or without the exclude, the length of docs is the same (and if I just print "documents" I see files in the folders I excluded).

Expected behavior

I would expect that listing subpaths from the main path then these would be excluded.

Copy link

dosubot bot commented Nov 22, 2023

🤖

Hello @giancarloerra! It's good to see you back. I hope you've been doing well.

Based on the information you've provided and the code you're using, it seems like you're correctly using the "exclude" parameter in the GenericLoader.from_filesystem method. This parameter is designed to exclude files that match the patterns you've specified.

However, it's important to note that the "exclude" parameter uses glob patterns to match the files. This means that the patterns should match the entire path of the files you want to exclude. In your case, you're providing relative paths to the directories you want to exclude. It's possible that these paths are not correctly matching the files you want to exclude.

One thing you could try is to use absolute paths in your "exclude" parameter. This would ensure that the paths you're providing are correctly matching the files you want to exclude. Here's how you could modify your code:

import os

repo_path = "../../my/laravel/project/"

exclude_paths = [
    "vendor/",
    "node_modules/",
    "storage/",
    "public/",
    "tests/",
    "resources/",
]

# Convert the relative exclude paths to absolute paths
exclude_paths = [os.path.join(repo_path, path) for path in exclude_paths]

# Load
loader = GenericLoader.from_filesystem(
    repo_path,
    glob="**/*",
    suffixes=[".php"],
    parser=LanguageParser(
        parser_threshold=2000,
    ),
    exclude=exclude_paths
)

documents = loader.load()
len(documents)

In this code, os.path.join(repo_path, path) is used to convert the relative exclude paths to absolute paths. This should ensure that the paths you're providing in the "exclude" parameter are correctly matching the files you want to exclude.

If this doesn't solve your issue, it would be helpful if you could provide more information about the files that are being included despite being in the directories you're trying to exclude. This could help identify if there's a specific pattern that's not being correctly matched by the "exclude" parameter.

Sources

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Nov 22, 2023
@giancarloerra
Copy link
Author

I'm afraid it doesn't seem to work. I got exclusion of first level in the folders using:

"vendor/**/*"

But it doesn't seem to be possible, no matter what I do, to make it exclude ALL the content in that folder (including all its subfolders). Is the exclude working only for files or group of files?

It's quite common to have local repositories with vendor modules, that need to be excluded from any attempt at loading code for analysis. It's also quite common to have several folders to exclude, recursively.

Maybe is not yet implemented?

@johnlarkin1
Copy link

johnlarkin1 commented Nov 27, 2023

Yeah, so I actually ran into this issue as well. I don't really think it's because of Langchain's implementation, I think it's just the fact that pathlib's Path .match method is a bit of a nightmare.

For example, there's been a myriad of complaints:

I think the current state is that the recursive wildcard isn't available until Python 3.13 it would appear. I don't think langchain is operating under the assumption that we have that installed.

(i.e. see here:
python/cpython@49f90ba#diff-1134e36a94ecfde1df43bee5efd285de5d67426fbca086425201ecc753f9139cR591-R594 )

So yeah, I took a stab at changing this so it's a bit more ergonomic. If you want to pull that down and ensure it's usable for you, you should be able to do that.

#13894

I don't really love it to be honest. However, it should be flexible, still allow true glob syntax similar to the pathlib Path and all of that.

cc: @giancarloerra

@giancarloerra
Copy link
Author

Thank you, that looks very useful! At the moment I solved the problem loading the subfolders I need and so avoiding the ones I don't need...simple and clean to do for a Laravel project where I only need to filter one level. But I appreciate not necesssarily always the case.

Very interesting to see all those links you pasted, something you would think it's very obvious and I'm not surprised generates some confusion also in many others.

Thanks a lot for your research, detailed reply and the PR!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 26, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

2 participants