-
Notifications
You must be signed in to change notification settings - Fork 15.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenericLoader.from_filesystem "exclude" not working #13751
Comments
🤖 Hello @giancarloerra! It's good to see you back. I hope you've been doing well. Based on the information you've provided and the code you're using, it seems like you're correctly using the "exclude" parameter in the However, it's important to note that the "exclude" parameter uses glob patterns to match the files. This means that the patterns should match the entire path of the files you want to exclude. In your case, you're providing relative paths to the directories you want to exclude. It's possible that these paths are not correctly matching the files you want to exclude. One thing you could try is to use absolute paths in your "exclude" parameter. This would ensure that the paths you're providing are correctly matching the files you want to exclude. Here's how you could modify your code: import os
repo_path = "../../my/laravel/project/"
exclude_paths = [
"vendor/",
"node_modules/",
"storage/",
"public/",
"tests/",
"resources/",
]
# Convert the relative exclude paths to absolute paths
exclude_paths = [os.path.join(repo_path, path) for path in exclude_paths]
# Load
loader = GenericLoader.from_filesystem(
repo_path,
glob="**/*",
suffixes=[".php"],
parser=LanguageParser(
parser_threshold=2000,
),
exclude=exclude_paths
)
documents = loader.load()
len(documents) In this code, If this doesn't solve your issue, it would be helpful if you could provide more information about the files that are being included despite being in the directories you're trying to exclude. This could help identify if there's a specific pattern that's not being correctly matched by the "exclude" parameter. Sources |
I'm afraid it doesn't seem to work. I got exclusion of first level in the folders using: "vendor/**/*" But it doesn't seem to be possible, no matter what I do, to make it exclude ALL the content in that folder (including all its subfolders). Is the exclude working only for files or group of files? It's quite common to have local repositories with vendor modules, that need to be excluded from any attempt at loading code for analysis. It's also quite common to have several folders to exclude, recursively. Maybe is not yet implemented? |
Yeah, so I actually ran into this issue as well. I don't really think it's because of Langchain's implementation, I think it's just the fact that For example, there's been a myriad of complaints:
I think the current state is that the recursive wildcard isn't available until Python 3.13 it would appear. I don't think langchain is operating under the assumption that we have that installed. (i.e. see here: So yeah, I took a stab at changing this so it's a bit more ergonomic. If you want to pull that down and ensure it's usable for you, you should be able to do that. I don't really love it to be honest. However, it should be flexible, still allow true cc: @giancarloerra |
Thank you, that looks very useful! At the moment I solved the problem loading the subfolders I need and so avoiding the ones I don't need...simple and clean to do for a Laravel project where I only need to filter one level. But I appreciate not necesssarily always the case. Very interesting to see all those links you pasted, something you would think it's very obvious and I'm not surprised generates some confusion also in many others. Thanks a lot for your research, detailed reply and the PR! |
System Info
Python 3.9.6, Langchain 0.0.334
Who can help?
@eyurtsev
Information
Related Components
Reproduction
I'm experimenting with some simple code to load a local repository to test CodeLlama, but the "exclude" in GenericLoader.from_filesystem seems not working:
`from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
from langchain.text_splitter import Language
repo_path = "../../my/laravel/project/"
Load
loader = GenericLoader.from_filesystem(
repo_path,
glob="**/*",
suffixes=[".php"],
parser=LanguageParser(
parser_threshold=2000,
),
exclude=["../../my/laravel/project/vendor/", "../../my/laravel/project/node_modules/", "../../my/laravel/project/storage/", "../../my/laravel/project/public/", "../../my/laravel/project/tests/", "../../my/laravel/project/resources/"]
)
documents = loader.load()
len(documents)
`
Am I missing something obvious? I cannot find any example...with or without the exclude, the length of docs is the same (and if I just print "documents" I see files in the folders I excluded).
Expected behavior
I would expect that listing subpaths from the main path then these would be excluded.
The text was updated successfully, but these errors were encountered: