-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code indexer #1050
base: main
Are you sure you want to change the base?
Code indexer #1050
Conversation
|
@@ -286,7 +286,7 @@ | |||
"lint-fix": "eslint src --ext ts --fix && npm run lint-fix --prefix webview-ui", | |||
"lint-fix-local": "eslint -c .eslintrc.local.json src --ext ts --fix && npm run lint-fix --prefix webview-ui", | |||
"package": "npm run build:webview && npm run check-types && npm run lint && node esbuild.js --production", | |||
"pretest": "npm run compile && npm run compile:integration", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an unnecessary step for just running jest
, and we already run it as part of test:integration
.
savedKey = process.env.OPENAI_API_KEY | ||
process.env.OPENAI_API_KEY = "fake" | ||
|
||
nock.back.fixtures = path.join(__dirname, "..", "__fixtures__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to mock anything, so I recorded the OpenAI embeddings API requests using nock.
public async initialize() { | ||
this.connection = await connect(this.dbPath) | ||
|
||
const fnCreator = getRegistry().get("openai") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To start we'll only support OpenAI / text-embedding-ada-002
, which means everyone will need an API profile with an OpenAI API key. Over time we can add more embedding options, including local options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planning on giving this a spin locally and will take a look at the PR too -- however, specifically on the embedding model, should we better go with "text-embedding-3-small" to begin with? seems more performant and also cheaper -> https://platform.openai.com/docs/guides/embeddings/embedding-models#embedding-models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use free embeding service from nvidia, or we don't have to take the RAG route, just like repoprompt.com did
@@ -1,15 +1,11 @@ | |||
// npx jest src/services/tree-sitter/__tests__/index.test.ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was mocking too much, and isn't compatible with the latest version of WASM tree-sitter. I updated it appropriately.
@@ -1,118 +1,106 @@ | |||
// npx jest src/services/tree-sitter/__tests__/languageParser.test.ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was mocking too much, and isn't compatible with the latest version of WASM tree-sitter. I updated it appropriately.
} | ||
} | ||
|
||
async function loadLanguage(langName: string) { | ||
return await Parser.Language.load(path.join(__dirname, `tree-sitter-${langName}.wasm`)) | ||
if (process.env.NODE_ENV === "test") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inspired by continue.dev; allow tests to loading language syntax trees so we don't have to mock.
How do we mitigate code stalenes? |
My plan is to do something similar to this: https://github.com/continuedev/continue/blob/main/core/indexing/README.md |
Is it possible to use this as another tools to make code insertion more precise, I tried diff insert for single file with > 3000 lines, and roo-code deletes all the line instead of insert between lines |
this is nice feature, please make it happen @mrubens |
Description
Using continue.dev as a reference, implement a basic "code indexer", which consists of three components:
The idea is that we'll:
This PR just shows a small piece of this system; namely how to index new code so that it's semantically searchable.
Known issues:
.vsix
size significantly (Continue.dev's is about 80mb, and ours will be similar)TheFixed!@lancedb/lancedb
npm package doesn't play nicely withCommonJS
, so we'll need to update our integration test setup to use a more ESM-friendly configuration, which is going to be a bit annoying.Type of change
How Has This Been Tested?
Checklist:
Additional context
Related Issues
Reviewers
Important
Implements a code indexer with chunking, embedding, and searching capabilities, adds tests, and updates integration setup for LanceDB.
CodeSearch
incode-search.ts
for indexing and searching code chunks using LanceDB.getChunks()
inchunker.ts
to parse and chunk code files.supportedLanguages
inchunker.ts
.chunker.test.ts
,code-search.test.ts
, anduri.test.ts
for chunking, indexing, and URI handling.index.test.ts
andlanguageParser.test.ts
to remove mock parsers and use real file reading.package.json
to fix integration test setup for ESM compatibility..vscodeignore
to manage.vsix
size.readFile()
infs.ts
for easier mocking in tests.tsconfig.integration.json
to setrootDir
tointegration-tests
.This description was created by
for 4587c75. It will automatically update as commits are pushed.