Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code indexer #1050

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Code indexer #1050

wants to merge 5 commits into from

Conversation

cte
Copy link
Collaborator

@cte cte commented Feb 18, 2025

Description

Using continue.dev as a reference, implement a basic "code indexer", which consists of three components:

The idea is that we'll:

  1. Listen for workspace update events
  2. Walk the workspace tree
  3. Check for potentially changed files based on modification time and...
  4. Add / remove / update the index based on the status of the modified files

This PR just shows a small piece of this system; namely how to index new code so that it's semantically searchable.

Known issues:

  • The LanceDB binary is going to increase our .vsix size significantly (Continue.dev's is about 80mb, and ours will be similar)
  • The @lancedb/lancedb npm package doesn't play nicely with CommonJS, so we'll need to update our integration test setup to use a more ESM-friendly configuration, which is going to be a bit annoying. Fixed!

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Checklist:

  • My code follows the patterns of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

Additional context

Related Issues

Reviewers


Important

Implements a code indexer with chunking, embedding, and searching capabilities, adds tests, and updates integration setup for LanceDB.

  • Code Indexer:
    • Implements CodeSearch in code-search.ts for indexing and searching code chunks using LanceDB.
    • Adds getChunks() in chunker.ts to parse and chunk code files.
    • Supports multiple languages via supportedLanguages in chunker.ts.
  • Testing:
    • Adds tests in chunker.test.ts, code-search.test.ts, and uri.test.ts for chunking, indexing, and URI handling.
    • Updates index.test.ts and languageParser.test.ts to remove mock parsers and use real file reading.
  • Integration:
    • Updates package.json to fix integration test setup for ESM compatibility.
    • Adds LanceDB to .vscodeignore to manage .vsix size.
  • Utilities:
    • Adds readFile() in fs.ts for easier mocking in tests.
    • Updates tsconfig.integration.json to set rootDir to integration-tests.

This description was created by Ellipsis for 4587c75. It will automatically update as commits are pushed.

Copy link

changeset-bot bot commented Feb 18, 2025

⚠️ No Changeset found

Latest commit: 4587c75

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@@ -286,7 +286,7 @@
"lint-fix": "eslint src --ext ts --fix && npm run lint-fix --prefix webview-ui",
"lint-fix-local": "eslint -c .eslintrc.local.json src --ext ts --fix && npm run lint-fix --prefix webview-ui",
"package": "npm run build:webview && npm run check-types && npm run lint && node esbuild.js --production",
"pretest": "npm run compile && npm run compile:integration",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an unnecessary step for just running jest, and we already run it as part of test:integration.

savedKey = process.env.OPENAI_API_KEY
process.env.OPENAI_API_KEY = "fake"

nock.back.fixtures = path.join(__dirname, "..", "__fixtures__")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to mock anything, so I recorded the OpenAI embeddings API requests using nock.

public async initialize() {
this.connection = await connect(this.dbPath)

const fnCreator = getRegistry().get("openai")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start we'll only support OpenAI / text-embedding-ada-002, which means everyone will need an API profile with an OpenAI API key. Over time we can add more embedding options, including local options.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning on giving this a spin locally and will take a look at the PR too -- however, specifically on the embedding model, should we better go with "text-embedding-3-small" to begin with? seems more performant and also cheaper -> https://platform.openai.com/docs/guides/embeddings/embedding-models#embedding-models

image

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use free embeding service from nvidia, or we don't have to take the RAG route, just like repoprompt.com did

@@ -1,15 +1,11 @@
// npx jest src/services/tree-sitter/__tests__/index.test.ts
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was mocking too much, and isn't compatible with the latest version of WASM tree-sitter. I updated it appropriately.

@@ -1,118 +1,106 @@
// npx jest src/services/tree-sitter/__tests__/languageParser.test.ts
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was mocking too much, and isn't compatible with the latest version of WASM tree-sitter. I updated it appropriately.

}
}

async function loadLanguage(langName: string) {
return await Parser.Language.load(path.join(__dirname, `tree-sitter-${langName}.wasm`))
if (process.env.NODE_ENV === "test") {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inspired by continue.dev; allow tests to loading language syntax trees so we don't have to mock.

@angginurfasilah321
Copy link

How do we mitigate code stalenes?

@cte
Copy link
Collaborator Author

cte commented Feb 18, 2025

How do we mitigate code stalenes?

My plan is to do something similar to this: https://github.com/continuedev/continue/blob/main/core/indexing/README.md

@angginurfasilah321
Copy link

How do we mitigate code stalenes?

My plan is to do something similar to this: https://github.com/continuedev/continue/blob/main/core/indexing/README.md

Is it possible to use this as another tools to make code insertion more precise, I tried diff insert for single file with > 3000 lines, and roo-code deletes all the line instead of insert between lines

@wwicak
Copy link

wwicak commented Feb 22, 2025

this is nice feature, please make it happen @mrubens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants