Code indexer #1050

cte · 2025-02-18T09:32:00Z

Description

Using continue.dev as a reference, implement a basic "code indexer", which consists of three components:

Code chunker (Tree Sitter)
Embeddings (OpenAI for now)
Vector database (LanceDB)

The idea is that we'll:

Listen for workspace update events
Walk the workspace tree
Check for potentially changed files based on modification time and...
Add / remove / update the index based on the status of the modified files

This PR just shows a small piece of this system; namely how to index new code so that it's semantically searchable.

Known issues:

The LanceDB binary is going to increase our .vsix size significantly (Continue.dev's is about 80mb, and ours will be similar)
The @lancedb/lancedb npm package doesn't play nicely with CommonJS, so we'll need to update our integration test setup to use a more ESM-friendly configuration, which is going to be a bit annoying. Fixed!

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Checklist:

My code follows the patterns of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation

Additional context

Related Issues

Reviewers

Important

Implements a code indexer with chunking, embedding, and searching capabilities, adds tests, and updates integration setup for LanceDB.

Code Indexer:
- Implements CodeSearch in code-search.ts for indexing and searching code chunks using LanceDB.
- Adds getChunks() in chunker.ts to parse and chunk code files.
- Supports multiple languages via supportedLanguages in chunker.ts.
Testing:
- Adds tests in chunker.test.ts, code-search.test.ts, and uri.test.ts for chunking, indexing, and URI handling.
- Updates index.test.ts and languageParser.test.ts to remove mock parsers and use real file reading.
Integration:
- Updates package.json to fix integration test setup for ESM compatibility.
- Adds LanceDB to .vscodeignore to manage .vsix size.
Utilities:
- Adds readFile() in fs.ts for easier mocking in tests.
- Updates tsconfig.integration.json to set rootDir to integration-tests.

^{This description was created by}^{for 4587c75. It will automatically update as commits are pushed.}

changeset-bot · 2025-02-18T09:32:04Z

⚠️ No Changeset found

Latest commit: 4587c75

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cte · 2025-02-18T09:32:54Z

package.json

@@ -286,7 +286,7 @@
 		"lint-fix": "eslint src --ext ts --fix && npm run lint-fix --prefix webview-ui",
 		"lint-fix-local": "eslint -c .eslintrc.local.json src --ext ts --fix && npm run lint-fix --prefix webview-ui",
 		"package": "npm run build:webview && npm run check-types && npm run lint && node esbuild.js --production",
-		"pretest": "npm run compile && npm run compile:integration",


This was an unnecessary step for just running jest, and we already run it as part of test:integration.

cte · 2025-02-18T09:34:23Z

src/services/code-indexer/__tests__/code-search.test.ts

+		savedKey = process.env.OPENAI_API_KEY
+		process.env.OPENAI_API_KEY = "fake"
+
+		nock.back.fixtures = path.join(__dirname, "..", "__fixtures__")


I don't want to mock anything, so I recorded the OpenAI embeddings API requests using nock.

cte · 2025-02-18T09:36:21Z

src/services/code-indexer/code-search.ts

+	public async initialize() {
+		this.connection = await connect(this.dbPath)
+
+		const fnCreator = getRegistry().get("openai")


To start we'll only support OpenAI / text-embedding-ada-002, which means everyone will need an API profile with an OpenAI API key. Over time we can add more embedding options, including local options.

planning on giving this a spin locally and will take a look at the PR too -- however, specifically on the embedding model, should we better go with "text-embedding-3-small" to begin with? seems more performant and also cheaper -> https://platform.openai.com/docs/guides/embeddings/embedding-models#embedding-models

I think you could use free embeding service from nvidia, or we don't have to take the RAG route, just like repoprompt.com did

src/services/code-indexer/chunker.ts

src/services/tree-sitter/index.ts

cte · 2025-02-18T09:37:37Z

src/services/tree-sitter/__tests__/index.test.ts

@@ -1,15 +1,11 @@
+// npx jest src/services/tree-sitter/__tests__/index.test.ts


This test was mocking too much, and isn't compatible with the latest version of WASM tree-sitter. I updated it appropriately.

cte · 2025-02-18T09:37:43Z

src/services/tree-sitter/__tests__/languageParser.test.ts

@@ -1,118 +1,106 @@
+// npx jest src/services/tree-sitter/__tests__/languageParser.test.ts


This test was mocking too much, and isn't compatible with the latest version of WASM tree-sitter. I updated it appropriately.

cte · 2025-02-18T09:39:21Z

src/services/tree-sitter/languageParser.ts

 	}
 }

 async function loadLanguage(langName: string) {
-	return await Parser.Language.load(path.join(__dirname, `tree-sitter-${langName}.wasm`))
+	if (process.env.NODE_ENV === "test") {


Inspired by continue.dev; allow tests to loading language syntax trees so we don't have to mock.

angginurfasilah321 · 2025-02-18T12:30:51Z

How do we mitigate code stalenes?

cte · 2025-02-18T17:38:36Z

How do we mitigate code stalenes?

My plan is to do something similar to this: https://github.com/continuedev/continue/blob/main/core/indexing/README.md

angginurfasilah321 · 2025-02-18T22:57:27Z

How do we mitigate code stalenes?

My plan is to do something similar to this: https://github.com/continuedev/continue/blob/main/core/indexing/README.md

Is it possible to use this as another tools to make code insertion more precise, I tried diff insert for single file with > 3000 lines, and roo-code deletes all the line instead of insert between lines

wwicak · 2025-02-22T06:40:50Z

this is nice feature, please make it happen @mrubens

Code indexer

bfb434a

cte requested review from stea9499, ColemanRoo and mrubens as code owners February 18, 2025 09:32

cte commented Feb 18, 2025

View reviewed changes

ellipsis-dev bot reviewed Feb 18, 2025

View reviewed changes

src/services/code-indexer/chunker.ts Outdated Show resolved Hide resolved

src/services/tree-sitter/index.ts Outdated Show resolved Hide resolved

cte commented Feb 18, 2025

View reviewed changes

Remove debugging

5e51fbb

cte commented Feb 18, 2025

View reviewed changes

cte added 3 commits February 18, 2025 01:45

Update fixtures

df31b8c

Fix integration tests

32b5b08

PR feedback

4587c75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code indexer #1050

Code indexer #1050

cte commented Feb 18, 2025 •

edited

Loading

changeset-bot bot commented Feb 18, 2025 •

edited

Loading

cte Feb 18, 2025

cte Feb 18, 2025

cte Feb 18, 2025

lupuletic Feb 27, 2025

wwicak Mar 1, 2025

cte Feb 18, 2025

cte Feb 18, 2025

cte Feb 18, 2025

angginurfasilah321 commented Feb 18, 2025

cte commented Feb 18, 2025

angginurfasilah321 commented Feb 18, 2025

wwicak commented Feb 22, 2025

		@@ -1,15 +1,11 @@
		// npx jest src/services/tree-sitter/__tests__/index.test.ts

		@@ -1,118 +1,106 @@
		// npx jest src/services/tree-sitter/__tests__/languageParser.test.ts

Code indexer #1050

Are you sure you want to change the base?

Code indexer #1050

Conversation

cte commented Feb 18, 2025 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist:

Additional context

Related Issues

Reviewers

changeset-bot bot commented Feb 18, 2025 • edited Loading

⚠️ No Changeset found

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angginurfasilah321 commented Feb 18, 2025

cte commented Feb 18, 2025

angginurfasilah321 commented Feb 18, 2025

wwicak commented Feb 22, 2025

cte commented Feb 18, 2025 •

edited

Loading

changeset-bot bot commented Feb 18, 2025 •

edited

Loading