Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Loading Data into Collection and HuggingFace Model #1132

Open
1 of 5 tasks
AsTeriaa09 opened this issue Jan 2, 2025 · 0 comments
Open
1 of 5 tasks

Issue with Loading Data into Collection and HuggingFace Model #1132

AsTeriaa09 opened this issue Jan 2, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@AsTeriaa09
Copy link

System Info

System Info :
Platform: HP Laptop 15s-du1xxx
Node Version: v20.18.0

huggingface & others:
"@datastax/astra-db-ts": "^1.5.0",
"@huggingface/transformers": "^3.2.4",
"@xenova/transformers": "^2.17.2",
"langchain": "^0.3.8",

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

The collection is successfully created using the @datastax/astra-db-ts library. but as the error shows, the embedding/chunk size is too large even though i kept the description minimal and it shouldn't be causing the issue.

i am using the following huggingFace model:

// const hugModel = await pipeline("feature-extraction","sentence-transformers/all-MiniLM-L6-v2");
const hugModel = await pipeline("feature-extraction","Xenova/bge-large-en-v1.5");

Reproduction

Expected Behavior:

  • Data from sample-data.json should load into the collection without errors.
  • The HuggingFace model should generate embeddings for text chunks successfully.

Observed Behavior:

  • data exceeding minimum allowed size even after ensuring it's short.
  • Failing to fetch chunks and embeddings being undefined.

Code Snippet:

import {DataAPIClient} from "@datastax/astra-db-ts"
import {RecursiveCharacterTextSplitter} from "langchain/text_splitter"
import { pipeline } from "@xenova/transformers";
import 'dotenv/config'
// import { pipeline } from '@huggingface/transformers';
import sampleData from './sample-data.json' with { type: "json" };

const splitter = new RecursiveCharacterTextSplitter({
    chunkSize:300,
    chunkOverlap:50,
})
const loadData = async()=>{
    const collection = await db.collection("portfolio");
    for await (const {id,info,description} of sampleData){
        const chunks = await splitter.splitText(description);
        // iterate through chunks
        for await (const chunk of chunks){
            const embedding = await hugModel(chunk, { pooling: "mean", normalize: true });
            // console.log("Chunk :", chunk);

            try {
                const res = await collection.insertOne({
                  document_id: id,
                  $vector: embedding[0],
                  info,
                  description: chunk,
                });
                console.log("data added successfully!",res);
              } catch (err) {
                console.error(`Failed to insert chunk: ${chunk}`, err);
              }
        }
    }
    
}

createCollection().then(()=>loadData());

Error Message:

Failed to insert chunk: name DataAPIResponseError: Document size limitation violated: number of properties an indexable Object (property 'data') has (1024) exceeds maximum allowed (1000)
errorDescriptors: [
    {
      errorCode: 'SHRED_DOC_LIMIT_VIOLATION',
      message: "Document size limitation violated: number of properties an indexable Object (property 'data') has (1024) exceeds maximum allowed (1000)",
      attributes: [Object]
    }
  ],
  detailedErrorDescriptors: [
    {
      errorDescriptors: [Array],
      command: [Object],
      rawResponse: [Object]
    }
  ]
}


@AsTeriaa09 AsTeriaa09 added the bug Something isn't working label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant