Indexing Pipelines and Arrow datasets #6491

demongolem-biz2 · 2023-12-05T17:31:47Z

demongolem-biz2
Dec 5, 2023

For testing purposes I have the following:

    from datasets import Dataset
    dataset = Dataset.from_file(df2)
    document_store = InMemoryDocumentStore(use_bm25=True)
    document_store.write_documents(dataset)

This works perfectly well in loading an arrow dataset into a documentstore. Nothing wrong with this operation (other than perhaps choice of DocumentStore which is a performance concern instead)

My question is can the above be turned into an indexing pipeline? The first element is often a FileConverter and I have read in pdf and txt files before with the repsective pipeline component. But is there a first element for an Arrow dataset? I would like to use the indexing pipeline, because I would like to stick a PreProcesser component in there before writing to the document store and I don't see how to do a PreProcessor without the indexing pipeline?

It looks like right now is that all the splitting and cleaning would have had to be done before the dataset was created and saved. However, I am in a spot where the entries are too large and need a little more cleaning so I want to do this preprocessing.

julian-risch · 2023-12-08T08:52:16Z

julian-risch
Dec 8, 2023
Maintainer

Hi @demongolem-biz2 Unfortunately we don't have a component in Haystack that you can readily use to read in an Arrow dataset. What you could try out as a work around is to not use a pipeline but call the process() method of the Preprocessor directly yourself.

from haystack.nodes import PreProcessor
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)
preprocessed_docs = preprocessor.process([doc1, doc2])

In Haystack 2.0, we will make it a lot easier to create custom components and then you will also find it easier to build your custom indexing pipeline. The beta version of Haystack 2.0 is being tested out as we speak. 🙂

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing Pipelines and Arrow datasets #6491

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Indexing Pipelines and Arrow datasets #6491

demongolem-biz2 Dec 5, 2023

Replies: 1 comment

julian-risch Dec 8, 2023 Maintainer

demongolem-biz2
Dec 5, 2023

julian-risch
Dec 8, 2023
Maintainer