Indexing Pipelines and Arrow datasets #6491
Unanswered
demongolem-biz2
asked this question in
Questions
Replies: 1 comment
-
Hi @demongolem-biz2 Unfortunately we don't have a component in Haystack that you can readily use to read in an Arrow dataset. What you could try out as a work around is to not use a pipeline but call the from haystack.nodes import PreProcessor
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True,
)
preprocessed_docs = preprocessor.process([doc1, doc2]) In Haystack 2.0, we will make it a lot easier to create custom components and then you will also find it easier to build your custom indexing pipeline. The beta version of Haystack 2.0 is being tested out as we speak. 🙂 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
For testing purposes I have the following:
This works perfectly well in loading an arrow dataset into a documentstore. Nothing wrong with this operation (other than perhaps choice of DocumentStore which is a performance concern instead)
My question is can the above be turned into an indexing pipeline? The first element is often a FileConverter and I have read in pdf and txt files before with the repsective pipeline component. But is there a first element for an Arrow dataset? I would like to use the indexing pipeline, because I would like to stick a PreProcesser component in there before writing to the document store and I don't see how to do a PreProcessor without the indexing pipeline?
It looks like right now is that all the splitting and cleaning would have had to be done before the dataset was created and saved. However, I am in a spot where the entries are too large and need a little more cleaning so I want to do this preprocessing.
Beta Was this translation helpful? Give feedback.
All reactions