Skip to content

Commit

Permalink
split sentences also by double newline
Browse files Browse the repository at this point in the history
  • Loading branch information
eeroel committed Jan 18, 2024
1 parent 8d803d5 commit 9795516
Show file tree
Hide file tree
Showing 4 changed files with 7 additions and 2 deletions.
6 changes: 4 additions & 2 deletions hae.cc
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ std::vector<std::string> combine_chunks(std::vector<std::string> &chunks, int64_

std::string buffer = "";
for (size_t i = 0; i < chunks.size(); ++i) {
buffer += chunks[i] + "\n";
buffer += chunks[i] + "\n\n";
// If the chunk has multiple lines, just append it
if (buffer.length() > min_size) {
combined.push_back(buffer.substr(0, buffer.length() - 1));
Expand All @@ -111,7 +111,9 @@ std::vector<std::string> combine_chunks(std::vector<std::string> &chunks, int64_

std::vector<std::string> split_sentences(const std::string& text) {
std::string wiki_citation_re = "(\\^\\[[0-9]+\\])*";
std::regex full_re(":\\n" + wiki_citation_re + "|[.!?]" + wiki_citation_re + "\\s");
std::string double_newline_re = "\r?\n\r?\n";

std::regex full_re(":\\n" + wiki_citation_re + "|[.!?]" + wiki_citation_re + "\\s" + "|" + double_newline_re);
size_t prev = 0;
std::vector<std::string> sentences;

Expand Down
1 change: 1 addition & 0 deletions test/t3.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@

This is a test



1 change: 1 addition & 0 deletions test/t4.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
This is a test



1 change: 1 addition & 0 deletions test/t5.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
1. 



0 comments on commit 9795516

Please sign in to comment.