Update lucene to version 8.11.2 #16

tuomas2 · 2024-07-19T16:00:46Z

Replaces #15

This gave access to some new features in Lucene, such as Regular Expression search. This is a major refactor because I updated Lucene 5 major versions.

I tested several languages, English, Czech, Chinese, Japanese, Thai and search works in these languages. I am not capable to test if the stemming is good for all languages, so some more testing by native speakers is necessary.

…queries, search as before

I think "a" is not a stop word in this context, because it is a verb here. But my French is not that good.

I don't speak all of these languages, so I sometimes just changed the test to reflect the output. At least that should prevent regression.

tuomas2 · 2024-07-19T16:08:23Z

So summarizing @JJK96 , I would like that we try to:

Remove AbstractBookAnalyzer alltogether, and all custom analyzers that are based on that.
Use StopwordAnalyzer as a baseclass for our custom analyzers (KeyAnalyzer etc)
Modify properties file / factory accordingly to use classes from core and other libs.
Change filter classes (used by some analyzers like KeyAnalyzer) so that they do not store book (as it does not seem to be used)

(related to discussion started here: #15 (comment))

Also removed LuceneAnalyzer and moved it's functionality into AnalyzerFactory AnalyzerFactory now returns a real subclass of Analyzer, instead of a wrapper. For all languages, language-specific analyzers are used, instead of Snowball Analyzers

Removed EnglishAnalyzer test in AnalyzerFactoryTest

JJK96 · 2024-10-14T18:41:24Z

Remove AbstractBookAnalyzer alltogether, and all custom analyzers that are based on that.
Use StopwordAnalyzer as a baseclass for our custom analyzers (KeyAnalyzer etc)
- I used Analyzer as the base, since stopwording was not used by these classes.
Modify properties file / factory accordingly to use classes from core and other libs.
Change filter classes (used by some analyzers like KeyAnalyzer) so that they do not store book (as it does not seem to be used)

…ries would always search the whole bible.

Added check for index version when getting index status. This ensures that the status correctly represents if the index is invalid.

JJK96 added 17 commits July 8, 2024 19:54

Compiles

8258128

Uncleaned version that supports regex searching

41a8b6d

For regex queries search in full non-canonical text, while for other …

fbeaac7

…queries, search as before

Add switch for regex search type

982ce80

Make Regex search case insensitive

4239e9c

Fix Thai analyzer

4c92c9c

Fix Hebrew analyser

a06ecda

Fix Arabic

c784ccc

Fix Persian

7c43cca

Remove local.properties

d7616bc

Fix analyzer references

02fa61f

Fix tests

54c73b6

Add local.properties to gitignore

a4f26c2

Add smartcn analyzer

c3933c7

Fix Chinese and Japanese

d26a312

Fix French stemmer test

f00f512

I think "a" is not a stop word in this context, because it is a verb here. But my French is not that good.

Fix all tests

f355696

I don't speak all of these languages, so I sometimes just changed the test to reflect the output. At least that should prevent regression.

tuomas2 changed the base branch from master to develop July 19, 2024 16:01

tuomas2 changed the base branch from develop to master July 19, 2024 16:02

tuomas2 mentioned this pull request Jul 19, 2024

Update lucene to version 8.11.2 #15

Closed

tuomas2 assigned JJK96 Jul 19, 2024

tuomas2 mentioned this pull request Aug 10, 2024

How best to extend indexing for different languages and scripts on AndBible AndBible/and-bible#3273

Open

JJK96 added 6 commits August 19, 2024 21:17

Removed AbstractBookAnalyzer

d830c48

Also removed LuceneAnalyzer and moved it's functionality into AnalyzerFactory AnalyzerFactory now returns a real subclass of Analyzer, instead of a wrapper. For all languages, language-specific analyzers are used, instead of Snowball Analyzers

All tests compiling, but not completely working yet

25ceeb8

Update test, stemming has been implemented now

c588094

Make stopwording optional but disabled by default

069667a

Removed EnglishAnalyzer test in AnalyzerFactoryTest

Make code cleaner

dd6c939

Restructured

4aaf655

JJK96 added 3 commits October 14, 2024 21:08

Remove print

89d6f45

Apply range query to regex queries as well. Fixes bug where regex que…

9e6da6d

…ries would always search the whole bible.

Invalidate old Lucene indices

c2a7da0

Added check for index version when getting index status. This ensures that the status correctly represents if the index is invalid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lucene to version 8.11.2 #16

Update lucene to version 8.11.2 #16

tuomas2 commented Jul 19, 2024

tuomas2 commented Jul 19, 2024 •

edited

Loading

JJK96 commented Oct 14, 2024

Update lucene to version 8.11.2 #16

Are you sure you want to change the base?

Update lucene to version 8.11.2 #16

Conversation

tuomas2 commented Jul 19, 2024

tuomas2 commented Jul 19, 2024 • edited Loading

JJK96 commented Oct 14, 2024

tuomas2 commented Jul 19, 2024 •

edited

Loading