Add better support for Brazilian Portuguese #4302

insinfo · 2024-08-21T16:42:42Z

I did a test to OCR scanned documents in Brazilian Portuguese, and I saw that Tesseract makes a lot of mistakes on scanned documents in Portuguese

Current Behavior

result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

“Chast vO

Precesse: 18457 J 2003 Data: 03/09/2003 Hora: 10:53:56
Requerente: COSCARELLI E CIALTDA ME 2 ;
* Sec.Destino: Secretaria Municipal de Fazend we
Dept.Destine: Dept? de Tributes @ Fiscalizagao

4
Assunto: ALVARA o Lh 3. )40

Expected Behavior

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Processo: 18457 / 2003
Data: 03/09/2003
Hora: 10:53:56
Requerente: COSCARELLI E CIA LTDA ME
Sec. Destino: Secretaria Municipal de Fazenda
Dept. Destino: Depto. de Tributos e Fiscalização
Assunto: ALVARÁ

Current Behavior

result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Frocesso 153

14 ¢ 2003 data 2540712003 Hora: 16:48:28

COLOMIA DE PESCADOPES 2.00

a oe

pcos

Expected Behavior

the correct thing would be

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Processo: 15314 / 2003
Data: 25/07/2003
Hora: 16:18:28

Requerente: COLÔNIA DE PESCADORES Z-22
Sec. Destino: Sec. Mun. Urbanismo Obras e S. Pub.
Dept. Destino: 0
Assunto: AGRADECIMENTO / FAZ

Windows 11

https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

The text was updated successfully, but these errors were encountered:

stweil · 2024-08-21T16:57:22Z

Latest Tesseract with the model script/Latin gives a better result for the first image:

ESTADO DO RIO DE JANEIRO

Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Cent EO
Processo: 18457 / 2003 Data: 03/09/2003 Hora: 10:53:56
Requerente: COSCARELLI E CIA LTDA ME 2, '
` Sec Destino: Secretaria Municipal de rarako OS
Dept.Destino: Dept? de Tributos è Fiscalização

Assunto: ALVARA A i L J: j4 0

ES

filipe-smartins · 2024-09-07T14:16:53Z

@stweil

What is the config to get this result in portuguese? Is it "-l lat+script/Latin" or "-l por+script/Latin"?

config_tesseract = fr'--tessdata-dir "{TESSDATA_PREFIX}" -l lat+script/Latin --oem 3 --psm 6'

stweil · 2024-09-07T14:38:22Z

It's simply -l script/Latin (or -l Latin, depending on your Linux distribution or local installation). The script Latin includes all Western European languages which are using the same script (instead of Greek or Cyrillic).

stweil · 2024-09-07T14:42:21Z

Note also that a correct installation of Tesseract does not need --tessdata-dir or TESSDATA_PREFIX, so avoid both (unless you have very special needs).

amitdo added the traineddata label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add better support for Brazilian Portuguese #4302

Add better support for Brazilian Portuguese #4302

insinfo commented Aug 21, 2024 •

edited

Loading

stweil commented Aug 21, 2024 •

edited

Loading

filipe-smartins commented Sep 7, 2024

stweil commented Sep 7, 2024 •

edited

Loading

stweil commented Sep 7, 2024 •

edited

Loading

Add better support for Brazilian Portuguese #4302

Add better support for Brazilian Portuguese #4302

Comments

insinfo commented Aug 21, 2024 • edited Loading

Current Behavior

result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

Expected Behavior

Current Behavior

result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

Expected Behavior

the correct thing would be

stweil commented Aug 21, 2024 • edited Loading

filipe-smartins commented Sep 7, 2024

stweil commented Sep 7, 2024 • edited Loading

stweil commented Sep 7, 2024 • edited Loading

insinfo commented Aug 21, 2024 •

edited

Loading

stweil commented Aug 21, 2024 •

edited

Loading

stweil commented Sep 7, 2024 •

edited

Loading

stweil commented Sep 7, 2024 •

edited

Loading