diff --git a/README.md b/README.md index 2930376..eb426b0 100644 --- a/README.md +++ b/README.md @@ -12,15 +12,15 @@ Ground Truth dataset for French 20th typewritten OCR | # | name | nb of images | GT for segmenter? | GT for recognizer? | description | | --- | :---- | :---: | :---: | :---: | :---: | -| 1 | seg_data | (48) | y | n | Manual segmentation on pages with straight and regular lines | -| 2 | alto | (84) | y | y | Manual segmentation and complete transcription of letters. The letters sometimes differ in quality, writing colors, etc. | -| 3 | newdata | (59) | y | y | Long letters, many lines per page, mostly straigt lines but also narrow tight lines. Approximatively ten letters with handwritten texts. | -| 4 | extra_data | (97) | n | y | Segmentation and transcription of chunks of texts or unique words to help recognition form specificities: capital letters, numbers, titles, recurring elements, handwritten elements, narrow tight parts of texts | -| 5 | data | (258) | y | y | Long letters, many lines per page, mostly straight lines but also narrow tight lines. Several pages contain lists, tables and many capital letters words. | +| 0 | batch-00 | (48) | y | n | Manual segmentation on pages with straight and regular lines | +| 1 | batch-01 | (258) | y | y | Long letters, many lines per page, mostly straight lines but also narrow tight lines. Several pages contain lists, tables and many capital letters words. | +| 2 | batch-02 | (59) | y | y | Long letters, many lines per page, mostly straigt lines but also narrow tight lines. Approximatively ten letters with handwritten texts. | +| 3 | batch-03 | (84) | y | y | Manual segmentation and complete transcription of letters. The letters sometimes differ in quality, writing colors, etc. | +| 4 | batch-04 | (97) | n | y | Segmentation and transcription of chunks of texts or unique words to help recognition form specificities: capital letters, numbers, titles, recurring elements, handwritten elements, narrow tight parts of texts | -*As there are only made for segmentation, some images/transcriptions from the seg\_data corpus are common with some elements founds in other corpus, but the `CONTENT` of each tag should be empty for the ALTO/PAGE XML of the seg\_data corpus.* +*As there are only made for segmentation, some images/transcriptions from the "batch-00" are common with some elements founds in other corpus, but the `CONTENT` of each tag should be empty for the ALTO/PAGE XML of the "batch-00".* -*As it is made to train the transcription model on peculiar characters rendition, some images/transcriptions from the extra\_data corpus are common with the other corpus, but the content of the XML files will differ because one will only transcribe special parts while the other will have the whole text transcribed.* +*As it is made to train the transcription model on peculiar characters rendition, some images/transcriptions from the "batch-04" corpus are common with the other corpus, but the content of the XML files will differ because one will only transcribe special parts while the other will have the whole text transcribed.* ## Images The training has been done with images digitized by the Archives départementales de la Sarthe (where the collection is kept), and then uploaded in NAKALA, which is the IIIF server used for the project that uses this corpus. diff --git a/data/batch-00/README.md b/data/batch-00/README.md index 8b39d88..aac1acd 100644 --- a/data/batch-00/README.md +++ b/data/batch-00/README.md @@ -1,4 +1,4 @@ -# Images from the folder "Segdata" +# Images from the folder "batch-00" | Page | Link NAKALA | | - | - | diff --git a/data/batch-01/README.md b/data/batch-01/README.md index 3e1ec1b..9295a64 100644 --- a/data/batch-01/README.md +++ b/data/batch-01/README.md @@ -1,4 +1,4 @@ -# Images from the folder "Data" +# Images from the folder "batch-01" | Letter | Number of pages | Link NAKALA | | - | - | - | diff --git a/data/batch-02/README.md b/data/batch-02/README.md index ba7c975..8a4ba68 100644 --- a/data/batch-02/README.md +++ b/data/batch-02/README.md @@ -1,4 +1,4 @@ -# Images from the folder "Newdata" +# Images from the folder "batch-02" | Letter | Number of pages | Link NAKALA | | - | - | - | diff --git a/data/batch-03/README.md b/data/batch-03/README.md index f275a3b..e45cec1 100644 --- a/data/batch-03/README.md +++ b/data/batch-03/README.md @@ -1,4 +1,4 @@ -# Images from the folder "Alto" +# Images from the folder "batch-03" | Letter | Number of pages | Link NAKALA | | - | - | - | diff --git a/data/batch-04/README.md b/data/batch-04/README.md index 9e41325..6924f0c 100644 --- a/data/batch-04/README.md +++ b/data/batch-04/README.md @@ -1,4 +1,4 @@ -# Images from the folder "Extra_data" +# Images from the folder "batch-04" | Page | Link NAKALA | | - | - |