kermitt2 · lfoppiano · Dec 25, 2024 · Dec 25, 2024 · Dec 25, 2024 · Dec 25, 2024
diff --git a/.gitignore b/.gitignore
@@ -27,3 +27,5 @@ traingenerator
 jeu/*
 build/
 install/
+.ninja_*
+build.ninja
diff --git a/.gitmodules b/.gitmodules
@@ -1,3 +1,3 @@
-[submodule "xpdf-4.03"]
-	path = xpdf-4.03
-	url = git@github.com:kermitt2/xpdf-4.03.git
+[submodule "xpdf-4.05"]
+	path = xpdf-4.05
+	url = https://github.com/lfoppiano/xpdf-4.05.git
diff --git a/.travis.yml b/.travis.yml
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,65 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
+
+## [0.5] -TBD 
+- update to xpdf-4.05
+
+## [0.4]
+
+- support for xpdf language support package for language-specific fonts like Arabic, Chinese-simplified, Japanese, etc. they are pre-installed locally and portable
+
+- refined line number detection and fixing a bug which could result in random missing numbers in the ALTO output
+
+- update to xpdf-4.03
+
+- fix issue with character spacing due to invalid rotation condition
+
+- update dependencies and dependency install script
+
+## [0.3]
+
+
+- line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file (`noLineNumbers` option)
+
+- removal of `-blocks` option, the block information are always returned for ensuring ALTO validation (`<TextBlock>` element)
+
+- bug fixing on reading order
+
+- fix possible incorrect XMax and YMax values at 0 on block coordinates having only one line
+
+## [0.2]
+
+
+- support Unicode composition of characters
+
+- generalize reading order to all blocks (it was limited to the blocks of the first page)
+
+- detect subscript/superscript text font style attribute
+
+- use SVG as a format for vectorial images
+
+- propagate unsolved character Unicode value (free Unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach)
+
+- generate metadata information in a separate XML file (as ALTO schema does not support that)
+
+- use the latest version of xpdf, version 4.00
+
+- add cmake
+
+- [ALTO](https://github.com/altoxml/documentation/wiki) output is replacing custom Xerox XML format
+
+- Note: this released version was used for Grobid release 0.5.6
+
+## [0.1]
+
+- encode URI (using `xmlURIEscape` from libxml2) for the @href attribute content to avoid blocking XML wellformedness issues. From our experiments, this problem happens in average for 2-3 scholar PDF out of one thousand.
+- output coordinates attributes for the BLOCK elements when the `-block` option is selected,
+- add a parameter `-readingOrder` which re-order the blocks following the reading order when the -block option is selected. By default in pdf2xml, the elements followed the PDF content stream (the so-called _raw order_). In xpdf, several text flow orders are available including the raw order and the reading order. Note that, with this modification and this new option, only the blocks are re-ordered.
+  From our experiments, the raw order can diverge quite significantly from the order of elements according to the visual/reading layout in 2-4% of scholar PDF (e.g. title element is introduced at the end of the page element, while visually present at the top of the page), and minor changes can be present in up to 100% of PDF for some scientific publishers (e.g. headnote introduced at the end of the page content). This additional mode can be thus quite useful for information/structure extraction applications exploiting pdfalto output.
+
+- use the latest version of xpdf, version 3.04.
+
+<!-- markdownlint-disable-file MD024 MD033 -->
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,10 +1,13 @@
-cmake_minimum_required(VERSION 3.5+)
+cmake_minimum_required(VERSION 3.10)
 project(pdfalto)
 
 set(CMAKE_CXX_STANDARD 11)
 set(CMAKE_EXE_LINKER_FLAGS "-no-pie")
 set(CMAKE_BUILD_TYPE "Release")
 
+# Set the SDK path
+set(CMAKE_OSX_SYSROOT /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk)
+
 #--- look for fontconfig
 if (NOT NO_FONTCONFIG)
   find_library(FONTCONFIG_LIBRARY
@@ -24,7 +27,7 @@ else ()
 endif ()
 
 #build xpdf
-set ( XPDF_SUBDIR ${CMAKE_CURRENT_SOURCE_DIR}/xpdf-4.03)
+set ( XPDF_SUBDIR ${CMAKE_CURRENT_SOURCE_DIR}/xpdf-4.05)
 
 set ( IMAGE_SUBDIR ${CMAKE_CURRENT_SOURCE_DIR}/libs/image)
 

diff --git a/Readme.md b/Readme.md
@@ -115,62 +115,7 @@ languages  pdfalto  xpdfrc
 
 # Changes
 
-New in version 0.4 (apart various bug fixes):
-
-- support for xpdf language support package for language-specific fonts like Arabic, Chinese-simplified, Japanese, etc. they are pre-installed locally and portable 
-
-- refined line number detection and fixing a bug which could result in random missing numbers in the ALTO output
-
-- update to xpdf-4.03
-
-- fix issue with character spacing due to invalid rotation condition
-
-- update dependencies and dependency install script
-
-New in version 0.3 (apart various bug fixes):
-
-- line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file (`noLineNumbers` option)
-
-- removal of `-blocks` option, the block information are always returned for ensuring ALTO validation (`<TextBlock>` element)
-
-- bug fixing on reading order
-
-- fix possible incorrect XMax and YMax values at 0 on block coordinates having only one line
-
-New in version 0.2 (apart various bug fixes):
-
-- support Unicode composition of characters
-
-- generalize reading order to all blocks (it was limited to the blocks of the first page)
-
-- detect subscript/superscript text font style attribute
-
-- use SVG as a format for vectorial images
-
-- propagate unsolved character Unicode value (free Unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach)
-
-- generate metadata information in a separate XML file (as ALTO schema does not support that)
-
-- use the latest version of xpdf, version 4.00
-
-- add cmake
-
-- [ALTO](https://github.com/altoxml/documentation/wiki) output is replacing custom Xerox XML format
-
-- Note: this released version was used for Grobid release 0.5.6
-
-New in version 0.1 (apart various bug fixes): 
-
-- encode URI (using `xmlURIEscape` from libxml2) for the @href attribute content to avoid blocking XML wellformedness issues. From our experiments, this problem happens in average for 2-3 scholar PDF out of one thousand.
-
-- output coordinates attributes for the BLOCK elements when the `-block` option is selected,
-
-- add a parameter `-readingOrder` which re-order the blocks following the reading order when the -block option is selected. By default in pdf2xml, the elements followed the PDF content stream (the so-called _raw order_). In xpdf, several text flow orders are available including the raw order and the reading order. Note that, with this modification and this new option, only the blocks are re-ordered.
-
-  From our experiments, the raw order can diverge quite significantly from the order of elements according to the visual/reading layout in 2-4% of scholar PDF (e.g. title element is introduced at the end of the page element, while visually present at the top of the page), and minor changes can be present in up to 100% of PDF for some scientific publishers (e.g. headnote introduced at the end of the page content). This additional mode can be thus quite useful for information/structure extraction applications exploiting pdfalto output. 
-
-- use the latest version of xpdf, version 3.04.
-
+All changes are in the [CHANGELOG.md](CHANGELOG.md)
 
 # Contributors