Merge branch 'develop' for MeTA v3.0.0

meta-toolkit · Feb 13, 2017 · 239805f · 239805f
2 parents e09ac0e + 484b1b9
commit 239805f
Show file tree

Hide file tree

Showing 143 changed files with 5,374 additions and 1,141 deletions.
diff --git a/.gitignore b/.gitignore
@@ -11,6 +11,7 @@ doc/
 data/ceeaus
 data/breast-cancer
 data/housing
+data/cranfield
 biicode.conf
 bii/
 bin/
diff --git a/.travis.yml b/.travis.yml
@@ -65,36 +65,50 @@ matrix:
             - gcc-6
             - g++-6
 
-    # Linux/Clang 3.6
+    # Linux/Clang 3.8
     - os: linux
-      env: COMPILER=clang CLANG_VERSION=3.6
+      env: COMPILER=clang CLANG_VERSION=3.8
       addons:
         apt:
           sources:
             - ubuntu-toolchain-r-test
-            - llvm-toolchain-precise-3.6
+            - llvm-toolchain-precise-3.8
           packages:
             - *default-packages
-            - clang-3.6
-            - llvm-3.6-dev
+            - clang-3.8
 
-    # OS X 10.9 + Xcode 6.1
-    - os: osx
-      env: COMPILER=clang
+    # Linux/Clang 3.8 + libc++-3.9
+    # (I want this to be 3.9 across the board, but the apt source is not
+    # yet whitelisted for llvm 3.9)
+    - os: linux
+      env:
+        - COMPILER=clang
+        - CLANG_VERSION=3.8
+        - LLVM_TAG=RELEASE_390
+        - LIBCXX_EXTRA_CMAKE_FLAGS=-DLIBCXX_INSTALL_EXPERIMENTAL_LIBRARY=On
+        - CMAKE_VERSION=3.4.3
+      addons:
+        apt:
+          sources:
+            - ubuntu-toolchain-r-test
+            - llvm-toolchain-precise-3.8
+          packages:
+            - *default-packages
+            - clang-3.8
 
-    # OS X 10.10 + Xcode 6.4
+    # OS X 10.10 + Xcode 7.1.1
     - os: osx
-      osx_image: xcode6.4
+      osx_image: xcode7.1
       env: COMPILER=clang
 
-    # OS X 10.10 + Xcode 7.1.1
+    # OS X 10.11 + Xcode 7.3
     - os: osx
-      osx_image: xcode7.1
+      osx_image: xcode7.3
       env: COMPILER=clang
 
-    # OS X 10.11 + Xcode 7.2
+    # OS X 10.11 + Xcode 8
     - os: osx
-      osx_image: xcode7.2
+      osx_image: xcode8
       env: COMPILER=clang
 
     # OS X/GCC 6

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,119 @@
+# [v3.0.0][3.0.0]
+## New features
+- Add an `embedding_analyzer` that represents documents with their averaged word
+  vectors.
+- Add a `parallel::reduction` algorithm designed for parallelizing complex
+  accumulation operations (like an E step in an EM algorithm)
+- Parallelize feature counting in feature selector using the new
+  `parallel::reduction`
+- Add a `parallel::for_each_block` algorithm to run functions on
+  (relatively) equal sub-ranges of an iterator range in parallel
+- Add a parallel merge sort as `parallel::sort`
+- Add a `util/traits.h` header for general useful traits
+- Add a Markov model implementation in `sequence::markov_model`
+- Add a generic unsupervised HMM implementation. This implementation
+  supports HMMs with discrete observations (what is used most often) and
+  sequence observations (useful for log mining applications). The
+  forward-backward algorithm is implemented using both the scaling method
+  and the log-space method. The scaling method is used by default, but the
+  log-space method is useful for HMMs with sequence observations to avoid
+  underflow issues when the output probabilities themselves are very small.
+- Add the KL-divergence retrieval function using pseudo-relevance feedback
+  with the two-component mixture-model approach of Zhai and Lafferty,
+  called `kl_divergence_prf`. This ranker internally can use any
+  `language_model_ranker` subclass like `dirichlet_prior` or
+  `jelinek_mercer` to perform the ranking of the feedback set and the
+  result documents with respect to the modified query.
+
+  The EM algorithm used for the two-component mixture model is provided as
+  the `index::feedback::unigram_mixture` free function and returns the
+  feedback model.
+- Add the Rocchio algorithm (`rocchio`) for pseudo-relevance feedback in
+  the vector space model.
+- **Breaking Change.** To facilitate the above to changes, we have also
+  broken the `ranker` hierarchy into one more level. At the top we have
+  `ranker`, which has a pure virtual function `rank()` that can be
+  overridden to provide entirely custom ranking behavior, This is the class
+  the KL-divergence and Rocchio methods derive from, as we need to
+  re-define what it means to rank documents (first retrieving a feedback
+  set, then ranking documents with respect to an updated query).
+
+  Most of the time, however, you will want to derive from the second level
+  `ranking_function`, which is what was called `ranker` before. This class
+  provides a definition of `rank()` to perform document-at-a-time ranking,
+  and expects deriving classes to instead provide `initial_score()` and
+  `score_one()` implementations to define the scoring function used for
+  each document. **Existing code that derived from `ranker` prior to this
+  version of MeTA likely needs to be changed to instead derive from
+  `ranking_function`.**
+- Add the `util::transform_iterator` class and `util::make_transform_iterator`
+  function for providing iterators that transform their output according to
+  a unary function.
+- **Breaking Change.** `whitespace_tokenizer` now emits *only* word tokens
+  by default, suppressing all whitespace tokens. The old default was to
+  emit tokens containing whitespace in addition to actual word tokens. The
+  old behavior can be obtained by passing `false` to its constructor, or
+  setting `suppress-whitespace = false` in its configuration group in
+  `config.toml.` (Note that whitespace tokens are still needed if using a
+  `sentence_boundary` filter but, in nearly all circumstances,
+  `icu_tokenizer` should be preferred.)
+- **Breaking Change.** Co-occurrence counting for embeddings now uses
+  history that crosses sentence boundaries by default. The old behavior
+  (clearing the history when starting a new sentence) can be obtained by
+  ensuring that a tokenizer is being used that emits sentence boundary tags
+  and by setting `break-on-tags = true` in the `[embeddings]` table of
+  `config.toml`.
+- **Breaking Change.** All references in the embeddings library to "coocur"
+  are have changed to "cooccur". This means that some files and binaries
+  have been renamed. Much of the co-occurrence counting part of the
+  embeddings library has also been moved to the public API.
+- Co-occurrence counting now is performed in parallel. Behavior of its
+  merge strategy can be configured with the new `[embeddings]` config
+  parameter `merge-fanout = n`, which specifies the maximum number of
+  on-disk chunks to allow before kicking off a multi-way merge (default 8).
+
+## Enhancements
+- Add additional `packed_write` and `packed_read` overloads: for
+  `std::pair`, `stats::dirichlet`, `stats::multinomial`,
+  `util::dense_matrix`, and `util::sparse_vector`
+- Additional functions have been added to `ranker_factory` to allow
+  construction/loading of language_model_ranker subclasses (useful for the
+  `kl_divergence_prf` implementation)
+- Add a `util::make_fixed_heap` helper function to simplify the declaration
+  of `util::fixed_heap` classes with lambda function comparators.
+- Add regression tests for rankers MAP and NDCG scores. This adds a new
+  dataset `cranfield` that contains non-binary relevance judgments to
+  facilitate these new tests.
+- Bump bundled version of ICU to 58.2.
+
+## Bug Fixes
+- Fix bug in NDCG calculation (ideal-DCG was computed using the wrong
+  sorting order for non-binary judgments)
+- Fix bug where the final chunks to be merged in index creation were not
+  being deleted when merging completed
+- Fix bug where GloVe training would allocate the embedding matrix before
+  starting the shuffling process, causing it to exceed the "max-ram"
+  config parameter.
+- Fix bug with consuming MeTA from a build directory with `cmake` when
+  building a static ICU library. `meta-utf` is now forced to be a shared
+  library, which (1) should save on binary sizes and (2) ensures that the
+  statically build ICU is linked into the `libmeta-utf.so` library to avoid
+  undefined references to ICU functions.
+- Fix bug with consuming Release-mode MeTA libraries from another project
+  being built in Debug mode. Before, `identifiers.h` would change behavior
+  based on the `NDEBUG` macro's setting. This behavior has been removed,
+  and opaque identifiers are always on.
+
+## Deprecation
+- `disk_index::doc_name` and `disk_index::doc_path` have been deprecated in
+  favor of the more general (and less confusing) `metadata()`. They will be
+  removed in a future major release.
+- Support for 32-bit architectures is provided on a best-effort basis. MeTA
+  makes heavy use of memory mapping, which is best paired with a 64-bit
+  address space. Please move to a 64-bit platform for using MeTA if at all
+  possible (most consumer machines should support 64-bit if they were made
+  in the last 5 years or so).
+
 # [v2.4.2][2.4.2]
 ## Bug Fixes
 - Properly shuffle documents when doing an even-split classification test
@@ -493,7 +609,8 @@
 # [v1.0][1.0]
 - Initial release.
 
-[unreleased]: https://github.com/meta-toolkit/meta/compare/v2.4.2...develop
+[unreleased]: https://github.com/meta-toolkit/meta/compare/v3.0.0...develop
+[3.0.0]: https://github.com/meta-toolkit/meta/compare/v2.4.2...v3.0.0
 [2.4.2]: https://github.com/meta-toolkit/meta/compare/v2.4.1...v2.4.2
 [2.4.1]: https://github.com/meta-toolkit/meta/compare/v2.4.0...v2.4.1
 [2.4.0]: https://github.com/meta-toolkit/meta/compare/v2.3.0...v2.4.0

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -47,11 +47,10 @@ endif()
 list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/deps/findicu)
 list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/deps/meta-cmake/)
 
-# We require Unicode 8 for the unit tests, which was added in ICU 56.1
 FindOrBuildICU(
-  VERSION 57.1
-  URL http://download.icu-project.org/files/icu4c/57.1/icu4c-57_1-src.tgz
-  URL_HASH MD5=976734806026a4ef8bdd17937c8898b9
+  VERSION 58.2
+  URL http://download.icu-project.org/files/icu4c/58.2/icu4c-58_2-src.tgz
+  URL_HASH MD5=fac212b32b7ec7ab007a12dff1f3aea1
 )
 
 add_library(meta-definitions INTERFACE)
@@ -143,6 +142,14 @@ add_subdirectory(src)
 add_subdirectory(tests)
 add_subdirectory(deps/cpptoml EXCLUDE_FROM_ALL)
 
+# Warn users that are using a 32-bit system
+if (CMAKE_SIZEOF_VOID_P LESS 8)
+  message(WARNING "You appear to be running on a 32-bit system. Support \
+    for 32-bit systems is provided on a best-effort basis; if at all \
+    possible, we strongly recommend that you use  MeTA on a 64-bit \
+    platform.")
+endif()
+
 # install our targets defined in this file
 install(TARGETS meta-definitions
         EXPORT meta-exports

diff --git a/README.md b/README.md
@@ -622,9 +622,12 @@ you should run the following commands to download dependencies and related
 software needed for building:
 
 ```bash
-pacman -Syu git make mingw-w64-x86_64-{gcc,cmake,icu,jemalloc,zlib}
+pacman -Syu git make mingw-w64-x86_64-{gcc,cmake,icu,jemalloc,zlib} --force
 ```
 
+(the `--force` is needed to work around a bug with the latest MSYS2
+installer as of the time of writing.)
+
 Then, exit the shell and launch the "MinGW-w64 Win64" shell. You can obtain
 the toolkit and get started with:
 

diff --git a/config.toml b/config.toml
@@ -96,7 +96,7 @@ test-sections = [23, 23]
 
 [embeddings]
 prefix = "word-embeddings"
-filter = [{type = "icu-tokenizer"}, {type = "lowercase"}]
+filter = [{type = "icu-tokenizer", suppress-tags = true}, {type = "lowercase"}]
 vector-size = 50
 [embeddings.vocab]
 min-count = 10

diff --git a/deps/cpptoml b/deps/cpptoml
diff --git a/deps/meta-cmake b/deps/meta-cmake
diff --git a/include/meta/analyzers/tokenizers/whitespace_tokenizer.h b/include/meta/analyzers/tokenizers/whitespace_tokenizer.h
@@ -9,6 +9,7 @@
 #ifndef META_WHITESPACE_TOKENIZER_H_
 #define META_WHITESPACE_TOKENIZER_H_
 
+#include "meta/analyzers/filter_factory.h"
 #include "meta/analyzers/token_stream.h"
 #include "meta/util/clonable.h"
 #include "meta/util/string_view.h"
@@ -39,8 +40,10 @@ class whitespace_tokenizer : public util::clonable<token_stream,
   public:
     /**
      * Creates a whitespace_tokenizer.
+     * @param suppress_whitespace Whether to suppress whitespace tokens
+     * themselves or not.
      */
-    whitespace_tokenizer();
+    whitespace_tokenizer(bool suppress_whitespace = true);
 
     /**
      * Sets the content for the tokenizer to parse.
@@ -64,12 +67,24 @@ class whitespace_tokenizer : public util::clonable<token_stream,
     const static util::string_view id;
 
   private:
+    void consume_adjacent_whitespace();
+
     /// Buffered string content for this tokenizer
     std::string content_;
 
-    /// Character index into the current buffer
-    uint64_t idx_;
+    /// Whether or not to output whitespace tokens
+    const bool suppress_whitespace_;
+
+    /// Character iterator into the current buffer
+    std::string::const_iterator it_;
 };
+
+/**
+ * Specialization of the factory method use to create whitespace_tokenizers.
+ */
+template <>
+std::unique_ptr<token_stream>
+    make_tokenizer<whitespace_tokenizer>(const cpptoml::table& config);
 }
 }
 }

diff --git a/include/meta/classify/classifier/classifier.h b/include/meta/classify/classifier/classifier.h
@@ -109,6 +109,7 @@ confusion_matrix cross_validate(Creator&& creator,
             docs, docs.begin(),
             docs.begin() + static_cast<diff_type>(step_size)};
         auto m = cls->test(test_view);
+        matrix.add_fold_accuracy(m.accuracy());
         matrix += m;
         docs.rotate(step_size);
     }

diff --git a/include/meta/classify/confusion_matrix.h b/include/meta/classify/confusion_matrix.h
@@ -41,6 +41,16 @@ class confusion_matrix
     void add(const predicted_label& predicted, const class_label& actual,
              size_t times = 1);
 
+    /**
+     * @param Accuracy to add
+     */
+    void add_fold_accuracy(double acc);
+
+    /**
+     * @return the list of added accuracies
+     */
+    std::vector<double> fold_accuracy() const;
+
     /**
      * Prints this matrix's statistics to out.
      *
@@ -160,6 +170,9 @@ class confusion_matrix
 
     /// Total number of classification attempts
     size_t total_;
+
+    /// Keeps track of accuracies between folds
+    std::vector<double> fold_acc_;
 };
 }
 }
+4 −56		CMakeLists.txt
+42 −24		README.md
+27 −0		examples/CMakeLists.txt
+0 −0		examples/build_toml.cpp
+57 −0		examples/conversions.cpp
+0 −0		examples/parse.cpp
+21 −21		examples/parse_stdin.cpp
+727 −204		include/cpptoml.h