Skip to content

Commit

Permalink
Merge branch 'develop' for MeTA v3.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Chase Geigle committed Feb 13, 2017
2 parents e09ac0e + 484b1b9 commit 239805f
Show file tree
Hide file tree
Showing 143 changed files with 5,374 additions and 1,141 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ doc/
data/ceeaus
data/breast-cancer
data/housing
data/cranfield
biicode.conf
bii/
bin/
42 changes: 28 additions & 14 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,36 +65,50 @@ matrix:
- gcc-6
- g++-6

# Linux/Clang 3.6
# Linux/Clang 3.8
- os: linux
env: COMPILER=clang CLANG_VERSION=3.6
env: COMPILER=clang CLANG_VERSION=3.8
addons:
apt:
sources:
- ubuntu-toolchain-r-test
- llvm-toolchain-precise-3.6
- llvm-toolchain-precise-3.8
packages:
- *default-packages
- clang-3.6
- llvm-3.6-dev
- clang-3.8

# OS X 10.9 + Xcode 6.1
- os: osx
env: COMPILER=clang
# Linux/Clang 3.8 + libc++-3.9
# (I want this to be 3.9 across the board, but the apt source is not
# yet whitelisted for llvm 3.9)
- os: linux
env:
- COMPILER=clang
- CLANG_VERSION=3.8
- LLVM_TAG=RELEASE_390
- LIBCXX_EXTRA_CMAKE_FLAGS=-DLIBCXX_INSTALL_EXPERIMENTAL_LIBRARY=On
- CMAKE_VERSION=3.4.3
addons:
apt:
sources:
- ubuntu-toolchain-r-test
- llvm-toolchain-precise-3.8
packages:
- *default-packages
- clang-3.8

# OS X 10.10 + Xcode 6.4
# OS X 10.10 + Xcode 7.1.1
- os: osx
osx_image: xcode6.4
osx_image: xcode7.1
env: COMPILER=clang

# OS X 10.10 + Xcode 7.1.1
# OS X 10.11 + Xcode 7.3
- os: osx
osx_image: xcode7.1
osx_image: xcode7.3
env: COMPILER=clang

# OS X 10.11 + Xcode 7.2
# OS X 10.11 + Xcode 8
- os: osx
osx_image: xcode7.2
osx_image: xcode8
env: COMPILER=clang

# OS X/GCC 6
Expand Down
119 changes: 118 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,119 @@
# [v3.0.0][3.0.0]
## New features
- Add an `embedding_analyzer` that represents documents with their averaged word
vectors.
- Add a `parallel::reduction` algorithm designed for parallelizing complex
accumulation operations (like an E step in an EM algorithm)
- Parallelize feature counting in feature selector using the new
`parallel::reduction`
- Add a `parallel::for_each_block` algorithm to run functions on
(relatively) equal sub-ranges of an iterator range in parallel
- Add a parallel merge sort as `parallel::sort`
- Add a `util/traits.h` header for general useful traits
- Add a Markov model implementation in `sequence::markov_model`
- Add a generic unsupervised HMM implementation. This implementation
supports HMMs with discrete observations (what is used most often) and
sequence observations (useful for log mining applications). The
forward-backward algorithm is implemented using both the scaling method
and the log-space method. The scaling method is used by default, but the
log-space method is useful for HMMs with sequence observations to avoid
underflow issues when the output probabilities themselves are very small.
- Add the KL-divergence retrieval function using pseudo-relevance feedback
with the two-component mixture-model approach of Zhai and Lafferty,
called `kl_divergence_prf`. This ranker internally can use any
`language_model_ranker` subclass like `dirichlet_prior` or
`jelinek_mercer` to perform the ranking of the feedback set and the
result documents with respect to the modified query.

The EM algorithm used for the two-component mixture model is provided as
the `index::feedback::unigram_mixture` free function and returns the
feedback model.
- Add the Rocchio algorithm (`rocchio`) for pseudo-relevance feedback in
the vector space model.
- **Breaking Change.** To facilitate the above to changes, we have also
broken the `ranker` hierarchy into one more level. At the top we have
`ranker`, which has a pure virtual function `rank()` that can be
overridden to provide entirely custom ranking behavior, This is the class
the KL-divergence and Rocchio methods derive from, as we need to
re-define what it means to rank documents (first retrieving a feedback
set, then ranking documents with respect to an updated query).

Most of the time, however, you will want to derive from the second level
`ranking_function`, which is what was called `ranker` before. This class
provides a definition of `rank()` to perform document-at-a-time ranking,
and expects deriving classes to instead provide `initial_score()` and
`score_one()` implementations to define the scoring function used for
each document. **Existing code that derived from `ranker` prior to this
version of MeTA likely needs to be changed to instead derive from
`ranking_function`.**
- Add the `util::transform_iterator` class and `util::make_transform_iterator`
function for providing iterators that transform their output according to
a unary function.
- **Breaking Change.** `whitespace_tokenizer` now emits *only* word tokens
by default, suppressing all whitespace tokens. The old default was to
emit tokens containing whitespace in addition to actual word tokens. The
old behavior can be obtained by passing `false` to its constructor, or
setting `suppress-whitespace = false` in its configuration group in
`config.toml.` (Note that whitespace tokens are still needed if using a
`sentence_boundary` filter but, in nearly all circumstances,
`icu_tokenizer` should be preferred.)
- **Breaking Change.** Co-occurrence counting for embeddings now uses
history that crosses sentence boundaries by default. The old behavior
(clearing the history when starting a new sentence) can be obtained by
ensuring that a tokenizer is being used that emits sentence boundary tags
and by setting `break-on-tags = true` in the `[embeddings]` table of
`config.toml`.
- **Breaking Change.** All references in the embeddings library to "coocur"
are have changed to "cooccur". This means that some files and binaries
have been renamed. Much of the co-occurrence counting part of the
embeddings library has also been moved to the public API.
- Co-occurrence counting now is performed in parallel. Behavior of its
merge strategy can be configured with the new `[embeddings]` config
parameter `merge-fanout = n`, which specifies the maximum number of
on-disk chunks to allow before kicking off a multi-way merge (default 8).

## Enhancements
- Add additional `packed_write` and `packed_read` overloads: for
`std::pair`, `stats::dirichlet`, `stats::multinomial`,
`util::dense_matrix`, and `util::sparse_vector`
- Additional functions have been added to `ranker_factory` to allow
construction/loading of language_model_ranker subclasses (useful for the
`kl_divergence_prf` implementation)
- Add a `util::make_fixed_heap` helper function to simplify the declaration
of `util::fixed_heap` classes with lambda function comparators.
- Add regression tests for rankers MAP and NDCG scores. This adds a new
dataset `cranfield` that contains non-binary relevance judgments to
facilitate these new tests.
- Bump bundled version of ICU to 58.2.

## Bug Fixes
- Fix bug in NDCG calculation (ideal-DCG was computed using the wrong
sorting order for non-binary judgments)
- Fix bug where the final chunks to be merged in index creation were not
being deleted when merging completed
- Fix bug where GloVe training would allocate the embedding matrix before
starting the shuffling process, causing it to exceed the "max-ram"
config parameter.
- Fix bug with consuming MeTA from a build directory with `cmake` when
building a static ICU library. `meta-utf` is now forced to be a shared
library, which (1) should save on binary sizes and (2) ensures that the
statically build ICU is linked into the `libmeta-utf.so` library to avoid
undefined references to ICU functions.
- Fix bug with consuming Release-mode MeTA libraries from another project
being built in Debug mode. Before, `identifiers.h` would change behavior
based on the `NDEBUG` macro's setting. This behavior has been removed,
and opaque identifiers are always on.

## Deprecation
- `disk_index::doc_name` and `disk_index::doc_path` have been deprecated in
favor of the more general (and less confusing) `metadata()`. They will be
removed in a future major release.
- Support for 32-bit architectures is provided on a best-effort basis. MeTA
makes heavy use of memory mapping, which is best paired with a 64-bit
address space. Please move to a 64-bit platform for using MeTA if at all
possible (most consumer machines should support 64-bit if they were made
in the last 5 years or so).

# [v2.4.2][2.4.2]
## Bug Fixes
- Properly shuffle documents when doing an even-split classification test
Expand Down Expand Up @@ -493,7 +609,8 @@
# [v1.0][1.0]
- Initial release.

[unreleased]: https://github.com/meta-toolkit/meta/compare/v2.4.2...develop
[unreleased]: https://github.com/meta-toolkit/meta/compare/v3.0.0...develop
[3.0.0]: https://github.com/meta-toolkit/meta/compare/v2.4.2...v3.0.0
[2.4.2]: https://github.com/meta-toolkit/meta/compare/v2.4.1...v2.4.2
[2.4.1]: https://github.com/meta-toolkit/meta/compare/v2.4.0...v2.4.1
[2.4.0]: https://github.com/meta-toolkit/meta/compare/v2.3.0...v2.4.0
Expand Down
15 changes: 11 additions & 4 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,10 @@ endif()
list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/deps/findicu)
list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/deps/meta-cmake/)

# We require Unicode 8 for the unit tests, which was added in ICU 56.1
FindOrBuildICU(
VERSION 57.1
URL http://download.icu-project.org/files/icu4c/57.1/icu4c-57_1-src.tgz
URL_HASH MD5=976734806026a4ef8bdd17937c8898b9
VERSION 58.2
URL http://download.icu-project.org/files/icu4c/58.2/icu4c-58_2-src.tgz
URL_HASH MD5=fac212b32b7ec7ab007a12dff1f3aea1
)

add_library(meta-definitions INTERFACE)
Expand Down Expand Up @@ -143,6 +142,14 @@ add_subdirectory(src)
add_subdirectory(tests)
add_subdirectory(deps/cpptoml EXCLUDE_FROM_ALL)

# Warn users that are using a 32-bit system
if (CMAKE_SIZEOF_VOID_P LESS 8)
message(WARNING "You appear to be running on a 32-bit system. Support \
for 32-bit systems is provided on a best-effort basis; if at all \
possible, we strongly recommend that you use MeTA on a 64-bit \
platform.")
endif()

# install our targets defined in this file
install(TARGETS meta-definitions
EXPORT meta-exports
Expand Down
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -622,9 +622,12 @@ you should run the following commands to download dependencies and related
software needed for building:

```bash
pacman -Syu git make mingw-w64-x86_64-{gcc,cmake,icu,jemalloc,zlib}
pacman -Syu git make mingw-w64-x86_64-{gcc,cmake,icu,jemalloc,zlib} --force
```

(the `--force` is needed to work around a bug with the latest MSYS2
installer as of the time of writing.)

Then, exit the shell and launch the "MinGW-w64 Win64" shell. You can obtain
the toolkit and get started with:

Expand Down
2 changes: 1 addition & 1 deletion config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ test-sections = [23, 23]

[embeddings]
prefix = "word-embeddings"
filter = [{type = "icu-tokenizer"}, {type = "lowercase"}]
filter = [{type = "icu-tokenizer", suppress-tags = true}, {type = "lowercase"}]
vector-size = 50
[embeddings.vocab]
min-count = 10
Expand Down
2 changes: 1 addition & 1 deletion deps/cpptoml
2 changes: 1 addition & 1 deletion deps/meta-cmake
Submodule meta-cmake updated 1 files
+34 −28 FindOrBuildICU.cmake
21 changes: 18 additions & 3 deletions include/meta/analyzers/tokenizers/whitespace_tokenizer.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#ifndef META_WHITESPACE_TOKENIZER_H_
#define META_WHITESPACE_TOKENIZER_H_

#include "meta/analyzers/filter_factory.h"
#include "meta/analyzers/token_stream.h"
#include "meta/util/clonable.h"
#include "meta/util/string_view.h"
Expand Down Expand Up @@ -39,8 +40,10 @@ class whitespace_tokenizer : public util::clonable<token_stream,
public:
/**
* Creates a whitespace_tokenizer.
* @param suppress_whitespace Whether to suppress whitespace tokens
* themselves or not.
*/
whitespace_tokenizer();
whitespace_tokenizer(bool suppress_whitespace = true);

/**
* Sets the content for the tokenizer to parse.
Expand All @@ -64,12 +67,24 @@ class whitespace_tokenizer : public util::clonable<token_stream,
const static util::string_view id;

private:
void consume_adjacent_whitespace();

/// Buffered string content for this tokenizer
std::string content_;

/// Character index into the current buffer
uint64_t idx_;
/// Whether or not to output whitespace tokens
const bool suppress_whitespace_;

/// Character iterator into the current buffer
std::string::const_iterator it_;
};

/**
* Specialization of the factory method use to create whitespace_tokenizers.
*/
template <>
std::unique_ptr<token_stream>
make_tokenizer<whitespace_tokenizer>(const cpptoml::table& config);
}
}
}
Expand Down
1 change: 1 addition & 0 deletions include/meta/classify/classifier/classifier.h
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ confusion_matrix cross_validate(Creator&& creator,
docs, docs.begin(),
docs.begin() + static_cast<diff_type>(step_size)};
auto m = cls->test(test_view);
matrix.add_fold_accuracy(m.accuracy());
matrix += m;
docs.rotate(step_size);
}
Expand Down
13 changes: 13 additions & 0 deletions include/meta/classify/confusion_matrix.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,16 @@ class confusion_matrix
void add(const predicted_label& predicted, const class_label& actual,
size_t times = 1);

/**
* @param Accuracy to add
*/
void add_fold_accuracy(double acc);

/**
* @return the list of added accuracies
*/
std::vector<double> fold_accuracy() const;

/**
* Prints this matrix's statistics to out.
*
Expand Down Expand Up @@ -160,6 +170,9 @@ class confusion_matrix

/// Total number of classification attempts
size_t total_;

/// Keeps track of accuracies between folds
std::vector<double> fold_acc_;
};
}
}
Expand Down
Loading

0 comments on commit 239805f

Please sign in to comment.