Releases · meta-toolkit/meta

30 Jan 01:42

skystrife

v2.0.0

6c062bf

MeTA v2.0.0

New features and major changes

Indexing

Index format rewrite: both inverted and forward indices now use the same
compressed postings format, and intermediate chunks are now also
compressed on-the-fly. There is now a built in tool to dump any forward
index to libsvm format (as this is not the on-disk format for that type
of index anymore).
Metadata support: indices can now store arbitrary metadata associated
with individual documents with string, integer, unsigned integer, and
floating point values
Corpus configuration is now stored within the corpus directory itself,
allowing for corpora to be distributed with their proper configurations
rather than having to bake this into the main configuration file
RAM limits can be set for the indexing process via the configuration
file. These are approximate and based on heuristics, so you should
always set these to lower than available RAM.
Forward indices can now be created directly instead of forcing the
creation of an inverted index first

Tokenization and Analysis

ICU will be built and statically linked if the system provided library is
too old on both OS X and Linux platforms. MeTA now will specify an
exact version of ICU that should be used per release for consistency.
That version is 56.1 as of this release.
Analyzers have been modified to support both integral and floating point
values via the use of the featurizer object passed to tokenize()
Documents no longer store any count information during the analysis
process

Ranking

Postings lists can now be read in a streaming fashion rather than all at
once via postings_stream
Ranking is now performed using a document-at-a-time scheme
Ranking functions now use fast approximate math from
fastapprox
Rank correlation measures have been added to the evaluation library

Language Model

Rewrite of the language model library which can load models from the
.arpa format
SyntacticDiff implementation for comparative text mining, which may
include grammatical error correction, summarization, or feature generation

Machine Learning

A feature selection library for selecting features for machine learning
using chi square, information gain, correlation coefficient, and odds
ratio has been added
The API for the machine learning algorithms has been changed to use
dataset classes; these are separate from the index classes and
represent data that is memory-resident
Support for regression has been added (currently only via SGD)
The SGD algorithm has been improved to use a normalized adaptive gradient
method which should make it less sensitive to feature scaling
The SGD algorithm now supports (approximate) L1 regularization via a
cumulative penalty approach
The libsvm modules are now also built using CMake

Miscellaneous

Packed binary I/O functions allow for writing integers/floating point
values in a compressed format that can be efficiently decoded. This
should be used for most binary I/O that needs to be performed in the
toolkit unless there is a specific reason not to.
An interactive demo application has been added for the shift-reduce
constituency parser
A string_view class is provided in the meta::util namespace to be
used for non-owning references to strings. This will use
std::experimental::string_view if available and our own
implementation if not
meta::util::optional will resolve to std::experimental::optional if
it is available
Support for jemalloc has been added to the build system. We strongly
recommend installing and linking against jemalloc for improved indexing
performance.
A tool has been added to print out the top k terms in a corpus
A new library for hashing has been added in namespace meta::hashing.
This includes a generic framework for writing hash functions that are
randomly keyed as well as (insertion only) probing-based hash sets/maps
with configurable resizing and probing strategies
A utility class fixed_heap has been added for places where a fixed size
set of maximal/minimal values should be maintained in constant space
The filesystem management routines have been converted to use STLsoft in
the event that the filesystem library in
std::experimental::filesystem is not available
Building MeTA on Windows is now officially supported via MSYS2 and
MinGW-w64, and continuious integration now builds it on every commit in
this environment
A small support library for things related to random number generation
has been added in meta::random
Sparse vectors now support operator+ and operator-
An STL container compatible allocator aligned_allocator<T, Alignment>
has been added that can over-align data (useful for performance in some
situations)
Bandit is now used for the unit tests, and these have been substantially
improved upon
io::parser deprecated and removed; most uses simply converted to
std::fstream
binary_file_{reader,writer} deprecated and removed;
io::packed or io::{read,write}_binary should be used instead

Bug fixes

knn classifier now only requests the top k when performing classification
An issue where uncompressed model files would not be found if using a
zlib-enabled build (#101)

Enhancements

Travis CI integration has been switched to their container
infrastructure, and it now builds with OS X with Clang in addition to
Linux with Clang and GCC
Appveyor CI for Windows builds alongside Travis
Indexing speeds are dramatically faster (thanks to many changes both in
the in-memory posting chunks as well as optimizations in the
tokenization process)
If no build type is specified, MeTA will be built in Release mode
The cpptoml dependency version has been bumped, allowing the use of
things like value_or for cleaner code
The identifiers library has been dramatically simplified

Assets 6

01 Sep 00:20

smassung

v1.3.8

935af7d

release MeTA v1.3.8

Bug fixes

Fix issue with confusion_matrix where precision and recall values were
swapped. Thanks to @HusseinHazimeh for finding this!

Enhancements

Better unit tests for confusion_matrix
Add functions to confusion_matrix to directly access precision, recall, and
F1 score
Create a predicted_label opaque identifier to emphasize class_labels that
are output from some model (and thus shouldn't be interchangeable)

Assets 6

13 Jun 22:39

skystrife

v1.3.7

624f8c5

release MeTA v1.3.7

Bug fixes

Fix inconsistent behavior of utf::segmenter (and thus icu_tokenizer) for
different locales. Thanks @CanoeFZH and @tng-konrad for helping debug
this!

Enhancements

Allow for specifying the language and country for locale generation in
setting up utf::segmenter (and thus icu_tokenizer)
Allow for suppression of <s> and </s> tags within icu_tokenizer,
mostly useful for information retrieval experiments with unigram words.
Thanks @HusseinHazimeh for the suggestion!
Add a default-unigram-chain filter chain preset which is suitable for
information retrieval experiments using unigram words. Thanks
@HusseinHazimeh for the suggestion!

Assets 6

03 Jun 02:01

skystrife

v1.3.6

978f519

release MeTA v1.3.6

Bug fixes

Fix potential off-by-one when calculating the number of documents in a
line_corpus when its files do not end in a newline

Enhancements

Change score_data to support floating-point weights on query terms

Assets 6

10 Apr 10:23

skystrife

v1.3.5

9a732ac

release MeTA v1.3.5

Bug fixes

Fix missing support for sequence/parser analyzers in classify tools

Assets 6

07 Apr 01:48

smassung

v1.3.4

1b07f39

release MeTA v1.3.4

New features

Support building with biicode
Add Vagrantfile for virtual machine configuration
Add Dockerfile for Docker support

Enhancements

Improve ir_eval unit tests

Bug fixes

Fix ir_eval::ndcg incorrect log base and addition instead of subtraction in
IDCG calculation
Fix ir_eval::avg_p incorrect early termination

Assets 6

24 Mar 22:48

skystrife

v1.3.3

cbdf237

release MeTA v1.3.3

Bug fixes

Fix issues with system-defined integer widths in binary model files
(mainly impacted the greedy tagger and parser); please re-download any
parser model files you may have had before
Fix bug where parser model directory is not created if a non-standard
prefix is used (anything other than "parser")

Enhancements

Silence inconsistent missing overrides warning on clang >= 3.6

Assets 6

18 Mar 00:05

skystrife

v1.3.2

140f17f

release MeTA v1.3.2

Bug fixes

fix potentially incorrect generation of vocabulary map files on 32-bit
systems (this appears to have only impacted non-default block sizes)

Assets 6

05 Mar 03:43

smassung

v1.3.1

dcae17f

release MeTA v1.3.1

Bug fixes:

fix calculation of average precision in ir_eval (the denominator was incorrect)
specify that labels are required for the file_corpus document list; this allows spaces in the path to each document

Assets 6

04 Mar 06:12

skystrife

v1.3

a57a814

release MeTA v1.3

New features:

additions to the graph library:
- myopic search
- BFS
- preferential attachment graph generation model (supports node
  attractiveness from different distributions)
- betweenness centrality
- eigenvector centrality
added a new natural language parsing library:
- parse tree library (visitor-based)
- shift-reduce constituency parser for generating phrase structure
  trees
- reimplementation of evalb metrics for evaluating parsers
- new filter for Penn Treebank-style normalization
added a greedy averaged Perceptron-based tagger
demo application for various basic text processing (profile)
basic iostreams that support gzip compression (if compiled with ZLib
support)
added iteration method for stats::multinomial seen events
added expected value and entropy functions to stats namespace
added linear_model: a generic multiclass classifier storage class
added gz_corpus: a compressed version of line_corpus
added macros for generating type safe identifiers with user defined
literal suffixes
added a persistent stack data structure to meta::util

Enhancements:

added operator== for util::optional<T>
better CMake support for building the libsvm modules
better CMake support for downloading unit-test data
improved setup guide in README (for OS X, Ubuntu, Arch, and EWS/ENGRIT)
tree analyzers refactored to use the new parser library (removes
dependency on outside toolkits for generating tree files)
analyzers that are not part of the "core" have been moved into their
respective folders (so ngram_pos_analyzer is in src/sequence,
tree_analyzer is in src/parser)
make_index now checks if the files exist before loading an index, and
if they are missing creates a new one (as opposed to just throwing an
exception on a nonexistent file)
cpptoml upgraded to support TOML v0.4.0
enable extra warnings (-Wextra) for clang++ and g++

Bug fixes:

fix sequence_analyzer::analyze() const when applied to untagged
sequences (was throwing when it shouldn't)
ensure that the inverted index object is destroyed first before
uninverting occurs in the creation of a forward_idnex
fix bug where icu_tokenizer would output spaces as tokens
fix bugs where index objects were not destroyed before trying to delete
their files in the unit tests
fix bug in sparse_vector::find() where it would return a non-end
iterator when asked to find an element that does not exist

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New features and major changes

Indexing

Tokenization and Analysis

Ranking

Language Model

Machine Learning

Miscellaneous

Bug fixes

Enhancements

Bug fixes

Enhancements

Bug fixes

Enhancements

Bug fixes

Enhancements

Bug fixes

New features

Enhancements

Bug fixes

Bug fixes

Enhancements

Bug fixes

Bug fixes:

New features:

Enhancements:

Bug fixes:

Releases: meta-toolkit/meta

MeTA v2.0.0

New features and major changes

Indexing

Tokenization and Analysis

Ranking

Language Model

Machine Learning

Miscellaneous

Bug fixes

Enhancements

release MeTA v1.3.8

Bug fixes

Enhancements

release MeTA v1.3.7

Bug fixes

Enhancements

release MeTA v1.3.6

Bug fixes

Enhancements

release MeTA v1.3.5

Bug fixes

release MeTA v1.3.4

New features

Enhancements

Bug fixes

release MeTA v1.3.3

Bug fixes

Enhancements

release MeTA v1.3.2

Bug fixes

release MeTA v1.3.1

Bug fixes:

release MeTA v1.3

New features:

Enhancements:

Bug fixes: