Skip to content

Commit

Permalink
Merge branch 'master' into int_master
Browse files Browse the repository at this point in the history
  • Loading branch information
thammegowda committed Jan 28, 2022
2 parents 612601d + df2b9af commit e9866e9
Show file tree
Hide file tree
Showing 10 changed files with 434 additions and 351 deletions.
15 changes: 15 additions & 0 deletions ACKNOWLEDGMENT
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

The research is based upon work supported by the Office of the Director of
National Intelligence (ODNI), Intelligence Advanced Research Projects
Activity (IARPA), via AFRL Contract #FA8650-17-C-9116.
The views and conclusions contained herein are those of the authors and
should not be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of the ODNI, IARPA, or the
U.S. Government. The U.S. Government is authorized to reproduce and
distribute reprints for Governmental purposes notwithstanding any
copyright annotation thereon.


This material is based on research sponsored by Air Force Research Laboratory (AFRL)
under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce and
distribute reprints for Government purposes notwithstanding any copyright notation therein.
27 changes: 20 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,31 @@

Reader-Translator-Generator (RTG) is a Neural Machine Translation toolkit based on pytorch.

Docs: available under [docs/](docs/index.adoc) directory, which are also rendered in a prettier HTML format at https://isi-nlp.github.io/rtg/
Documentation: https://isi-nlp.github.io/rtg/
> for editing/improving docs go to [docs/](docs/index.adoc) directory.
---------
### Authors:
[See Here](https://github.com/isi-nlp/rtg-xt/graphs/contributors)

### Questions or Issues

Please use github issues to ask a question or report an issue :
1. https://github.com/isi-nlp/rtg/issues (public/ external repo)
2. https://github.com/isi-nlp/rtg-in/issues (a fork of rtg internal to ISI NLP)
2. https://github.com/isi-nlp/rtg-in/issues (an internal fork, for ISI NLP)

### Credits / Thanks
+ OpenNMT and the Harvard NLP team for [Annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html), I learned a lot from their work
+ [My team at USC ISI](https://www.isi.edu/research_groups/nlg/people) for everything else

### ACKNOWLEDGEMENTS

* The research is based upon work supported by the Office of the Director of
National Intelligence (ODNI), Intelligence Advanced Research Projects
Activity (IARPA), via AFRL Contract #FA8650-17-C-9116.
The views and conclusions contained herein are those of the authors and
should not be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of the ODNI, IARPA, or the
U.S. Government. The U.S. Government is authorized to reproduce and
distribute reprints for Governmental purposes notwithstanding any
copyright annotation thereon.

* This material is based on research sponsored by
Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000.
The U.S. Government is authorized to reproduce and distribute reprints for
Government purposes notwithstanding any copyright notation therein.
16 changes: 7 additions & 9 deletions docs/10-conf.yml.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,7 @@ prep:
----

[#conf-vocab]
== Vocabulary Preprocessing
=== Vocabulary Preprocessing

link:https://github.com/google/sentencepiece[Google's sentencepiece] is an awesome lib for
preprocessing the text datasets.
Expand All @@ -387,7 +387,7 @@ prep:
codec_lib: nlcodec # default is sentpiece
----

=== Vocabulary Types
==== Vocabulary Types
Both `sentpiece` or `nlcodec` support `pieces=` `bpe`, `char`, `word`.

[source, yaml]
Expand All @@ -398,8 +398,9 @@ prep:
pieces: bpe # other options: char, word
----
As of now, only `sentpiece` supports `pieces=unigram`.
For classification experiments, `nlcodec` supports `pieces=class`

=== Character coverage
==== Character coverage

For `bpe` and `char` vocabulary types, a useful trick is to exclude low frequency character and mark them as `UNK's`.
Usually expressed as percentage of character coverage in training corpus.
Expand All @@ -416,12 +417,9 @@ prep:

=== Sub-Word Regularization

When using `codec_lib: nlcodec` and `pieces: bpe`, you have the option to add
sub-word regularization to your training.
Normally, text is split into the fewest tokens necessary to represent
the sequence (greedy split).
By occasionally splitting some tokens into its constituents (suboptimal split),
we can represent the same sequence many different ways.
When using `codec_lib: nlcodec` and `pieces: bpe`, you have the option to add sub-word regularization to your training.
Normally, text is split into the fewest tokens necessary to represent the sequence (greedy split).
By occasionally splitting some tokens into its constituents (suboptimal split), we can represent the same sequence many ways.
This allows us to leverage less data more effectively.

[source, yaml]
Expand Down
2 changes: 1 addition & 1 deletion docs/howto-release.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
. Update the version: `\__version__` in `rtg/\__init__.py`
. Remove old builds (if any)

rm -r build dist *.egg-info`
rm -r build dist *.egg-info

. Build:

Expand Down
7 changes: 6 additions & 1 deletion docs/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ USC Information Sciences Institute Natural Language Group
:toc: left
//injects google analytics to <head>
:docinfo2:
:icons: font
:hide-uri-scheme:
:source-highlighter: rouge

Expand All @@ -31,6 +32,10 @@ include::45-scaling.adoc[]

include::50-serve.adoc[]

include::60-develop.adoc[]

:!sectnums:

include::60-develop.adoc[]
== Acknowledgements

include::../ACKNOWLEDGMENT[]
Loading

0 comments on commit e9866e9

Please sign in to comment.