Merge branch 'master' into int_master

isi-nlp · Jan 28, 2022 · e9866e9 · e9866e9
2 parents 612601d + df2b9af
commit e9866e9
Show file tree

Hide file tree

Showing 10 changed files with 434 additions and 351 deletions.
diff --git a/ACKNOWLEDGMENT b/ACKNOWLEDGMENT
@@ -0,0 +1,15 @@
+
+The research is based upon work supported by the Office of the Director of
+National Intelligence (ODNI), Intelligence Advanced Research Projects
+Activity (IARPA), via AFRL Contract #FA8650-17-C-9116.
+The views and conclusions contained herein are those of the authors and
+should not be interpreted as necessarily representing the official policies or
+endorsements, either expressed or implied, of the ODNI, IARPA, or the
+U.S. Government. The U.S. Government is authorized to reproduce and
+distribute reprints for Governmental purposes notwithstanding any
+copyright annotation thereon.
+
+
+This material is based on research sponsored by Air Force Research Laboratory (AFRL)
+under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce and
+distribute reprints for Government purposes notwithstanding any copyright notation therein.
diff --git a/README.md b/README.md
@@ -4,18 +4,31 @@
 
 Reader-Translator-Generator (RTG) is a Neural Machine Translation toolkit based on pytorch. 
 
-Docs: available under [docs/](docs/index.adoc) directory, which are also rendered in a prettier HTML format at https://isi-nlp.github.io/rtg/    
+Documentation: https://isi-nlp.github.io/rtg/
+> for editing/improving docs go to [docs/](docs/index.adoc) directory.
 
 ---------
-### Authors:
-[See Here](https://github.com/isi-nlp/rtg-xt/graphs/contributors)
 
 ### Questions or Issues 
 
 Please use github issues to ask a question or report an issue :
 1. https://github.com/isi-nlp/rtg/issues   (public/ external repo)
-2. https://github.com/isi-nlp/rtg-in/issues (a fork of rtg internal to ISI NLP)
+2. https://github.com/isi-nlp/rtg-in/issues (an internal fork, for ISI NLP)
 
-### Credits / Thanks
-+ OpenNMT and the Harvard NLP team for [Annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html), I learned a lot from their work
-+ [My team at USC ISI](https://www.isi.edu/research_groups/nlg/people) for everything else
+
+### ACKNOWLEDGEMENTS
+
+* The research is based upon work supported by the Office of the Director of
+National Intelligence (ODNI), Intelligence Advanced Research Projects
+Activity (IARPA), via AFRL Contract #FA8650-17-C-9116.
+The views and conclusions contained herein are those of the authors and
+should not be interpreted as necessarily representing the official policies or
+endorsements, either expressed or implied, of the ODNI, IARPA, or the
+U.S. Government. The U.S. Government is authorized to reproduce and
+distribute reprints for Governmental purposes notwithstanding any
+copyright annotation thereon.
+
+* This material is based on research sponsored by
+ Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000.
+The U.S. Government is authorized to reproduce and distribute reprints for
+Government purposes notwithstanding any copyright notation therein.
diff --git a/docs/10-conf.yml.adoc b/docs/10-conf.yml.adoc
@@ -369,7 +369,7 @@ prep:
 ----
 
 [#conf-vocab]
-== Vocabulary Preprocessing
+=== Vocabulary Preprocessing
 
 link:https://github.com/google/sentencepiece[Google's sentencepiece] is an awesome lib for
 preprocessing the text datasets.
@@ -387,7 +387,7 @@ prep:
   codec_lib: nlcodec  # default is sentpiece
 ----
 
-=== Vocabulary Types
+==== Vocabulary Types
 Both `sentpiece` or `nlcodec` support `pieces=` `bpe`, `char`, `word`.
 
 [source, yaml]
@@ -398,8 +398,9 @@ prep:
   pieces: bpe         # other options: char, word
 ----
 As of now, only `sentpiece` supports `pieces=unigram`.
+For classification experiments, `nlcodec` supports `pieces=class`
 
-=== Character coverage
+==== Character coverage
 
 For `bpe` and `char` vocabulary types, a useful trick is to exclude low frequency character and mark them as `UNK's`.
 Usually expressed as percentage of character coverage in training corpus.
@@ -416,12 +417,9 @@ prep:
 
 === Sub-Word Regularization
 
-When using `codec_lib: nlcodec` and `pieces: bpe`, you have the option to add
-sub-word regularization to your training.
-Normally, text is split into the fewest tokens necessary to represent
-the sequence (greedy split).
-By occasionally splitting some tokens into its constituents (suboptimal split),
-we can represent the same sequence many different ways.
+When using `codec_lib: nlcodec` and `pieces: bpe`, you have the option to add  sub-word regularization to your training.
+Normally, text is split into the fewest tokens necessary to represent  the sequence (greedy split).
+By occasionally splitting some tokens into its constituents (suboptimal split),  we can represent the same sequence many ways.
 This allows us to leverage less data more effectively.
 
 [source, yaml]

diff --git a/docs/howto-release.adoc b/docs/howto-release.adoc
@@ -7,7 +7,7 @@
 . Update the version: `\__version__` in `rtg/\__init__.py`
 . Remove old builds (if any)
 
-   rm -r build dist *.egg-info`
+   rm -r build dist *.egg-info
 
 . Build:
 

diff --git a/docs/index.adoc b/docs/index.adoc
@@ -11,6 +11,7 @@ USC Information Sciences Institute  Natural Language Group
 :toc: left
 //injects google analytics to <head>
 :docinfo2:
+:icons: font
 :hide-uri-scheme:
 :source-highlighter: rouge
 
@@ -31,6 +32,10 @@ include::45-scaling.adoc[]
 
 include::50-serve.adoc[]
 
+include::60-develop.adoc[]
 
+:!sectnums:
 
-include::60-develop.adoc[]
+== Acknowledgements
+
+include::../ACKNOWLEDGMENT[]