Update documentation

jbesomi · Jun 1, 2020 · c13cf75 · c13cf75
1 parent a06ef61
commit c13cf75
Show file tree

Hide file tree

Showing 36 changed files with 418 additions and 127 deletions.
diff --git a/docs/source/nlp.rst b/docs/source/nlp.rst
@@ -3,7 +3,6 @@
    .. autosummary::
       :toctree: api
 
-      dependency_parse
       named_entities
       noun_chunks
 

diff --git a/docs/source/preprocessing.rst b/docs/source/preprocessing.rst
@@ -18,10 +18,10 @@
       remove_square_brackets
       remove_stopwords
       remove_urls
+      replace_urls
       remove_whitespace
       replace_punctuation
       replace_stopwords
-      stem
       tokenize
 
 

diff --git a/website/docs/api-nlp.md b/website/docs/api-nlp.md
@@ -12,14 +12,11 @@ hide_title: false
 <col style="width: 90%"/>
 </colgroup>
 <tbody>
-<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.nlp.dependency_parse.html#texthero.nlp.dependency_parse" title="texthero.nlp.dependency_parse"><code class="xref py py-obj docutils literal notranslate"><span class="pre">dependency_parse</span></code></a>(s)</p></td>
-<td><p>Return the dependency parse</p></td>
-</tr>
-<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.nlp.named_entities.html#texthero.nlp.named_entities" title="texthero.nlp.named_entities"><code class="xref py py-obj docutils literal notranslate"><span class="pre">named_entities</span></code></a>(s[, package])</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.nlp.named_entities.html#texthero.nlp.named_entities" title="texthero.nlp.named_entities"><code class="xref py py-obj docutils literal notranslate"><span class="pre">named_entities</span></code></a>(s[, package])</p></td>
 <td><p>Return named-entities.</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.nlp.noun_chunks.html#texthero.nlp.noun_chunks" title="texthero.nlp.noun_chunks"><code class="xref py py-obj docutils literal notranslate"><span class="pre">noun_chunks</span></code></a>(s)</p></td>
-<td><p>Return noun_chunks, flat phrases that have a noun as their head.</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.nlp.noun_chunks.html#texthero.nlp.noun_chunks" title="texthero.nlp.noun_chunks"><code class="xref py py-obj docutils literal notranslate"><span class="pre">noun_chunks</span></code></a>(s)</p></td>
+<td><p>Return noun_chunks, group of consecutive words that belong together.</p></td>
 </tr>
 </tbody>
 </table>

diff --git a/website/docs/api-preprocessing.md b/website/docs/api-preprocessing.md
@@ -5,51 +5,45 @@ hide_title: false
 ---
 
 <div>
-<span class="target" id="module-texthero.preprocessing"></span><p>Preprocess text-based Pandas DataFrame</p>
-<div class="section" id="examples">
-<h1>Examples<a class="headerlink" href="#examples" title="Permalink to this headline">¶</a></h1>
-<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">texthero</span> <span class="k">as</span> <span class="nn">hero</span>
-<span class="gp">&gt;&gt;&gt; </span><span class="n">hero</span><span class="o">.</span><span class="n">pipeline</span><span class="p">()</span> <span class="o">...</span>
-</pre></div>
-</div>
+<span class="target" id="module-texthero.preprocessing"></span><p>The texthero.preprocess module allow for efficient pre-processing of text-based Pandas Series and DataFrame.</p>
 <table class="longtable table">
 <colgroup>
 <col style="width: 10%"/>
 <col style="width: 90%"/>
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.clean.html#texthero.preprocessing.clean" title="texthero.preprocessing.clean"><code class="xref py py-obj docutils literal notranslate"><span class="pre">clean</span></code></a>(s[, pipeline])</p></td>
-<td><p>Clean pandas series by appling a preprocessing pipeline.</p></td>
+<td><p>Pre-process a text-based Pandas Series.</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.drop_no_content.html#texthero.preprocessing.drop_no_content" title="texthero.preprocessing.drop_no_content"><code class="xref py py-obj docutils literal notranslate"><span class="pre">drop_no_content</span></code></a>(s)</p></td>
-<td><p>Drop all rows where has_content is empty.</p></td>
+<td><p>Drop all rows without content.</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.get_default_pipeline.html#texthero.preprocessing.get_default_pipeline" title="texthero.preprocessing.get_default_pipeline"><code class="xref py py-obj docutils literal notranslate"><span class="pre">get_default_pipeline</span></code></a>()</p></td>
 <td><p>Return a list contaning all the methods used in the default cleaning pipeline.</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.has_content.html#texthero.preprocessing.has_content" title="texthero.preprocessing.has_content"><code class="xref py py-obj docutils literal notranslate"><span class="pre">has_content</span></code></a>(s)</p></td>
-<td><p>For each row, check that there is content.</p></td>
+<td><p>Return a Boolean Pandas Series indicating if the rows has content.</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_angle_brackets.html#texthero.preprocessing.remove_angle_brackets" title="texthero.preprocessing.remove_angle_brackets"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_angle_brackets</span></code></a>(s)</p></td>
 <td><p>Remove content within angle brackets &lt;&gt; and the angle brackets.</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_brackets.html#texthero.preprocessing.remove_brackets" title="texthero.preprocessing.remove_brackets"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_brackets</span></code></a>(s)</p></td>
-<td><p>Remove content within brackets and the brackets.</p></td>
+<td><p>Remove content within brackets and the brackets itself.</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_curly_brackets.html#texthero.preprocessing.remove_curly_brackets" title="texthero.preprocessing.remove_curly_brackets"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_curly_brackets</span></code></a>(s)</p></td>
 <td><p>Remove content within curly brackets {} and the curly brackets.</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_diacritics.html#texthero.preprocessing.remove_diacritics" title="texthero.preprocessing.remove_diacritics"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_diacritics</span></code></a>(input)</p></td>
-<td><p>Remove all diacritics.</p></td>
+<td><p>Remove all diacritics and accents.</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_digits.html#texthero.preprocessing.remove_digits" title="texthero.preprocessing.remove_digits"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_digits</span></code></a>(input[, only_blocks])</p></td>
-<td><p>Remove all digits from a series and replace it with an empty space.</p></td>
+<td><p>Remove all digits and replace it with a single space.</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_html_tags.html#texthero.preprocessing.remove_html_tags" title="texthero.preprocessing.remove_html_tags"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_html_tags</span></code></a>(s)</p></td>
 <td><p>Remove html tags from the given Pandas Series.</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_punctuation.html#texthero.preprocessing.remove_punctuation" title="texthero.preprocessing.remove_punctuation"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_punctuation</span></code></a>(input)</p></td>
-<td><p>Remove string.punctuation (!”#$%&amp;’()*+,-./:;&lt;=&gt;?@[]^_`{|}~).</p></td>
+<td><p>Replace all punctuation with a single space (” “).</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_round_brackets.html#texthero.preprocessing.remove_round_brackets" title="texthero.preprocessing.remove_round_brackets"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_round_brackets</span></code></a>(s)</p></td>
 <td><p>Remove content within parentheses () and parentheses.</p></td>
@@ -61,24 +55,23 @@ hide_title: false
 <td><p>Remove all instances of <cite>words</cite> and replace it with an empty space.</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_urls.html#texthero.preprocessing.remove_urls" title="texthero.preprocessing.remove_urls"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_urls</span></code></a>(s)</p></td>
-<td><p>Remove all urls from a given Series.</p></td>
+<td><p>Remove all urls from a given Pandas Series.</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_whitespace.html#texthero.preprocessing.remove_whitespace" title="texthero.preprocessing.remove_whitespace"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_whitespace</span></code></a>(input)</p></td>
-<td><p>Remove all extra white spaces between words.</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_urls.html#texthero.preprocessing.replace_urls" title="texthero.preprocessing.replace_urls"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_urls</span></code></a>(s, symbol)</p></td>
+<td><p>Replace all urls with the given symbol.</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_punctuation.html#texthero.preprocessing.replace_punctuation" title="texthero.preprocessing.replace_punctuation"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_punctuation</span></code></a>(input, symbol)</p></td>
-<td><p>Replace string.punctuation (!”#$%&amp;’()*+,-./:;&lt;=&gt;?@[]^_`{|}~) with symbol argument.</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_whitespace.html#texthero.preprocessing.remove_whitespace" title="texthero.preprocessing.remove_whitespace"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_whitespace</span></code></a>(input)</p></td>
+<td><p>Remove any extra white spaces.</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_stopwords.html#texthero.preprocessing.replace_stopwords" title="texthero.preprocessing.replace_stopwords"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_stopwords</span></code></a>(input, symbol, stopwords, …)</p></td>
-<td><p>Replace all stopwords with symbol.</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_punctuation.html#texthero.preprocessing.replace_punctuation" title="texthero.preprocessing.replace_punctuation"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_punctuation</span></code></a>(input, symbol)</p></td>
+<td><p>Replace all punctuation with a given symbol.</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.stem.html#texthero.preprocessing.stem" title="texthero.preprocessing.stem"><code class="xref py py-obj docutils literal notranslate"><span class="pre">stem</span></code></a>(input[, stem, language])</p></td>
-<td><p>Stem series using either ‘porter’ or ‘snowball’ NLTK stemmers.</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_stopwords.html#texthero.preprocessing.replace_stopwords" title="texthero.preprocessing.replace_stopwords"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_stopwords</span></code></a>(input, symbol, stopwords, …)</p></td>
+<td><p>Replace all stopwords with symbol.</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.tokenize.html#texthero.preprocessing.tokenize" title="texthero.preprocessing.tokenize"><code class="xref py py-obj docutils literal notranslate"><span class="pre">tokenize</span></code></a>(s)</p></td>
 <td><p>Tokenize each row of the given Series.</p></td>
 </tr>
 </tbody>
 </table>
-</div>
 </div>
diff --git a/website/docs/api-representation.md b/website/docs/api-representation.md
@@ -25,16 +25,16 @@ hide_title: false
 <td><p>Perform non-negative matrix factorization.</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.representation.pca.html#texthero.representation.pca" title="texthero.representation.pca"><code class="xref py py-obj docutils literal notranslate"><span class="pre">pca</span></code></a>(s[, n_components])</p></td>
-<td><p>Perform PCA.</p></td>
+<td><p>Perform principal component analysis on the given Pandas Series.</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.representation.term_frequency.html#texthero.representation.term_frequency" title="texthero.representation.term_frequency"><code class="xref py py-obj docutils literal notranslate"><span class="pre">term_frequency</span></code></a>(s[, max_features, lowercase, …])</p></td>
-<td><p>Represent input on term frequency.</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.representation.term_frequency.html#texthero.representation.term_frequency" title="texthero.representation.term_frequency"><code class="xref py py-obj docutils literal notranslate"><span class="pre">term_frequency</span></code></a>(s, max_features, NoneType] = None)</p></td>
+<td><p>Represent a text-based Pandas Series using term_frequency.</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.representation.tfidf.html#texthero.representation.tfidf" title="texthero.representation.tfidf"><code class="xref py py-obj docutils literal notranslate"><span class="pre">tfidf</span></code></a>(s[, max_features, min_df, …])</p></td>
-<td><p>Represent input on a TF-IDF vector space.</p></td>
+<td><p>Represent a text-based Pandas Series using TF-IDF.</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="api/texthero.representation.tsne.html#texthero.representation.tsne" title="texthero.representation.tsne"><code class="xref py py-obj docutils literal notranslate"><span class="pre">tsne</span></code></a>(s[, n_components, perplexity, …])</p></td>
-<td><p>Perform TSNE.</p></td>
+<td><p>Perform TSNE on the given pandas series.</p></td>
 </tr>
 </tbody>
 </table>

diff --git a/website/docs/api/texthero.nlp.dependency_parse.md b/website/docs/api/texthero.nlp.dependency_parse.md
@@ -7,10 +7,5 @@ hide_title: true
 <div>
 <div class="section" id="texthero-nlp-dependency-parse">
 <h1>texthero.nlp.dependency_parse<a class="headerlink" href="#texthero-nlp-dependency-parse" title="Permalink to this headline">¶</a></h1>
-<dl class="py function">
-<dt id="texthero.nlp.dependency_parse">
-<code class="sig-name descname">dependency_parse</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">s</span></em><span class="sig-paren">)</span><a class="headerlink" href="#texthero.nlp.dependency_parse" title="Permalink to this definition">¶</a></dt>
-<dd><p>Return the dependency parse</p>
-</dd></dl>
 </div>
 </div>
diff --git a/website/docs/api/texthero.nlp.named_entities.md b/website/docs/api/texthero.nlp.named_entities.md
@@ -11,27 +11,40 @@ hide_title: true
 <dt id="texthero.nlp.named_entities">
 <code class="sig-name descname">named_entities</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">s</span></em>, <em class="sig-param"><span class="n">package</span><span class="o">=</span><span class="default_value">'spacy'</span></em><span class="sig-paren">)</span><a class="headerlink" href="#texthero.nlp.named_entities" title="Permalink to this definition">¶</a></dt>
 <dd><p>Return named-entities.</p>
-<p>Use Spacy named-entity-recognition.</p>
-<blockquote>
-<div><p>PERSON: People, including fictional.
-NORP: Nationalities or religious or political groups.
-FAC: Buildings, airports, highways, bridges, etc.
-ORG: Companies, agencies, institutions, etc.
-GPE: Countries, cities, states.
-LOC: Non-GPE locations, mountain ranges, bodies of water.
-PRODUCT: Objects, vehicles, foods, etc. (Not services.)
-EVENT: Named hurricanes, battles, wars, sports events, etc.
-WORK_OF_ART: Titles of books, songs, etc.
-LAW: Named documents made into laws.
-LANGUAGE: Any named language.
-DATE: Absolute or relative dates or periods.
-TIME: Times smaller than a day.
-PERCENT: Percentage, including ”%“.
-MONEY: Monetary values, including unit.
-QUANTITY: Measurements, as of weight or distance.
-ORDINAL: “first”, “second”, etc.
-CARDINAL:       Numerals that do not fall under another type.</p>
-</div></blockquote>
+<p>Return a Pandas Series where each rows contains a list of tuples containing information regarding the given named entities.</p>
+<p>Tuple: (<cite>entity’name</cite>, <cite>entity’label</cite>, <cite>starting character</cite>, <cite>ending character</cite>)</p>
+<p>Under the hood, <cite>named_entities</cite> make use of Spacy name entity recognition.</p>
+<dl class="simple">
+<dt>List of labels:</dt><dd><ul class="simple">
+<li><p><cite>PERSON</cite>: People, including fictional.</p></li>
+<li><p><cite>NORP</cite>: Nationalities or religious or political groups.</p></li>
+<li><p><cite>FAC</cite>: Buildings, airports, highways, bridges, etc.</p></li>
+<li><p><cite>ORG</cite> : Companies, agencies, institutions, etc.</p></li>
+<li><p><cite>GPE</cite>: Countries, cities, states.</p></li>
+<li><p><cite>LOC</cite>: Non-GPE locations, mountain ranges, bodies of water.</p></li>
+<li><p><cite>PRODUCT</cite>: Objects, vehicles, foods, etc. (Not services.)</p></li>
+<li><p><cite>EVENT</cite>: Named hurricanes, battles, wars, sports events, etc.</p></li>
+<li><p><cite>WORK_OF_ART</cite>: Titles of books, songs, etc.</p></li>
+<li><p><cite>LAW</cite>: Named documents made into laws.</p></li>
+<li><p><cite>LANGUAGE</cite>: Any named language.</p></li>
+<li><p><cite>DATE</cite>: Absolute or relative dates or periods.</p></li>
+<li><p><cite>TIME</cite>: Times smaller than a day.</p></li>
+<li><p><cite>PERCENT</cite>: Percentage, including ”%“.</p></li>
+<li><p><cite>MONEY</cite>: Monetary values, including unit.</p></li>
+<li><p><cite>QUANTITY</cite>: Measurements, as of weight or distance.</p></li>
+<li><p><cite>ORDINAL</cite>: “first”, “second”, etc.</p></li>
+<li><p><cite>CARDINAL</cite>: Numerals that do not fall under another type.</p></li>
+</ul>
+</dd>
+</dl>
+<p class="rubric">Examples</p>
+<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">texthero</span> <span class="k">as</span> <span class="nn">hero</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="s2">"Yesterday I was in NY with Bill de Blasio"</span><span class="p">)</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="n">hero</span><span class="o">.</span><span class="n">named_entities</span><span class="p">(</span><span class="n">s</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
+<span class="go">[('Yesterday', 'DATE', 0, 9), ('NY', 'GPE', 19, 21), ('Bill de Blasio', 'PERSON', 27, 41)]</span>
+</pre></div>
+</div>
 </dd></dl>
 </div>
 </div>