Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
jbesomi committed Jun 1, 2020
1 parent a06ef61 commit c13cf75
Show file tree
Hide file tree
Showing 36 changed files with 418 additions and 127 deletions.
1 change: 0 additions & 1 deletion docs/source/nlp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
.. autosummary::
:toctree: api

dependency_parse
named_entities
noun_chunks

Expand Down
2 changes: 1 addition & 1 deletion docs/source/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
remove_square_brackets
remove_stopwords
remove_urls
replace_urls
remove_whitespace
replace_punctuation
replace_stopwords
stem
tokenize


Expand Down
9 changes: 3 additions & 6 deletions website/docs/api-nlp.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,11 @@ hide_title: false
<col style="width: 90%"/>
</colgroup>
<tbody>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.nlp.dependency_parse.html#texthero.nlp.dependency_parse" title="texthero.nlp.dependency_parse"><code class="xref py py-obj docutils literal notranslate"><span class="pre">dependency_parse</span></code></a>(s)</p></td>
<td><p>Return the dependency parse</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.nlp.named_entities.html#texthero.nlp.named_entities" title="texthero.nlp.named_entities"><code class="xref py py-obj docutils literal notranslate"><span class="pre">named_entities</span></code></a>(s[, package])</p></td>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.nlp.named_entities.html#texthero.nlp.named_entities" title="texthero.nlp.named_entities"><code class="xref py py-obj docutils literal notranslate"><span class="pre">named_entities</span></code></a>(s[, package])</p></td>
<td><p>Return named-entities.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.nlp.noun_chunks.html#texthero.nlp.noun_chunks" title="texthero.nlp.noun_chunks"><code class="xref py py-obj docutils literal notranslate"><span class="pre">noun_chunks</span></code></a>(s)</p></td>
<td><p>Return noun_chunks, flat phrases that have a noun as their head.</p></td>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.nlp.noun_chunks.html#texthero.nlp.noun_chunks" title="texthero.nlp.noun_chunks"><code class="xref py py-obj docutils literal notranslate"><span class="pre">noun_chunks</span></code></a>(s)</p></td>
<td><p>Return noun_chunks, group of consecutive words that belong together.</p></td>
</tr>
</tbody>
</table>
Expand Down
41 changes: 17 additions & 24 deletions website/docs/api-preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,51 +5,45 @@ hide_title: false
---

<div>
<span class="target" id="module-texthero.preprocessing"></span><p>Preprocess text-based Pandas DataFrame</p>
<div class="section" id="examples">
<h1>Examples<a class="headerlink" href="#examples" title="Permalink to this headline">¶</a></h1>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">texthero</span> <span class="k">as</span> <span class="nn">hero</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">hero</span><span class="o">.</span><span class="n">pipeline</span><span class="p">()</span> <span class="o">...</span>
</pre></div>
</div>
<span class="target" id="module-texthero.preprocessing"></span><p>The texthero.preprocess module allow for efficient pre-processing of text-based Pandas Series and DataFrame.</p>
<table class="longtable table">
<colgroup>
<col style="width: 10%"/>
<col style="width: 90%"/>
</colgroup>
<tbody>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.clean.html#texthero.preprocessing.clean" title="texthero.preprocessing.clean"><code class="xref py py-obj docutils literal notranslate"><span class="pre">clean</span></code></a>(s[, pipeline])</p></td>
<td><p>Clean pandas series by appling a preprocessing pipeline.</p></td>
<td><p>Pre-process a text-based Pandas Series.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.drop_no_content.html#texthero.preprocessing.drop_no_content" title="texthero.preprocessing.drop_no_content"><code class="xref py py-obj docutils literal notranslate"><span class="pre">drop_no_content</span></code></a>(s)</p></td>
<td><p>Drop all rows where has_content is empty.</p></td>
<td><p>Drop all rows without content.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.get_default_pipeline.html#texthero.preprocessing.get_default_pipeline" title="texthero.preprocessing.get_default_pipeline"><code class="xref py py-obj docutils literal notranslate"><span class="pre">get_default_pipeline</span></code></a>()</p></td>
<td><p>Return a list contaning all the methods used in the default cleaning pipeline.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.has_content.html#texthero.preprocessing.has_content" title="texthero.preprocessing.has_content"><code class="xref py py-obj docutils literal notranslate"><span class="pre">has_content</span></code></a>(s)</p></td>
<td><p>For each row, check that there is content.</p></td>
<td><p>Return a Boolean Pandas Series indicating if the rows has content.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_angle_brackets.html#texthero.preprocessing.remove_angle_brackets" title="texthero.preprocessing.remove_angle_brackets"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_angle_brackets</span></code></a>(s)</p></td>
<td><p>Remove content within angle brackets &lt;&gt; and the angle brackets.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_brackets.html#texthero.preprocessing.remove_brackets" title="texthero.preprocessing.remove_brackets"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_brackets</span></code></a>(s)</p></td>
<td><p>Remove content within brackets and the brackets.</p></td>
<td><p>Remove content within brackets and the brackets itself.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_curly_brackets.html#texthero.preprocessing.remove_curly_brackets" title="texthero.preprocessing.remove_curly_brackets"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_curly_brackets</span></code></a>(s)</p></td>
<td><p>Remove content within curly brackets {} and the curly brackets.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_diacritics.html#texthero.preprocessing.remove_diacritics" title="texthero.preprocessing.remove_diacritics"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_diacritics</span></code></a>(input)</p></td>
<td><p>Remove all diacritics.</p></td>
<td><p>Remove all diacritics and accents.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_digits.html#texthero.preprocessing.remove_digits" title="texthero.preprocessing.remove_digits"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_digits</span></code></a>(input[, only_blocks])</p></td>
<td><p>Remove all digits from a series and replace it with an empty space.</p></td>
<td><p>Remove all digits and replace it with a single space.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_html_tags.html#texthero.preprocessing.remove_html_tags" title="texthero.preprocessing.remove_html_tags"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_html_tags</span></code></a>(s)</p></td>
<td><p>Remove html tags from the given Pandas Series.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_punctuation.html#texthero.preprocessing.remove_punctuation" title="texthero.preprocessing.remove_punctuation"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_punctuation</span></code></a>(input)</p></td>
<td><p>Remove string.punctuation (!”#$%&amp;’()*+,-./:;&lt;=&gt;?@[]^_`{|}~).</p></td>
<td><p>Replace all punctuation with a single space (” “).</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_round_brackets.html#texthero.preprocessing.remove_round_brackets" title="texthero.preprocessing.remove_round_brackets"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_round_brackets</span></code></a>(s)</p></td>
<td><p>Remove content within parentheses () and parentheses.</p></td>
Expand All @@ -61,24 +55,23 @@ hide_title: false
<td><p>Remove all instances of <cite>words</cite> and replace it with an empty space.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_urls.html#texthero.preprocessing.remove_urls" title="texthero.preprocessing.remove_urls"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_urls</span></code></a>(s)</p></td>
<td><p>Remove all urls from a given Series.</p></td>
<td><p>Remove all urls from a given Pandas Series.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_whitespace.html#texthero.preprocessing.remove_whitespace" title="texthero.preprocessing.remove_whitespace"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_whitespace</span></code></a>(input)</p></td>
<td><p>Remove all extra white spaces between words.</p></td>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_urls.html#texthero.preprocessing.replace_urls" title="texthero.preprocessing.replace_urls"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_urls</span></code></a>(s, symbol)</p></td>
<td><p>Replace all urls with the given symbol.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_punctuation.html#texthero.preprocessing.replace_punctuation" title="texthero.preprocessing.replace_punctuation"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_punctuation</span></code></a>(input, symbol)</p></td>
<td><p>Replace string.punctuation (!”#$%&amp;’()*+,-./:;&lt;=&gt;?@[]^_`{|}~) with symbol argument.</p></td>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.remove_whitespace.html#texthero.preprocessing.remove_whitespace" title="texthero.preprocessing.remove_whitespace"><code class="xref py py-obj docutils literal notranslate"><span class="pre">remove_whitespace</span></code></a>(input)</p></td>
<td><p>Remove any extra white spaces.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_stopwords.html#texthero.preprocessing.replace_stopwords" title="texthero.preprocessing.replace_stopwords"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_stopwords</span></code></a>(input, symbol, stopwords, …)</p></td>
<td><p>Replace all stopwords with symbol.</p></td>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_punctuation.html#texthero.preprocessing.replace_punctuation" title="texthero.preprocessing.replace_punctuation"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_punctuation</span></code></a>(input, symbol)</p></td>
<td><p>Replace all punctuation with a given symbol.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.stem.html#texthero.preprocessing.stem" title="texthero.preprocessing.stem"><code class="xref py py-obj docutils literal notranslate"><span class="pre">stem</span></code></a>(input[, stem, language])</p></td>
<td><p>Stem series using either ‘porter’ or ‘snowball’ NLTK stemmers.</p></td>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.preprocessing.replace_stopwords.html#texthero.preprocessing.replace_stopwords" title="texthero.preprocessing.replace_stopwords"><code class="xref py py-obj docutils literal notranslate"><span class="pre">replace_stopwords</span></code></a>(input, symbol, stopwords, …)</p></td>
<td><p>Replace all stopwords with symbol.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.preprocessing.tokenize.html#texthero.preprocessing.tokenize" title="texthero.preprocessing.tokenize"><code class="xref py py-obj docutils literal notranslate"><span class="pre">tokenize</span></code></a>(s)</p></td>
<td><p>Tokenize each row of the given Series.</p></td>
</tr>
</tbody>
</table>
</div>
</div>
10 changes: 5 additions & 5 deletions website/docs/api-representation.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,16 @@ hide_title: false
<td><p>Perform non-negative matrix factorization.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.representation.pca.html#texthero.representation.pca" title="texthero.representation.pca"><code class="xref py py-obj docutils literal notranslate"><span class="pre">pca</span></code></a>(s[, n_components])</p></td>
<td><p>Perform PCA.</p></td>
<td><p>Perform principal component analysis on the given Pandas Series.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.representation.term_frequency.html#texthero.representation.term_frequency" title="texthero.representation.term_frequency"><code class="xref py py-obj docutils literal notranslate"><span class="pre">term_frequency</span></code></a>(s[, max_features, lowercase, …])</p></td>
<td><p>Represent input on term frequency.</p></td>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.representation.term_frequency.html#texthero.representation.term_frequency" title="texthero.representation.term_frequency"><code class="xref py py-obj docutils literal notranslate"><span class="pre">term_frequency</span></code></a>(s, max_features, NoneType] = None)</p></td>
<td><p>Represent a text-based Pandas Series using term_frequency.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="api/texthero.representation.tfidf.html#texthero.representation.tfidf" title="texthero.representation.tfidf"><code class="xref py py-obj docutils literal notranslate"><span class="pre">tfidf</span></code></a>(s[, max_features, min_df, …])</p></td>
<td><p>Represent input on a TF-IDF vector space.</p></td>
<td><p>Represent a text-based Pandas Series using TF-IDF.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="api/texthero.representation.tsne.html#texthero.representation.tsne" title="texthero.representation.tsne"><code class="xref py py-obj docutils literal notranslate"><span class="pre">tsne</span></code></a>(s[, n_components, perplexity, …])</p></td>
<td><p>Perform TSNE.</p></td>
<td><p>Perform TSNE on the given pandas series.</p></td>
</tr>
</tbody>
</table>
Expand Down
5 changes: 0 additions & 5 deletions website/docs/api/texthero.nlp.dependency_parse.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,5 @@ hide_title: true
<div>
<div class="section" id="texthero-nlp-dependency-parse">
<h1>texthero.nlp.dependency_parse<a class="headerlink" href="#texthero-nlp-dependency-parse" title="Permalink to this headline">¶</a></h1>
<dl class="py function">
<dt id="texthero.nlp.dependency_parse">
<code class="sig-name descname">dependency_parse</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">s</span></em><span class="sig-paren">)</span><a class="headerlink" href="#texthero.nlp.dependency_parse" title="Permalink to this definition">¶</a></dt>
<dd><p>Return the dependency parse</p>
</dd></dl>
</div>
</div>
55 changes: 34 additions & 21 deletions website/docs/api/texthero.nlp.named_entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,27 +11,40 @@ hide_title: true
<dt id="texthero.nlp.named_entities">
<code class="sig-name descname">named_entities</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">s</span></em>, <em class="sig-param"><span class="n">package</span><span class="o">=</span><span class="default_value">'spacy'</span></em><span class="sig-paren">)</span><a class="headerlink" href="#texthero.nlp.named_entities" title="Permalink to this definition">¶</a></dt>
<dd><p>Return named-entities.</p>
<p>Use Spacy named-entity-recognition.</p>
<blockquote>
<div><p>PERSON: People, including fictional.
NORP: Nationalities or religious or political groups.
FAC: Buildings, airports, highways, bridges, etc.
ORG: Companies, agencies, institutions, etc.
GPE: Countries, cities, states.
LOC: Non-GPE locations, mountain ranges, bodies of water.
PRODUCT: Objects, vehicles, foods, etc. (Not services.)
EVENT: Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW: Named documents made into laws.
LANGUAGE: Any named language.
DATE: Absolute or relative dates or periods.
TIME: Times smaller than a day.
PERCENT: Percentage, including ”%“.
MONEY: Monetary values, including unit.
QUANTITY: Measurements, as of weight or distance.
ORDINAL: “first”, “second”, etc.
CARDINAL: Numerals that do not fall under another type.</p>
</div></blockquote>
<p>Return a Pandas Series where each rows contains a list of tuples containing information regarding the given named entities.</p>
<p>Tuple: (<cite>entity’name</cite>, <cite>entity’label</cite>, <cite>starting character</cite>, <cite>ending character</cite>)</p>
<p>Under the hood, <cite>named_entities</cite> make use of Spacy name entity recognition.</p>
<dl class="simple">
<dt>List of labels:</dt><dd><ul class="simple">
<li><p><cite>PERSON</cite>: People, including fictional.</p></li>
<li><p><cite>NORP</cite>: Nationalities or religious or political groups.</p></li>
<li><p><cite>FAC</cite>: Buildings, airports, highways, bridges, etc.</p></li>
<li><p><cite>ORG</cite> : Companies, agencies, institutions, etc.</p></li>
<li><p><cite>GPE</cite>: Countries, cities, states.</p></li>
<li><p><cite>LOC</cite>: Non-GPE locations, mountain ranges, bodies of water.</p></li>
<li><p><cite>PRODUCT</cite>: Objects, vehicles, foods, etc. (Not services.)</p></li>
<li><p><cite>EVENT</cite>: Named hurricanes, battles, wars, sports events, etc.</p></li>
<li><p><cite>WORK_OF_ART</cite>: Titles of books, songs, etc.</p></li>
<li><p><cite>LAW</cite>: Named documents made into laws.</p></li>
<li><p><cite>LANGUAGE</cite>: Any named language.</p></li>
<li><p><cite>DATE</cite>: Absolute or relative dates or periods.</p></li>
<li><p><cite>TIME</cite>: Times smaller than a day.</p></li>
<li><p><cite>PERCENT</cite>: Percentage, including ”%“.</p></li>
<li><p><cite>MONEY</cite>: Monetary values, including unit.</p></li>
<li><p><cite>QUANTITY</cite>: Measurements, as of weight or distance.</p></li>
<li><p><cite>ORDINAL</cite>: “first”, “second”, etc.</p></li>
<li><p><cite>CARDINAL</cite>: Numerals that do not fall under another type.</p></li>
</ul>
</dd>
</dl>
<p class="rubric">Examples</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">texthero</span> <span class="k">as</span> <span class="nn">hero</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="s2">"Yesterday I was in NY with Bill de Blasio"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">hero</span><span class="o">.</span><span class="n">named_entities</span><span class="p">(</span><span class="n">s</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">[('Yesterday', 'DATE', 0, 9), ('NY', 'GPE', 19, 21), ('Bill de Blasio', 'PERSON', 27, 41)]</span>
</pre></div>
</div>
</dd></dl>
</div>
</div>
Loading

0 comments on commit c13cf75

Please sign in to comment.