Skip to content

Commit

Permalink
CU-8693gd6c7 CU-86930zxq5 Move to medcat 1.10 (#19)
Browse files Browse the repository at this point in the history
* CU-8693gd6c7: Move in-notebook medcat install versions to 1.10

* CU-86930zxq5: Add notes regarding loading config separately for newer CDBs

* CU-86930zxq5: Save and load config along with CDB along with notes where applicable

* CU-8693gd6c7: Undo unnecessary version-changes in notebooks

* CU-8693gd6c7: Remove use of deprecated multiprocessing methods and add notes regarding it
  • Loading branch information
mart-r authored Jan 29, 2024
1 parent 6ea147c commit 53019eb
Show file tree
Hide file tree
Showing 19 changed files with 110 additions and 65 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13095,7 +13095,7 @@ <h1 id="MedCAT-tutorial---logging-with-MedCAT">MedCAT tutorial - logging with Me
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Install medcat</span>
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.9.3
<span class="o">!</span> pip install medcat~<span class="o">=</span><span class="m">1</span>.10.0
<span class="k">try</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">medcat.cat</span> <span class="kn">import</span> <span class="n">CAT</span>
<span class="k">except</span><span class="p">:</span>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"outputs": [],
"source": [
"# Install medcat\n",
"! pip install medcat==1.9.3\n",
"! pip install medcat~=1.10.0\n",
"try:\n",
" from medcat.cat import CAT\n",
"except:\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13099,7 +13099,7 @@ <h3 id="First-we-need-to-install-MedCAT">First we need to install MedCAT<a class
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Install MedCAT</span>
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.9.3
<span class="o">!</span> pip install medcat~<span class="o">=</span><span class="m">1</span>.10.0
<span class="c1"># Get the scispacy model</span>
<span class="o">!</span> python -m spacy download en_core_web_md
<span class="k">try</span><span class="p">:</span>
Expand Down Expand Up @@ -14264,6 +14264,9 @@ <h3 id="Save-the-Concept-Database-model">Save the Concept Database model<a class
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">cdb</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">DATA_DIR</span> <span class="o">+</span> <span class="s2">&quot;cdb.dat&quot;</span><span class="p">)</span>
<span class="c1"># NOTE: Starting from medcat 1.10 we&#39;re no longer saving the config</span>
<span class="c1"># along with the CDB. It&#39;ll need to be saved separately.</span>
<span class="n">cdb</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">DATA_DIR</span> <span class="o">+</span> <span class="s2">&quot;config.json&quot;</span><span class="p">)</span>
</pre></div>

</div>
Expand All @@ -14274,7 +14277,8 @@ <h3 id="Save-the-Concept-Database-model">Save the Concept Database model<a class
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Load-the-Concept-Database-model">Load the Concept Database model<a class="anchor-link" href="#Load-the-Concept-Database-model">&#182;</a></h3>
<h3 id="Load-the-Concept-Database-model">Load the Concept Database model<a class="anchor-link" href="#Load-the-Concept-Database-model">&#182;</a></h3><p>NOTE: In CDBs saved with <code>medcat</code> 1.10+, the <em>config</em> will be saved separately and would thus need to be loaded using the <code>cdb.load_config</code> method.</p>

</div>
</div>
</div>
Expand All @@ -14284,6 +14288,10 @@ <h3 id="Load-the-Concept-Database-model">Load the Concept Database model<a class
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">cdb</span> <span class="o">=</span> <span class="n">CDB</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">DATA_DIR</span> <span class="o">+</span> <span class="s2">&quot;cdb.dat&quot;</span><span class="p">)</span>
<span class="c1"># PS: If you&#39;ve saved the CDB in medcat 1.10+ you&#39;ll need to load the config separately</span>
<span class="c1"># If you&#39;re using an older CDB, it should load with the config and you can omit the</span>
<span class="c1"># next line</span>
<span class="n">cdb</span><span class="o">.</span><span class="n">load_config</span><span class="p">(</span><span class="n">DATA_DIR</span> <span class="o">+</span> <span class="s2">&quot;config.json&quot;</span><span class="p">)</span>
</pre></div>

</div>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,7 @@
],
"source": [
"# Install MedCAT\n",
"! pip install medcat==1.9.3\n",
"! pip install medcat~=1.10.0\n",
"# Get the scispacy model\n",
"! python -m spacy download en_core_web_md\n",
"try:\n",
Expand Down Expand Up @@ -1136,7 +1136,10 @@
},
"outputs": [],
"source": [
"cdb.save(DATA_DIR + \"cdb.dat\")"
"cdb.save(DATA_DIR + \"cdb.dat\")\n",
"# NOTE: Starting from medcat 1.10 we're no longer saving the config\n",
"# along with the CDB. It'll need to be saved separately.\n",
"cdb.config.save(DATA_DIR + \"config.json\")"
]
},
{
Expand All @@ -1146,7 +1149,9 @@
"id": "97uiDwvAk7hc"
},
"source": [
"### Load the Concept Database model"
"### Load the Concept Database model\n",
"\n",
"NOTE: In CDBs saved with `medcat` 1.10+, the _config_ will be saved separately and would thus need to be loaded using the `cdb.load_config` method."
]
},
{
Expand All @@ -1169,7 +1174,11 @@
}
],
"source": [
"cdb = CDB.load(DATA_DIR + \"cdb.dat\")"
"cdb = CDB.load(DATA_DIR + \"cdb.dat\")\n",
"# PS: If you've saved the CDB in medcat 1.10+ you'll need to load the config separately\n",
"# If you're using an older CDB, it should load with the config and you can omit the\n",
"# next line\n",
"cdb.load_config(DATA_DIR + \"config.json\")"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13092,7 +13092,7 @@ <h1 id="Now-let's-start-extracting-concepts-from-unstructured-text!">Now let's s
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Install medcat</span>
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.9.3
<span class="o">!</span> pip install medcat~<span class="o">=</span><span class="m">1</span>.10.0
<span class="c1"># install seaborn</span>
<span class="o">!</span> pip install seaborn
<span class="k">try</span><span class="p">:</span>
Expand Down Expand Up @@ -14471,7 +14471,11 @@ <h2 id="Use-Multiprocessing">Use Multiprocessing<a class="anchor-link" href="#Us
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Let&#39;s test the multi processing function first</span>
<span class="n">in_data</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="s2">&quot;He was a diabetic patient&quot;</span><span class="p">)]</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="o">.</span><span class="n">multiprocessing</span><span class="p">(</span><span class="n">in_data</span><span class="p">,</span> <span class="n">nproc</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="c1"># NOTE: The method below has changed. The `multiprocessing` method has been</span>
<span class="c1"># deprecated and the following should be used instead. This is because</span>
<span class="c1"># there were multiple multiprocessing methods and there was no clear</span>
<span class="c1"># distinction between them.</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="o">.</span><span class="n">multiprocessing_batch_char_size</span><span class="p">(</span><span class="n">in_data</span><span class="p">,</span> <span class="n">nproc</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">results</span>
</pre></div>

Expand Down Expand Up @@ -14620,9 +14624,9 @@ <h2 id="Use-Multiprocessing">Use Multiprocessing<a class="anchor-link" href="#Us
<span class="n">batch_size_chars</span> <span class="o">=</span> <span class="mi">500000</span> <span class="c1"># Batch size (BS) in number of characters</span>

<span class="c1"># Run model</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="o">.</span><span class="n">multiprocessing</span><span class="p">(</span><span class="n">in_data</span><span class="p">,</span> <span class="c1"># Formatted data</span>
<span class="n">batch_size_chars</span> <span class="o">=</span> <span class="n">batch_size_chars</span><span class="p">,</span>
<span class="n">nproc</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="c1"># Number of processors</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="o">.</span><span class="n">multiprocessing_batch_char_size</span><span class="p">(</span><span class="n">in_data</span><span class="p">,</span> <span class="c1"># Formatted data</span>
<span class="n">batch_size_chars</span> <span class="o">=</span> <span class="n">batch_size_chars</span><span class="p">,</span>
<span class="n">nproc</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="c1"># Number of processors</span>
</pre></div>

</div>
Expand Down Expand Up @@ -14658,7 +14662,7 @@ <h2 id="Use-Multiprocessing">Use Multiprocessing<a class="anchor-link" href="#Us
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To batch on the number of documents, you can use <code>multiprocessing_pipe</code> alternatively, which also supports Windows platforms:</p>
<p>To batch on the number of documents, you can use <code>multiprocessing_batch_docs_size</code> alternatively, which also supports Windows platforms:</p>

</div>
</div>
Expand All @@ -14675,9 +14679,13 @@ <h2 id="Use-Multiprocessing">Use Multiprocessing<a class="anchor-link" href="#Us
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="n">torch</span><span class="o">.</span><span class="n">multiprocessing</span><span class="o">.</span><span class="n">set_start_method</span><span class="p">(</span><span class="s1">&#39;spawn&#39;</span><span class="p">,</span> <span class="n">force</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="o">.</span><span class="n">multiprocessing_pipe</span><span class="p">(</span><span class="n">in_data</span><span class="p">,</span> <span class="c1"># Formatted data</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="n">batch_size</span><span class="p">,</span>
<span class="n">nproc</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># Increase it when having more cores available</span>
<span class="c1"># NOTE: The below method has changed. The `multiprocessing_pipe` methods</span>
<span class="c1"># was deprecated due to being somewhat ambiguous as to when one</span>
<span class="c1"># should use it instead of the `multiprocessing` method. As such,</span>
<span class="c1"># the method defined below should be used</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="o">.</span><span class="n">multiprocessing_batch_docs_size</span><span class="p">(</span><span class="n">in_data</span><span class="p">,</span> <span class="c1"># Formatted data</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="n">batch_size</span><span class="p">,</span>
<span class="n">nproc</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># Increase it when having more cores available</span>
</pre></div>

</div>
Expand All @@ -14698,10 +14706,10 @@ <h2 id="Use-Multiprocessing">Use Multiprocessing<a class="anchor-link" href="#Us



<div id="8e896d85-31c1-4e32-bf26-3406878361c7"></div>
<div id="32d831b3-f19d-4057-9c5b-ec8b82685814"></div>
<div class="output_subarea output_widget_view ">
<script type="text/javascript">
var element = $('#8e896d85-31c1-4e32-bf26-3406878361c7');
var element = $('#32d831b3-f19d-4057-9c5b-ec8b82685814');
</script>
<script type="application/vnd.jupyter.widget-view+json">
{"model_id": "05b18c97da9d4d05b9280df006a5fb82", "version_major": 2, "version_minor": 0}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@
],
"source": [
"# Install medcat\n",
"! pip install medcat==1.9.3\n",
"! pip install medcat~=1.10.0\n",
"# install seaborn\n",
"! pip install seaborn\n",
"try:\n",
Expand Down Expand Up @@ -1374,7 +1374,11 @@
"source": [
"# Let's test the multi processing function first\n",
"in_data = [(1, \"He was a diabetic patient\")]\n",
"results = cat.multiprocessing(in_data, nproc=2)\n",
"# NOTE: The method below has changed. The `multiprocessing` method has been\n",
"# deprecated and the following should be used instead. This is because\n",
"# there were multiple multiprocessing methods and there was no clear\n",
"# distinction between them.\n",
"results = cat.multiprocessing_batch_char_size(in_data, nproc=2)\n",
"results"
]
},
Expand Down Expand Up @@ -1501,9 +1505,9 @@
"batch_size_chars = 500000 # Batch size (BS) in number of characters\n",
"\n",
"# Run model\n",
"results = cat.multiprocessing(in_data, # Formatted data\n",
" batch_size_chars = batch_size_chars,\n",
" nproc=8) # Number of processors"
"results = cat.multiprocessing_batch_char_size(in_data, # Formatted data\n",
" batch_size_chars = batch_size_chars,\n",
" nproc=8) # Number of processors"
]
},
{
Expand All @@ -1513,7 +1517,7 @@
"id": "f-gRm2WKsq5n"
},
"source": [
"To batch on the number of documents, you can use `multiprocessing_pipe` alternatively, which also supports Windows platforms:"
"To batch on the number of documents, you can use `multiprocessing_batch_docs_size` alternatively, which also supports Windows platforms:"
]
},
{
Expand Down Expand Up @@ -1564,9 +1568,13 @@
"if __name__ == '__main__':\n",
" import torch\n",
" torch.multiprocessing.set_start_method('spawn', force=True)\n",
" results = cat.multiprocessing_pipe(in_data, # Formatted data\n",
" batch_size = batch_size,\n",
" nproc=2) # Increase it when having more cores available"
" # NOTE: The below method has changed. The `multiprocessing_pipe` methods\n",
" # was deprecated due to being somewhat ambiguous as to when one\n",
" # should use it instead of the `multiprocessing` method. As such,\n",
" # the method defined below should be used\n",
" results = cat.multiprocessing_batch_docs_size(in_data, # Formatted data\n",
" batch_size = batch_size,\n",
" nproc=2) # Increase it when having more cores available"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13092,7 +13092,7 @@ <h1 id="Now-let's-look-at-ways-to-optimise-the-model-for-our-specific-use-case">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Install medcat</span>
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.9.3
<span class="o">!</span> pip install medcat~<span class="o">=</span><span class="m">1</span>.10.0
<span class="k">try</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">medcat.cat</span> <span class="kn">import</span> <span class="n">CAT</span>
<span class="k">except</span><span class="p">:</span>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@
],
"source": [
"# Install medcat\n",
"! pip install medcat==1.9.3\n",
"! pip install medcat~=1.10.0\n",
"try:\n",
" from medcat.cat import CAT\n",
"except:\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13085,7 +13085,7 @@
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Install medcat</span>
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.9.3
<span class="o">!</span> pip install medcat~<span class="o">=</span><span class="m">1</span>.10.0
<span class="k">try</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">medcat.cat</span> <span class="kn">import</span> <span class="n">CAT</span>
<span class="k">except</span><span class="p">:</span>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@
],
"source": [
"# Install medcat\n",
"! pip install medcat==1.9.3\n",
"! pip install medcat~=1.10.0\n",
"try:\n",
" from medcat.cat import CAT\n",
"except:\n",
Expand Down
Loading

0 comments on commit 53019eb

Please sign in to comment.