Skip to content

Commit

Permalink
bump version for 0.4.0 + update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
seanmacavaney committed Jun 4, 2021
1 parent 403f632 commit aeb0652
Show file tree
Hide file tree
Showing 81 changed files with 34,861 additions and 254 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ Available datasets include:
- [TREC CAR](https://ir-datasets.com/car.html)
- [ClueWeb09](https://ir-datasets.com/clueweb09.html)
- [ClueWeb12](https://ir-datasets.com/clueweb12.html)
- [CLIRMatrix](https://ir-datasets.com/clirmatrix.html)
- [CodeSearchNet](https://ir-datasets.com/codesearchnet.html)
- [CORD-19](https://ir-datasets.com/cord19.html)
- [DPR Wiki100](https://ir-datasets.com/dpr-w100.html)
Expand Down
12 changes: 12 additions & 0 deletions docs/antique.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ <h1><code>ir_datasets</code>: ANTIQUE</h1>
<li><a href="#antique/train/split200-train"><kbd><span class="prefix">antique</span>/train/split200-train</kbd></a></li>
<li><a href="#antique/train/split200-valid"><kbd><span class="prefix">antique</span>/train/split200-valid</kbd></a></li>
</ol>
<div id="Downloads">
</div>
<hr />
<div class="dataset" id="antique">
<h3><kbd class="select"><span class="str">"antique"</kdb></h3>
Expand Down Expand Up @@ -446,6 +448,16 @@ <h3><kbd class="ds-name select"><span class="str">"antique/train/split200-valid"
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=antique'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
17 changes: 17 additions & 0 deletions docs/aquaint.html
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,13 @@ <h1><code>ir_datasets</code>: AQUAINT</h1>
<li><a href="#aquaint"><kbd>aquaint</kbd></a></li>
<li><a href="#aquaint/trec-robust-2005"><kbd><span class="prefix">aquaint</span>/trec-robust-2005</kbd></a></li>
</ol>
<div id="Downloads">
</div>

<div id="DataAccess">
<h3>Data Access Information</h3>
<p> To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is <a href="https://catalog.ldc.upenn.edu/LDC2002T31">LDC2002T31</a>. </p> <p> Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details. </p> <p> The source file is: <kbd>aquaint_comp_LDC2002T31.tgz</kbd>. </p> <p> ir_datasets expects this file to be copied/linked in <kbd>~/.ir_datasets/aquaint/</kbd>. </p>
</div>
<hr />
<div class="dataset" id="aquaint">
<h3><kbd class="select"><span class="str">"aquaint"</kdb></h3>
Expand Down Expand Up @@ -150,6 +157,16 @@ <h3><kbd class="ds-name select"><span class="str">"aquaint/trec-robust-2005"</kd
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=aquaint'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
4,372 changes: 4,372 additions & 0 deletions docs/beir.html

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions docs/car.html
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ <h1><code>ir_datasets</code>: TREC CAR</h1>
<li><a href="#car/v1.5/trec-y1/auto"><kbd><span class="prefix">car</span>/v1.5/trec-y1/auto</kbd></a></li>
<li><a href="#car/v1.5/trec-y1/manual"><kbd><span class="prefix">car</span>/v1.5/trec-y1/manual</kbd></a></li>
</ol>
<div id="Downloads">
</div>
<hr />
<div class="dataset" id="car">
<h3><kbd class="select"><span class="str">"car"</kdb></h3>
Expand Down Expand Up @@ -743,6 +745,16 @@ <h3><kbd class="ds-name select"><span class="str">"car/v1.5/trec-y1/manual"</kdb
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=car'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
42 changes: 42 additions & 0 deletions docs/clinicaltrials.html
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ <h1><code>ir_datasets</code>: Clinical Trials</h1>
<li><a href="#clinicaltrials/2017/trec-pm-2018"><kbd><span class="prefix">clinicaltrials</span>/2017/trec-pm-2018</kbd></a></li>
<li><a href="#clinicaltrials/2019"><kbd><span class="prefix">clinicaltrials</span>/2019</kbd></a></li>
<li><a href="#clinicaltrials/2019/trec-pm-2019"><kbd><span class="prefix">clinicaltrials</span>/2019/trec-pm-2019</kbd></a></li>
<li><a href="#clinicaltrials/2021"><kbd><span class="prefix">clinicaltrials</span>/2021</kbd></a></li>
</ol>
<div id="Downloads">
</div>
<hr />
<div class="dataset" id="clinicaltrials">
<h3><kbd class="select"><span class="str">"clinicaltrials"</kdb></h3>
Expand Down Expand Up @@ -354,6 +357,45 @@ <h3><kbd class="ds-name select"><span class="str">"clinicaltrials/2019/trec-pm-2
</div>
</div>

<hr />
<div class="dataset" id="clinicaltrials/2021" data-parent="clinicaltrials">
<h3><kbd class="ds-name select"><span class="str">"clinicaltrials/2021"</kdb></h3>

<div class="desc">
<p> A snapshot of <a href="https://clinicaltrials.gov/">ClinicalTrials.gov</a> from April 2021 for use with the <a href="http://www.trec-cds.org/2021.html">TREC Clinical Trials 2021 Track</a>. </p> <p> Queries for the TREC Clinical Trials 2021 Track will be released later. </p> <ul> <li><a href="http://www.trec-cds.org/2021.html#documents">Dataset information</a></li> </ul>
</div>
<div class="tabs">
<a class="tab" target="clinicaltrials/2021__docs">docs</a>
<div id="clinicaltrials/2021__docs" class="tab-content">
<p>Language: <span class="lang-code">en</span></p>
<div>Document type:</div>
<div class="type">
<div class="type-name">ClinicalTrialsDoc: (<span class="kwd">namedtuple</span>)</div>
<ol class="type-fields">
<li data-tuple-idx="0"><span class="">doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="1"><span class="">title</span>: <span class="kwd">str</span></li><li data-tuple-idx="2"><span class="">condition</span>: <span class="kwd">str</span></li><li data-tuple-idx="3"><span class="">summary</span>: <span class="kwd">str</span></li><li data-tuple-idx="4"><span class="">detailed_description</span>: <span class="kwd">str</span></li><li data-tuple-idx="5"><span class="">eligibility</span>: <span class="kwd">str</span></li>
</ol>
</div>
<p>Example</p>
<code class="example">
<div><span class="kwd">import</span> ir_datasets</div>
<div>dataset = ir_datasets.load(<span class="str">'clinicaltrials/2021')</div>
<div><span class="kwd">for</span> doc <span class="kwd">in</span> dataset.docs_iter():</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;doc <span class="comment"># namedtuple&lt;doc_id, title, condition, summary, detailed_description, eligibility&gt;</span></div>
</code>
</div>
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=clinicaltrials'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
67 changes: 67 additions & 0 deletions docs/clirmatrix.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="main.css" />
<script src="https://code.jquery.com/jquery-1.12.4.min.js" integrity="sha256-ZosEbRLbNQzLpnKIkEdrPv7lOy9C27hHQ+Xp8a4MxAQ=" crossorigin="anonymous"></script>
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js" integrity="sha256-VazP97ZCwtekAsvgPBSUwPFKdrwD3unUfSGVYrahUqU=" crossorigin="anonymous"></script>
<script src="main.js"></script>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />

<title>CLIRMatrix - ir_datasets</title>
<body>
<div class="page">

<div style="position: absolute; top: 4px; left: 4px;"><a href="index.html">&larr; home</a></div>

<div style="position: absolute; top: 4px; right: 4px;">Github: <a href="https://github.com/allenai/ir_datasets/ir_datasets/datasets/clirmatrix.py">datasets/clirmatrix.py</a></div>
<h1><code>ir_datasets</code>: CLIRMatrix</h1>
<div style="font-weight: bold; font-size: 1.1em;">Index</div>
<ol class="index">
<li><a href="#clirmatrix"><kbd>clirmatrix</kbd></a></li>

</ol>
<div id="Downloads">
</div>
<hr />
<div class="dataset" id="clirmatrix">
<h3><kbd class="select"><span class="str">"clirmatrix"</kdb></h3>

<div class="desc">
<p> CLIRMatrix contains is massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. </p> <p> With 139 languages, there are 19,182 total language pairs. This is too many to list individually in the catalog, so patterns are instead used to match the dataset. </p> <p> <kbd class="str">"clirmatrix/{lang}"</kbd> (e.g., <kbd class="str">"clirmatrix/en"</kbd>): </p> <p> The document corpus for the given language. Documents are provided as <kbd class="kwd">GenericDoc</kbd>s. </p> <p> <kbd class="str">"clirmatrix/{doc_lang}/{bi139-base|bi139-full}/{query_lang}/{train|dev|test1|test2}"</kbd> (e.g., <kbd class="str">"clirmatrix/en/bi139-full/de/train"</kbd>): </p> <p> Documents are provided as <kbd class="kwd">GenericDoc</kbd>s, queries are provided as <kbd class="kwd">GenericQuery</kbd>s, and qrels are provided as <kbd class="kwd">TrecQrel</kbd>s. </p> <p> Supported languages are: af, als, am, an, ar, arz, ast, az, azb, ba, bar, be, bg, bn, bpy, br, bs, bug, ca, cdo, ce, ceb, ckb, cs, cv, cy, da, de, diq, el, eml, en, eo, es, et, eu, fa, fi, fo, fr, fy, ga, gd, gl, gu, he, hi, hr, hsb, ht, hu, hy, ia, id, ilo, io, is, it, ja, jv, ka, kk, kn, ko, ku, ky, la, lb, li, lmo, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, my, mzn, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pl, pms, pnb, ps, pt, qu, ro, ru, sa, sah, scn, sco, sd, sh, si, simple, sk, sl, sq, sr, su, sv, sw, szl, ta, te, tg, th, tl, tr, tt, uk, ur, uz, vec, vi, vo, wa, war, wuu, xmf, yi, yo, zh </p> <p> <kbd class="str">"clirmatrix/{doc_lang}/multi8/{query_lang}/{train|dev|test1|test2}"</kbd> (e.g., <kbd class="str">"clirmatrix/en/multi8/de/train"</kbd>): </p> <p> Documents are provided as <kbd class="kwd">GenericDoc</kbd>s, queries are provided as <kbd class="kwd">GenericQuery</kbd>s, and qrels are provided as <kbd class="kwd">TrecQrel</kbd>s. Supported languages are: ar, de, en, es, fr, ja, ru, zh </p> <ul> <li><a href="https://www.aclweb.org/anthology/2020.emnlp-main.340">Paper</a></li> <li><a href="http://www.cs.jhu.edu/~shuosun/clirmatrix/">Data Website</a></li> </ul>
</div>
<div class="tabs">
<a class="tab" target="clirmatrix__citation">Citation</a>
<div id="clirmatrix__citation" class="tab-content">
bibtex:
<cite class="select">@inproceedings{sun-duh-2020-clirmatrix,
title = "{CLIRM}atrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval",
author = "Sun, Shuo and
Duh, Kevin",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.340",
doi = "10.18653/v1/2020.emnlp-main.340",
pages = "4160--4170"
}
</cite>
</div>
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=clirmatrix'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
17 changes: 17 additions & 0 deletions docs/clueweb09.html
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,13 @@ <h1><code>ir_datasets</code>: ClueWeb09</h1>
<li><a href="#clueweb09/trec-mq-2009"><kbd><span class="prefix">clueweb09</span>/trec-mq-2009</kbd></a></li>
<li><a href="#clueweb09/zh"><kbd><span class="prefix">clueweb09</span>/zh</kbd></a></li>
</ol>
<div id="Downloads">
</div>

<div id="DataAccess">
<h3>Data Access Information</h3>
<p> To use this dataset, you need a copy of <a href="https://lemurproject.org/clueweb09/">ClueWeb 2009</a>, provided by CMU. </p> <p> Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you. </p> <p> Once you have the data, ir_datasets will need the directories that look like the following: </p> <ul> <li><kbd>ClueWeb09_English_1</kbd></li> <li><kbd>ClueWeb09_English_2</kbd></li> <li><kbd>...</kbd></li> <li><kbd>ClueWeb09_Arabic_1</kbd></li> <li><kbd>...</kbd></li> </ul> <p> ir_datasets expects the above directories to be copied/linked under <kbd>~/.ir_datasets/clueweb09/corpus</kbd>. </p>
</div>
<hr />
<div class="dataset" id="clueweb09">
<h3><kbd class="select"><span class="str">"clueweb09"</kdb></h3>
Expand Down Expand Up @@ -1229,6 +1236,16 @@ <h3><kbd class="ds-name select"><span class="str">"clueweb09/zh"</kdb></h3>
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=clueweb09'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
17 changes: 17 additions & 0 deletions docs/clueweb12.html
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@ <h1><code>ir_datasets</code>: ClueWeb12</h1>
<li><a href="#clueweb12/trec-web-2013"><kbd><span class="prefix">clueweb12</span>/trec-web-2013</kbd></a></li>
<li><a href="#clueweb12/trec-web-2014"><kbd><span class="prefix">clueweb12</span>/trec-web-2014</kbd></a></li>
</ol>
<div id="Downloads">
</div>

<div id="DataAccess">
<h3>Data Access Information</h3>
<p> To use this dataset, you need a copy of <a href="https://lemurproject.org/clueweb12/">ClueWeb 2012</a>, provided by CMU. </p> <p> Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you. </p> <p> Once you have the data, ir_datasets will need the directories that look like the following: </p> <ul> <li><kbd>ClueWeb12_00</kbd></li> <li><kbd>ClueWeb12_01</kbd></li> <li><kbd>...</kbd></li> </ul> <p> ir_datasets expects the above directories to be copied/linked under <kbd>~/.ir_datasets/clueweb12/corpus</kbd>. </p>
</div>
<hr />
<div class="dataset" id="clueweb12">
<h3><kbd class="select"><span class="str">"clueweb12"</kdb></h3>
Expand Down Expand Up @@ -1246,6 +1253,16 @@ <h3><kbd class="ds-name select"><span class="str">"clueweb12/trec-web-2014"</kdb
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=clueweb12'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
12 changes: 12 additions & 0 deletions docs/codesearchnet.html
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ <h1><code>ir_datasets</code>: CodeSearchNet</h1>
<li><a href="#codesearchnet/train"><kbd><span class="prefix">codesearchnet</span>/train</kbd></a></li>
<li><a href="#codesearchnet/valid"><kbd><span class="prefix">codesearchnet</span>/valid</kbd></a></li>
</ol>
<div id="Downloads">
</div>
<hr />
<div class="dataset" id="codesearchnet">
<h3><kbd class="select"><span class="str">"codesearchnet"</kdb></h3>
Expand Down Expand Up @@ -360,6 +362,16 @@ <h3><kbd class="ds-name select"><span class="str">"codesearchnet/valid"</kdb></h
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=codesearchnet'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
12 changes: 12 additions & 0 deletions docs/cord19.html
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ <h1><code>ir_datasets</code>: CORD-19</h1>
<li><a href="#cord19/trec-covid/round4"><kbd><span class="prefix">cord19</span>/trec-covid/round4</kbd></a></li>
<li><a href="#cord19/trec-covid/round5"><kbd><span class="prefix">cord19</span>/trec-covid/round5</kbd></a></li>
</ol>
<div id="Downloads">
</div>
<hr />
<div class="dataset" id="cord19">
<h3><kbd class="select"><span class="str">"cord19"</kdb></h3>
Expand Down Expand Up @@ -736,6 +738,16 @@ <h3><kbd class="ds-name select"><span class="str">"cord19/trec-covid/round5"</kd
</div>
</div>

<script type="text/javascript">
$(function () {
$.ajax({
'url': 'https://smac.pub/irdsdlc?ds=cord19'
}).done(function (data) {
$('#Downloads').append(generateDownloads('Downloadable content', data));
});
});
</script>

</div>
</body>
</html>
Loading

0 comments on commit aeb0652

Please sign in to comment.