Skip to content

Commit

Permalink
tweets2013-ia dataset with TREC microblog 2013-14
Browse files Browse the repository at this point in the history
  • Loading branch information
seanmacavaney committed Mar 1, 2021
1 parent 7ef592b commit 158a0e5
Show file tree
Hide file tree
Showing 12 changed files with 848 additions and 8 deletions.
18 changes: 18 additions & 0 deletions .github/workflows/verify_downloads.yml
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,24 @@ jobs:
run: |
python -m test.downloads --filter "^trec-spanish/"
tweets2013-ia:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Test
env:
IR_DATASETS_DL_DISABLE_PBAR: 'true'
run: |
python -m test.downloads --filter "^tweets2013-ia/"
vaswani:
runs-on: ubuntu-latest
steps:
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,7 @@ Available datasets include:
- [TREC Mandarin](https://ir-datasets.com/trec-mandarin.html)
- [TREC Robust 2004](https://ir-datasets.com/trec-robust04.html)
- [TREC Spanish](https://ir-datasets.com/trec-spanish.html)
- [Tweets 2013 (Internet Archive)](https://ir-datasets.com/tweets2013-ia.html)
- [Vaswani](https://ir-datasets.com/vaswani.html)
- [WikIR](https://ir-datasets.com/wikir.html)

Expand Down
3 changes: 3 additions & 0 deletions docs/master/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,9 @@ <h2>Dataset Index</h2>
</tbody><tbody><tr><td><a style="font-weight: bold;" href="trec-spanish.html"><kbd>trec-spanish</kbd></a></li></td><td class="center"><span style="cursor: help;" title="docs available from LDC">⚠️</span></td><td class="center"></td><td class="center"></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
<tr><td><a href="trec-spanish.html#trec-spanish/trec3"><kbd><span class="prefix">trec-spanish</span>/trec3</kbd></a></td><td class="center"><span style="cursor: help;" title="docs available from LDC">⚠️</span></td><td class="center"><span style="cursor: help;" title="queries available as automatic download"></span></td><td class="center"><span style="cursor: help;" title="qrels available as automatic download"></span></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
<tr><td><a href="trec-spanish.html#trec-spanish/trec4"><kbd><span class="prefix">trec-spanish</span>/trec4</kbd></a></td><td class="center"><span style="cursor: help;" title="docs available from LDC">⚠️</span></td><td class="center"><span style="cursor: help;" title="queries available as automatic download"></span></td><td class="center"><span style="cursor: help;" title="qrels available as automatic download"></span></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
</tbody><tbody><tr><td><a style="font-weight: bold;" href="tweets2013-ia.html"><kbd>tweets2013-ia</kbd></a></li></td><td class="center"><span style="cursor: help;" title="docs available as automatic download"></span></td><td class="center"></td><td class="center"></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
<tr><td><a href="tweets2013-ia.html#tweets2013-ia/trec-mb-2013"><kbd><span class="prefix">tweets2013-ia</span>/trec-mb-2013</kbd></a></td><td class="center"><span style="cursor: help;" title="docs available as automatic download"></span></td><td class="center"><span style="cursor: help;" title="queries available as automatic download"></span></td><td class="center"><span style="cursor: help;" title="qrels available as automatic download"></span></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
<tr><td><a href="tweets2013-ia.html#tweets2013-ia/trec-mb-2014"><kbd><span class="prefix">tweets2013-ia</span>/trec-mb-2014</kbd></a></td><td class="center"><span style="cursor: help;" title="docs available as automatic download"></span></td><td class="center"><span style="cursor: help;" title="queries available as automatic download"></span></td><td class="center"><span style="cursor: help;" title="qrels available as automatic download"></span></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
</tbody><tbody><tr><td><a style="font-weight: bold;" href="vaswani.html"><kbd>vaswani</kbd></a></li></td><td class="center"><span style="cursor: help;" title="docs available as automatic download"></span></td><td class="center"><span style="cursor: help;" title="queries available as automatic download"></span></td><td class="center"><span style="cursor: help;" title="qrels available as automatic download"></span></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
</tbody><tbody><tr><td><a style="font-weight: bold;" href="wikir.html"><kbd>wikir</kbd></a></li></td><td class="center"></td><td class="center"></td><td class="center"></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
<tr><td><a href="wikir.html#wikir/en1k"><kbd><span class="prefix">wikir</span>/en1k</kbd></a></td><td class="center"><span style="cursor: help;" title="docs available as automatic download"></span></td><td class="center"></td><td class="center"></td><td class="center screen-small-hide"></td><td class="center screen-small-hide"></td></tr>
Expand Down
244 changes: 244 additions & 0 deletions docs/master/tweets2013-ia.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="main.css" />
<script src="https://code.jquery.com/jquery-1.12.4.min.js" integrity="sha256-ZosEbRLbNQzLpnKIkEdrPv7lOy9C27hHQ+Xp8a4MxAQ=" crossorigin="anonymous"></script>
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js" integrity="sha256-VazP97ZCwtekAsvgPBSUwPFKdrwD3unUfSGVYrahUqU=" crossorigin="anonymous"></script>
<script src="main.js"></script>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="robots" content="noindex,nofollow" />
<title>Tweets 2013 (Internet Archive) - ir_datasets</title>
<body>
<div class="page">

<div class="banner">This documentation is for <strong>master</strong>. See <a href="../tweets2013-ia.html">here</a> for documentation of the current latest version on pypi.</div>

<div style="position: absolute; top: 4px; left: 4px;"><a href="index.html">&larr; home</a></div>

<div style="position: absolute; top: 4px; right: 4px;">Github: <a href="https://github.com/allenai/ir_datasets/blob/master/ir_datasets/datasets/tweets2013_ia.py">datasets/tweets2013_ia.py</a></div>
<h1><code>ir_datasets</code>: Tweets 2013 (Internet Archive)</h1>
<div style="font-weight: bold; font-size: 1.1em;">Index</div>
<ol class="index">
<li><a href="#tweets2013-ia"><kbd>tweets2013-ia</kbd></a></li>
<li><a href="#tweets2013-ia/trec-mb-2013"><kbd><span class="prefix">tweets2013-ia</span>/trec-mb-2013</kbd></a></li>
<li><a href="#tweets2013-ia/trec-mb-2014"><kbd><span class="prefix">tweets2013-ia</span>/trec-mb-2014</kbd></a></li>
</ol>
<hr />
<div class="dataset" id="tweets2013-ia">
<h3><kbd class="select"><span class="str">"tweets2013-ia"</kdb></h3>

<div class="desc">
<p> A collection of tweets from a 2-month window achived by the Internet Achive. This collection can be a stand-in document collection for the TREC Microblog 2013-14 tasks. (Even though it is not exactly the same collection, <a href="https://cs.uwaterloo.ca/~jimmylin/publications/Sequiera_Lin_SIGIR2017.pdf">Sequiera and Lin</a> show that it it close enough.) </p> <p> This collection is automatically downloaded from the Internet Archive, though download speeds are often slow so it takes some time. ir_datasets constructs a new directory hierarchy during the download process to facilitate fast lookups and slices. </p> <ul> <li>Documents: Tweets</li> <li><a href="https://cs.uwaterloo.ca/~jimmylin/publications/Sequiera_Lin_SIGIR2017.pdf">Information about dataset (paper)</a></li> <li><a href="https://github.com/castorini/Tweets2013-IA">Information about dataset (repository)</a></li> </ul>
</div>
<div class="tabs">
<a class="tab" target="tweets2013-ia__docs">docs</a>
<div id="tweets2013-ia__docs" class="tab-content">
<p>Language: <em>multiple/other/unknown</em></p>
<div>Document type:</div>
<div class="type">
<div class="type-name">TweetDoc: (<span class="kwd">namedtuple</span>)</div>
<ol class="type-fields">
<li data-tuple-idx="0"><span class="">doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="1"><span class="">text</span>: <span class="kwd">str</span></li><li data-tuple-idx="2"><span class="">user_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="3"><span class="">created_at</span>: <span class="kwd">str</span></li><li data-tuple-idx="4"><span class="">lang</span>: <span class="kwd">str</span></li><li data-tuple-idx="5"><span class="">reply_doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="6"><span class="">retweet_doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="7"><span class="">source</span>: <span class="kwd">bytes</span></li><li data-tuple-idx="8"><span class="">source_content_type</span>: <span class="kwd">str</span></li>
</ol>
</div>
<p>Example</p>
<code class="example">
<div><span class="kwd">import</span> ir_datasets</div>
<div>dataset = ir_datasets.load(<span class="str">'tweets2013-ia')</div>
<div><span class="kwd">for</span> doc <span class="kwd">in</span> dataset.docs_iter():</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;doc <span class="comment"># namedtuple&lt;doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type&gt;</span></div>
</code>
</div>

<a class="tab" target="tweets2013-ia__citation">Citation</a>
<div id="tweets2013-ia__citation" class="tab-content">
bibtex:
<cite class="select">@inproceedings{Sequiera2017Finally,
title={Finally, a Downloadable Test Collection of Tweets},
author={Royal Sequiera and Jimmy Lin},
booktitle={SIGIR},
year={2017}
}
</cite>
</div>
</div>
</div>

<hr />
<div class="dataset" id="tweets2013-ia/trec-mb-2013" data-parent="tweets2013-ia">
<h3><kbd class="ds-name select"><span class="str">"tweets2013-ia/trec-mb-2013"</kdb></h3>

<div class="desc">
<p> TREC Microblog 2013 test collection. </p> <ul> <li><a href="https://trec.nist.gov/pubs/trec22/papers/MB.OVERVIEW.pdf">Shared Task Paper</a></li> <li><a href="https://github.com/lintool/twitter-tools/wiki/TREC-2013-Track-Guidelines">Shared Task Site</a></li> </ul>
</div>
<div class="tabs">
<a class="tab" target="tweets2013-ia/trec-mb-2013__queries">queries</a>
<div id="tweets2013-ia/trec-mb-2013__queries" class="tab-content">
<p>Language: <span class="lang-code">en</span></p>
<div>Query type:</div>
<div class="type">
<div class="type-name">TrecMb13Query: (<span class="kwd">namedtuple</span>)</div>
<ol class="type-fields">
<li data-tuple-idx="0"><span class="">query_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="1"><span class="">query</span>: <span class="kwd">str</span></li><li data-tuple-idx="2"><span class="">time</span>: <span class="kwd">str</span></li><li data-tuple-idx="3"><span class="">tweet_time</span>: <span class="kwd">str</span></li>
</ol>
</div>
<p>Example</p>
<code class="example">
<div><span class="kwd">import</span> ir_datasets</div>
<div>dataset = ir_datasets.load(<span class="str">'tweets2013-ia/trec-mb-2013')</div>
<div><span class="kwd">for</span> query <span class="kwd">in</span> dataset.queries_iter():</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;query <span class="comment"># namedtuple&lt;query_id, query, time, tweet_time&gt;</span></div>
</code>
</div>

<a class="tab" target="tweets2013-ia/trec-mb-2013__docs">docs</a>
<div id="tweets2013-ia/trec-mb-2013__docs" class="tab-content">
<p>Language: <em>multiple/other/unknown</em></p>
<div>Document type:</div>
<div class="type">
<div class="type-name">TweetDoc: (<span class="kwd">namedtuple</span>)</div>
<ol class="type-fields">
<li data-tuple-idx="0"><span class="">doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="1"><span class="">text</span>: <span class="kwd">str</span></li><li data-tuple-idx="2"><span class="">user_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="3"><span class="">created_at</span>: <span class="kwd">str</span></li><li data-tuple-idx="4"><span class="">lang</span>: <span class="kwd">str</span></li><li data-tuple-idx="5"><span class="">reply_doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="6"><span class="">retweet_doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="7"><span class="">source</span>: <span class="kwd">bytes</span></li><li data-tuple-idx="8"><span class="">source_content_type</span>: <span class="kwd">str</span></li>
</ol>
</div>
<p>Example</p>
<code class="example">
<div><span class="kwd">import</span> ir_datasets</div>
<div>dataset = ir_datasets.load(<span class="str">'tweets2013-ia/trec-mb-2013')</div>
<div><span class="kwd">for</span> doc <span class="kwd">in</span> dataset.docs_iter():</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;doc <span class="comment"># namedtuple&lt;doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type&gt;</span></div>
</code>
</div>

<a class="tab" target="tweets2013-ia/trec-mb-2013__qrels">qrels</a>
<div id="tweets2013-ia/trec-mb-2013__qrels" class="tab-content">
<div>Query relevance judgment type:</div>
<div class="type">
<div class="type-name">TrecQrel: (<span class="kwd">namedtuple</span>)</div>
<ol class="type-fields">
<li data-tuple-idx="0"><span class="">query_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="1"><span class="">doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="2"><span class="">relevance</span>: <span class="kwd">int</span></li><li data-tuple-idx="3"><span class="">iteration</span>: <span class="kwd">str</span></li>
</ol>
</div>
<p>Relevance levels</p>

<table>
<tr><th>Rel.</th><th>Definition</th></tr>
<tr><td class="relScore">0</td><td>not relevant</td></tr>
<tr><td class="relScore">1</td><td>relevant</td></tr>
<tr><td class="relScore">2</td><td>highly relevant</td></tr>
</table>

<p>Example</p>
<code class="example">
<div><span class="kwd">import</span> ir_datasets</div>
<div>dataset = ir_datasets.load(<span class="str">'tweets2013-ia/trec-mb-2013')</div>
<div><span class="kwd">for</span> qrel <span class="kwd">in</span> dataset.qrels_iter():</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;qrel <span class="comment"># namedtuple&lt;query_id, doc_id, relevance, iteration&gt;</span></div>
</code>
</div>

<a class="tab" target="tweets2013-ia/trec-mb-2013__citation">Citation</a>
<div id="tweets2013-ia/trec-mb-2013__citation" class="tab-content">
bibtex:
<cite class="select">@inproceedings{Lin2013Microblog,
title={Overview of the TREC-2013 Microblog Track},
author={Jimmy Lin and Miles Efron},
booktitle={TREC},
year={2013}
}
</cite>
</div>
</div>
</div>

<hr />
<div class="dataset" id="tweets2013-ia/trec-mb-2014" data-parent="tweets2013-ia">
<h3><kbd class="ds-name select"><span class="str">"tweets2013-ia/trec-mb-2014"</kdb></h3>

<div class="desc">
<p> TREC Microblog 2014 test collection. </p> <ul> <li><a href="https://trec.nist.gov/pubs/trec23/papers/overview-microblog.pdf">Shared Task Paper</a></li> <li><a href="https://github.com/lintool/twitter-tools/wiki/TREC-2014-Track-Guidelines">Shared Task Site</a></li> </ul>
</div>
<div class="tabs">
<a class="tab" target="tweets2013-ia/trec-mb-2014__queries">queries</a>
<div id="tweets2013-ia/trec-mb-2014__queries" class="tab-content">
<p>Language: <span class="lang-code">en</span></p>
<div>Query type:</div>
<div class="type">
<div class="type-name">TrecMb14Query: (<span class="kwd">namedtuple</span>)</div>
<ol class="type-fields">
<li data-tuple-idx="0"><span class="">query_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="1"><span class="">query</span>: <span class="kwd">str</span></li><li data-tuple-idx="2"><span class="">time</span>: <span class="kwd">str</span></li><li data-tuple-idx="3"><span class="">tweet_time</span>: <span class="kwd">str</span></li><li data-tuple-idx="4"><span class="">description</span>: <span class="kwd">str</span></li>
</ol>
</div>
<p>Example</p>
<code class="example">
<div><span class="kwd">import</span> ir_datasets</div>
<div>dataset = ir_datasets.load(<span class="str">'tweets2013-ia/trec-mb-2014')</div>
<div><span class="kwd">for</span> query <span class="kwd">in</span> dataset.queries_iter():</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;query <span class="comment"># namedtuple&lt;query_id, query, time, tweet_time, description&gt;</span></div>
</code>
</div>

<a class="tab" target="tweets2013-ia/trec-mb-2014__docs">docs</a>
<div id="tweets2013-ia/trec-mb-2014__docs" class="tab-content">
<p>Language: <em>multiple/other/unknown</em></p>
<div>Document type:</div>
<div class="type">
<div class="type-name">TweetDoc: (<span class="kwd">namedtuple</span>)</div>
<ol class="type-fields">
<li data-tuple-idx="0"><span class="">doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="1"><span class="">text</span>: <span class="kwd">str</span></li><li data-tuple-idx="2"><span class="">user_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="3"><span class="">created_at</span>: <span class="kwd">str</span></li><li data-tuple-idx="4"><span class="">lang</span>: <span class="kwd">str</span></li><li data-tuple-idx="5"><span class="">reply_doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="6"><span class="">retweet_doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="7"><span class="">source</span>: <span class="kwd">bytes</span></li><li data-tuple-idx="8"><span class="">source_content_type</span>: <span class="kwd">str</span></li>
</ol>
</div>
<p>Example</p>
<code class="example">
<div><span class="kwd">import</span> ir_datasets</div>
<div>dataset = ir_datasets.load(<span class="str">'tweets2013-ia/trec-mb-2014')</div>
<div><span class="kwd">for</span> doc <span class="kwd">in</span> dataset.docs_iter():</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;doc <span class="comment"># namedtuple&lt;doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type&gt;</span></div>
</code>
</div>

<a class="tab" target="tweets2013-ia/trec-mb-2014__qrels">qrels</a>
<div id="tweets2013-ia/trec-mb-2014__qrels" class="tab-content">
<div>Query relevance judgment type:</div>
<div class="type">
<div class="type-name">TrecQrel: (<span class="kwd">namedtuple</span>)</div>
<ol class="type-fields">
<li data-tuple-idx="0"><span class="">query_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="1"><span class="">doc_id</span>: <span class="kwd">str</span></li><li data-tuple-idx="2"><span class="">relevance</span>: <span class="kwd">int</span></li><li data-tuple-idx="3"><span class="">iteration</span>: <span class="kwd">str</span></li>
</ol>
</div>
<p>Relevance levels</p>

<table>
<tr><th>Rel.</th><th>Definition</th></tr>
<tr><td class="relScore">0</td><td>not relevant</td></tr>
<tr><td class="relScore">1</td><td>relevant</td></tr>
<tr><td class="relScore">2</td><td>highly relevant</td></tr>
</table>

<p>Example</p>
<code class="example">
<div><span class="kwd">import</span> ir_datasets</div>
<div>dataset = ir_datasets.load(<span class="str">'tweets2013-ia/trec-mb-2014')</div>
<div><span class="kwd">for</span> qrel <span class="kwd">in</span> dataset.qrels_iter():</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;qrel <span class="comment"># namedtuple&lt;query_id, doc_id, relevance, iteration&gt;</span></div>
</code>
</div>

<a class="tab" target="tweets2013-ia/trec-mb-2014__citation">Citation</a>
<div id="tweets2013-ia/trec-mb-2014__citation" class="tab-content">
bibtex:
<cite class="select">@inproceedings{Lin2014Microblog,
title={Overview of the TREC-2014 Microblog Track},
author={Jimmy Lin and Miles Efron and Yulu Wang and Garrick Sherman},
booktitle={TREC},
year={2014}
}
</cite>
</div>
</div>
</div>

</div>
</body>
</html>
1 change: 1 addition & 0 deletions ir_datasets/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,6 @@
from . import trec_arabic
from . import trec_mandarin
from . import trec_spanish
from . import tweets2013_ia
from . import vaswani
from . import wikir
Loading

0 comments on commit 158a0e5

Please sign in to comment.