-
Notifications
You must be signed in to change notification settings - Fork 4
/
sources.html
88 lines (83 loc) · 4.34 KB
/
sources.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
layout: docs
---
<h1>
Where did this data come from?
</h1>
<p class="section-intro">
<em>Short answer</em>: we combine data from as many useful online sources
as you can tell us about, official and unofficial.
<em>Longer (technical) answer</em>: if you’re happy to read code, you can look in
the file <code>instructions.json</code> that EveryPolitician uses to retrieve
and collate the data — see <a href="#instructions-json">more about this below</a>.
</p>
<p>
We list the sources we’re using at the bottom of each term’s page
(for example, see “Main sources” right at the bottom of the data for the
<a href="http://everypolitician.org/australia/representatives/term-table/44.html">44th term</a>
of Australia’s House of Representatives).
</p>
<p>
Sometimes the data changes, of course. So we regularly rebuild the data from
those sources to keep EveryPolitician up to date.
</p>
<h2>About those sources</h2>
<p>
We aggregate from lots of difference online sources, such as official
parliament sites and unofficial sites (including <a
href="https://www.wikidata.org">Wikidata</a>, which is the database on which
projects like Wikipedia are based).
<!-- FIXME: more about how this works-->
If you know of a good source that we’re not using:
<a href="/contribute.html">let us know</a>!
</p>
<p>
We merge our data from multiple sources because it's common for different
sources to provide different kinds of data (for example, one source might have
politicians' dates of birth, while another has their Twitter handles).
</p>
<p>
If any of the sources themselves have clear, consistent IDs, we try to
capture those (and include them in the <code>identifiers</code> field within
the JSON), because we know that sometimes it can be helpful to be able to map
back to the original data sets.
</p>
<p>
The data sources available vary immensely from country to country. And the
best people to ask for the best data sources are the locals: so if you know
of a good source that we’re not using in <em>your</em> country, let us know!
Just pointing out a source to us is helpful; you don’t have to do the hard
work of actually extracting the data.
</p>
<a name="instructions-json"></a>
<h2 id="the_technical_details">The technical details</h2>
<p>
You can see exactly where the data’s coming from by looking in the
<a href="https://github.com/everypolitician/everypolitician-data/">EveryPolitician
data repo</a>. Specifically, you want the <code>sources</code> directory for
the legislature you’re interested in. Look inside the
<code>instructions.json</code> there because that is the file EveryPolitician
uses to rebuild its data whenever something changes.
</p>
<p>
For example, the instructions (containing the explicit sources as well as
indications of how to process them) that EveryPolitician follows for putting
together its data for Australia’s House of Representatives are in this
<a href="https://github.com/everypolitician/everypolitician-data/blob/master/data/Australia/Representatives/sources/instructions.json"><code>instructions.json</code> file</a>.
</p>
<p>
Amongst other things, that file tells you the <em>type</em> of data it’s
getting as well as URL of the <em>resource</em>. It’s common for the resource
itself to be the output of a process that is getting data from the “raw”
source — for example, in the case of Australia, two of the sources
(<a href="https://github.com/everypolitician/everypolitician-data/blob/720cf692889927aec962ff05b48bf2a9958df835/data/Australia/Representatives/sources/instructions.json#L78-L86">one</a>
for determining the terms available, and
<a href="https://github.com/everypolitician/everypolitician-data/blob/720cf692889927aec962ff05b48bf2a9958df835/data/Australia/Representatives/sources/instructions.json#L3-L12">another</a>
for the politicians’ names) are the output of a single webscraper running
here:
<code><a href="https://morph.io/tmtmtmtm/australia-openaustralia">morph.io/tmtmtmtm/australia-openaustralia</a></code>.
If you look at
<a href="https://github.com/tmtmtmtm/australia-openaustralia/blob/1184c4128321392b15c27a00bcb40862810646bc/scraper.rb#L102-L106">the scraper's source code</a>,
you can see that the scraper itself is getting data from
<a href="http://data.openaustralia.org/members/">OpenAustralia's data site</a>.
</p>