-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
220 lines (205 loc) · 13.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
################################################################################
# README NILCLUSTERING MODULE TAC/KBP 2014 #
################################################################################
Configuration
-------------
Before installing and running the code it is necessary to specify the paths to
the external data (e.g., ProPPR output). This has to be done in the first
section of the Makefile. The necessary files are:
- PROPPR_TRAIN: 'qid eid'. ProPPR results for the training data with query
column and eid column, e.g.,
------------------------------------
| edl14_eng_training_0001 nil |
| edl14_eng_training_0002 nil |
| edl14_eng_training_0006 e0731517 |
------------------------------------
- PROPPR_TEST: 'qid eid'. ProPPR results for the test data (if any) with
query column and eid column, e.g.,
------------------------------------
| edl14_eng_training_0005 nil |
| edl14_eng_training_0032 e0330409 |
| edl14_eng_training_0044 e0805223 |
------------------------------------
- SCORE_TRAIN: 'rank score eid'. ProPPR solutions (scores) for the training
data, e.g.,
-----------------------------------------------------------------------------
| # proved 1 answerQuery(d1020302,edl14_eng_training_0363,-1) 2107 msec |
| 1 0.9039872065556289 -1=c[nil] |
| 2 0.004607878107344008 -1=c[e0082408] |
| 3 0.004607878107344008 -1=c[e0491365] |
-----------------------------------------------------------------------------
- SCORE_TEST: 'rank score eid'. ProPPR solutions (scores) for the test data
(if any), e.g.,
-----------------------------------------------------------------------------
| # proved 1 answerQuery(d1020302,edl14_eng_training_0956,-1) 3319 msec |
| 1 0.9029592137998453 -1=c[nil] |
| 2 0.003451444907669729 -1=c[e0027798] |
| 3 0.003451444907669729 -1=c[e0658312] |
-----------------------------------------------------------------------------
- QNAME: 'queryName qid string'. File containing the qid and the query string
as well as a "queryName" dummy column, e.g.,
---------------------------------------------------------
| queryName edl14_eng_training_0001 xenophon |
| queryName edl14_eng_training_0002 richmond |
| queryName edl14_eng_training_0003 newark_teachers_union |
---------------------------------------------------------
- TOKEN: 'inDocument did token'. File containing the did, the tokens in that
document as well as an "inDocument" dummy column, e.g.,
--------------------------------------
| inDocument d1020302 abramovich |
| inDocument d1020302 africa |
| inDocument d1020302 agents |
--------------------------------------
- QSENT: 'querySentence qid sid'. File containing the qid and the sid as
well as a "querySentence" dummy column, e.g.,
---------------------------------------------------
| querySentence EDL14_ENG_TRAINING_0002 d484430s1 |
| querySentence EDL14_ENG_TRAINING_0003 d1886127s12 |
| querySentence EDL14_ENG_TRAINING_0006 d54485s10 |
---------------------------------------------------
- INSENT: 'inSentence sid token'. File containing the sid, the tokens in that
sentence as well as an "inSentence" dummy column, e.g.,
-------------------------------------
| inSentence d1020302s1 ballack |
| inSentence d1020302s1 champions |
| inSentence d1020302s1 chelsea |
-------------------------------------
- TACPR: 'did wp14 entype score begin end mention tacid name type gentype'.
File containing a TAC-aligned PR output file, e.g.,
----------------------------------------------------------
| AFP_ENG_20090802.0401 Tikka (food) prepared food \ |
| 0.252 66 71 tikka OTHER |
| AFP_ENG_20090802.0401 Spice mix null \ |
| 0.132 72 78 masala NULL |
| AFP_ENG_20090802.0401 Agence France-Presse company \ |
| 0.113 155 158 AFP E0689220 Agence France-Presse \ |
| ORG ORGANIZATION |
----------------------------------------------------------
It is possible to specify the amount of padding (e.g., the number of "X" in
"NILXXX") by changing the FORMATTING_FLAGS.
Installation
------------
Run "make" inside the main directory. This will set up a virtual environment
with the necessary python dependencies and then run the different clustering
scripts. If you only want to install the virtual environment without running
the clustering scripts, use "make venv" instead.
Running the code
----------------
The makefile provides several targets corresponding to the different steps of
the clustering process:
- "make raw" generates all the input needed for the different clustering
versions.
- "make baseline" performs baseline clustering. At the moment, there are the
following baseline versions:
-- baseline0: String only. All queries with the same query string are
assigned the same nid.
-- baseline1: String and did. All queries with the same query string and
the same did are assigned the same nid.
-- baseline2: String and document distance. Queries that have the same
query string are assigned to different clusters based on
the distance between the documents they appear in. This
distance is calculated based on the tokens that appear
in the documents.
Agglomerative clustering is used. The clustering parameters
(pairwise distance measure between documents, distance
measure between clusters, clustering method and threshold
for flattening) can be specified in the makefile. Detailed
information regarding the clustering algorithm and its
parameters can be found in the scipy documentation.
-- baseline3: String and document distance. Same as baseline2, but
instead of agglomerative clustering exploratory clustering
is used. Detailed information regarding the clustering
algorithm and its parameters can be found in the ExploreEM
documentation.
-- baseline4: String and sentence distance. Queries that have the same
query string are assigned to different clusters based on
the distance between the sentences they appear in. The
distance between the documents they appear in. This
distance is calculated based on the tokens that appear
in the documents.
Agglomerative clustering is used. The clustering parameters
(pairwise distance measure between documents, distance
measure between clusters, clustering method and threshold
for flattening) can be specified in the makefile. Detailed
information regarding the clustering algorithm and its
parameters can be found in the scipy documentation.
-- baseline5: String and sentence distance. Same as baseline4, but
instead of agglomerative clustering exploratory clustering
is used. Detailed information regarding the clustering
algorithm and its parameters can be found in the ExploreEM
documentation.
-- baseline6: String and sentence distance with string and document
distance as a fallback. In a first step, queries for which
sentence information is available are clustered using
baseline4. In a second step, the remaining queries are
clustered using baseline2.
-- baseline7: String and sentence distance with string and document
distance as a fallback. In a first step, queries for which
sentence information is available are clustered using
baseline5. In a second step, the remaining queries are
clustered using baseline3.
- "make explore" performs exploreEM clustering. At the moment, there are the
following exploreEM versions:
-- unsupervised0: String, did, document tokens, and score. For each query,
a feature vector with the query string, did, the tokens
of the document and the ProPPR scores is generated. This
is a global context only unsupervised version of
exploreEM.
-- unsupervised1: String, sid, sentence tokens, and score. For each query,
a feature vector with the query string, did, the tokens
of the sentence and the ProPPR scores is generated. This
is a local context only unsupervised version of
exploreEM.
-- unsupervised2: String, did, document tokens, sid, sentence tokens, and
score. For each query, a feature vector with the query
string, did, the tokens of the document, sid, sentence
tokens and the ProPPR scores is generated. This is a
combined local and global context unsupervised version
of exploreEM.
-- semi_supervised0: String, did, document tokens, and score. For each query,
a feature vector with the query string, did, the tokens
of the document and the ProPPR scores is generated.
Additionally, PageReactor output is used to generate
training data (seeds) for a number of queries.
This is a global context only semi-supervised version of
exploreEM.
-- semi_supervised1: String, sid, sentence tokens, and score. For each query,
a feature vector with the query string, sid, the tokens
of the sentence and the ProPPR scores is generated.
Additionally, PageReactor output is used to generate
training data (seeds) for a number of queries.
This is a local context only semi-supervised version of
exploreEM.
-- semi_supervised2: String, did, document tokens, sid, sentence tokens, and
score. For each query, a feature vector with the query
string, sid, the tokens of the sentence and the ProPPR
scores is generated. Additionally, PageReactor output is
used to generate training data (seeds) for a number of
queries. This is a combined global and local context
semi-supervised version of exploreEM.
Detailed information regarding the clustering algorithm and its parameters can
be found in the ExploreEM documentation.
- "make pagereactor" performs clustering based on the pagereactor output. At the
moment, there is only one version:
-- pagereactor0: String only. The assignment is performed as a two step
process. In a first step, pagereactor entities whose
tacid is NULL and whose generic type is not NULL and not
OTHER are grouped by their wp14 name and assigned a nid.
In a second step, those pagereactor entitites that have
either an eid or a nid are matched with their qid. The
matching is performed based on string identity and
appearance, e.g. if there are three tac queries with the
string "abraham_lincoln" and five pagereactor queries
with the string "abraham_lincoln" then the first three
of those pagereactor queries will be matched with the
qid's from the three tac queries.
- "make" or "make all" is shorthand for calling "make raw", "make baseline",
"make explore", and "make pagereactor".
Cleaning up
-----------
- "make clean" removes all the input generated by "make raw" (i.e., the data
directory) and the output generated by "make baseline", "make explore", and
"make pagereactor" (i.e., the output directory).
- "make cleandist" removes the same targets as "make clean" and, in addition
to that, also removes the virtual environment. Note that in general "make clean"
should suffice.