README

################################################################################
#                   README NILCLUSTERING MODULE TAC/KBP 2014                   #
################################################################################

Configuration
-------------

Before installing and running the code it is necessary to specify the paths to
the external data (e.g., ProPPR output). This has to be done in the first
section of the Makefile. The necessary files are:
- PROPPR_TRAIN: 'qid eid'. ProPPR results for the training data with query 
                column and eid column, e.g.,
                 ------------------------------------
                | edl14_eng_training_0001   nil      |
                | edl14_eng_training_0002   nil      |
                | edl14_eng_training_0006   e0731517 |
                 ------------------------------------
- PROPPR_TEST:  'qid eid'. ProPPR results for the test data (if any) with
                query column and eid column, e.g.,
                 ------------------------------------
                | edl14_eng_training_0005   nil      |
                | edl14_eng_training_0032   e0330409 |
                | edl14_eng_training_0044   e0805223 |
                 ------------------------------------
- SCORE_TRAIN:  'rank score eid'. ProPPR solutions (scores) for the training
                data, e.g.,
  -----------------------------------------------------------------------------
 | # proved 1    answerQuery(d1020302,edl14_eng_training_0363,-1)    2107 msec |
 | 1 0.9039872065556289      -1=c[nil]                                         |
 | 2 0.004607878107344008    -1=c[e0082408]                                    |
 | 3 0.004607878107344008    -1=c[e0491365]                                    |
  -----------------------------------------------------------------------------
- SCORE_TEST:   'rank score eid'. ProPPR solutions (scores) for the test data
                (if any), e.g.,
  -----------------------------------------------------------------------------
 | # proved 1    answerQuery(d1020302,edl14_eng_training_0956,-1)    3319 msec |
 | 1 0.9029592137998453      -1=c[nil]                                         |
 | 2 0.003451444907669729    -1=c[e0027798]                                    |
 | 3 0.003451444907669729    -1=c[e0658312]                                    |
 -----------------------------------------------------------------------------
- QNAME:    'queryName qid string'. File containing the qid and the query string
            as well as a "queryName" dummy column, e.g.,
             ---------------------------------------------------------
            | queryName edl14_eng_training_0001 xenophon              |
            | queryName edl14_eng_training_0002 richmond              |
            | queryName edl14_eng_training_0003 newark_teachers_union |
             ---------------------------------------------------------
- TOKEN:    'inDocument did token'. File containing the did, the tokens in that
            document as well as an "inDocument" dummy column, e.g.,
             --------------------------------------
            | inDocument    d1020302    abramovich |
            | inDocument    d1020302    africa     |
            | inDocument    d1020302    agents     |
             --------------------------------------
- QSENT:    'querySentence qid sid'. File containing the qid and the sid as
            well as a "querySentence" dummy column, e.g.,
             ---------------------------------------------------
            | querySentence EDL14_ENG_TRAINING_0002 d484430s1   |
            | querySentence EDL14_ENG_TRAINING_0003 d1886127s12 |
            | querySentence EDL14_ENG_TRAINING_0006 d54485s10   |
             ---------------------------------------------------
- INSENT:   'inSentence sid token'. File containing the sid, the tokens in that
            sentence as well as an "inSentence" dummy column, e.g.,
             -------------------------------------
            | inSentence    d1020302s1  ballack   |
            | inSentence    d1020302s1  champions |
            | inSentence    d1020302s1  chelsea   |
             -------------------------------------
- TACPR:    'did wp14 entype score begin end mention tacid name type gentype'.
            File containing a TAC-aligned PR output file, e.g.,
             ----------------------------------------------------------
            | AFP_ENG_20090802.0401	Tikka (food)	prepared food \    |
            |   0.252	66	71	tikka				OTHER              |
            | AFP_ENG_20090802.0401	Spice mix	null \                 |
            |    0.132	72	78	masala				NULL               |
            | AFP_ENG_20090802.0401	Agence France-Presse	company	\  |
            |    0.113	155	158	AFP	E0689220	Agence France-Presse \ |
            |    ORG	ORGANIZATION                                   |
             ----------------------------------------------------------

It is possible to specify the amount of padding (e.g., the number of "X" in 
"NILXXX") by changing the FORMATTING_FLAGS.

Installation
------------

Run "make" inside the main directory. This will set up a virtual environment
with the necessary python dependencies and then run the different clustering
scripts. If you only want to install the virtual environment without running 
the clustering scripts, use "make venv" instead. 

Running the code
----------------

The makefile provides several targets corresponding to the different steps of
the clustering process:

- "make raw" generates all the input needed for the different clustering 
versions.

- "make baseline" performs baseline clustering. At the moment, there are the 
following baseline versions:
    -- baseline0:   String only. All queries with the same query string are
                    assigned the same nid.
    -- baseline1:   String and did. All queries with the same query string and
                    the same did are assigned the same nid.
    -- baseline2:   String and document distance. Queries that have the same 
                    query string are assigned to different clusters based on
                    the distance between the documents they appear in. This
                    distance is calculated based on the tokens that appear
                    in the documents.
                    Agglomerative clustering is used. The clustering parameters
                    (pairwise distance measure between documents, distance 
                    measure between clusters, clustering method and threshold 
                    for flattening) can be specified in the makefile. Detailed 
                    information regarding the clustering algorithm and its 
                    parameters can be found in the scipy documentation. 
    -- baseline3:   String and document distance. Same as baseline2, but 
                    instead of agglomerative clustering exploratory clustering
                    is used. Detailed information regarding the clustering
                    algorithm and its parameters can be found in the ExploreEM
                    documentation.
    -- baseline4:   String and sentence distance. Queries that have the same 
                    query string are assigned to different clusters based on
                    the distance between the sentences they appear in. The 
                    distance between the documents they appear in. This 
                    distance is calculated based on the tokens that appear
                    in the documents.
                    Agglomerative clustering is used. The clustering parameters
                    (pairwise distance measure between documents, distance 
                    measure between clusters, clustering method and threshold 
                    for flattening) can be specified in the makefile. Detailed 
                    information regarding the clustering algorithm and its 
                    parameters can be found in the scipy documentation. 
    -- baseline5:   String and sentence distance. Same as baseline4, but 
                    instead of agglomerative clustering exploratory clustering
                    is used. Detailed information regarding the clustering
                    algorithm and its parameters can be found in the ExploreEM
                    documentation.
    -- baseline6:   String and sentence distance with string and document
                    distance as a fallback. In a first step, queries for which
                    sentence information is available are clustered using
                    baseline4. In a second step, the remaining queries are
                    clustered using baseline2.
    -- baseline7:   String and sentence distance with string and document
                    distance as a fallback. In a first step, queries for which
                    sentence information is available are clustered using
                    baseline5. In a second step, the remaining queries are
                    clustered using baseline3.

- "make explore" performs exploreEM clustering. At the moment, there are the 
following exploreEM versions:
    -- unsupervised0:   String, did, document tokens, and score. For each query, 
                        a feature vector with the query string, did, the tokens 
                        of the document and the ProPPR scores is generated. This 
                        is a global context only unsupervised version of 
                        exploreEM.
    -- unsupervised1:   String, sid, sentence tokens, and score. For each query, 
                        a feature vector with the query string, did, the tokens 
                        of the sentence and the ProPPR scores is generated. This 
                        is a local context only unsupervised version of 
                        exploreEM.
    -- unsupervised2:   String, did, document tokens, sid, sentence tokens, and 
                        score. For each query, a feature vector with the query 
                        string, did, the tokens of the document, sid, sentence 
                        tokens and the ProPPR scores is generated. This is a 
                        combined local and global context unsupervised version 
                        of exploreEM.
    -- semi_supervised0: String, did, document tokens, and score. For each query, 
                        a feature vector with the query string, did, the tokens 
                        of the document and the ProPPR scores is generated. 
                        Additionally, PageReactor output is used to generate
                        training data (seeds) for a number of queries.
                        This is a global context only semi-supervised version of 
                        exploreEM.
    -- semi_supervised1: String, sid, sentence tokens, and score. For each query, 
                        a feature vector with the query string, sid, the tokens 
                        of the sentence and the ProPPR scores is generated. 
                        Additionally, PageReactor output is used to generate
                        training data (seeds) for a number of queries.
                        This is a local context only semi-supervised version of 
                        exploreEM.
    -- semi_supervised2: String, did, document tokens, sid, sentence tokens, and 
                        score. For each query, a feature vector with the query 
                        string, sid, the tokens of the sentence and the ProPPR 
                        scores is generated. Additionally, PageReactor output is 
                        used to generate training data (seeds) for a number of 
                        queries. This is a combined global and local context 
                        semi-supervised version of exploreEM.
Detailed information regarding the clustering algorithm and its parameters can 
be found in the ExploreEM documentation.

- "make pagereactor" performs clustering based on the pagereactor output. At the 
moment, there is only one version:
    -- pagereactor0:    String only. The assignment is performed as a two step
                        process. In a first step, pagereactor entities whose 
                        tacid is NULL and whose generic type is not NULL and not 
                        OTHER are grouped by their wp14 name and assigned a nid.
                        In a second step, those pagereactor entitites that have 
                        either an eid or a nid are matched with their qid. The
                        matching is performed based on string identity and
                        appearance, e.g. if there are three tac queries with the 
                        string "abraham_lincoln" and five pagereactor queries 
                        with the string "abraham_lincoln" then the first three 
                        of those pagereactor queries will be matched with the 
                        qid's from the three tac queries.

- "make" or "make all" is shorthand for calling "make raw", "make baseline", 
"make explore", and "make pagereactor".

Cleaning up
-----------

- "make clean" removes all the input generated by "make raw" (i.e., the data 
directory) and the output generated by "make baseline", "make explore", and
"make pagereactor" (i.e., the output directory).

- "make cleandist" removes the same targets as "make clean" and, in addition
to that, also removes the virtual environment. Note that in general "make clean"
should suffice.