-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththesis.tex
512 lines (410 loc) · 28.2 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
\documentclass[a4paper]{report}
% do not ignore non-ascii chars
\usepackage{fontspec}
% use harvard-style referencing
\usepackage[round]{natbib}
\renewcommand{\bibname}{References}
\bibliographystyle{abbrvnat}
% needed for URLs
\usepackage[hidelinks]{hyperref}
% beautiful tables
\usepackage{booktabs}
% set the font size of captions
\usepackage[font=small]{caption}
% package for drawing graphs
% source: http://www.texample.net/tikz/examples/simple-flow-chart/
\usepackage{tikz}
\usetikzlibrary{shapes,arrows}
\tikzstyle{block} = [rectangle, draw, fill=teal!20, text centered, rounded corners, minimum height=0.9cm, node distance=1.6cm]
\tikzstyle{line} = [draw, -latex']
\begin{document}
\title{IPA alignment using vector representations}
\author{
Pavel Sofroniev \\ \normalsize{author} \\ \normalsize\url{[email protected]}\medskip
\and
Çağri Çöltekin \\ \normalsize{advisor} \\ \normalsize\url{[email protected]}\bigskip
}
\date{March 2018}
\maketitle
\begin{abstract}
This paper explores an alternative approach to pairwise alignment of IPA-encoded sound sequences in the context of computational historical linguistics:
employing the cosine distance between representations of IPA segments in vector space as the scoring function in the standard sequence alignment algorithm.
We implement and evaluate several different methods for obtaining such IPA embeddings, including three data-driven methods.
Our experiments suggest that data-driven embeddings perform better than their counterparts grounded in general linguistics theory;
and, for data-driven methods, that modelling the directionality of context yields better-performing vector representations.
\end{abstract}
\chapter{Introduction}
Most of the computational methods developed in the field of historical linguistics involve the task of aligning sound sequences,
either on its own or as a necessary step in a larger application.
In its essence, sequence alignment is a way to arrange two or more sequences together
in order to identify sub-sequences which are similar according to certain pre-defined criteria.
Usually in the context of historical linguistics these are sequences of phonological or phonetic segments comprising transcriptions of words;
and the similarity of given segments is subsequently used to infer either the cognacy of these words or rules of sound correspondences.
The currently standard algorithm for pairwise (i.e. aligning two sequences at a time) alignment
was originally developed by \citet{1970_Needleman_Wunsch} for aligning amino acid sequences.
Given a function that assigns scores to pairs of sequence elements,
the Needleman-Wunsch algorithm is guaranteed to find the alignment(s) that minimises/maximises the overall score.
There exist a number of variants of the basic algorithm, e.g. \citet{1981_Smith_Waterman},
and many of these have been adapted for the purposes of computational linguistics;
however, the output always depends to a great extent on the scoring function.
Some of these variants have been developed for aligning more than two sequences at once,
e.g. the T-Coffee algorithm by \citet{2000_Notredame_al}, but in this paper we focus on pairwise alignment only.
Unlike in bioinformatics, which deals primarily with sequences composed out of small sets of well-defined units,
in historical linguistics the encoding of sequences posed for alignment is not a solved problem.
Even though in general the standard way to encode sound sequences is IPA, the International Phonetic Alphabet,
for the purposes of alignment the overwhelming diversity of sound segments impedes the creation of a useful scoring function.
That is why modern sound sequence alignment methods tend to operate on small sets of segments;
IPA-encoded data is handled by translating it into the respective reduced alphabet.
While this constitutes a valid approach to tackle the problem of deriving a scoring function for the rich set of IPA segments,
in this paper we experiment with an alternative approach: vectorisation.
The goal is to obtain representations of IPA segments in a vector space
in which the cosine similarity between segments' vectors would reflect the segments' similarity and, hence, probability of shared ancestry.
Our inspiration for experimenting with vector representations (also known as embeddings)
is drawn from the success of such methods in other fields of computational linguistics:
e.g. \citet{2013_Mikolov_al} affirm that vector representations of words exhibit useful semantic and syntactic properties;
and \citet{2014_Chen_Manning} use word embeddings for transition-based parsing.
More closely related to our topic,
\citet{2018_Silfverberg_al} obtain vector representations of orthographic segments for Finnish, Spanish and Turkish
from recurrent neural networks trained to perform an inflection task.
\section{Background}
This section provides a concise overview of sound sequence alignment methods reported in the computational historical linguistics literature.
\subsection{Early methods}
An early method for sound sequence alignment reported in the literature is the one proposed by \citet{1996_Covington},
which was subsequently improved and extended in \citet{1998_Covington}.
Although the method does not take advantage of any standard sequence alignment algorithm and instead performs branch-and-bound search,
it does use a scoring function.
This function assigns hand-crafted values depending on whether the segments are consonants, vowels, or glides.
Due to this simple classification, the method could be adapted to work with IPA sequences (the author uses an undocumented encoding);
however, its poor performance in comparison to more complex methods demonstrates that such coarse three-way classification is not sufficient.
ALINE, first introduced by \citet{2000_Kondrak} and further elaborated in \citet{2002_Kondrak_Hirst} and \citet{2003_Kondrak},
employs a complex scoring function that is built around multi-valued (e.g. manner of articulation) and binary (e.g. nasality) phonetic features,
each with its relative weight, as well as some additional parameters (e.g. relative weight of vowels).
While some of these numerical values have been determined based on observations from the field of articulatory phonetics,
others have been established by trial and error.
\subsection{ASJP}
The Automated Similarity Judgement Program (ASJP) is an actively developed database aiming to provide
the translations of a set of 40 basic concepts into all the world's languages.
At the time of writing the database covers 294\,548 words from 7221 languages \citep{2016_Wichmann_al}.
The words are encoded using a uniform transcription consisting of 34 consonants and 7 vowels,
introduced by \citet{2008_Brown_al} and further specified by \citet{2013_Brown_al} as ASJPcode, but more commonly referred to as ASJP.
This restricted set of encoding symbols results in much of the phonetic information being lost,
but the authors argue that such information might not be that valuable for certain applications.
The simplified encoding also allows for a greater reach of the database as many of the world's languages lack sufficiently detailed phonetic record.
\citet{2013_Jäger} uses the ASJP database to calculate the partial mutual information (PMI) scores\,---\,a logarithmic measure
for sound similarity based on occurrence and co-occurrence of the sound segments in data\,---\,for each pair of ASJP segments.
These PMI scores are successfully employed for inferring phylogenetic trees through determining language distances from PMI-based sequence alignment.
Unlike the aforementioned methods, this one obtains its scoring function in a data-driven manner, an approach also embraced in our paper.
PMI-based scoring has been used on IPA-encoded datasets as well, by converting the sequences to ASJP as a pre-processing step.
As an example, \citet{2016_Jäger_Sofroniev} use this approach to train a support vector machine (SVM) for cognate classification.
\subsection{Other methods}
The sound-class-based phonetic alignment (SCA) method developed by \citet{2012_List} employs a set of 28 sound classes.
It operates on IPA sequences by converting the segments into their respective sound classes, aligning the sound class tokens, and then converting these back into IPA.
The scoring function is hand-crafted to reflect the perceived probabilities of sound change transforming a segment of one class into a segment of another.
\chapter{Methods}
We consider 5 different methods for obtaining vector representations of IPA segments.
With all methods, the representations are employed for computing the cosine distance between IPA segments (cf.~Figure~\ref{fig:cosine}),
which comprises the scoring function in an otherwise standard application of the Needleman-Wunsch algorithm.
\begin{figure}[h]
\begin{displaymath}
f(u, v) = 1-\cos(\alpha) = 1-\frac{\displaystyle\sum_{i=1}^n u_i v_i}{\sqrt{\displaystyle\sum_{i=1}^n u_i^2}\sqrt{\displaystyle\sum_{i=1}^n v_i^2}}
\end{displaymath}
\caption{The scoring function; $u$ and $v$ are vector representations of two IPA segments; $\alpha$ is the angle between $u$ and $v$}
\label{fig:cosine}
\end{figure}
The code implementation of the complete methods, as well as of the experiments described in the next chapter,
is open sourced under the GNU GPLv3 license and publicly available online.\footnote{\url{https://github.com/pavelsof/ipavec}}
It is written using the Python programming language and a set of common dependencies,
including NumPy \citep{2011_Walt_al},
SciPy \citep{2001_Jones_al},
scikit-learn \citep{2011_Pedregosa_al},
Gensim \citep{2010_Řehůřek_Sojka},
and Keras \citep{2015_Chollet_al}.
\section{One-hot encoding}
Under one-hot encoding, given a vocabulary of $N$ distinct segments, each segment would be represented as a distinct binary vector of size $N$,
such that exactly one of its dimensions has value 1 and all its other dimensions have value 0.
In such a vector space, all vectors have length 1 and any two non-identical vectors are orthogonal.
One-hot encoding is a simple method, both conceptually and computationally, but it cannot be very useful for producing distance measures
because under its model each segment is equidistant from all the others;
i.e. the method is equivalent to treating each IPA segment as a basic symbol with no connection to any other IPA segment.
For the purposes of our study one-hot vector representations are used as a baseline for comparing other methods.
\section{PHOIBLE feature vectors}
PHOIBLE Online is an ongoing project aiming to compile a comprehensive tertiary database of the world languages' phonological inventories.
At the time of writing the database contains phonological inventories for 1672 distinct languages making use of 2160 distinct IPA segments \citep{2014_Moran_al}.
As part of the project the developers are also maintaining a table of phonological features,
effectively mapping each segment encountered in the database to a unique ternary feature vector
(values indicate either the presence, absence, or non-applicability of the respective feature).
The feature table includes 39 distinct features and is based on research by \citet{2009_Bruce} and \citet{2011_Moisik_al}.
Table~\ref{tab:phoible-features} lists these, grouped by value, for an arbitrary segment.
\begin{table}[h]
\centering\small
\begin{tabular}{l *{3}{p{3.3cm}}}
\toprule
& present & absent & not applicable \\
\midrule
ə &
syllabic, sonorant, continuant, approximant, dorsal, periodic glottal source &
stress, short, long, consonantal, tap, trill, nasal, lateral, labial, coronal, high, low, front, back, tense,
retracted tongue root, advanced tongue root, epilaryngeal source, spread glottis, constricted glottis, raised larynx ejective, lowered larynx implosive &
tone, delayed release, round, labiodental, anterior, distributed, strident, fortis, click \\
\bottomrule
\end{tabular}
\caption{The PHOIBLE features grouped by value for the segment ə}
\label{tab:phoible-features}
\end{table}
The PHOIBLE feature vectors comprise segment representations grounded in phonology and linguistics theory
that could be readily used for a variety of computational endeavours.
Nevertheless, cosine distances in the vector space defined by the PHOIBLE features is equally affected by both common and rare features/dimensions;
for example the presence of a feature such as advanced tongue root in one of a pair of vectors
could result in a distance larger than needed for the purposes of alignment.
That is why we have also experimented with reducing the number of dimensions by applying principal component analysis (PCA),
expecting that this would decrease the influence of rare features; we refer to this method as PHOIBLE-PC.
\section{phon2vec embeddings}
The first data-driven method we consider is word2vec, developed by \citet{2013_Mikolov_al}.
It comprises two closely related model architectures for computing vector representations of words:
the continuous bag-of-words model (CBOW) which tries to predict a word based on its context of neighbouring words,
and the continuous skip-gram model which, given a word, tries to predict words from its context.
The resulting embeddings also depend on a few other hyperparameters,
e.g. the number of neighbouring words that constitute a word's context, or the number of negative samples for each positive sample.
As our study concerns IPA segments rather than words, we refer to the method as phon2vec.
Our embeddings are trained on a dataset of tokenised IPA sequences,
with the model effectively treating the IPA segments as words and the sequences as sentences.
\section{NN embeddings}
Our fourth method consists of using the embeddings of a neural network (NN) which is trained to predict an IPA segment's immediate neighbours.
As it is the case with the phon2vec model, such a network requires a training dataset of IPA sequences.
During training, each IPA segment of each sequence is embedded, processed by a regular dense layer,
and used to predict the segment's one-hot encoded preceding and following segments through two softmax layers.
The network's architecture is depicted in Figure~\ref{fig:nn}.
\begin{figure}[h]
\centering\small
\begin{tikzpicture}
\node [block] (input) {input segment};
\node [block, below of=input] (embedding) {embedding layer};
\node [block, below of=embedding] (dense) {dense layer};
\node [block, below of=dense, left of=dense] (prev) {preceding segment};
\node [block, below of=dense, right of=dense] (next) {following segment};
\path [line] (input) -- (embedding);
\path [line] (embedding) -- (dense);
\path [line] (dense) -- (prev);
\path [line] (dense) -- (next);
\end{tikzpicture}
\caption{NN: layers}
\label{fig:nn}
\end{figure}
The network's hyperparameters include the size of the embeddings layer, the size of the dense layer,
the number of training epochs, the batch size, the loss and optimisation functions.
We use sparse categorical cross-entropy as the loss function and we experiment with common values of the other hyperparameters.
We have also experimented with predicting a larger context by extending the number of predicted neighbours to two in each direction,
but this fails to improve the resulting embeddings.
A major difference between phon2vec and the neural network, which is also one of the reasons to propose the latter,
is that phon2vec does not model the order of IPA segments, effectively treating the preceding and succeeding neighbours equivalently.
That is why we expect the NN embeddings to perform better than their phon2vec counterparts.
\section{RNN embeddings}
Our last method consists of using the embeddings of a recursive neural network (RNN).
Given a pair of sequences, the network recursively encodes the first sequence into a vector which is then recursively decoded into an output sequence.
The backward propagation of the mismatches between the output sequence
and the pair's second sequence, represented as a vector of one-hot vectors, affects the whole network.
Introduced by \citet{2014_Cho_al}, networks of such architecture are typically employed for machine translation,
but our network is trained on pairs of IPA sequences comprising bilingual pairs of words linked to the same meaning.
As both encoded and decoded data are of the same form, the embedding layer is shared.
The network's architecture is depicted in Figure~\ref{fig:rnn}.
\begin{figure}[h]
\centering\small
\begin{tikzpicture}
\node [block] (encoder) {encoder};
\node [block, right of=encoder, node distance=3cm] (state) {internal state};
\node [block, right of=state, node distance=3cm] (decoder) {decoder};
\node [block, right of=decoder, node distance=2cm] (output) {output};
\node [block, below of=state] (embedding) {embedding layer};
\node [block, below of=embedding, left of=embedding] (encoder-input) {encoder input};
\node [block, below of=embedding, right of=embedding] (decoder-input) {decoder input};
\path [line] (encoder-input) -- (embedding);
\path [line] (decoder-input) -- (embedding);
\path [line] (embedding) -- (encoder);
\path [line] (embedding) -- (decoder);
\path [line] (encoder) -- (state);
\path [line] (state) -- (decoder);
\path [line] (decoder) -- (output);
\end{tikzpicture}
\caption{RNN: layers}
\label{fig:rnn}
\end{figure}
The network's hyperparameters include the size of the embeddings layer, the size of the encoder and decoder recurrent layers,
the number of training epochs, the batch size, the loss and optimisation functions.
We use categorical cross-entropy as the loss function and we experiment with common values of the other hyperparameters.
We have also experimented with using the NN embeddings as initial weights of the RNN embedding layer;
this consistently improves the resulting vector representations\,---\,which we refer to as NN\,+\,RNN embeddings.
\chapter{Experiments}
In order to evaluate the performance of the methods put forward in the last chapter,
we use the Benchmark Database for Phonetic Alignments (BDPA) compiled by \citet{2014_List_Prokić}.
The database contains 7198 aligned pairs of IPA sequences collected from 12 source datasets,
covering languages and dialects from 6 language families (cf.~Table~\ref{tab:bdpa}).
The database also features the small set of 82 selected pairs used by \citet{1996_Covington} to evaluate his method, encoded in IPA.
\begin{table}[h]
\centering\small
\begin{tabular}{l r r l l}
\toprule
& pairs & langs & family & source \\
\midrule
Andean & 619 & 20 & Aymaran, Quechuan & \citet{2006_Heggarty} \\
Bai & 889 & 17 & Sino-Tibetan & \citet{2006_Wang}, \citet{2007_Allen} \\
Bulgarian & 1519 & 196 & Indo-European & \citet{2009_Prokić_al} \\
Dutch & 500 & 62 & Indo-European & \citet{2005_Schutter_al} \\
French & 712 & 62 & Indo-European & \citet{1925_Gauchat_al} \\
Germanic & 1110 & 45 & Indo-European & \citet{2009_Renfrew_Heggarty} \\
Japanese & 219 & 10 & Japonic & \citet{1973_Shirō} \\
Norwegian & 501 & 51 & Indo-European & \citet{2011_Almberg_Skarbø} \\
Ob-Ugrian & 444 & 21 & Uralic & \citet{2011_Zhivlov} \\
Romance & 297 & 8 & Indo-European & \citet{2009_Renfrew_Heggarty} \\
Sinitic & 200 & 38 & Sino-Tibetan & \citet{2004_Hóu} \\
Slavic & 188 & 5 & Indo-European & \citet{2008_Derksen} \\
\bottomrule
\end{tabular}
\caption{The BDPA datasets: numbers of pairs and languages, family affiliations as per \citet{2018_Hammarström_al}, and sources}
\label{tab:bdpa}
\end{table}
Our methods are also compared against SCA, the current state-of-the-art method for aligning IPA sequences developed by \citet{2012_List},
which is briefly described in the introductory chapter.
In order to quantify the methods' performance, we employ an intuitive evaluation scheme similar to the one used in \citet{2002_Kondrak_Hirst}:
if, for a given word pair, $m$ is the number of alternative gold-standard alignments
and $n$ is the number of correctly predicted alignments, the score for that pair would be $\frac{n}{m}$.
In the common word pair case of a single gold-standard alignment and a single predicted alignment,
the latter would yield 1 point if it is correct and 0 points otherwise;
partially correct alignment do not yield points.
The percentage scores are obtained by dividing the points by the total number of pairs.
\section{Training}
Obtaining vector representations with the phon2vec and neural network methods involves
settings the models' hyperparameters and training on a dataset of IPA sequences (or pairs thereof).
Our training data is sourced from NorthEuraLex, a comprehensive lexicostatistical database
that provides IPA-encoded lexical data for languages of, primarily but not exclusively, Northern Eurasia \citep{2017_Dellert_Jäger}.
At the time of writing the database covers 1016 concepts from 107 languages, resulting in 121\,614 IPA transcriptions.
The latter are tokenised using ipatok, an open source Python package for tokenising IPA strings developed by us.\footnote{\url{https://pypi.python.org/pypi/ipatok}}
The phon2vec and NN embeddings are trained on the set of all tokenised transcriptions.
As NorthEuraLex does not include cognacy information,
the RNN embeddings are trained on the set of tokenised transcriptions of the word pairs constituting probable cognates\,---\,pairs
in which the words belong to different languages, are linked to the same concept, and have normalised Levenshtein distance lower than 0.5.
Admittedly, this threshold value is chosen somewhat arbitrarily;
we have also experimented with thresholds of 0.4 and 0.6, but setting the cutoff at 0.5 yields better-performing embeddings.
For each of the methods, we have run the respective model with the Cartesian product of common values for each hyperparameter,
practically performing a crude search of the hyperparameter space.
The values we have experimented with, as well as the best-performing combinations thereof, are summarised in Tables \ref{tab:phon2vec} and \ref{tab:networks}.
\begin{table}[h]
\centering\small
\begin{tabular}{*{3}{l}}
\toprule
& phon2vec & experimental values \\
\midrule
model architecture & CBOW & CBOW, skip-gram \\
embedding size & 15 & 5, 15, 30 \\
context size & 2 (one segment in each direction) & 0, 2, 4 \\
negative samples & 1 per positive sample & 0, 1, 2 \\
epochs & 5 & 5, 10, 20 \\
\bottomrule
\end{tabular}
\caption{Hyperparameter values, phon2vec}
\label{tab:phon2vec}
\end{table}
\begin{table}[h]
\centering\small
\begin{tabular}{*{4}{l}}
\toprule
& nn & rnn & experimental values \\
\midrule
embedding size & 64 & 64 & 32, 64 \\
dense layer size & 128 & - & 64, 128 \\
rnn layer size & - & 128 & 64, 128 \\
epochs & 10 & 5 & 5, 10, 20 \\
batch size & 128 & 128 & 32, 64, 128 \\
optimisation & sgd & rmsprop & cf.~Keras' optimisation functions \\
\bottomrule
\end{tabular}
\caption{Hyperparameter values, neural networks}
\label{tab:networks}
\end{table}
\section{Results}
The percentage-score evaluation of the output of running our proposed methods and SCA on the BDPA datasets is summarised in Table~\ref{tab:results}.
\begin{table}[h]
\centering\small
\begin{tabular}{l *{6}{c}}
\toprule
& one-hot & phoible & phon2vec & nn & nn\,+\,rnn & sca \\
\midrule
Andean & 85.66 & 87.31 & 97.25 & 99.34 & 99.50 & 99.67 \\
Bai & 52.55 & 62.77 & 61.25 & 74.72 & 75.52 & 83.45 \\
Bulgarian & 60.54 & 80.54 & 77.98 & 82.55 & 86.70 & 89.34 \\
Dutch & 14.16 & 25.65 & 26.00 & 32.50 & 32.50 & 42.20 \\
French & 42.94 & 62.92 & 68.94 & 74.30 & 77.04 & 80.90 \\
Germanic & 39.93 & 51.78 & 54.59 & 71.83 & 72.55 & 83.48 \\
Japanese & 53.56 & 65.04 & 73.74 & 62.71 & 71.08 & 82.19 \\
Norwegian & 59.39 & 78.87 & 73.69 & 83.53 & 89.06 & 91.77 \\
Ob-Ugrian & 59.58 & 77.87 & 73.35 & 78.04 & 82.55 & 86.04 \\
Romance & 40.48 & 71.28 & 63.16 & 76.37 & 77.55 & 95.62 \\
Sinitic & 27.34 & 28.57 & 30.75 & 72.46 & 74.04 & 98.95 \\
Slavic & 76.96 & 90.73 & 84.22 & 89.89 & 96.81 & 94.15 \\
\addlinespace
Global & 51.83 & 66.64 & 66.99 & 75.88 & 78.45 & 84.84 \\
Covington & 60.61 & 82.42 & 80.18 & 82.52 & 82.52 & 90.24 \\
\bottomrule
\end{tabular}
\caption{Scores, as percentage of total alignment pairs}
\label{tab:results}
\end{table}
\section{Discussion}
The first point we would like to draw attention to is that the one-hot encoding scores are consistently lower than those in the other columns.
This is expected because, unlike the other methods, one-hot encoding cannot represent variability in the degree of phonetic similarity between IPA segments.
Viewing the one-hot encoding scores as a baseline, we conclude that the other methods' distance measures do indeed contribute to the task of sequence alignment.
The PHOIBLE feature vector are roughly on par with the phon2vec embeddings,
yield better results than the NN embeddings on two of the datasets (Japanese and Slavic), and are outperformed by the RNN\,+\,NN embeddings.
The better scores achieved by SCA and the the data-induced embeddings can be explained by the binary nature of PHOIBLE's vectors and the smaller number of features.
Furthermore, PHOIBLE does not provide feature vectors for all IPA segments encountered in the BDPA datasets.
Not included in Table~\ref{tab:results} are the scores of the PHOIBLE-PC method,
which are instead contrasted with the scores of the base PHOIBLE method in Table~\ref{tab:results-phoible}.
PCA-reduced vectors of size 29 yield the highest scores
but still perform consistently worse than their non-reduced counterparts, albeit with relatively small difference.
Thus our hypothesis that applying PCA on the PHOIBLE vector space would improve the results is proven false.
\begin{table}[h]
\centering\small
\begin{tabular}{l *{2}{c}}
\toprule
& phoible & phoible-pc \\
\midrule
Andean & 87.31 & 82.31 \\
Bai & 62.77 & 00.23 \\
Bulgarian & 80.54 & 78.79 \\
Dutch & 25.65 & 24.53 \\
French & 62.92 & 61.31 \\
Germanic & 51.78 & 45.89 \\
Japanese & 65.04 & 62.56 \\
Norwegian & 78.87 & 75.80 \\
Ob-Ugrian & 77.87 & 76.20 \\
Romance & 71.28 & 71.04 \\
Sinitic & 28.57 & 00.00 \\
Slavic & 90.73 & 86.17 \\
\bottomrule
\end{tabular}
\caption{Scores: PHOIBLE vs. PHOIBLE-PC}
\label{tab:results-phoible}
\end{table}
Of the data-driven methods, phon2vec yields the lowest scores, being outperformed by both neural network models in all datasets except Japanese.
Given that both the phon2vec and the NN embeddings are trained on the same data,
we believe that the consistent difference is due to the fact that the phon2vec model ignores the order of IPA segments.
The NN\,+\,RNN model yields higher scores than the NN model that it builds upon,
indicating that the embeddings of the former capture useful information that the embeddings of the feedforward network do not\,---\,the
difference between the two models ranges from none for the Dutch dataset to almost 7 percent points for the Slavic dataset.
For all but the Slavic dataset, SCA yields higher scores than our best-performing NN\,+\,RNN embeddings.
The score differences exhibit considerable variance\,---\,from less than 1 percent point for the Andean dataset up to 26 percent points for the Sinitic dataset.
A possible explanation for this variance is the fact that not all IPA segments found in the benchmark datasets are found in the training data.
For example, NorthEuraLex includes a single tonal language, Mandarin Chinese,
and the NN model cannot produce meaningful embeddings for most of the tones encountered in the Sinitic and Bai datasets.
Arguably, a larger training dataset featuring a richer set of IPA segments would produce better-performing embeddings.
\chapter{Conclusion}
In this paper we have proposed, implemented, and evaluated a small set of methods for obtaining vector representations of IPA segments
for the purposes of pairwise IPA sequence alignment.
With the exception of a single testing dataset, our vector representations fail to outperform the current state-of-the-art IPA alignment method.
Nevertheless, we consider the results of the data-driven methods not too far off the mark,
and we believe that they could be significantly improved by using larger and more diverse training data.
This constitutes one direction for future experiments;
another possibility is to train and use embeddings specific to a particular language family or macro-area.
Further investigation is also needed with respect to comparing and evaluating the methods,
especially in the context of a larger application, such as cognacy identification or phylogenetic inference.
\bibliography{references}
\end{document}