initial import

JmlrOrg · Oct 18, 2020 · a113b49 · a113b49
commit a113b49
Show file tree

Hide file tree

Showing 200 changed files with 1,646 additions and 0 deletions.
diff --git a/MLOPT-intro06a/MLOPT-intro06a.pdf b/MLOPT-intro06a/MLOPT-intro06a.pdf
diff --git a/MLOPT-intro06a/info.json b/MLOPT-intro06a/info.json
@@ -0,0 +1,16 @@
+{
+    "abstract": "The fields of machine learning and mathematical\nprogramming are increasingly intertwined.  Optimization problems\nlie at the heart of most machine learning approaches. The Special\nTopic on Machine Learning and Large Scale Optimization examines\nthis interplay. Machine learning researchers have embraced the\nadvances in mathematical programming allowing new types of models\nto be pursued.  The special topic includes models using quadratic,\nlinear, second-order cone, semi-definite, and semi-infinite\nprograms. We observe that the qualities of good optimization\nalgorithms from the machine learning and optimization perspectives\ncan be quite different. Mathematical programming puts a premium on\naccuracy, speed, and robustness.   Since generalization is the\nbottom line in machine learning and training is normally done\noff-line, accuracy and small speed improvements are of little\nconcern in machine learning.  Machine learning prefers simpler\nalgorithms that work in reasonable computational time for\nspecific classes of problems. Reducing machine learning problems\nto well-explored mathematical programming classes with robust\ngeneral purpose optimization codes allows machine learning\nresearchers to rapidly develop new techniques. In turn, machine\nlearning presents new challenges to mathematical programming. The\nspecial issue include papers from two primary themes:  novel\nmachine learning models and novel optimization approaches for\nexisting models. Many papers blend both themes, making small\nchanges in the underlying core mathematical program that enable\nthe develop of effective new algorithms.",
+    "authors": [
+        "Kristin P. Bennett",
+        "Emilio Parrado-Hern{{\\'a}}ndez"
+    ],
+    "id": "MLOPT-intro06a",
+    "issue": 45,
+    "pages": [
+        1265,
+        1281
+    ],
+    "title": "The Interplay of Optimization and Machine Learning Research",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/MLSEC-intro06a/MLSEC-intro06a.pdf b/MLSEC-intro06a/MLSEC-intro06a.pdf
diff --git a/MLSEC-intro06a/info.json b/MLSEC-intro06a/info.json
@@ -0,0 +1,16 @@
+{
+    "abstract": "The prevalent use of computers and internet has enhanced the quality\nof life for many people, but it has also attracted undesired attempts\nto undermine these systems.  This special topic contains several\nresearch studies on how machine learning algorithms can help improve\nthe security of computer systems.",
+    "authors": [
+        "Philip K. Chan",
+        "Richard P. Lippmann"
+    ],
+    "id": "MLSEC-intro06a",
+    "issue": 95,
+    "pages": [
+        2669,
+        2672
+    ],
+    "title": "Machine Learning for Computer Security",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/abbeel06a/abbeel06a.pdf b/abbeel06a/abbeel06a.pdf
diff --git a/abbeel06a/info.json b/abbeel06a/info.json
@@ -0,0 +1,17 @@
+{
+    "abstract": "We study the computational and sample complexity of parameter and\nstructure learning in graphical models.  Our main result shows that\nthe class of factor graphs with bounded degree can be learned in\npolynomial time and from a polynomial number of training examples,\nassuming that the data is generated by a network in this class.  This\nresult covers both parameter estimation for a known network structure\nand structure learning.  It implies as a corollary that we can learn\nfactor graphs for both Bayesian networks and Markov networks of\nbounded degree, in polynomial time and sample complexity. Importantly,\nunlike standard maximum likelihood estimation algorithms, our method\ndoes not require inference in the underlying network, and so applies\nto networks where inference is intractable.  We also show that the\nerror of our learned model degrades gracefully when the generating\ndistribution is not a member of the target class of networks. In\naddition to our main result, we show that the sample complexity of\nparameter learning in graphical models has an <i>O</i>(1) dependence\non the number of variables in the model when using the KL-divergence\nnormalized by the number of variables as the performance criterion.",
+    "authors": [
+        "Pieter Abbeel",
+        "Daphne Koller",
+        "Andrew Y. Ng"
+    ],
+    "id": "abbeel06a",
+    "issue": 63,
+    "pages": [
+        1743,
+        1788
+    ],
+    "title": "Learning Factor Graphs in Polynomial Time and Sample Complexity",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/angluin06a/angluin06a.pdf b/angluin06a/angluin06a.pdf
diff --git a/angluin06a/info.json b/angluin06a/info.json
@@ -0,0 +1,16 @@
+{
+    "abstract": "We consider the problem of learning a hypergraph using edge-detecting queries.\nIn this model, the learner may query whether a set of vertices induces an \nedge of the hidden hypergraph or not.\nWe show that an <i>r</i>-uniform hypergraph with <i>m</i> edges and <i>n</i> \nvertices is learnable with <i>O</i>(2<sup>4<i>r</i></sup><i>m</i> &#183; \n<i>poly</i>(<i>r</i>,log<i>n</i>)) queries with high probability.\nThe queries can be made in <i>O</i>(min(2<sup><i>r</i></sup> \n(log <i>m+r</i>)<sup>2</sup>, (log <i>m+r</i>)<sup>3</sup>)) rounds.\nWe also give an algorithm that learns an almost uniform hypergraph of \ndimension <i>r</i> using <i>O</i>(2<sup><i>O</i>((1+&#916;/2)r)</sup> &#183; \n<i>m</i><sup>1+&#916;/2</sup> &#183; <i>poly</i>(log <i>n</i>)) \nqueries with high probability,\nwhere &#916; is the difference between the maximum and the minimum edge \nsizes.  This upper bound matches our lower bound of \n&#937;((<i>m</i>/(1+&#916;/2))<sup>1+&#916;/2</sup>) for this \nclass of hypergraphs in terms of dependence on <i>m</i>.\nThe queries can also be made in \n<i>O</i>((1+&#916;) &#183; min(2<sup><i>r</i></sup> (log <i>m+r</i>)<sup>2</sup>, \n(log <i>m+r</i>)<sup>3</sup>)) rounds.",
+    "authors": [
+        "Dana Angluin",
+        "Jiang Chen"
+    ],
+    "id": "angluin06a",
+    "issue": 78,
+    "pages": [
+        2215,
+        2236
+    ],
+    "title": "Learning a Hidden Hypergraph",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/bach06a/bach06a.pdf b/bach06a/bach06a.pdf
diff --git a/bach06a/info.json b/bach06a/info.json
@@ -0,0 +1,17 @@
+{
+    "abstract": "Receiver Operating Characteristic (ROC) curves are a standard way to\ndisplay the performance of a set of binary classifiers for all\nfeasible ratios of the costs associated with false positives and\nfalse negatives. For linear classifiers, the set of classifiers is\ntypically obtained by training once, holding constant the estimated\nslope and then varying the intercept to obtain a parameterized set\nof classifiers whose performances can be plotted in the ROC plane.\nWe consider the alternative of varying the asymmetry of the cost\nfunction used for training. We show that the ROC curve obtained by\nvarying both the intercept and the asymmetry, and hence the slope,\nalways outperforms the ROC curve obtained by varying only the\nintercept. In addition, we present a path-following algorithm for\nthe support vector machine (SVM) that can compute efficiently the\nentire ROC curve, and that has the same computational complexity as\ntraining a single classifier. Finally, we provide a theoretical\nanalysis of the relationship between the asymmetric cost model\nassumed when training a classifier and the cost model assumed in\napplying the classifier. In particular, we show that the mismatch\nbetween the step function used for testing and its convex upper\nbounds, usually used for training, leads to a provable and\nquantifiable difference around extreme asymmetries.",
+    "authors": [
+        "Francis R. Bach",
+        "David Heckerman",
+        "Eric Horvitz"
+    ],
+    "id": "bach06a",
+    "issue": 62,
+    "pages": [
+        1713,
+        1741
+    ],
+    "title": "Considering Cost Asymmetry in Learning Classifiers",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/bach06b/bach06b.pdf b/bach06b/bach06b.pdf
diff --git a/bach06b/info.json b/bach06b/info.json
@@ -0,0 +1,16 @@
+{
+    "abstract": "Spectral clustering refers to a class of techniques which rely on\nthe eigenstructure of a similarity matrix to partition points into\ndisjoint clusters, with points in the same cluster having high\nsimilarity and points in different clusters having low similarity.\nIn this paper, we derive new cost functions for spectral\nclustering based on measures of error between a given partition\nand a solution of the spectral relaxation of a minimum normalized\ncut problem.  Minimizing these cost functions with respect to the\npartition leads to new spectral clustering algorithms.  Minimizing\nwith respect to the similarity matrix leads to algorithms for\nlearning the similarity matrix from fully labelled data sets. We\napply our learning algorithm to the blind one-microphone speech\nseparation problem, casting the problem as one of segmentation\nof the spectrogram.",
+    "authors": [
+        "Francis R. Bach",
+        "Michael I. Jordan"
+    ],
+    "id": "bach06b",
+    "issue": 70,
+    "pages": [
+        1963,
+        2001
+    ],
+    "title": "Learning Spectral Clustering, With Application To Speech Separation",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/barber06a/barber06a.pdf b/barber06a/barber06a.pdf
diff --git a/barber06a/info.json b/barber06a/info.json
@@ -0,0 +1,15 @@
+{
+    "abstract": "We introduce a method for approximate smoothed inference in a class\nof switching linear dynamical systems, based on a novel form of\nGaussian Sum smoother. This class includes the switching Kalman\n'Filter' and the more general case of switch transitions dependent\non the continuous latent state. The method improves on the standard\nKim smoothing approach by dispensing with one of the key\napproximations, thus making fuller use of the available future\ninformation.\nWhilst the central assumption required is projection to a mixture of\nGaussians, we show that an additional conditional independence\nassumption results in a simpler but accurate alternative. Our method\nconsists of a single Forward and Backward Pass and is reminiscent of\nthe standard smoothing 'correction' recursions in the simpler linear\ndynamical system. The method is numerically stable and compares\nfavourably against alternative approximations, both in cases where a\nsingle mixture component provides a good posterior approximation,\nand where a multimodal approximation is required.",
+    "authors": [
+        "David Barber"
+    ],
+    "id": "barber06a",
+    "issue": 88,
+    "pages": [
+        2515,
+        2540
+    ],
+    "title": "Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/begleiter06a/begleiter06a.pdf b/begleiter06a/begleiter06a.pdf
diff --git a/begleiter06a/info.json b/begleiter06a/info.json
@@ -0,0 +1,16 @@
+{
+    "abstract": "We present worst case bounds for the learning\nrate of a known prediction method that is based on hierarchical\napplications of binary context tree weighting (CTW) predictors. A\nheuristic application of this approach that relies on Huffman's alphabet\ndecomposition is known to achieve state-of-the-art performance\nin prediction and lossless compression benchmarks. We show that our\nnew bound for this heuristic is tighter than the best known\nperformance guarantees for prediction and lossless compression\nalgorithms in various settings. This result\nsubstantiates the efficiency of this hierarchical method and provides a compelling\nexplanation for its practical success.\nIn addition, we present the results of a few experiments that\nexamine other possibilities for improving the multi-alphabet\nprediction performance of CTW-based algorithms.",
+    "authors": [
+        "Ron Begleiter",
+        "Ran El-Yaniv"
+    ],
+    "id": "begleiter06a",
+    "issue": 12,
+    "pages": [
+        379,
+        411
+    ],
+    "title": "Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/belkin06a/belkin06a.pdf b/belkin06a/belkin06a.pdf
diff --git a/belkin06a/info.json b/belkin06a/info.json
@@ -0,0 +1,17 @@
+{
+    "abstract": "We propose a family of learning algorithms based on a new form of\nregularization that allows us to exploit the geometry of the marginal\ndistribution.  We focus on a semi-supervised framework that\nincorporates labeled and unlabeled data in a general-purpose learner.\nSome transductive graph learning algorithms and standard methods\nincluding support vector machines and regularized least squares can be\nobtained as special cases.  We use properties of reproducing kernel\nHilbert spaces to prove new Representer theorems that provide\ntheoretical basis for the algorithms.  As a result (in contrast to\npurely graph-based approaches) we obtain a natural out-of-sample\nextension to novel examples and so are able to handle both\ntransductive and truly semi-supervised settings.  We present\nexperimental evidence suggesting that our semi-supervised algorithms\nare able to use unlabeled data effectively. Finally we have a brief\ndiscussion of unsupervised and fully supervised learning within our\ngeneral framework.",
+    "authors": [
+        "Mikhail Belkin",
+        "Partha Niyogi",
+        "Vikas Sindhwani"
+    ],
+    "id": "belkin06a",
+    "issue": 84,
+    "pages": [
+        2399,
+        2434
+    ],
+    "title": "Manifold  Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/bergkvist06a/bergkvist06a.pdf b/bergkvist06a/bergkvist06a.pdf
diff --git a/bergkvist06a/info.json b/bergkvist06a/info.json
@@ -0,0 +1,17 @@
+{
+    "abstract": "We consider an optimization problem in probabilistic inference: Given\n<i>n</i> hypotheses <i>H<sub>j</sub></i>, <i>m</i> possible \nobservations <i>O<sub>k</sub></i>, their\nconditional probabilities <i>p<sub>kj</sub></i>, and a particular \n<i>O<sub>k</sub></i>, select a\npossibly small subset of hypotheses excluding the true target only\nwith some error probability &#949;. After specifying the\noptimization goal we show that this problem can be solved through a\nlinear program in <i>mn</i> variables that indicate the probabilities to\ndiscard a hypothesis given an observation. Moreover, we can compute\noptimal strategies where only <i>O(m+n)</i> of these variables get\nfractional values. The manageable size of the linear programs and the\nmostly deterministic shape of optimal strategies makes the method\npracticable. We interpret the dual variables as worst-case\ndistributions of hypotheses, and we point out some counterintuitive\nnonmonotonic behaviour of the variables as a function of the error\nbound &#949;. One of the open problems is the existence of a\npurely combinatorial algorithm that is faster than generic linear\nprogramming.",
+    "authors": [
+        "Anders Bergkvist",
+        "Peter Damaschke",
+        "Marcel L{{\\\"u}}thi"
+    ],
+    "id": "bergkvist06a",
+    "issue": 48,
+    "pages": [
+        1339,
+        1355
+    ],
+    "title": "Linear Programs for Hypotheses Selection in Probabilistic Inference Models",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/bhatnagar06a/bhatnagar06a.pdf b/bhatnagar06a/bhatnagar06a.pdf
diff --git a/bhatnagar06a/info.json b/bhatnagar06a/info.json
@@ -0,0 +1,17 @@
+{
+    "abstract": "We study the problem of long-run average cost control of Markov chains\nconditioned on a rare event. In a related recent work, a simulation\nbased algorithm for estimating performance measures associated with a\nMarkov chain conditioned on a rare event has been developed. We extend\nideas from this work and develop an adaptive algorithm for obtaining,\nonline, optimal control policies conditioned on a rare event.  Our\nalgorithm uses three timescales or step-size schedules. On the slowest\ntimescale, a gradient search algorithm for policy updates that is\nbased on one-simulation simultaneous perturbation stochastic\napproximation (SPSA) type estimates is used. Deterministic\nperturbation sequences obtained from appropriate normalized Hadamard\nmatrices are used here. The fast timescale recursions compute the\nconditional transition probabilities of an associated chain by\nobtaining solutions to the multiplicative Poisson equation (for a\ngiven policy estimate).  Further, the risk parameter associated with\nthe value function for a given policy estimate is updated on a\ntimescale that lies in between the two scales above. We briefly sketch\nthe convergence analysis of our algorithm and present a numerical\napplication in the setting of routing multiple flows in communication\nnetworks.",
+    "authors": [
+        "Shalabh Bhatnagar",
+        "Vivek S. Borkar",
+        "Madhukar Akarapu"
+    ],
+    "id": "bhatnagar06a",
+    "issue": 69,
+    "pages": [
+        1937,
+        1962
+    ],
+    "title": "A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/bickel06a/bickel06a.pdf b/bickel06a/bickel06a.pdf
diff --git a/bickel06a/info.json b/bickel06a/info.json
@@ -0,0 +1,17 @@
+{
+    "abstract": "We give a review of various aspects of boosting, clarifying the\nissues through a few simple results, and relate our work and that of\nothers to the minimax paradigm of statistics. We consider the\npopulation version of the  boosting algorithm and prove its\nconvergence to the Bayes classifier as a corollary of a general\nresult about Gauss-Southwell optimization in Hilbert space. We then\ninvestigate the algorithmic  convergence of the sample version, and\ngive bounds to the time until perfect separation of the sample. We\nconclude by some results on the statistical optimality of the <i>L<sub>2</sub></i>\nboosting.",
+    "authors": [
+        "Peter J. Bickel",
+        "Ya'acov Ritov",
+        "Alon Zakai"
+    ],
+    "id": "bickel06a",
+    "issue": 24,
+    "pages": [
+        705,
+        732
+    ],
+    "title": "Some Theory for Generalized Boosting Algorithms",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/blanchard06a/blanchard06a.pdf b/blanchard06a/blanchard06a.pdf
diff --git a/blanchard06a/info.json b/blanchard06a/info.json
@@ -0,0 +1,19 @@
+{
+    "abstract": "Finding non-Gaussian components of high-dimensional data is an\nimportant preprocessing step for efficient information processing.\nThis article proposes a new <i>linear</i> method to identify the\n\"non-Gaussian subspace\" within a very general semi-parametric\nframework.  Our proposed method, called NGCA (non-Gaussian component\nanalysis), is based on a linear operator which, to any arbitrary\nnonlinear (smooth) function, associates a vector belonging to the\nlow dimensional non-Gaussian target subspace, up to an estimation\nerror.  By applying this operator to a family of different nonlinear\nfunctions, one obtains a family of different vectors lying in a\nvicinity of the target space. As a final step, the target space\nitself is estimated by applying PCA to this family of vectors.  We\nshow that this procedure is consistent in the sense that the\nestimaton error tends to zero at a parametric rate, uniformly over\nthe family, Numerical examples demonstrate the usefulness of our\nmethod.",
+    "authors": [
+        "Gilles Blanchard",
+        "Motoaki Kawanabe",
+        "Masashi Sugiyama",
+        "Vladimir Spokoiny",
+        "Klaus-Robert M{{\\\"u}}ller"
+    ],
+    "id": "blanchard06a",
+    "issue": 8,
+    "pages": [
+        247,
+        282
+    ],
+    "title": "In Search of Non-Gaussian Components of a High-Dimensional Distribution",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/bratko06a/bratko06a.pdf b/bratko06a/bratko06a.pdf
diff --git a/bratko06a/info.json b/bratko06a/info.json
@@ -0,0 +1,19 @@
+{
+    "abstract": "Spam filtering poses a special problem in text categorization, of\nwhich the defining characteristic is that filters face an active\nadversary, which constantly attempts to evade filtering. Since spam\nevolves continuously and most practical applications are based on\nonline user feedback, the task calls for fast, incremental and robust\nlearning algorithms. In this paper, we investigate a novel approach to\nspam filtering based on adaptive statistical data compression\nmodels. The nature of these models allows them to be employed as\nprobabilistic text classifiers based on character-level or binary\nsequences. By modeling messages as sequences, tokenization and other\nerror-prone preprocessing steps are omitted altogether, resulting in a\nmethod that is very robust. The models are also fast to construct and\nincrementally updateable. We evaluate the filtering performance of two\ndifferent compression algorithms; dynamic Markov compression and\nprediction by partial matching. The results of our empirical\nevaluation indicate that compression models outperform currently\nestablished spam filters, as well as a number of methods proposed in\nprevious studies.",
+    "authors": [
+        "Andrej Bratko",
+        "Gordon V. Cormack",
+        "Bogdan Filipi&#269;",
+        "Thomas R. Lynam",
+        "Bla{\\v{z}} Zupan"
+    ],
+    "id": "bratko06a",
+    "issue": 96,
+    "pages": [
+        2673,
+        2698
+    ],
+    "title": "Spam Filtering Using Statistical Data Compression Models",
+    "volume": "7",
+    "year": "2006"
+}
diff --git a/braun06a/braun06a.pdf b/braun06a/braun06a.pdf