pairmatch.numeric enhancement following Greifer, Savje #243

ngreifer · 2025-01-10T17:26:24Z

Hi optmatch team,

I'm always working on improving MatchIt and found an optimization that might be of interest for you. It is based on the algorithm described in Sävje (2020, Sec 2.3). It relies on the idea that for a 1:1 optimal pair match on a vector score, among the set of optimal solutions that have the same objective value, there will be at least one that has no "crossings", which are described in the paper. By preventing crossing prior to matching, we can dramatically reduce the problem size and speed up the matching, while still yielding a solution as optimal as one that would have been found without this restriction. In that sense, the restriction comes almost for free, similar to setting a caliper value larger than any pair distance in the final solution.

I coded the restriction in Rcpp and in preliminary testing have found it helps while ensuring a truly optimal solution is found. However, due to imprecision based on the numerical tolerance, it might be slightly better or worse than the solution found without it, though that difference goes away with small values of tol.

It works by processing the score vector and treatment status and produces a vector of indices that correspond to the indices in the distance matrix that should be retained for the matching (i.e., not pruned). This makes it work well with the InfinitySparseMatrix class because it corresponds to the subset of the matrix distance matrix that should have non-Inf values, i.e., that would be supplied to discardOthers(). I think an ideal implementation in optmatch would involve running this algorithm when a call to pairmatch() is called with ratio = 1 and the supplied first argument is a numeric vector (or one of the argument types that resolves to a vector, like a glm object). This code could be run prior to match_on() to prevent the calculation of the full distance matrix and instead immediately produce a sparse distance matrix, similar to using the within argument. You might consider adding a preprocess argument to allow the user to select whether they want to implement this (I don't see why they wouldn't except for backward compatibility for reproducing prior results). I believe it would need to be run separately for each subproblem, so perhaps it would only be called after splitting or only when there is no division into subproblems.

The code is below. I provide it with no license, so feel free to use and modify it with no restrictions and no need for attribution. Sorry for not writing comments; let me know if you want some help interpreting it.

using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

//Preprocess by pruning unnecessary edges as in Savje (2020) https://doi.org/10.1214/19-STS699.
//Returns a vector of matrix indices for n1xn0 distance matrix.

// [[Rcpp::export]]
IntegerVector preprocess_matchC(IntegerVector t,
                                NumericVector p) {
  R_xlen_t n = t.size();
  int n1 = sum(t == 1);
  int n0 = n - n1;

  int i, j;

  Function ord("order");

  IntegerVector o = ord(p);
  o = o - 1; //location of each unit after sorting

  IntegerVector im(n);
  int i0 = 0;
  int i1 = 0;
  for (j = 0; j < n; j++) {
    if (t[j] == 0) {
      im[j] = i0;
      i0++;
    }
    else {
      im[j] = i1;
      i1++;
    }
  }

  IntegerVector a(n0), b(n0);

  std::vector<int> queue;
  queue.reserve(n1);

  int k0 = 0;
  int ci = 0;

  for (i = 0; i < n; i++) {
    if (t[o[i]] == 1) {
      queue.push_back(i);
      continue;
    }

    if (queue.size() > k0) {
      a[ci] = queue[k0];
      k0++;
    }
    else {
      a[ci] = i;
    }

    ci++;
  }

  k0 = 0;
  ci = n0 - 1;

  queue.clear();

  for (i = n - 1; i >= 0; i--) {
    if (t[o[i]] == 1) {
      queue.push_back(i);
      continue;
    }

    if (queue.size() > k0) {
      b[ci] = queue[k0];
      k0++;
    }
    else {
      b[ci] = i;
    }

    ci--;
  }

  std::vector<int> keep;
  keep.reserve(n1 * n0);

  ci = 0;

  for (i = 0; i < n; i++) {
    if (t[o[i]] == 1) {
      continue;
    }

    for (j = a[ci]; j <= b[ci]; j++) {
      if (t[o[j]] == 0) {
        continue;
      }

      keep.push_back(im[o[j]] + n1 * im[o[i]]);
    }

    ci++;
  }

  IntegerVector out(keep.size());
  for (i = 0; i < keep.size(); i++) {
    out[i] = keep[i] + 1;
  }

  return out;
}

Noah

The text was updated successfully, but these errors were encountered:

benthestatistician · 2025-01-11T23:00:58Z

Thanks, Noah! I share your appreciation for @fsavje's nice idea, and complement you on this implementation.

A few thoughts, noncommittally in advance of speaking with collaborators:

The most direct path to updating our internal API for this functionality would be to make it an an option to match_on.numeric() and to pairmatch.numeric().
Quite likely we'd want to stop there, as the next steps would call not only for adding a similar option to match_on.glm() (which would pass it down to match_on.numeric()) but also for creating a pairmatch.glm() and attending logic.
To think about: Does Fredrik's no-crossings concept generalize to apply to 1:k matching, k>1?
Can the procedure make use of information on whether given potential pairings have already been excluded for other reasons, i.e. information from a within argument?

benthestatistician added the performance label Jan 12, 2025

benthestatistician changed the title ~~Suggestion for performance improvement for 1:1 pair matching on vector~~ pairmatch.numeric enhancement following Greifer, Savje Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pairmatch.numeric enhancement following Greifer, Savje #243

pairmatch.numeric enhancement following Greifer, Savje #243

ngreifer commented Jan 10, 2025

benthestatistician commented Jan 11, 2025 •

edited

Loading

pairmatch.numeric enhancement following Greifer, Savje #243

pairmatch.numeric enhancement following Greifer, Savje #243

Comments

ngreifer commented Jan 10, 2025

benthestatistician commented Jan 11, 2025 • edited Loading

benthestatistician commented Jan 11, 2025 •

edited

Loading