Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pairmatch.numeric enhancement following Greifer, Savje #243

Open
ngreifer opened this issue Jan 10, 2025 · 1 comment
Open

pairmatch.numeric enhancement following Greifer, Savje #243

ngreifer opened this issue Jan 10, 2025 · 1 comment

Comments

@ngreifer
Copy link

Hi optmatch team,

I'm always working on improving MatchIt and found an optimization that might be of interest for you. It is based on the algorithm described in Sävje (2020, Sec 2.3). It relies on the idea that for a 1:1 optimal pair match on a vector score, among the set of optimal solutions that have the same objective value, there will be at least one that has no "crossings", which are described in the paper. By preventing crossing prior to matching, we can dramatically reduce the problem size and speed up the matching, while still yielding a solution as optimal as one that would have been found without this restriction. In that sense, the restriction comes almost for free, similar to setting a caliper value larger than any pair distance in the final solution.

I coded the restriction in Rcpp and in preliminary testing have found it helps while ensuring a truly optimal solution is found. However, due to imprecision based on the numerical tolerance, it might be slightly better or worse than the solution found without it, though that difference goes away with small values of tol.

It works by processing the score vector and treatment status and produces a vector of indices that correspond to the indices in the distance matrix that should be retained for the matching (i.e., not pruned). This makes it work well with the InfinitySparseMatrix class because it corresponds to the subset of the matrix distance matrix that should have non-Inf values, i.e., that would be supplied to discardOthers(). I think an ideal implementation in optmatch would involve running this algorithm when a call to pairmatch() is called with ratio = 1 and the supplied first argument is a numeric vector (or one of the argument types that resolves to a vector, like a glm object). This code could be run prior to match_on() to prevent the calculation of the full distance matrix and instead immediately produce a sparse distance matrix, similar to using the within argument. You might consider adding a preprocess argument to allow the user to select whether they want to implement this (I don't see why they wouldn't except for backward compatibility for reproducing prior results). I believe it would need to be run separately for each subproblem, so perhaps it would only be called after splitting or only when there is no division into subproblems.

The code is below. I provide it with no license, so feel free to use and modify it with no restrictions and no need for attribution. Sorry for not writing comments; let me know if you want some help interpreting it.

using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

//Preprocess by pruning unnecessary edges as in Savje (2020) https://doi.org/10.1214/19-STS699.
//Returns a vector of matrix indices for n1xn0 distance matrix.

// [[Rcpp::export]]
IntegerVector preprocess_matchC(IntegerVector t,
                                NumericVector p) {
  R_xlen_t n = t.size();
  int n1 = sum(t == 1);
  int n0 = n - n1;

  int i, j;

  Function ord("order");

  IntegerVector o = ord(p);
  o = o - 1; //location of each unit after sorting

  IntegerVector im(n);
  int i0 = 0;
  int i1 = 0;
  for (j = 0; j < n; j++) {
    if (t[j] == 0) {
      im[j] = i0;
      i0++;
    }
    else {
      im[j] = i1;
      i1++;
    }
  }

  IntegerVector a(n0), b(n0);

  std::vector<int> queue;
  queue.reserve(n1);

  int k0 = 0;
  int ci = 0;

  for (i = 0; i < n; i++) {
    if (t[o[i]] == 1) {
      queue.push_back(i);
      continue;
    }

    if (queue.size() > k0) {
      a[ci] = queue[k0];
      k0++;
    }
    else {
      a[ci] = i;
    }

    ci++;
  }

  k0 = 0;
  ci = n0 - 1;

  queue.clear();

  for (i = n - 1; i >= 0; i--) {
    if (t[o[i]] == 1) {
      queue.push_back(i);
      continue;
    }

    if (queue.size() > k0) {
      b[ci] = queue[k0];
      k0++;
    }
    else {
      b[ci] = i;
    }

    ci--;
  }

  std::vector<int> keep;
  keep.reserve(n1 * n0);

  ci = 0;

  for (i = 0; i < n; i++) {
    if (t[o[i]] == 1) {
      continue;
    }

    for (j = a[ci]; j <= b[ci]; j++) {
      if (t[o[j]] == 0) {
        continue;
      }

      keep.push_back(im[o[j]] + n1 * im[o[i]]);
    }

    ci++;
  }

  IntegerVector out(keep.size());
  for (i = 0; i < keep.size(); i++) {
    out[i] = keep[i] + 1;
  }

  return out;
}

Noah

@benthestatistician
Copy link
Collaborator

benthestatistician commented Jan 11, 2025

Thanks, Noah! I share your appreciation for @fsavje's nice idea, and complement you on this implementation.

A few thoughts, noncommittally in advance of speaking with collaborators:

  1. The most direct path to updating our internal API for this functionality would be to make it an an option to match_on.numeric() and to pairmatch.numeric().
  2. Quite likely we'd want to stop there, as the next steps would call not only for adding a similar option to match_on.glm() (which would pass it down to match_on.numeric()) but also for creating a pairmatch.glm() and attending logic.
  3. To think about: Does Fredrik's no-crossings concept generalize to apply to 1:k matching, k>1?
  4. Can the procedure make use of information on whether given potential pairings have already been excluded for other reasons, i.e. information from a within argument?

@benthestatistician benthestatistician changed the title Suggestion for performance improvement for 1:1 pair matching on vector pairmatch.numeric enhancement following Greifer, Savje Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants