You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm always working on improving MatchIt and found an optimization that might be of interest for you. It is based on the algorithm described in Sävje (2020, Sec 2.3). It relies on the idea that for a 1:1 optimal pair match on a vector score, among the set of optimal solutions that have the same objective value, there will be at least one that has no "crossings", which are described in the paper. By preventing crossing prior to matching, we can dramatically reduce the problem size and speed up the matching, while still yielding a solution as optimal as one that would have been found without this restriction. In that sense, the restriction comes almost for free, similar to setting a caliper value larger than any pair distance in the final solution.
I coded the restriction in Rcpp and in preliminary testing have found it helps while ensuring a truly optimal solution is found. However, due to imprecision based on the numerical tolerance, it might be slightly better or worse than the solution found without it, though that difference goes away with small values of tol.
It works by processing the score vector and treatment status and produces a vector of indices that correspond to the indices in the distance matrix that should be retained for the matching (i.e., not pruned). This makes it work well with the InfinitySparseMatrix class because it corresponds to the subset of the matrix distance matrix that should have non-Inf values, i.e., that would be supplied to discardOthers(). I think an ideal implementation in optmatch would involve running this algorithm when a call to pairmatch() is called with ratio = 1 and the supplied first argument is a numeric vector (or one of the argument types that resolves to a vector, like a glm object). This code could be run prior to match_on() to prevent the calculation of the full distance matrix and instead immediately produce a sparse distance matrix, similar to using the within argument. You might consider adding a preprocess argument to allow the user to select whether they want to implement this (I don't see why they wouldn't except for backward compatibility for reproducing prior results). I believe it would need to be run separately for each subproblem, so perhaps it would only be called after splitting or only when there is no division into subproblems.
The code is below. I provide it with no license, so feel free to use and modify it with no restrictions and no need for attribution. Sorry for not writing comments; let me know if you want some help interpreting it.
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
//Preprocess by pruning unnecessary edges as in Savje (2020) https://doi.org/10.1214/19-STS699.
//Returns a vector of matrix indices for n1xn0 distance matrix.
// [[Rcpp::export]]
IntegerVector preprocess_matchC(IntegerVector t,
NumericVector p) {
R_xlen_t n = t.size();
int n1 = sum(t == 1);
int n0 = n - n1;
int i, j;
Function ord("order");
IntegerVector o = ord(p);
o = o - 1; //location of each unit after sorting
IntegerVector im(n);
int i0 = 0;
int i1 = 0;
for (j = 0; j < n; j++) {
if (t[j] == 0) {
im[j] = i0;
i0++;
}
else {
im[j] = i1;
i1++;
}
}
IntegerVector a(n0), b(n0);
std::vector<int> queue;
queue.reserve(n1);
int k0 = 0;
int ci = 0;
for (i = 0; i < n; i++) {
if (t[o[i]] == 1) {
queue.push_back(i);
continue;
}
if (queue.size() > k0) {
a[ci] = queue[k0];
k0++;
}
else {
a[ci] = i;
}
ci++;
}
k0 = 0;
ci = n0 - 1;
queue.clear();
for (i = n - 1; i >= 0; i--) {
if (t[o[i]] == 1) {
queue.push_back(i);
continue;
}
if (queue.size() > k0) {
b[ci] = queue[k0];
k0++;
}
else {
b[ci] = i;
}
ci--;
}
std::vector<int> keep;
keep.reserve(n1 * n0);
ci = 0;
for (i = 0; i < n; i++) {
if (t[o[i]] == 1) {
continue;
}
for (j = a[ci]; j <= b[ci]; j++) {
if (t[o[j]] == 0) {
continue;
}
keep.push_back(im[o[j]] + n1 * im[o[i]]);
}
ci++;
}
IntegerVector out(keep.size());
for (i = 0; i < keep.size(); i++) {
out[i] = keep[i] + 1;
}
return out;
}
Noah
The text was updated successfully, but these errors were encountered:
Thanks, Noah! I share your appreciation for @fsavje's nice idea, and complement you on this implementation.
A few thoughts, noncommittally in advance of speaking with collaborators:
The most direct path to updating our internal API for this functionality would be to make it an an option to match_on.numeric() and to pairmatch.numeric().
Quite likely we'd want to stop there, as the next steps would call not only for adding a similar option to match_on.glm() (which would pass it down to match_on.numeric()) but also for creating a pairmatch.glm() and attending logic.
To think about: Does Fredrik's no-crossings concept generalize to apply to 1:k matching, k>1?
Can the procedure make use of information on whether given potential pairings have already been excluded for other reasons, i.e. information from a within argument?
benthestatistician
changed the title
Suggestion for performance improvement for 1:1 pair matching on vector
pairmatch.numeric enhancement following Greifer, Savje
Jan 12, 2025
Hi
optmatch
team,I'm always working on improving
MatchIt
and found an optimization that might be of interest for you. It is based on the algorithm described in Sävje (2020, Sec 2.3). It relies on the idea that for a 1:1 optimal pair match on a vector score, among the set of optimal solutions that have the same objective value, there will be at least one that has no "crossings", which are described in the paper. By preventing crossing prior to matching, we can dramatically reduce the problem size and speed up the matching, while still yielding a solution as optimal as one that would have been found without this restriction. In that sense, the restriction comes almost for free, similar to setting a caliper value larger than any pair distance in the final solution.I coded the restriction in
Rcpp
and in preliminary testing have found it helps while ensuring a truly optimal solution is found. However, due to imprecision based on the numerical tolerance, it might be slightly better or worse than the solution found without it, though that difference goes away with small values oftol
.It works by processing the score vector and treatment status and produces a vector of indices that correspond to the indices in the distance matrix that should be retained for the matching (i.e., not pruned). This makes it work well with the
InfinitySparseMatrix
class because it corresponds to the subset of the matrix distance matrix that should have non-Inf
values, i.e., that would be supplied todiscardOthers()
. I think an ideal implementation inoptmatch
would involve running this algorithm when a call topairmatch()
is called withratio = 1
and the supplied first argument is a numeric vector (or one of the argument types that resolves to a vector, like aglm
object). This code could be run prior tomatch_on()
to prevent the calculation of the full distance matrix and instead immediately produce a sparse distance matrix, similar to using thewithin
argument. You might consider adding apreprocess
argument to allow the user to select whether they want to implement this (I don't see why they wouldn't except for backward compatibility for reproducing prior results). I believe it would need to be run separately for each subproblem, so perhaps it would only be called after splitting or only when there is no division into subproblems.The code is below. I provide it with no license, so feel free to use and modify it with no restrictions and no need for attribution. Sorry for not writing comments; let me know if you want some help interpreting it.
Noah
The text was updated successfully, but these errors were encountered: