regrake implements regularized raking,
following Barratt
et al. (2021). It formulates weighting as an optimization
problem:
\[\text{minimize} \quad \ell(Fw, f^{\text{target}}) + \lambda \, r(w)\]
where \(w\) are the weights, \(F\) is a design matrix mapping weights to weighted statistics, \(\ell\) is a loss measuring how well we match population targets, and \(r\) is a regularizer that keeps weights from becoming extreme.
The key payoff: unlike traditional raking (which requires exact matching on every constraint), this framework enables a greater variety of constraint types. You can, for example, exactly match population marginals while least-squares matching joint distributions or leverage continuous targets. When combined with sensible regularization, this expanded toolbox for creating weights allows for solutions that can be both more expressive and statistically efficient.
For more on the (very fun, elegant) optimization math here, see the original paper or my NYOSP talk slides, which motivates why this is particularly helpful for survey weighting these days.
We need two things: population targets and a (biased) survey sample.
pop_targets <- data.frame(
variable = c("age_group", "age_group", "age_group",
"race", "race", "race", "race",
"income"),
level = c("young", "middle", "old",
"white", "black", "hispanic", "asian",
"mean"),
target = c(0.35, 0.40, 0.25,
0.60, 0.13, 0.18, 0.09,
71092)
)
pop_targets
#> variable level target
#> 1 age_group young 0.35
#> 2 age_group middle 0.40
#> 3 age_group old 0.25
#> 4 race white 0.60
#> 5 race black 0.13
#> 6 race hispanic 0.18
#> 7 race asian 0.09
#> 8 income mean 71092.00This is the "proportions" format — a data frame with
variable, level, and target
columns (the format used by the autumn package). The
package also accepts other common formats from packages like
anesrake and survey; see the feature reference
below.
Now a biased sample that over-represents older and white respondents, with income skewed above the population mean:
The simplest use of regrake is functionally equivalent
to traditional raking: exact matching on categorical variables with an
entropy regularizer.
result_exact <- regrake(
data = sample_data,
formula = ~ rr_exact(age_group + race),
population_data = pop_targets,
regularizer = "entropy", # default behavior, but showing for clarity
pop_type = "proportions"
)
result_exact
#> Regrake result
#> Sample size: 800
#> Constraints: 7 (7 exact)
#> Regularizer: entropy (lambda = 1)
#> Converged: Yes (133 iterations)
#> Weight range: [0.506, 2.794]
#> Kish ESS: 639 / 800 (deff = 1.25)
#> Max margin diff: 2.04e-06Moving beyond an equivalent solution to regular raking, suppose we also want to:
The interaction probably isn’t too much to ask of regular raking in
this case, but as weighting problems grow larger, raking with many
interactions may struggle to converge, and likely will have high
variance costs. This motivates the notion of mixing constraints: we’ll
use rr_exact() for the age/race marginals (combining them
in one call with +) and rr_l2() for the joint
distribution.
For the interaction, we need joint distribution targets. These reflect modest departures from independence — younger respondents are somewhat more Asian/Hispanic and less White than the marginals alone would imply.
interaction_targets <- data.frame(
variable = "age_group:race",
level = c("young:asian", "young:black", "young:hispanic", "young:white",
"middle:asian", "middle:black", "middle:hispanic", "middle:white",
"old:asian", "old:black", "old:hispanic", "old:white"),
target = c(0.036, 0.048, 0.068, 0.198,
0.036, 0.052, 0.071, 0.241,
0.019, 0.030, 0.041, 0.161)
)
pop_targets_full <- rbind(pop_targets, interaction_targets)Now we can mix exact and soft constraints:
result_mixed <- regrake(
data = sample_data,
formula = ~ rr_exact(age_group + race) +
rr_mean(income) + rr_l2(age_group:race),
population_data = pop_targets_full,
pop_type = "proportions"
)
#> Warning: Variables in rr_l2(age_group:race) also appear as main effects. Using
#> exact constraints for main effects and rr_l2 constraint for the interaction
#> term
result_mixed
#> Regrake result
#> Sample size: 800
#> Constraints: 20 (8 exact, 12 l2)
#> Regularizer: entropy (lambda = 1)
#> Converged: Yes (158 iterations)
#> Weight range: [0.443, 2.945]
#> Kish ESS: 638 / 800 (deff = 1.25)
#> Max margin diff: 1.25e-02This works smoothly because the L2 loss on the interaction doesn’t demand perfect matching on every joint cell — it just pushes the weights in the right direction. The optimizer finds the best trade-off between matching all the constraints and keeping the weights well-behaved in a variance sense.
With mixed constraints, convergence doesn’t mean perfect matching. This is an important difference from traditional raking — the mental shift is thinking about how the weights converged, not just whether they converged.
The package provides a balance data frame in results to
examine this:
bal <- result_mixed$balance
bal
#> constraint type variable level achieved
#> 1 exact_age_group exact age_group middle 4.000010e-01
#> 2 exact_age_group exact age_group old 2.499983e-01
#> 3 exact_age_group exact age_group young 3.500006e-01
#> 4 exact_race exact race asian 9.000020e-02
#> 5 exact_race exact race black 1.300000e-01
#> 6 exact_race exact race hispanic 1.800000e-01
#> 7 exact_race exact race white 5.999997e-01
#> 8 exact_income exact income mean 7.109199e+04
#> 9 l2_age_group:race l2 age_group:race middle:asian 3.974373e-02
#> 10 l2_age_group:race l2 age_group:race old:asian 2.554885e-02
#> 11 l2_age_group:race l2 age_group:race young:asian 2.470761e-02
#> 12 l2_age_group:race l2 age_group:race middle:black 5.329382e-02
#> 13 l2_age_group:race l2 age_group:race old:black 3.012934e-02
#> 14 l2_age_group:race l2 age_group:race young:black 4.657687e-02
#> 15 l2_age_group:race l2 age_group:race middle:hispanic 6.878663e-02
#> 16 l2_age_group:race l2 age_group:race old:hispanic 4.276139e-02
#> 17 l2_age_group:race l2 age_group:race young:hispanic 6.845202e-02
#> 18 l2_age_group:race l2 age_group:race middle:white 2.381768e-01
#> 19 l2_age_group:race l2 age_group:race old:white 1.515588e-01
#> 20 l2_age_group:race l2 age_group:race young:white 2.102641e-01
#> target residual
#> 1 4.000000e-01 1.019591e-06
#> 2 2.500000e-01 -1.659420e-06
#> 3 3.500000e-01 6.398289e-07
#> 4 9.000000e-02 1.962003e-07
#> 5 1.300000e-01 3.521587e-08
#> 6 1.800000e-01 4.222431e-08
#> 7 6.000000e-01 -2.736405e-07
#> 8 7.109200e+04 -9.766847e-03
#> 9 3.596404e-02 3.779694e-03
#> 10 1.898102e-02 6.567833e-03
#> 11 3.596404e-02 -1.125642e-02
#> 12 5.194805e-02 1.345770e-03
#> 13 2.997003e-02 1.593137e-04
#> 14 4.795205e-02 -1.375178e-03
#> 15 7.092907e-02 -2.142446e-03
#> 16 4.095904e-02 1.802353e-03
#> 17 6.793207e-02 5.199556e-04
#> 18 2.407592e-01 -2.582398e-03
#> 19 1.608392e-01 -9.280409e-03
#> 20 1.978022e-01 1.246193e-02Exact constraints should have near-zero residuals. L2 and mean constraints will have small residuals reflecting the optimization trade-off.
You can visualize this with a simple residual plot:
# Color by constraint type
cols <- ifelse(bal$type == "exact", "#2166AC", "#B2182B")
pch <- ifelse(bal$type == "exact", 16, 17)
plot(
bal$target, bal$residual,
col = cols, pch = pch,
xlab = "Target value",
ylab = "Residual (achieved - target)",
main = "Balance: exact constraints vs soft constraints"
)
abline(h = 0, lty = 2, col = "gray50")
legend(
"topright",
legend = c("Exact (must match)", "L2 / mean (soft)"),
col = c("#2166AC", "#B2182B"),
pch = c(16, 17),
cex = 0.8
)Exact constraints (blue circles) cluster tightly at zero. Soft constraints (red triangles) have small residuals — the optimizer matched them as closely as it could while satisfying the exact constraints and keeping weights reasonable.
The diagnostics field has additional information about
weight quality:
If your weighting problem only needs exact matching on a few
categorical marginals to be representative, packages like
autumn, anesrake, or survey::rake
will likely be all you need. Breaking out regrake becomes
increasingly sensible as you have more work to do with your weighting,
especially if you’re sensitive to the variance price you’re paying.
On the other end of weighting tool complexity, there are a broad family of approaches like (Bayesian) multilevel regression and postratification or more recent improvements, which bring even more modeling flexibility to bear on the weighting problem. These are powerful tools, but my sense is that often a bit of flexibility and regularization goes a long way — regularized raking can get you a lot of the flexibility here without all the complexity and computation fancier methods require.
This vignette covered the core workflow. Here’s a quick index of what
else regrake supports:
Constraint types:
rr_exact() — equality constraints (equivalent to
traditional raking)rr_l2() — least-squares soft matchingrr_kl() — KL divergence soft matching to a reference
distributionrr_mean(), rr_var(),
rr_quantile() — continuous variable targetsrr_range(age, 40, 50) — constrain weighted mean to an
intervalWeight control:
bounds = c(0.3, 3) — limit weights to 0.3x–3x the
averagebounds_method = "hard" — strict enforcement (default is
"soft")regularizer — "entropy" (default),
"sum_squares", or "kl"Population data:
"raw" — unit-level population data (one row per
person); regrake computes proportions and means internally"weighted" — unit-level data with a weight column
(e.g., from ACS microdata)Interoperability with other R packages:
"proportions" — the autumn format used in
this vignette"anesrake" — named list of target vectors (as used by
anesrake::anesrake)"survey" / "survey_design" — margin tables
or design objects from the survey packageSolver tuning:
control = list(margin_tol = 0.001) — adjust convergence
toleranceSee ?regrake for full documentation.