Introduction to regrake

What is regrake?

regrake implements regularized raking, following Barratt et al. (2021). It formulates weighting as an optimization problem:

\[\text{minimize} \quad \ell(Fw, f^{\text{target}}) + \lambda \, r(w)\]

where \(w\) are the weights, \(F\) is a design matrix mapping weights to weighted statistics, \(\ell\) is a loss measuring how well we match population targets, and \(r\) is a regularizer that keeps weights from becoming extreme.

The key payoff: unlike traditional raking (which requires exact matching on every constraint), this framework enables a greater variety of constraint types. You can, for example, exactly match population marginals while least-squares matching joint distributions or leverage continuous targets. When combined with sensible regularization, this expanded toolbox for creating weights allows for solutions that can be both more expressive and statistically efficient.

For more on the (very fun, elegant) optimization math here, see the original paper or my NYOSP talk slides, which motivates why this is particularly helpful for survey weighting these days.

Setup

library(regrake)
set.seed(604)

We need two things: population targets and a (biased) survey sample.

pop_targets <- data.frame(
  variable = c("age_group", "age_group", "age_group",
                "race", "race", "race", "race",
                "income"),
  level = c("young", "middle", "old",
            "white", "black", "hispanic", "asian",
            "mean"),
  target = c(0.35, 0.40, 0.25,
             0.60, 0.13, 0.18, 0.09,
             71092)
)
pop_targets
#>    variable    level   target
#> 1 age_group    young     0.35
#> 2 age_group   middle     0.40
#> 3 age_group      old     0.25
#> 4      race    white     0.60
#> 5      race    black     0.13
#> 6      race hispanic     0.18
#> 7      race    asian     0.09
#> 8    income     mean 71092.00

This is the "proportions" format — a data frame with variable, level, and target columns (the format used by the autumn package). The package also accepts other common formats from packages like anesrake and survey; see the feature reference below.

Now a biased sample that over-represents older and white respondents, with income skewed above the population mean:

n <- 800
sample_data <- data.frame(
  age_group = sample(c("young", "middle", "old"), n, replace = TRUE,
                     prob = c(0.20, 0.35, 0.45)),
  race = sample(c("asian", "black", "hispanic", "white"), n, replace = TRUE,
                prob = c(0.06, 0.10, 0.14, 0.70)),
  income = rnorm(n, mean = 72000, sd = 15000)
)

Basic raking with exact constraints

The simplest use of regrake is functionally equivalent to traditional raking: exact matching on categorical variables with an entropy regularizer.

result_exact <- regrake(
  data = sample_data,
  formula = ~ rr_exact(age_group + race),
  population_data = pop_targets,
  regularizer = "entropy", # default behavior, but showing for clarity
  pop_type = "proportions"
)
result_exact
#> Regrake result
#>   Sample size:      800
#>   Constraints:      7 (7 exact)
#>   Regularizer:      entropy (lambda = 1)
#>   Converged:        Yes (133 iterations)
#>   Weight range:     [0.506, 2.794]
#>   Kish ESS:         639 / 800 (deff = 1.25)
#>   Max margin diff:  2.04e-06

Adding complexity: mixed constraints

Moving beyond an equivalent solution to regular raking, suppose we also want to:

  • Match the mean income (a continuous target)
  • Softly match the age group-by-race interaction (L2 loss on joint cells)

The interaction probably isn’t too much to ask of regular raking in this case, but as weighting problems grow larger, raking with many interactions may struggle to converge, and likely will have high variance costs. This motivates the notion of mixing constraints: we’ll use rr_exact() for the age/race marginals (combining them in one call with +) and rr_l2() for the joint distribution.

For the interaction, we need joint distribution targets. These reflect modest departures from independence — younger respondents are somewhat more Asian/Hispanic and less White than the marginals alone would imply.

interaction_targets <- data.frame(
  variable = "age_group:race",
  level = c("young:asian", "young:black", "young:hispanic", "young:white",
            "middle:asian", "middle:black", "middle:hispanic", "middle:white",
            "old:asian", "old:black", "old:hispanic", "old:white"),
  target = c(0.036, 0.048, 0.068, 0.198,
             0.036, 0.052, 0.071, 0.241,
             0.019, 0.030, 0.041, 0.161)
)

pop_targets_full <- rbind(pop_targets, interaction_targets)

Now we can mix exact and soft constraints:

result_mixed <- regrake(
  data = sample_data,
  formula = ~ rr_exact(age_group + race) +
              rr_mean(income) + rr_l2(age_group:race),
  population_data = pop_targets_full,
  pop_type = "proportions"
)
#> Warning: Variables in rr_l2(age_group:race) also appear as main effects. Using
#> exact constraints for main effects and rr_l2 constraint for the interaction
#> term
result_mixed
#> Regrake result
#>   Sample size:      800
#>   Constraints:      20 (8 exact, 12 l2)
#>   Regularizer:      entropy (lambda = 1)
#>   Converged:        Yes (158 iterations)
#>   Weight range:     [0.443, 2.945]
#>   Kish ESS:         638 / 800 (deff = 1.25)
#>   Max margin diff:  1.25e-02

This works smoothly because the L2 loss on the interaction doesn’t demand perfect matching on every joint cell — it just pushes the weights in the right direction. The optimizer finds the best trade-off between matching all the constraints and keeping the weights well-behaved in a variance sense.

Checking balance

With mixed constraints, convergence doesn’t mean perfect matching. This is an important difference from traditional raking — the mental shift is thinking about how the weights converged, not just whether they converged.

The package provides a balance data frame in results to examine this:

bal <- result_mixed$balance
bal
#>           constraint  type       variable           level     achieved
#> 1    exact_age_group exact      age_group          middle 4.000010e-01
#> 2    exact_age_group exact      age_group             old 2.499983e-01
#> 3    exact_age_group exact      age_group           young 3.500006e-01
#> 4         exact_race exact           race           asian 9.000020e-02
#> 5         exact_race exact           race           black 1.300000e-01
#> 6         exact_race exact           race        hispanic 1.800000e-01
#> 7         exact_race exact           race           white 5.999997e-01
#> 8       exact_income exact         income            mean 7.109199e+04
#> 9  l2_age_group:race    l2 age_group:race    middle:asian 3.974373e-02
#> 10 l2_age_group:race    l2 age_group:race       old:asian 2.554885e-02
#> 11 l2_age_group:race    l2 age_group:race     young:asian 2.470761e-02
#> 12 l2_age_group:race    l2 age_group:race    middle:black 5.329382e-02
#> 13 l2_age_group:race    l2 age_group:race       old:black 3.012934e-02
#> 14 l2_age_group:race    l2 age_group:race     young:black 4.657687e-02
#> 15 l2_age_group:race    l2 age_group:race middle:hispanic 6.878663e-02
#> 16 l2_age_group:race    l2 age_group:race    old:hispanic 4.276139e-02
#> 17 l2_age_group:race    l2 age_group:race  young:hispanic 6.845202e-02
#> 18 l2_age_group:race    l2 age_group:race    middle:white 2.381768e-01
#> 19 l2_age_group:race    l2 age_group:race       old:white 1.515588e-01
#> 20 l2_age_group:race    l2 age_group:race     young:white 2.102641e-01
#>          target      residual
#> 1  4.000000e-01  1.019591e-06
#> 2  2.500000e-01 -1.659420e-06
#> 3  3.500000e-01  6.398289e-07
#> 4  9.000000e-02  1.962003e-07
#> 5  1.300000e-01  3.521587e-08
#> 6  1.800000e-01  4.222431e-08
#> 7  6.000000e-01 -2.736405e-07
#> 8  7.109200e+04 -9.766847e-03
#> 9  3.596404e-02  3.779694e-03
#> 10 1.898102e-02  6.567833e-03
#> 11 3.596404e-02 -1.125642e-02
#> 12 5.194805e-02  1.345770e-03
#> 13 2.997003e-02  1.593137e-04
#> 14 4.795205e-02 -1.375178e-03
#> 15 7.092907e-02 -2.142446e-03
#> 16 4.095904e-02  1.802353e-03
#> 17 6.793207e-02  5.199556e-04
#> 18 2.407592e-01 -2.582398e-03
#> 19 1.608392e-01 -9.280409e-03
#> 20 1.978022e-01  1.246193e-02

Exact constraints should have near-zero residuals. L2 and mean constraints will have small residuals reflecting the optimization trade-off.

You can visualize this with a simple residual plot:

# Color by constraint type
cols <- ifelse(bal$type == "exact", "#2166AC", "#B2182B")
pch <- ifelse(bal$type == "exact", 16, 17)

plot(
  bal$target, bal$residual,
  col = cols, pch = pch,
  xlab = "Target value",
  ylab = "Residual (achieved - target)",
  main = "Balance: exact constraints vs soft constraints"
)
abline(h = 0, lty = 2, col = "gray50")
legend(
  "topright",
  legend = c("Exact (must match)", "L2 / mean (soft)"),
  col = c("#2166AC", "#B2182B"),
  pch = c(16, 17),
  cex = 0.8
)

Exact constraints (blue circles) cluster tightly at zero. Soft constraints (red triangles) have small residuals — the optimizer matched them as closely as it could while satisfying the exact constraints and keeping weights reasonable.

The diagnostics field has additional information about weight quality:

result_mixed$diagnostics[c("weight_range", "kish_deff", "kish_ess", "converged")]
#> $weight_range
#> [1] 0.4425191 2.9445955
#> 
#> $kish_deff
#> [1] 1.254387
#> 
#> $kish_ess
#> [1] 637.7619
#> 
#> $converged
#> [1] TRUE

When is regrake worth the complexity?

If your weighting problem only needs exact matching on a few categorical marginals to be representative, packages like autumn, anesrake, or survey::rake will likely be all you need. Breaking out regrake becomes increasingly sensible as you have more work to do with your weighting, especially if you’re sensitive to the variance price you’re paying.

On the other end of weighting tool complexity, there are a broad family of approaches like (Bayesian) multilevel regression and postratification or more recent improvements, which bring even more modeling flexibility to bear on the weighting problem. These are powerful tools, but my sense is that often a bit of flexibility and regularization goes a long way — regularized raking can get you a lot of the flexibility here without all the complexity and computation fancier methods require.

Feature reference

This vignette covered the core workflow. Here’s a quick index of what else regrake supports:

Constraint types:

  • rr_exact() — equality constraints (equivalent to traditional raking)
  • rr_l2() — least-squares soft matching
  • rr_kl() — KL divergence soft matching to a reference distribution
  • rr_mean(), rr_var(), rr_quantile() — continuous variable targets
  • rr_range(age, 40, 50) — constrain weighted mean to an interval

Weight control:

  • bounds = c(0.3, 3) — limit weights to 0.3x–3x the average
  • bounds_method = "hard" — strict enforcement (default is "soft")
  • regularizer"entropy" (default), "sum_squares", or "kl"

Population data:

  • "raw" — unit-level population data (one row per person); regrake computes proportions and means internally
  • "weighted" — unit-level data with a weight column (e.g., from ACS microdata)

Interoperability with other R packages:

  • "proportions" — the autumn format used in this vignette
  • "anesrake" — named list of target vectors (as used by anesrake::anesrake)
  • "survey" / "survey_design" — margin tables or design objects from the survey package

Solver tuning:

  • control = list(margin_tol = 0.001) — adjust convergence tolerance

See ?regrake for full documentation.