--- title: "Introduction to regrake" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to regrake} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # Limit BLAS threads to comply with CRAN's 2-core limit. # RhpcBLASctl makes a direct runtime API call to BLAS (more reliable than # env vars, which OpenBLAS may read only at library load time). if (requireNamespace("RhpcBLASctl", quietly = TRUE)) { RhpcBLASctl::blas_set_num_threads(2L) } ``` ## What is regrake? `regrake` implements **regularized raking**, following [Barratt et al. (2021)](https://web.stanford.edu/~boyd/papers/pdf/optimal_representative_sampling.pdf). It formulates weighting as an optimization problem: $$\text{minimize} \quad \ell(Fw, f^{\text{target}}) + \lambda \, r(w)$$ where $w$ are the weights, $F$ is a design matrix mapping weights to weighted statistics, $\ell$ is a loss measuring how well we match population targets, and $r$ is a regularizer that keeps weights from becoming extreme. The key payoff: unlike traditional raking (which requires *exact* matching on every constraint), this framework enables a greater variety of constraint types. You can, for example, exactly match population marginals while *least-squares* matching joint distributions or leverage continuous targets. When combined with sensible regularization, this expanded toolbox for creating weights allows for solutions that can be both more expressive and statistically efficient. For more on the (very fun, elegant) optimization math here, see the [original paper](https://web.stanford.edu/~boyd/papers/pdf/optimal_representative_sampling.pdf) or my [NYOSP talk slides](https://github.com/andytimm/nyospm_regrake_public), which motivates why this is particularly helpful for survey weighting these days. ## Setup ```{r} library(regrake) set.seed(604) ``` We need two things: population targets and a (biased) survey sample. ```{r} pop_targets <- data.frame( variable = c("age_group", "age_group", "age_group", "race", "race", "race", "race", "income"), level = c("young", "middle", "old", "white", "black", "hispanic", "asian", "mean"), target = c(0.35, 0.40, 0.25, 0.60, 0.13, 0.18, 0.09, 71092) ) pop_targets ``` This is the `"proportions"` format — a data frame with `variable`, `level`, and `target` columns (the format used by the `autumn` package). The package also accepts other common formats from packages like `anesrake` and `survey`; see the feature reference below. Now a biased sample that over-represents older and white respondents, with income skewed above the population mean: ```{r} n <- 800 sample_data <- data.frame( age_group = sample(c("young", "middle", "old"), n, replace = TRUE, prob = c(0.20, 0.35, 0.45)), race = sample(c("asian", "black", "hispanic", "white"), n, replace = TRUE, prob = c(0.06, 0.10, 0.14, 0.70)), income = rnorm(n, mean = 72000, sd = 15000) ) ``` ## Basic raking with exact constraints The simplest use of `regrake` is functionally equivalent to traditional raking: exact matching on categorical variables with an entropy regularizer. ```{r} result_exact <- regrake( data = sample_data, formula = ~ rr_exact(age_group + race), population_data = pop_targets, regularizer = "entropy", # default behavior, but showing for clarity pop_type = "proportions" ) result_exact ``` ## Adding complexity: mixed constraints Moving beyond an equivalent solution to regular raking, suppose we also want to: - Match the mean income (a continuous target) - Softly match the age group-by-race interaction (L2 loss on joint cells) The interaction probably isn't too much to ask of regular raking in this case, but as weighting problems grow larger, raking with many interactions may struggle to converge, and likely will have high variance costs. This motivates the notion of mixing constraints: we'll use `rr_exact()` for the age/race marginals (combining them in one call with `+`) and `rr_l2()` for the joint distribution. For the interaction, we need joint distribution targets. These reflect modest departures from independence — younger respondents are somewhat more Asian/Hispanic and less White than the marginals alone would imply. ```{r} interaction_targets <- data.frame( variable = "age_group:race", level = c("young:asian", "young:black", "young:hispanic", "young:white", "middle:asian", "middle:black", "middle:hispanic", "middle:white", "old:asian", "old:black", "old:hispanic", "old:white"), target = c(0.036, 0.048, 0.068, 0.198, 0.036, 0.052, 0.071, 0.241, 0.019, 0.030, 0.041, 0.161) ) pop_targets_full <- rbind(pop_targets, interaction_targets) ``` Now we can mix exact and soft constraints: ```{r} result_mixed <- regrake( data = sample_data, formula = ~ rr_exact(age_group + race) + rr_mean(income) + rr_l2(age_group:race), population_data = pop_targets_full, pop_type = "proportions" ) result_mixed ``` This works smoothly because the L2 loss on the interaction doesn't demand *perfect* matching on every joint cell — it just pushes the weights in the right direction. The optimizer finds the best trade-off between matching all the constraints and keeping the weights well-behaved in a variance sense. ## Checking balance With mixed constraints, **convergence doesn't mean perfect matching**. This is an important difference from traditional raking — the mental shift is thinking about **how** the weights converged, not just whether they converged. The package provides a `balance` data frame in results to examine this: ```{r} bal <- result_mixed$balance bal ``` Exact constraints should have near-zero residuals. L2 and mean constraints will have small residuals reflecting the optimization trade-off. You can visualize this with a simple residual plot: ```{r, fig.width=6, fig.height=4} # Color by constraint type cols <- ifelse(bal$type == "exact", "#2166AC", "#B2182B") pch <- ifelse(bal$type == "exact", 16, 17) plot( bal$target, bal$residual, col = cols, pch = pch, xlab = "Target value", ylab = "Residual (achieved - target)", main = "Balance: exact constraints vs soft constraints" ) abline(h = 0, lty = 2, col = "gray50") legend( "topright", legend = c("Exact (must match)", "L2 / mean (soft)"), col = c("#2166AC", "#B2182B"), pch = c(16, 17), cex = 0.8 ) ``` Exact constraints (blue circles) cluster tightly at zero. Soft constraints (red triangles) have small residuals — the optimizer matched them as closely as it could while satisfying the exact constraints and keeping weights reasonable. The `diagnostics` field has additional information about weight quality: ```{r} result_mixed$diagnostics[c("weight_range", "kish_deff", "kish_ess", "converged")] ``` ## When is regrake worth the complexity? If your weighting problem only needs exact matching on a few categorical marginals to be representative, packages like `autumn`, `anesrake`, or `survey::rake` will likely be all you need. Breaking out `regrake` becomes increasingly sensible as you have more work to do with your weighting, especially if you're sensitive to the variance price you're paying. On the other end of weighting tool complexity, there are a broad family of approaches like (Bayesian) multilevel regression and postratification or more recent improvements, which bring even more modeling flexibility to bear on the weighting problem. These are powerful tools, but my sense is that often a bit of flexibility and regularization goes a long way — regularized raking can get you a lot of the flexibility here without all the complexity and computation fancier methods require. ## Feature reference This vignette covered the core workflow. Here's a quick index of what else `regrake` supports: **Constraint types:** - `rr_exact()` — equality constraints (equivalent to traditional raking) - `rr_l2()` — least-squares soft matching - `rr_kl()` — KL divergence soft matching to a reference distribution - `rr_mean()`, `rr_var()`, `rr_quantile()` — continuous variable targets - `rr_range(age, 40, 50)` — constrain weighted mean to an interval **Weight control:** - `bounds = c(0.3, 3)` — limit weights to 0.3x–3x the average - `bounds_method = "hard"` — strict enforcement (default is `"soft"`) - `regularizer` — `"entropy"` (default), `"sum_squares"`, or `"kl"` **Population data:** - `"raw"` — unit-level population data (one row per person); regrake computes proportions and means internally - `"weighted"` — unit-level data with a weight column (e.g., from ACS microdata) **Interoperability with other R packages:** - `"proportions"` — the `autumn` format used in this vignette - `"anesrake"` — named list of target vectors (as used by `anesrake::anesrake`) - `"survey"` / `"survey_design"` — margin tables or design objects from the `survey` package **Solver tuning:** - `control = list(margin_tol = 0.001)` — adjust convergence tolerance See `?regrake` for full documentation.