---
title: "Introduction to regrake"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to regrake}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# Limit BLAS threads to comply with CRAN's 2-core limit.
# RhpcBLASctl makes a direct runtime API call to BLAS (more reliable than
# env vars, which OpenBLAS may read only at library load time).
if (requireNamespace("RhpcBLASctl", quietly = TRUE)) {
  RhpcBLASctl::blas_set_num_threads(2L)
}
```

## What is regrake?

`regrake` implements **regularized raking**, following
[Barratt et al. (2021)](https://web.stanford.edu/~boyd/papers/pdf/optimal_representative_sampling.pdf).
It formulates weighting as an optimization problem:

$$\text{minimize} \quad \ell(Fw, f^{\text{target}}) + \lambda \, r(w)$$

where $w$ are the weights, $F$ is a design matrix mapping weights to weighted
statistics, $\ell$ is a loss measuring how well we match population targets, and
$r$ is a regularizer that keeps weights from becoming extreme.

The key payoff: unlike traditional raking (which requires *exact* matching on
every constraint), this framework enables a greater variety of constraint types. You can, for example,
exactly match population marginals while *least-squares* matching joint distributions or leverage continuous targets. When combined with sensible regularization, this expanded toolbox for creating weights allows for solutions that can be both more expressive and statistically efficient.  

For more on the (very fun, elegant) optimization math here, see the
[original paper](https://web.stanford.edu/~boyd/papers/pdf/optimal_representative_sampling.pdf)
or my
[NYOSP talk slides](https://github.com/andytimm/nyospm_regrake_public), which motivates why this is particularly helpful for survey weighting these days.

## Setup

```{r}
library(regrake)
set.seed(604)
```

We need two things: population targets and a (biased) survey sample.

```{r}
pop_targets <- data.frame(
  variable = c("age_group", "age_group", "age_group",
                "race", "race", "race", "race",
                "income"),
  level = c("young", "middle", "old",
            "white", "black", "hispanic", "asian",
            "mean"),
  target = c(0.35, 0.40, 0.25,
             0.60, 0.13, 0.18, 0.09,
             71092)
)
pop_targets
```

This is the `"proportions"` format — a data frame with `variable`, `level`,
and `target` columns (the format used by the `autumn` package). The package also
accepts other common formats from packages like `anesrake` and `survey`; see the feature
reference below.

Now a biased sample that over-represents older and white respondents, with
income skewed above the population mean:

```{r}
n <- 800
sample_data <- data.frame(
  age_group = sample(c("young", "middle", "old"), n, replace = TRUE,
                     prob = c(0.20, 0.35, 0.45)),
  race = sample(c("asian", "black", "hispanic", "white"), n, replace = TRUE,
                prob = c(0.06, 0.10, 0.14, 0.70)),
  income = rnorm(n, mean = 72000, sd = 15000)
)
```

## Basic raking with exact constraints

The simplest use of `regrake` is functionally equivalent to traditional raking: exact matching on categorical variables with an entropy regularizer.

```{r}
result_exact <- regrake(
  data = sample_data,
  formula = ~ rr_exact(age_group + race),
  population_data = pop_targets,
  regularizer = "entropy", # default behavior, but showing for clarity
  pop_type = "proportions"
)
result_exact
```

## Adding complexity: mixed constraints

Moving beyond an equivalent solution to regular raking, suppose we also want to:

- Match the mean income (a continuous target)
- Softly match the age group-by-race interaction (L2 loss on joint cells)

The interaction probably isn't too much to ask of regular raking in this case, but as weighting problems grow larger, raking with many interactions may struggle to converge, and likely will have high variance costs.
This motivates the notion of mixing constraints: we'll use `rr_exact()` for the age/race marginals (combining them in one call with `+`) and `rr_l2()` for the joint distribution.

For the interaction, we need joint distribution targets. These reflect modest
departures from independence — younger respondents are somewhat more
Asian/Hispanic and less White than the marginals alone would imply.

```{r}
interaction_targets <- data.frame(
  variable = "age_group:race",
  level = c("young:asian", "young:black", "young:hispanic", "young:white",
            "middle:asian", "middle:black", "middle:hispanic", "middle:white",
            "old:asian", "old:black", "old:hispanic", "old:white"),
  target = c(0.036, 0.048, 0.068, 0.198,
             0.036, 0.052, 0.071, 0.241,
             0.019, 0.030, 0.041, 0.161)
)

pop_targets_full <- rbind(pop_targets, interaction_targets)
```

Now we can mix exact and soft constraints:

```{r}
result_mixed <- regrake(
  data = sample_data,
  formula = ~ rr_exact(age_group + race) +
              rr_mean(income) + rr_l2(age_group:race),
  population_data = pop_targets_full,
  pop_type = "proportions"
)
result_mixed
```

This works smoothly because the L2 loss on the interaction doesn't demand
*perfect* matching on every joint cell — it just pushes the weights in the
right direction. The optimizer finds the best trade-off between matching all
the constraints and keeping the weights well-behaved in a variance sense.

## Checking balance

With mixed constraints, **convergence doesn't mean perfect matching**. This is
an important difference from traditional raking — the mental shift is thinking
about **how** the weights converged, not just whether they converged.

The package provides a `balance` data frame in results to examine this:

```{r}
bal <- result_mixed$balance
bal
```

Exact constraints should have near-zero residuals. L2 and mean constraints will
have small residuals reflecting the optimization trade-off.

You can visualize this with a simple residual plot:

```{r, fig.width=6, fig.height=4}
# Color by constraint type
cols <- ifelse(bal$type == "exact", "#2166AC", "#B2182B")
pch <- ifelse(bal$type == "exact", 16, 17)

plot(
  bal$target, bal$residual,
  col = cols, pch = pch,
  xlab = "Target value",
  ylab = "Residual (achieved - target)",
  main = "Balance: exact constraints vs soft constraints"
)
abline(h = 0, lty = 2, col = "gray50")
legend(
  "topright",
  legend = c("Exact (must match)", "L2 / mean (soft)"),
  col = c("#2166AC", "#B2182B"),
  pch = c(16, 17),
  cex = 0.8
)
```

Exact constraints (blue circles) cluster tightly at zero. Soft constraints
(red triangles) have small residuals — the optimizer matched them as closely as
it could while satisfying the exact constraints and keeping weights reasonable.

The `diagnostics` field has additional information about weight quality:

```{r}
result_mixed$diagnostics[c("weight_range", "kish_deff", "kish_ess", "converged")]
```

## When is regrake worth the complexity?

If your weighting problem only needs exact matching on a few categorical marginals to be representative, packages like `autumn`, `anesrake`, or `survey::rake` will likely be all you need. Breaking out `regrake` becomes increasingly sensible as you have more work to do with your weighting, especially if you're sensitive to the variance price you're paying.

On the other end of weighting tool complexity, there are a broad family of approaches like (Bayesian) multilevel regression and postratification or more recent improvements, which bring even more modeling flexibility to bear on the weighting problem. These are powerful tools, but my sense is that often a bit of flexibility and regularization goes a long way — regularized raking can get you a lot of the flexibility here without all the complexity and computation fancier methods require.  

## Feature reference

This vignette covered the core workflow. Here's a quick index of what else
`regrake` supports:

**Constraint types:**

- `rr_exact()` — equality constraints (equivalent to traditional raking)
- `rr_l2()` — least-squares soft matching
- `rr_kl()` — KL divergence soft matching to a reference distribution
- `rr_mean()`, `rr_var()`, `rr_quantile()` — continuous variable targets
- `rr_range(age, 40, 50)` — constrain weighted mean to an interval

**Weight control:**

- `bounds = c(0.3, 3)` — limit weights to 0.3x–3x the average
- `bounds_method = "hard"` — strict enforcement (default is `"soft"`)
- `regularizer` — `"entropy"` (default), `"sum_squares"`, or `"kl"`

**Population data:**

- `"raw"` — unit-level population data (one row per person); regrake computes
  proportions and means internally
- `"weighted"` — unit-level data with a weight column (e.g., from ACS microdata)

**Interoperability with other R packages:**

- `"proportions"` — the `autumn` format used in this vignette
- `"anesrake"` — named list of target vectors (as used by `anesrake::anesrake`)
- `"survey"` / `"survey_design"` — margin tables or design objects from the
  `survey` package

**Solver tuning:**

- `control = list(margin_tol = 0.001)` — adjust convergence tolerance

See `?regrake` for full documentation.