Summarize PUMS data with target updates and confidence intervals

Summarizes disaggregated PUMS data, applying target updates and scaling weights to client zones, then computes totals and confidence intervals by group. Use for reporting and validation of control totals against survey results.

Usage

summarize_pums(
  pums_wide,
  group_col = NULL,
  puma_zone_group_xwalk,
  ci_level = 0.9,
  settings
)

Arguments

pums_wide

data.table. Disaggregated PUMS target data. Required columns:

puma_id : PUMA ID
hh_id : household ID
person_id : person ID
WGTP : household weight
PWGTP : person weight
h_total : household total
p_total : person total Rows: one per person or household. Modified by reference: no (returns copy).

group_col

character vector. Columns to group by. Default: NULL.

puma_zone_group_xwalk

data.table. Crosswalk between PUMS and zone groups via client zones. Must include all grouping columns except geometry.

ci_level

numeric. Confidence interval level (fraction, not percent). Default: 0.9.

settings

list. Project settings; must include target update definitions and weighting flags.

Value

data.table. Summarized target data by group, with columns:

Grouping columns (as specified)
Target variable columns
total, lower, upper for each target (confidence interval)
Row order: by group and target

Details

Checks for required columns in PUMS data: puma_id, hh_id, person_id, WGTP, PWGTP, h_total, p_total.
Checks for required columns in crosswalk: group_col, zone_group, zone_group_label, puma_id, prop_hh, prop_per.
Drops geometry from crosswalk to reduce memory usage.
Merges PUMS data with crosswalk, allowing cartesian join for multiple PUMA zones per client zone.
Applies update_targets() to harmonize PUMS targets with client zone definitions.
Scales weights to client zones:
- If force_balance_hh_weights is TRUE, pegs household weights from person weights and adjusts by one proportion column.
- Otherwise, adjusts both person and household weights independently.
Ensures that weighted totals are unchanged after scaling (checksums).
Sets weights and totals to zero for outside-region rows (client_zone_id == -1).
Aggregates by group, client zone, and zone group label for households and persons.
Asserts that weights match after update if forced balancing is enabled.
Calls summarize_data() to aggregate by group and calculate confidence intervals for households and persons.
Checks that summarized totals match naive weighted sums (checksum validation).
Returns a data.table of summarized targets by group, with confidence intervals.
Error handling: stops if required columns are missing or weights do not match after update.
FIXME: Replicate weight calculation for standard errors is not implemented (see code comment and reference link).

Settings

force_balance_hh_weights (direct): controls weight scaling logic.
study_unit (direct): selects proportion column for scaling.
Uses target update definitions from settings for harmonization.

Examples

## Not run:
summarize_pums(pums_wide, group_col = "zone_group", puma_zone_group_xwalk, ci_level = 0.9, settings = settings)
#> Error: object 'pums_wide' not found
## End(Not run)