6 Target Data Preparation

Two key Census data sources are used for weighting:

5-year ACS estimates: Provide household and population counts at a highly disaggregated, block group level, offering statistical reliability due to cumulative sampling over multiple years. These are essential for sample planning and constructing base weights at the same geographic scale as sampling.
1-year ACS Public Use Microdata Sample (PUMS): Provide detailed crosstabulations by household and person characteristics at the PUMA (Public Use Microdata Area) level, corresponding closely to the survey year and less affected by pandemic-era distortions. These are used to set demographic targets for weighting.

Key Concept: Reference Counts vs. Target Estimates

Reference counts: Number of households/persons recorded in the sample plan using 5-year ACS data, at the block group level. Used for initial base weight calculations.
Target estimates: Population and household totals from the 1-year ACS PUMS, at PUMA level, used to set demographic weighting targets.

This step translates 1-year ACS PUMS data into PopulationSim-ready control totals. The script imports 1-year PUMS microdata, removes group-quarters residents, reconciles household and person weights, and allocates both households and persons to the study’s weighting zones. These processed targets define the benchmark totals that PopulationSim matches during the weighting rounds.

Before weighting, the sample plan reference counts must be calibrated to represent the same population used in PopulationSim. The 5-year ACS estimates enable spatial disaggregation at the block group level, while 1-year PUMS data allow for detailed demographic targets at the PUMA level. The final step in this chapter is to align the sample plan’s reference counts (from 5-year ACS) to the target estimates (from 1-year PUMS) for households and persons.

Key Concept: Targets vs. Weights

Targets are fixed control totals from external data (ACS/PUMS). Weights are adjustments applied to survey data so that estimates align with the targets (i.e., population).

6.1 Chapter Setup

This script begins by loading the R packages and configuration settings needed to construct the household and person-level weighting targets. Of key importance in this chapter is the pums_year, which specifies the vintage of the ACS PUMS data to be used. This should match the year of the ACS 5-year estimates used for the crosswalk denominators.

6.1.1 Load Packages and Settings

# Load hts.weighting Packages
devtools::load_all()

# Load Settings; Pass python_env Explicitly for Quarto
settings = get_settings(reload_settings = TRUE, 
                        print = FALSE)

pums_year = get("pums_year", settings)
check_path = file.path(get("report_dir", settings), "020_check_counts.csv")

cli::cli_inform(check_path)

6.1.2 Load 1-year PUMS Dataset

The next step imports the 1-year PUMS data, which provides the individual and household-level detail needed to specify target variables such as income, household size, and employment status. After loading, the script retains only the columns specified to create target variables - such as household income, age, and sex - to streamline data processing when preparing target variables.

# PUMS Data Dictionary for Reference
pums_vars = read_pums_codebook(settings)
pums_0 = fetch_pums(settings)
pums_sf = get_puma_geom(settings)

# Record Initial Weights for QA
check_dt = record_checksum(
    fname = check_path,
    append = FALSE,
    pums_checksum("initial", pums_0, "person")
)

6.2 Target Data Preparation: Clean 1-Year ACS PUMS Data

The next step adjusts the ACS PUMS 1-year data used to construct detailed weighting targets. To do so, the following steps are performed:

Remove group quarters residents from the data.
Person-level studies only: Adjust households to persons
Align the PUMS data to itself by adjusting the household-level weighted estimates to match the total weighted sum of persons in the household using person-level weighted estimates.
PUMS data are separated into household- and person-level datasets for creating and tabulating weighting target variables.

6.2.1 Remove Group Quarters from PUMS Data

Once the data are loaded, the script filters the PUMS records to match the survey sampling frame. All targets must reflect the same population frame as the survey sample frame. Residents of group quarters, such as dormitories, prisons, and nursing homes, are removed from the data as these are not part of the core survey address-based sampling frame. This cleaning step removes those group-quarters cases and retains only household-based records.

type_var = str_subset(names(pums_0), "TYPE(_label|HUGQ_label)$")

# Check for Skew Between HH/PER Weights
check_weight_skew(pums_0, "initial unadjusted PUMS")

# Separate Group Quarters and Housing Units
pums_gq = pums_0[get(type_var) != "Housing unit"]
pums_hu = pums_0[get(type_var) == "Housing unit"]

# Confirm Split is Correct
stopifnot(sum(pums_gq$PWGTP) + sum(pums_hu$PWGTP) == sum(pums_0$PWGTP))
stopifnot(pums_hu[get(type_var) != "Housing unit", .N] == 0)

# Report Skew Between HH/PER Weights After Split
check_weight_skew(pums_hu, "after dropping GQ")

check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    pums_checksum("remove GQ", pums_hu, "person")
)

6.2.2 For Person-Level Studies – Adjust Households to Persons

Person-level studies require a transformation of the PUMS household data to ensure each adult is treated as an individual household unit (just as they were in a person-level HTS). This adjustment involves several steps:

Count the number of adults and children in each household to retain household composition information.
Re-label the number of vehicles to maintain consistency.
Assign each adult as their own household by modifying the SERIALNO and SPORDER identifiers, effectively creating a new household for each adult while dropping children from the dataset. This ensures that person-level targets are accurately represented, with each adult treated as a separate entity for weighting purposes.
Verify that the total weights remain consistent after this restructuring, ensuring that the overall population estimates are preserved.

if (get("study_unit", settings) == "person") {

    # Count kids and adults per household
    num_kids = pums_hu[AGEP < 18, .N, keyby = SERIALNO]
    pums_hu[num_kids, num_kids := i.N, on = "SERIALNO"]
    pums_hu[is.na(num_kids), num_kids := 0]

    # Find the number of adults in each household for precomputation
    num_adults = pums_hu[AGEP >= 18, .N, keyby = SERIALNO]
    pums_hu[num_adults, num_adults := i.N, on = "SERIALNO"]
    pums_hu[is.na(num_adults), num_adults := 0] # Should never happen but here as a safeguard

    # Set hh target to be carried over at person level
    pums_hu[, h_kids := num_kids]
    pums_hu[, h_adults := num_adults]

    # Re-label the number of vehicles in each household for precomputation
    pums_hu[, num_vehicles := VEH]

    # Assign each adult as their own household
    pums_hu[, `:=`(SPORDER_orig = SPORDER, SERIALNO_orig = SERIALNO)]
    pums_hu[, SERIALNO := paste0(SERIALNO, stringr::str_pad(SPORDER, 2, pad = "0"))]
    pums_hu[, NP_adj := 1]
    
    # Drop children
    pums_hu = pums_hu[AGEP >= 18]

    # Record checksum
    check_dt = record_checksum(
        fname = check_path,
        append = TRUE,
        pums_checksum("drop children", pums_hu, "person")
    )

    # Update SPORDER to start at 1. This ensures we can still use the SPORDER to filter on HHs
    pums_hu = pums_hu[order(SERIALNO, SPORDER_orig)]
    pums_hu[, SPORDER := rowid(SERIALNO)]

    # Find households with no SPORDER == 1
    ok_hh = pums_hu[SPORDER == 1, SERIALNO]
    stopifnot(all(pums_hu$SERIALNO %in% ok_hh))

    # Should be similar, but not exactly the same
    # Check that weights are consistent after transformation
    stopifnot(
        all.equal(
            sum(pums_hu$PWGTP),
            sum(pums_hu$NP_adj * pums_hu$WGTP),
            tolerance = 0.05
        )
    )

    # Record checksum
    check_dt = record_checksum(
        fname = check_path,
        append = TRUE,
        pums_checksum("create person hh", pums_hu, "person")
    )
    
}

6.2.3 Reconcile Household and Person Weights

A key data management step in the target preparation process is reconciling household (WGTP) and person (PWGTP) weights. These two weighting schemes in PUMS are produced independently by the Census Bureau and therefore are not internally consistent. To ensure coherent expansion factors, the script recalculates each household’s weight so that the sum of its members’ person weights equals its adjusted household weight. This alignment guarantees that total household counts and total person counts are derived from the same underlying expansion basis, preventing logical conflicts when PopulationSim later calibrates both household and person targets simultaneously. The process also includes simple diagnostics to confirm that household and person totals remain in proportion after reconciliation.

# Balance Household and Person Weights if Requested
if (get("force_balance_hh_weights", settings)) {
    pums = force_balance_pums_weights(pums_hu)

    # Record checksum
    check_dt = record_checksum(
        fname = check_path,
        append = TRUE,
        pums_checksum("force balance hh weights", pums, "person")
    )

} else {
    pums = pums_hu
}

stopifnot(pums[is.na(WGTP), .N] == 0)

# Sort for Consistency and Save Cleaned PUMS Data
pums = pums[order(SERIALNO, SPORDER)]

6.2.4 Separate Households and Persons

In this step, RSG separates the cleaned PUMS data into distinct household and person datasets. It identifies which variables pertain to households (e.g., HINCP, NP) and which pertain to persons (e.g., AGEP, SEX). The script then splits the data accordingly, ensuring that each dataset contains only the relevant columns. This separation is crucial for subsequent target tabulation, as household-level targets (like household size and income) and person-level targets (like age and employment status) need to be processed independently. The script also retains common variables, such as PUMA and SERIALNO, in both datasets to facilitate later merging and analysis.

In other words, the step prepares a “household” and “person” dataset, mirroring the data structure of the HTS.

# Identify Household/Person Variables From Codebook/Settings
pums_hvars = get("pums_hvars", settings)
pums_hvars = c(pums_hvars, paste0(pums_hvars, "_label"))
pums_hvars = intersect(pums_hvars, names(pums))

pums_pvars = get("pums_pvars", settings)
pums_pvars = c(pums_pvars, paste0(pums_pvars, "_label"))
pums_pvars = intersect(pums_pvars, names(pums))

# Split Into Household and Person Datasets
pums_cvars = setdiff(names(pums), c(pums_hvars, pums_pvars))

# Split Into Households and Persons
pums_hh = pums[SPORDER == 1, c(pums_cvars, pums_hvars), with = FALSE]
pums_per = pums[, c(pums_cvars, pums_pvars), with = FALSE]

6.3 Configure Target Variables

Ahead of weighting, a set of target estimates representing population distributions across household- and person-level characteristics are created. These target estimates are constructed from ACS data and are chosen to match key variables in the survey (e.g., household size, income, age). In this step, a set of demographic target variables are created using the information specified in the project settings. Different household and personal attributes affect survey response, which presents bias in unweighted survey data. For example, larger households may be less likely to respond due to the additional time needed to complete the survey questions and travel diaries for each member. To correct these types of biases, a variety of household- and person-level target categories are selected as weighting targets.

Key Concept: Target variables, categories, and estimates

Target variable: A household- or person-level characteristic used for weighting (e.g., household size, income).
Target category: A level or bin within a target variable (e.g., 1-person household, $30k-$50k income).
Target estimate (control total): The population count matching each target category, derived from ACS or another external source.

6.3.1 Create and Tabulate Target Variables

This step creates the target variables and categories as specified in the project settings. It then calculates the weighted target estimates for each target category. A small subsequent step sets household income to zero where it is negative. Negative income values in PUMS represent losses or debts, which can complicate target matching. By normalizing these values to zero, the script ensures that income-based targets reflect those values HTS respondents can report (negative income is not a survey response option).

This step relies on a key function, prepare_targets, which processes the separated household and person datasets to generate the final target tables. The function uses the variable definitions from the codebook to categorize and tabulate the data according to the specified target variables. It also incorporates any necessary adjustments based on the study settings, such as scaling factors or demographic groupings. The output is a set of tabulated targets that reflect the weighted counts of households and persons across various categories, ready for use in PopulationSim. The script also includes checks to ensure that the tabulated targets align with expected totals and distributions, providing confidence in the accuracy of the prepared data. For additional details, see the documentation for prepare_targets() and associated functions (e.g., prep_target_age, prep_target_income).

pums_hh[HINCP < 0, HINCP := 0]

pums_hh[, `:=`(
    hh_id = SERIALNO,
    puma_id = PUMA,
    h_puma = PUMA
    )]

pums_per[, `:=`(
    hh_id = SERIALNO,
    person_id = paste0(SERIALNO, SPORDER),
    puma_id = PUMA,
    p_puma = PUMA
    )]

# Add Dummy Columns for Total Counts
pums_hh[, h_total := 1]
pums_per[, p_total := 1]

# Tabulate Targets for Weighting
pums_tabbed = prepare_targets(
    households = pums_hh,
    persons = pums_per,
    codebook = pums_vars,
    settings = settings
)

check_group_sums(pums_tabbed, settings)

6.3.2 Aggregate Targets to PUMA Level

This step aggregates the tabulated household and person target estimates to the PUMA level. It sums the weighted counts for each target variable, ensuring that the totals reflect the population estimates for each PUMA. The aggregation is done separately for household and person targets, using the appropriate weights (WGTP for households and PWGTP for persons). After aggregation, the script merges the household and person target tables into a single dataset, keyed by PUMA ID. This consolidated target table serves as the basis for further adjustments to align with the study’s geographic zones. The script also includes checks to verify that the aggregated totals are consistent with expectations, providing an additional layer of quality assurance.

# Identify the Person-Level and Household-Level Target Variables
p_cols = grep("^p_", names(pums_tabbed), value = TRUE)
h_cols = grep("^h_", names(pums_tabbed), value = TRUE)

# Aggregate the Person Targets Separately on puma_id
per_targs = pums_tabbed[, lapply(.SD, function(x) sum(x * PWGTP)), by = puma_id, .SDcols = p_cols]

# Keep Only 1 Record per Hh for Hh Columns
hh_targs = pums_tabbed[, .SD[1], by = .(hh_id, puma_id, WGTP), .SDcols = h_cols]
hh_targs = hh_targs[, lapply(.SD, function(x) sum(x * WGTP)), by = puma_id, .SDcols = h_cols]

# Merge Household and Person Targets
target_puma = merge(per_targs, hh_targs, by = "puma_id", all = TRUE)

# Record Checksum
check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    data.table(
        dataset = "PUMS",
        step = "calculate targets",
        sum_hh_wt = target_puma[, sum(h_total)],
        sum_per_wt = target_puma[, sum(p_total)],
        sum_hhwtXnp = NA,
        n_rows = target_puma[, .N],
        n_hh = NA,
        n_per = NA,
        unit = 'PUMA'
    )
)

6.3.3 Optional: Adjust PUMS Targets to Weighting Zones

This step translates the PUMA-based control totals to the study’s weighting zones if PUMAs are not used as the weighting zones. This is done using a spatial crosswalk to the custom weighting zones (often called “client zones”). The script reallocates household and person targets in proportion to each PUMA’s overlap with the study zones. This ensures that control totals reflect the population distribution within the modeled region rather than the full PUMA extent.

Where no custom weighting zones are defined, the function performs a 1:1 pass-through, trimming any PUMAs that extend beyond the study boundary. A check is performed at the end of the step to confirm that aggregated totals remain consistent, providing confidence that the allocation preserves the original universe.

PSRC Specifics

PSRC did not use custom weighting zones, so this step effectively behaves as a pass-through with sanity checks.

puma_zone_xwalk_sf = readRDS(file.path(
  get("working_dir", settings),
  "puma_czone_xwalk.rds"
))

# Adjust Targets to Client Zones Using the Zrosswalk
target_czones = adjust_target_to_study_zones(
  puma_targets = target_puma,
  puma_crosswalk = puma_zone_xwalk_sf,
  settings
)

# Record Checksum
check_dt = record_checksum(
  fname = check_path,
  append = TRUE,
  data.table(
    dataset = "PUMS",
    step = "adjust to client zones adj target",
    sum_hh_wt = target_czones[, sum(h_total)],
    sum_per_wt = target_czones[, sum(p_total)],
    sum_hhwtXnp = NA,
    n_rows = target_czones[, .N],
    n_hh = NA,
    n_per = NA,
    unit = 'client zone'
  )
)

6.3.4 Create Regional Targets

After creating and tabulating target estimates for each weighting zone, targets are then tabulated at the region level. These targets help PopulationSim match regional population totals and other targets constructed at the regional level (e.g., transit targets). They can also serve as a fallback if the model fails to converge at the finer zone level.

target_region = target_czones[, !"client_zone_id", with = FALSE][, lapply(.SD, sum)]
target_region[, region := 1]

6.3.5 Optional: Create Transit Targets

Some clients choose to incorporate a transit target to help ensure that the model accurately reflects transit usage patterns. If specified in the settings, this step adds transit boardings as a target at the regional level. Since transit boardings are not available in the PUMS data, the script pulls this information from an external source, ideally broken down by transit agency to serve as a proxy for sub-regional distribution. However, this requires boardings data, which may not always be available, especially at the granular level required to construct “typical weekday” estimates. The script adds the total average weekday ridership or boardings to the regional targets, providing an additional control for PopulationSim to match during the weighting process.

PSRC Specifics

PSRC did not use a transit target.

if ("h_transit_trips" %in% names(settings$targets)) {
    h_transit_trips = prep_transit_target(settings)
    target_region[, names(h_transit_trips) := h_transit_trips]
}

# Check that Totals Add up Properly --------------------------------------------
check_group_sums(target_czones, settings)

6.3.6 Update the Targets with Day-of-Week Estimates

RSG’s weighting process enables clients to weight their data at the level of individual days of the week or groups of days of the week. For example, a client might choose to weight their data to match separate targets for weekdays (Monday-Friday) and weekends (Saturday-Sunday). In this scenario, the weighting process would ensure that the weighted survey data aligns with the specified targets for both weekdays and weekends. This requires minor restructuring of the target data for PopulationSim. In cases where day-of-week weights are not specified, this step will create a replicate of the total household and population variables labeled with the name specified for the grouped days in the settings.

PSRC Specifics

PSRC did not specify day-of-week targets.

day_groups_dt = get_day_groups(settings)
day_groups = unique(day_groups_dt$day_group)
dayregex = paste(day_groups, collapse = "|")
pattern = stringr::str_glue("^(p_|h_)((?!({dayregex})).)*$")

targets_ls = list(
    "puma" = copy(target_puma),
    "czone" = copy(target_czones),
    "region" = copy(target_region)
)

for (target_name in names(targets_ls)) {
    # Cols to update, non DOW columns
    targ_cols = stringr::str_subset(names(targets_ls[[target_name]]), pattern)

    # Assign DOW columns from h_total and p_total
    targets_ls[[target_name]][, paste0("h_dow_", day_groups) := h_total]
    targets_ls[[target_name]][, paste0("p_dow_", day_groups) := p_total]

    # Rescale everything by n days
    targets_ls[[target_name]][,
        (targ_cols) := lapply(.SD, function(x) x * length(day_groups)),
        .SDcols = targ_cols
    ]
}

6.3.7 Optional: Aggregate Target Categories for Weighting

Sometimes certain targets and target categories can have large error margins around the Census target estimates, which can impact the results of weighting. One option is to aggregate target categories for certain weighting zone groups or the entire region. This step will optionally aggregate target categories as specified in the project settings file.

PSRC Specifics

Due to large error margins around the Census target estimates for person-level commute mode for the categories bike, walk, and transit, these were aggregated for all regions except King County - Seattle.

# Create a List of Zone Group and Regional Targets
targets = list("zone_group" = target_czones, "region" = target_region)

# Convert the PUMA to Client Zone Crosswalk to a `data.table`
puma_zone_xwalk <- as.data.table(puma_zone_xwalk_sf)

# Create a Weighting Zone Group Crosswalk to Use in Updating Targets and Convert to a `data.table`
zone_group_crosswalk = prepare_zone_groups(
    seed = puma_zone_xwalk,
    targets = target_czones,
    settings = settings,
    show_plot = FALSE,
    replace = FALSE
)
zone_group_crosswalk = as.data.table(zone_group_crosswalk)

# Append the Zone Group to the Matching Weighting Zone ID in the Target Estimate Tables
target_czones[zone_group_crosswalk, zone_group := i.zone_group, on = "client_zone_id"]
target_region[, zone_group := "region"]

# Update the Targets as Specified in the Project Settings Configurations
updated_targets_czone <- update_targets(target_czones, settings, geo_cross_walk = as.data.table(zone_group_crosswalk), geom = "zone_group")

updated_targets_region <- update_targets(target_region, settings, geo_cross_walk = as.data.table(zone_group_crosswalk), geom = "zone_group")

6.3.8 Review Household- and Person-Level Target Estimates

6.3.8.1 Table: Total HH and Population Targets for each weighting zone group

Weighting Zone Group	Household	Persons
King County - Seattle	368,012.2	729,129
King County - Other	541,295.8	1,373,543
Kitsap County - Expanded	201,550.5	502,271
Pierce County	313,545.5	803,114
Snohomish County	320,948.9	835,625
Total	1,745,353.0	4,243,682

6.3.8.2 Table: Household-Level Target Variables by Weighting Zone Groups: ACS 1-year Target Estimates

Target Variable	Target Category	King County - Seattle	King County - Other	Kitsap County - Expanded	Pierce County	Snohomish County	Region
Household size	1	156,491	133,620	48,670	83,386	75,695	497,862
	2	127,110	187,137	78,355	105,182	108,169	605,952
	3+	84,412	220,539	74,526	124,978	137,085	641,539
Household income	$0-$24,999	43,569	49,399	21,427	32,804	31,423	178,621
	$25,000-$49,999	36,634	54,603	22,300	41,303	37,079	191,919
	$50,000-$74,999	39,487	59,248	35,266	44,094	44,539	222,633
	$75,000-$99,999	35,324	58,226	25,197	44,753	38,900	202,400
	$100,000-$199,999	104,632	159,080	69,075	109,689	105,446	547,922
	$200,000+	108,367	160,740	28,286	40,903	63,561	401,857
Number of workers	0	64,488	102,754	52,106	68,537	63,698	351,583
	1	162,434	197,222	71,467	115,362	118,051	664,536
	2+	141,090	241,320	77,977	129,646	139,200	729,233
Vehicle sufficiency	None	76,991	37,006	8,523	16,193	14,904	153,618
	Insufficient	85,179	123,106	36,222	51,313	65,267	361,086
	Sufficient	205,842	381,184	156,806	246,040	240,777	1,230,649
Number of children	0	303,802	371,352	143,820	214,796	216,633	1,250,404
Number of children	1+	64,210	169,944	57,731	98,749	104,316	494,949
Total: Households		368,012	541,296	201,551	313,546	320,949	1,745,353

6.3.8.3 Table: Total Households in PUMAs Weighting Zone Groups: ACS 1-year Target Estimates

Target Variable	Target Category	King County - Seattle	King County - Other	Kitsap County - Expanded	Pierce County	Snohomish County	Region
Total Households in PUMA	5323301	0	54,363	0	0	0	54,363
	5323302	0	43,261	0	0	0	43,261
	5323303	0	74,179	0	0	0	74,179
	5323304	0	64,104	0	0	0	64,104
	5323305	0	57,591	0	0	0	57,591
	5323306	0	45,916	0	0	0	45,916
	5323307	0	43,711	0	0	0	43,711
	5323308	0	0	51,302	0	0	51,302
	5323309	0	51,522	0	0	0	51,522
	5323310	0	54,637	0	0	0	54,637
	5323311	0	52,012	0	0	0	52,012
	5323312	47,709	0	0	0	0	47,709
	5323313	42,353	0	0	0	0	42,353
	5323314	63,202	0	0	0	0	63,202
	5323315	64,788	0	0	0	0	64,788
	5323316	49,403	0	0	0	0	49,403
	5323317	48,003	0	0	0	0	48,003
	5323318	52,555	0	0	0	0	52,555
	5323501	0	0	55,078	0	0	55,078
	5323502	0	0	56,560	0	0	56,560
	5325301	0	0	0	54,734	0	54,734
	5325302	0	0	0	40,243	0	40,243
	5325303	0	0	0	38,343	0	38,343
	5325304	0	0	0	45,812	0	45,812
	5325305	0	0	0	50,283	0	50,283
	5325306	0	0	0	38,135	0	38,135
	5325307	0	0	0	45,996	0	45,996
	5325308	0	0	38,612	0	0	38,612
	5326101	0	0	0	0	57,883	57,883
	5326102	0	0	0	0	54,040	54,040
	5326103	0	0	0	0	51,159	51,159
	5326104	0	0	0	0	51,840	51,840
	5326105	0	0	0	0	42,767	42,767
	5326106	0	0	0	0	63,260	63,260

6.3.8.4 Table: Person-Level Target Variables by Weighting Zone Groups: ACS 1-year Target Estimates

Target Variable	Target Category	King County - Seattle	King County - Other	Kitsap County - Expanded	Pierce County	Snohomish County	Region
Gender	Male	379,274	690,208	251,969	398,496	418,976	2,138,923
Gender	Female	349,855	683,335	250,302	404,618	416,649	2,104,759
Age	0-4	31,595	75,075	28,200	45,198	49,710	229,778
	5-15	65,113	191,843	63,704	117,889	111,272	549,821
	16-17	11,927	34,836	10,204	23,407	22,911	103,285
	18-24	63,082	90,062	40,801	61,301	58,179	313,425
	25-44	311,646	431,913	139,414	246,289	258,077	1,387,339
	45-64	151,301	348,426	121,095	192,144	209,417	1,022,383
	65+	94,465	201,388	98,853	116,886	126,059	637,651
Employment	Non worker	262,999	629,775	253,772	390,930	395,599	1,933,075
	Part-time	82,074	138,237	56,702	77,056	86,823	440,892
	Full-time	384,056	605,531	191,797	335,128	353,203	1,869,715
Commute mode	Bike/transit/walk	NA	47,722	10,747	13,541	16,337	NA
	Work from home	134,902	163,590	34,840	51,388	75,630	460,350
	Transit	62,077	NA	NA	NA	NA	115,386
	Walk	37,122	NA	NA	NA	NA	67,584
	Bike	12,791	NA	NA	NA	NA	17,367
	Other (includes auto)	211,314	515,703	199,157	337,264	337,981	1,601,419
	None	270,923	646,528	257,527	400,921	405,677	1,981,576
University student status	No	674,154	1,300,589	477,629	766,055	798,835	4,017,262
University student status	Yes	54,975	72,954	24,642	37,059	36,790	226,420
Educational attainment	No college	192,478	552,620	221,299	416,401	386,728	1,769,526
Educational attainment	Some college	536,651	820,923	280,972	386,713	448,897	2,474,156
Race	Asian or Pacific Islander	126,375	340,737	43,464	67,555	128,513	706,644
	Black or African American	46,566	78,842	28,661	61,461	32,174	247,704
	Other	114,087	243,238	95,642	182,011	148,631	783,609
	White	442,101	710,726	334,504	492,087	526,307	2,505,725
Ethnicity	Not Hispanic	666,368	1,214,119	441,688	694,024	732,639	3,748,838
Ethnicity	Hispanic	62,761	159,424	60,583	109,090	102,986	494,844
Total: Persons		729,129	1,373,543	502,271	803,114	835,625	4,243,682

6.4 Save Cleaned and Tabulated PUMS Datasets

Note: The targets saved here are not the updated targets. That is because in preparing control tables for PopulationSim in Round 1 weighting, updates are made in those scripts. Updated outputs here are for review to ensure the updates applied are as specified in the project settings under target_updates prior to starting the weighting process.

# Write Out File ---------------------------------------------------------
saveRDS(pums, file = file.path(get("working_dir", settings), "pums_cleaned.rds"))
saveRDS(pums_tabbed, file = file.path(get("working_dir", settings), "pums_tabbed.rds"))
saveRDS(targets_ls[['puma']], file = get("target_puma_path", settings))
saveRDS(targets_ls[['czone']], file = get("target_czone_path", settings))
saveRDS(targets_ls[['region']], file =  get("target_region_path", settings))

6.5 Allocate PUMS Estimates to Sampled Block Groups

The last step of target data preparation is the allocation of PUMS households and persons to sample segments. This enables us to calculate the base weight at the same geographic scale as sampling occurred.

The allocation process involves a spatial crosswalk between PUMA and block group geographies. The fraction of each PUMA that overlaps with each block group is calculated, and this fraction is used to proportionally allocate PUMS households and persons to each block group. Totals are then aggregated to sample segments, which match the geographic scale of sampling.

Note: Small discrepancies (e.g., rounding errors) may occur during allocation, but total household and person estimates should remain consistent.

6.5.1 Load Data and Record Initial Target and Reference Sums

This step loads the tabulated target dataset (from 1-year PUMS/ACS) and the sample plan (from 5-year ACS), then records initial counts for households, persons, and observations.

# Load Client Zone and Regional Target Datasets (from ACS/PUMS).
target_czone = readRDS(get("target_czone_path", settings))
target_region = readRDS(get("target_region_path", settings))

# Number of Days Being Weighted (for Day-of-Week Splits).
n_weight_days = length(settings$weight_dow_groups)

check_path = file.path(get("report_dir", settings), "040_check_counts.csv")

# Record Initial Totals for Client Zone Targets.
check_dt = record_checksum(
    fname = check_path,
    append = FALSE,
    data.table(
        dataset = "PUMS",
        step = "initial client zone targets",
        n_hh = target_czone[, sum(h_total) / n_weight_days],
        n_per = target_czone[, sum(p_total) / n_weight_days],
        n_rows = target_czone[, .N],
        unit = 'client zone'
    )
)

# Load Value Labels and Sample Plan Used for Survey Assignment.
value_labels = fetch_hts_table("value_labels", settings)
sample_plan = fetch_hts_table("sample_plan", settings) 

# Record Reference Counts in the Sample Plan.
check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    data.table(
        dataset = "sample_plan",
        step = "initial sample plan reference counts",
        n_hh = sample_plan[, sum(ref_count_hh)],
        n_per = sample_plan[, sum(ref_count_per)],
        n_rows = sample_plan[, .N],
        unit = 'bg_geoid'
    ) 
)

6.5.2 Assign Weighting Zones to Sample Plan and Record Reference Sums

Weighting zone IDs are assigned to each block group in the sample plan. For block groups that partially overlap multiple weighting zones, the reference counts for households and persons are proportionally allocated using calculated area fractions. Zones outside the study area are excluded.

#  Assign Weighting Zones (Client Zones) to Sample Plan Block Groups.
# If not Using Client Zone, the PUMA ID is the Client Zone
if (!"client_zone_id" %in% names(sample_plan)) {   
    bg_puma_czone_xwalk = readRDS(file.path(get("working_dir", settings), 'bg_puma_czone_xwalk.rds'))
    sample_plan = merge(
        sample_plan,
        bg_puma_czone_xwalk[, .(bg_geoid = GEOID, client_zone, client_zone_id, area_prop)],
        by = "bg_geoid",
        allow.cartesian = TRUE
    )

    # Adjust reference counts by area proportion for partial overlaps
    sample_plan[,
        `:=`(
            ref_count_hh = ref_count_hh * area_prop,
            ref_count_per = ref_count_per * area_prop,
            area_prop = NULL
        )
    ]

    # Remove zones outside the study area
    sample_plan = sample_plan[client_zone_id != -1]
}

# Record Checksum After Zone Assignment
check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    data.table(
        dataset = "sample_plan_outer",
        step = "assign weighting zones to sample plan",
        n_hh = sample_plan[, sum(ref_count_hh)],
        n_per = sample_plan[, sum(ref_count_per)],
        n_rows = sample_plan[, .N],
        unit = 'bg_geoid'
    ) 
)

6.5.3 Adjust Sample Plan Reference Counts to Match Target Data

Because block group and weighting zone boundaries do not align perfectly, reference counts are proportionally allocated so that their sums match the target population estimates for each weighting zone. To align the reference counts to the target estimates, an allocation factor is calculated. In the previous step, we adjusted the block group reference counts by the proportion of the area that a block group overlaps with a PUMA. To ensure the sums of the reference counts in the sample plan match the sums in the PUMS data, we calculate a factor to adjust block group reference counts. This factor is calculated by summing the block group reference counts to the PUMA level, then divide the PUMA target estimates by the block group reference counts to get an allocation factor. This factor is then applied to the reference counts at the block group level.

# Calibrate the Sample Plan so Totals Align with Targets.
sample_plan_adj = adjust_reference_to_target(
    ref_counts = sample_plan,
    targets = target_czone,
    settings
)

# Record Checksum for the Adjusted Sample Plan
check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    data.table(
        dataset = "sample_plan_adj",
        step = "adjust reference counts to PUMS data",
        n_hh = sample_plan_adj[, sum(ref_count_hh)],
        n_per = sample_plan_adj[, sum(ref_count_per)],
        n_rows = sample_plan_adj[, .N],
        unit = 'bg_geoid'
    ) 
)

6.6 Save Adjusted Sample Plan for Base Weight Calculation

# Write Out File ---------------------------------------------------------
saveRDS(sample_plan_adj, file = file.path(get("working_dir", settings), "sample_plan_adj.rds"))
saveRDS(sample_plan, file = file.path(get("working_dir", settings), "sample_plan_unadj.rds"))

6.7 Table: Total PUMS-estimated Hhs and Persons at each Step of Pre-Processing

PSRC Specifics

PSRC did not use custom weighting zones, so the allocation of PUMS to specified weighting zones should not show any change in estimates.

Step	Household estimate	Person estimate
Initial PUMS file	1,737,345	4,322,435
Remove Group Quarters Residents	1,737,345	4,243,682
Align PUMS to itself	1,745,353	4,243,682
Allocate PUMS to sample plan without adjustment	1,673,496	4,163,187
Allocate PUMS to sample plan with adjustment	1,745,353	4,243,682