6  Target Data Preparation

Two key Census data sources are used for weighting:

NoteKey Concept: Reference Counts vs. Target Estimates
  • Reference counts: Number of households/persons recorded in the sample plan using 5-year ACS data, at the block group level. Used for initial base weight calculations.
  • Target estimates: Population and household totals from the 1-year ACS PUMS, at PUMA level, used to set demographic weighting targets.

This step translates 1-year ACS PUMS data into PopulationSim-ready control totals. The script imports 1-year PUMS microdata, removes group-quarters residents, reconciles household and person weights, and allocates both households and persons to the study’s weighting zones. These processed targets define the benchmark totals that PopulationSim matches during the weighting rounds.

Before weighting, the sample plan reference counts must be calibrated to represent the same population used in PopulationSim. The 5-year ACS estimates enable spatial disaggregation at the block group level, while 1-year PUMS data allow for detailed demographic targets at the PUMA level. The final step in this chapter is to align the sample plan’s reference counts (from 5-year ACS) to the target estimates (from 1-year PUMS) for households and persons.

NoteKey Concept: Targets vs. Weights

Targets are fixed control totals from external data (ACS/PUMS). Weights are adjustments applied to survey data so that estimates align with the targets (i.e., population).

6.1 Chapter Setup

This script begins by loading the R packages and configuration settings needed to construct the household and person-level weighting targets. Of key importance in this chapter is the pums_year, which specifies the vintage of the ACS PUMS data to be used. This should match the year of the ACS 5-year estimates used for the crosswalk denominators.

6.1.1 Load Packages and Settings

# Load hts.weighting Packages
devtools::load_all()

# Load Settings; Pass python_env Explicitly for Quarto
settings = get_settings(reload_settings = TRUE, 
                        print = FALSE)

pums_year = get("pums_year", settings)
check_path = file.path(get("report_dir", settings), "020_check_counts.csv")

cli::cli_inform(check_path)

6.1.2 Load 1-year PUMS Dataset

The next step imports the 1-year PUMS data, which provides the individual and household-level detail needed to specify target variables such as income, household size, and employment status. After loading, the script retains only the columns specified to create target variables - such as household income, age, and sex - to streamline data processing when preparing target variables.

# PUMS Data Dictionary for Reference
pums_vars = read_pums_codebook(settings)
pums_0 = fetch_pums(settings)
pums_sf = get_puma_geom(settings)

# Record Initial Weights for QA
check_dt = record_checksum(
    fname = check_path,
    append = FALSE,
    pums_checksum("initial", pums_0, "person")
)

6.2 Target Data Preparation: Clean 1-Year ACS PUMS Data

The next step adjusts the ACS PUMS 1-year data used to construct detailed weighting targets. To do so, the following steps are performed:

  1. Remove group quarters residents from the data.
  2. Person-level studies only: Adjust households to persons
  3. Align the PUMS data to itself by adjusting the household-level weighted estimates to match the total weighted sum of persons in the household using person-level weighted estimates.
  4. PUMS data are separated into household- and person-level datasets for creating and tabulating weighting target variables.

6.2.1 Remove Group Quarters from PUMS Data

Once the data are loaded, the script filters the PUMS records to match the survey sampling frame. All targets must reflect the same population frame as the survey sample frame. Residents of group quarters, such as dormitories, prisons, and nursing homes, are removed from the data as these are not part of the core survey address-based sampling frame. This cleaning step removes those group-quarters cases and retains only household-based records.

type_var = str_subset(names(pums_0), "TYPE(_label|HUGQ_label)$")

# Check for Skew Between HH/PER Weights
check_weight_skew(pums_0, "initial unadjusted PUMS")

# Separate Group Quarters and Housing Units
pums_gq = pums_0[get(type_var) != "Housing unit"]
pums_hu = pums_0[get(type_var) == "Housing unit"]

# Confirm Split is Correct
stopifnot(sum(pums_gq$PWGTP) + sum(pums_hu$PWGTP) == sum(pums_0$PWGTP))
stopifnot(pums_hu[get(type_var) != "Housing unit", .N] == 0)

# Report Skew Between HH/PER Weights After Split
check_weight_skew(pums_hu, "after dropping GQ")

check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    pums_checksum("remove GQ", pums_hu, "person")
)

6.2.2 For Person-Level Studies – Adjust Households to Persons

Person-level studies require a transformation of the PUMS household data to ensure each adult is treated as an individual household unit (just as they were in a person-level HTS). This adjustment involves several steps:

  1. Count the number of adults and children in each household to retain household composition information.
  2. Re-label the number of vehicles to maintain consistency.
  3. Assign each adult as their own household by modifying the SERIALNO and SPORDER identifiers, effectively creating a new household for each adult while dropping children from the dataset. This ensures that person-level targets are accurately represented, with each adult treated as a separate entity for weighting purposes.
  4. Verify that the total weights remain consistent after this restructuring, ensuring that the overall population estimates are preserved.
if (get("study_unit", settings) == "person") {

    # Count kids and adults per household
    num_kids = pums_hu[AGEP < 18, .N, keyby = SERIALNO]
    pums_hu[num_kids, num_kids := i.N, on = "SERIALNO"]
    pums_hu[is.na(num_kids), num_kids := 0]

    # Find the number of adults in each household for precomputation
    num_adults = pums_hu[AGEP >= 18, .N, keyby = SERIALNO]
    pums_hu[num_adults, num_adults := i.N, on = "SERIALNO"]
    pums_hu[is.na(num_adults), num_adults := 0] # Should never happen but here as a safeguard

    # Set hh target to be carried over at person level
    pums_hu[, h_kids := num_kids]
    pums_hu[, h_adults := num_adults]

    # Re-label the number of vehicles in each household for precomputation
    pums_hu[, num_vehicles := VEH]

    # Assign each adult as their own household
    pums_hu[, `:=`(SPORDER_orig = SPORDER, SERIALNO_orig = SERIALNO)]
    pums_hu[, SERIALNO := paste0(SERIALNO, stringr::str_pad(SPORDER, 2, pad = "0"))]
    pums_hu[, NP_adj := 1]
    
    # Drop children
    pums_hu = pums_hu[AGEP >= 18]

    # Record checksum
    check_dt = record_checksum(
        fname = check_path,
        append = TRUE,
        pums_checksum("drop children", pums_hu, "person")
    )

    # Update SPORDER to start at 1. This ensures we can still use the SPORDER to filter on HHs
    pums_hu = pums_hu[order(SERIALNO, SPORDER_orig)]
    pums_hu[, SPORDER := rowid(SERIALNO)]

    # Find households with no SPORDER == 1
    ok_hh = pums_hu[SPORDER == 1, SERIALNO]
    stopifnot(all(pums_hu$SERIALNO %in% ok_hh))

    # Should be similar, but not exactly the same
    # Check that weights are consistent after transformation
    stopifnot(
        all.equal(
            sum(pums_hu$PWGTP),
            sum(pums_hu$NP_adj * pums_hu$WGTP),
            tolerance = 0.05
        )
    )

    # Record checksum
    check_dt = record_checksum(
        fname = check_path,
        append = TRUE,
        pums_checksum("create person hh", pums_hu, "person")
    )
    
}

6.2.3 Reconcile Household and Person Weights

A key data management step in the target preparation process is reconciling household (WGTP) and person (PWGTP) weights. These two weighting schemes in PUMS are produced independently by the Census Bureau and therefore are not internally consistent. To ensure coherent expansion factors, the script recalculates each household’s weight so that the sum of its members’ person weights equals its adjusted household weight. This alignment guarantees that total household counts and total person counts are derived from the same underlying expansion basis, preventing logical conflicts when PopulationSim later calibrates both household and person targets simultaneously. The process also includes simple diagnostics to confirm that household and person totals remain in proportion after reconciliation.

# Balance Household and Person Weights if Requested
if (get("force_balance_hh_weights", settings)) {
    pums = force_balance_pums_weights(pums_hu)

    # Record checksum
    check_dt = record_checksum(
        fname = check_path,
        append = TRUE,
        pums_checksum("force balance hh weights", pums, "person")
    )

} else {
    pums = pums_hu
}

stopifnot(pums[is.na(WGTP), .N] == 0)

# Sort for Consistency and Save Cleaned PUMS Data
pums = pums[order(SERIALNO, SPORDER)]

6.2.4 Separate Households and Persons

In this step, RSG separates the cleaned PUMS data into distinct household and person datasets. It identifies which variables pertain to households (e.g., HINCP, NP) and which pertain to persons (e.g., AGEP, SEX). The script then splits the data accordingly, ensuring that each dataset contains only the relevant columns. This separation is crucial for subsequent target tabulation, as household-level targets (like household size and income) and person-level targets (like age and employment status) need to be processed independently. The script also retains common variables, such as PUMA and SERIALNO, in both datasets to facilitate later merging and analysis.

In other words, the step prepares a “household” and “person” dataset, mirroring the data structure of the HTS.

# Identify Household/Person Variables From Codebook/Settings
pums_hvars = get("pums_hvars", settings)
pums_hvars = c(pums_hvars, paste0(pums_hvars, "_label"))
pums_hvars = intersect(pums_hvars, names(pums))

pums_pvars = get("pums_pvars", settings)
pums_pvars = c(pums_pvars, paste0(pums_pvars, "_label"))
pums_pvars = intersect(pums_pvars, names(pums))

# Split Into Household and Person Datasets
pums_cvars = setdiff(names(pums), c(pums_hvars, pums_pvars))

# Split Into Households and Persons
pums_hh = pums[SPORDER == 1, c(pums_cvars, pums_hvars), with = FALSE]
pums_per = pums[, c(pums_cvars, pums_pvars), with = FALSE]

6.3 Configure Target Variables

Ahead of weighting, a set of target estimates representing population distributions across household- and person-level characteristics are created. These target estimates are constructed from ACS data and are chosen to match key variables in the survey (e.g., household size, income, age). In this step, a set of demographic target variables are created using the information specified in the project settings. Different household and personal attributes affect survey response, which presents bias in unweighted survey data. For example, larger households may be less likely to respond due to the additional time needed to complete the survey questions and travel diaries for each member. To correct these types of biases, a variety of household- and person-level target categories are selected as weighting targets.

NoteKey Concept: Target variables, categories, and estimates
  • Target variable: A household- or person-level characteristic used for weighting (e.g., household size, income).
  • Target category: A level or bin within a target variable (e.g., 1-person household, $30k-$50k income).
  • Target estimate (control total): The population count matching each target category, derived from ACS or another external source.

6.3.1 Create and Tabulate Target Variables

This step creates the target variables and categories as specified in the project settings. It then calculates the weighted target estimates for each target category. A small subsequent step sets household income to zero where it is negative. Negative income values in PUMS represent losses or debts, which can complicate target matching. By normalizing these values to zero, the script ensures that income-based targets reflect those values HTS respondents can report (negative income is not a survey response option).

This step relies on a key function, prepare_targets, which processes the separated household and person datasets to generate the final target tables. The function uses the variable definitions from the codebook to categorize and tabulate the data according to the specified target variables. It also incorporates any necessary adjustments based on the study settings, such as scaling factors or demographic groupings. The output is a set of tabulated targets that reflect the weighted counts of households and persons across various categories, ready for use in PopulationSim. The script also includes checks to ensure that the tabulated targets align with expected totals and distributions, providing confidence in the accuracy of the prepared data. For additional details, see the documentation for prepare_targets() and associated functions (e.g., prep_target_age, prep_target_income).

pums_hh[HINCP < 0, HINCP := 0]

pums_hh[, `:=`(
    hh_id = SERIALNO,
    puma_id = PUMA,
    h_puma = PUMA
    )]

pums_per[, `:=`(
    hh_id = SERIALNO,
    person_id = paste0(SERIALNO, SPORDER),
    puma_id = PUMA,
    p_puma = PUMA
    )]

# Add Dummy Columns for Total Counts
pums_hh[, h_total := 1]
pums_per[, p_total := 1]

# Tabulate Targets for Weighting
pums_tabbed = prepare_targets(
    households = pums_hh,
    persons = pums_per,
    codebook = pums_vars,
    settings = settings
)

check_group_sums(pums_tabbed, settings)

6.3.2 Aggregate Targets to PUMA Level

This step aggregates the tabulated household and person target estimates to the PUMA level. It sums the weighted counts for each target variable, ensuring that the totals reflect the population estimates for each PUMA. The aggregation is done separately for household and person targets, using the appropriate weights (WGTP for households and PWGTP for persons). After aggregation, the script merges the household and person target tables into a single dataset, keyed by PUMA ID. This consolidated target table serves as the basis for further adjustments to align with the study’s geographic zones. The script also includes checks to verify that the aggregated totals are consistent with expectations, providing an additional layer of quality assurance.

# Identify the Person-Level and Household-Level Target Variables
p_cols = grep("^p_", names(pums_tabbed), value = TRUE)
h_cols = grep("^h_", names(pums_tabbed), value = TRUE)

# Aggregate the Person Targets Separately on puma_id
per_targs = pums_tabbed[, lapply(.SD, function(x) sum(x * PWGTP)), by = puma_id, .SDcols = p_cols]

# Keep Only 1 Record per Hh for Hh Columns
hh_targs = pums_tabbed[, .SD[1], by = .(hh_id, puma_id, WGTP), .SDcols = h_cols]
hh_targs = hh_targs[, lapply(.SD, function(x) sum(x * WGTP)), by = puma_id, .SDcols = h_cols]

# Merge Household and Person Targets
target_puma = merge(per_targs, hh_targs, by = "puma_id", all = TRUE)

# Record Checksum
check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    data.table(
        dataset = "PUMS",
        step = "calculate targets",
        sum_hh_wt = target_puma[, sum(h_total)],
        sum_per_wt = target_puma[, sum(p_total)],
        sum_hhwtXnp = NA,
        n_rows = target_puma[, .N],
        n_hh = NA,
        n_per = NA,
        unit = 'PUMA'
    )
)

6.3.3 Optional: Adjust PUMS Targets to Weighting Zones

This step translates the PUMA-based control totals to the study’s weighting zones if PUMAs are not used as the weighting zones. This is done using a spatial crosswalk to the custom weighting zones (often called “client zones”). The script reallocates household and person targets in proportion to each PUMA’s overlap with the study zones. This ensures that control totals reflect the population distribution within the modeled region rather than the full PUMA extent.

Where no custom weighting zones are defined, the function performs a 1:1 pass-through, trimming any PUMAs that extend beyond the study boundary. A check is performed at the end of the step to confirm that aggregated totals remain consistent, providing confidence that the allocation preserves the original universe.

NotePSRC Specifics

PSRC did not use custom weighting zones, so this step effectively behaves as a pass-through with sanity checks.

puma_zone_xwalk_sf = readRDS(file.path(
  get("working_dir", settings),
  "puma_czone_xwalk.rds"
))

# Adjust Targets to Client Zones Using the Zrosswalk
target_czones = adjust_target_to_study_zones(
  puma_targets = target_puma,
  puma_crosswalk = puma_zone_xwalk_sf,
  settings
)

# Record Checksum
check_dt = record_checksum(
  fname = check_path,
  append = TRUE,
  data.table(
    dataset = "PUMS",
    step = "adjust to client zones adj target",
    sum_hh_wt = target_czones[, sum(h_total)],
    sum_per_wt = target_czones[, sum(p_total)],
    sum_hhwtXnp = NA,
    n_rows = target_czones[, .N],
    n_hh = NA,
    n_per = NA,
    unit = 'client zone'
  )
)

6.3.4 Create Regional Targets

After creating and tabulating target estimates for each weighting zone, targets are then tabulated at the region level. These targets help PopulationSim match regional population totals and other targets constructed at the regional level (e.g., transit targets). They can also serve as a fallback if the model fails to converge at the finer zone level.

target_region = target_czones[, !"client_zone_id", with = FALSE][, lapply(.SD, sum)]
target_region[, region := 1]

6.3.5 Optional: Create Transit Targets

Some clients choose to incorporate a transit target to help ensure that the model accurately reflects transit usage patterns. If specified in the settings, this step adds transit boardings as a target at the regional level. Since transit boardings are not available in the PUMS data, the script pulls this information from an external source, ideally broken down by transit agency to serve as a proxy for sub-regional distribution. However, this requires boardings data, which may not always be available, especially at the granular level required to construct “typical weekday” estimates. The script adds the total average weekday ridership or boardings to the regional targets, providing an additional control for PopulationSim to match during the weighting process.

NotePSRC Specifics

PSRC did not use a transit target.

if ("h_transit_trips" %in% names(settings$targets)) {
    h_transit_trips = prep_transit_target(settings)
    target_region[, names(h_transit_trips) := h_transit_trips]
}

# Check that Totals Add up Properly --------------------------------------------
check_group_sums(target_czones, settings)

6.3.6 Update the Targets with Day-of-Week Estimates

RSG’s weighting process enables clients to weight their data at the level of individual days of the week or groups of days of the week. For example, a client might choose to weight their data to match separate targets for weekdays (Monday-Friday) and weekends (Saturday-Sunday). In this scenario, the weighting process would ensure that the weighted survey data aligns with the specified targets for both weekdays and weekends. This requires minor restructuring of the target data for PopulationSim. In cases where day-of-week weights are not specified, this step will create a replicate of the total household and population variables labeled with the name specified for the grouped days in the settings.

NotePSRC Specifics

PSRC did not specify day-of-week targets.

day_groups_dt = get_day_groups(settings)
day_groups = unique(day_groups_dt$day_group)
dayregex = paste(day_groups, collapse = "|")
pattern = stringr::str_glue("^(p_|h_)((?!({dayregex})).)*$")

targets_ls = list(
    "puma" = copy(target_puma),
    "czone" = copy(target_czones),
    "region" = copy(target_region)
)

for (target_name in names(targets_ls)) {
    # Cols to update, non DOW columns
    targ_cols = stringr::str_subset(names(targets_ls[[target_name]]), pattern)

    # Assign DOW columns from h_total and p_total
    targets_ls[[target_name]][, paste0("h_dow_", day_groups) := h_total]
    targets_ls[[target_name]][, paste0("p_dow_", day_groups) := p_total]

    # Rescale everything by n days
    targets_ls[[target_name]][,
        (targ_cols) := lapply(.SD, function(x) x * length(day_groups)),
        .SDcols = targ_cols
    ]
}

6.3.7 Optional: Aggregate Target Categories for Weighting

Sometimes certain targets and target categories can have large error margins around the Census target estimates, which can impact the results of weighting. One option is to aggregate target categories for certain weighting zone groups or the entire region. This step will optionally aggregate target categories as specified in the project settings file.

NotePSRC Specifics

Due to large error margins around the Census target estimates for person-level commute mode for the categories bike, walk, and transit, these were aggregated for all regions except King County - Seattle.

# Create a List of Zone Group and Regional Targets
targets = list("zone_group" = target_czones, "region" = target_region)

# Convert the PUMA to Client Zone Crosswalk to a `data.table`
puma_zone_xwalk <- as.data.table(puma_zone_xwalk_sf)

# Create a Weighting Zone Group Crosswalk to Use in Updating Targets and Convert to a `data.table`
zone_group_crosswalk = prepare_zone_groups(
    seed = puma_zone_xwalk,
    targets = target_czones,
    settings = settings,
    show_plot = FALSE,
    replace = FALSE
)
zone_group_crosswalk = as.data.table(zone_group_crosswalk)

# Append the Zone Group to the Matching Weighting Zone ID in the Target Estimate Tables
target_czones[zone_group_crosswalk, zone_group := i.zone_group, on = "client_zone_id"]
target_region[, zone_group := "region"]

# Update the Targets as Specified in the Project Settings Configurations
updated_targets_czone <- update_targets(target_czones, settings, geo_cross_walk = as.data.table(zone_group_crosswalk), geom = "zone_group")

updated_targets_region <- update_targets(target_region, settings, geo_cross_walk = as.data.table(zone_group_crosswalk), geom = "zone_group")

6.3.8 Review Household- and Person-Level Target Estimates

6.3.8.1 Table: Total HH and Population Targets for each weighting zone group

Weighting Zone Group

Household

Persons

King County - Seattle

368,012.2

729,129

King County - Other

541,295.8

1,373,543

Kitsap County - Expanded

201,550.5

502,271

Pierce County

313,545.5

803,114

Snohomish County

320,948.9

835,625

Total

1,745,353.0

4,243,682

6.3.8.2 Table: Household-Level Target Variables by Weighting Zone Groups: ACS 1-year Target Estimates

Target Variable

Target Category

King County - Seattle

King County - Other

Kitsap County - Expanded

Pierce County

Snohomish County

Region

Household size

1

156,491

133,620

48,670

83,386

75,695

497,862

2

127,110

187,137

78,355

105,182

108,169

605,952

3+

84,412

220,539

74,526

124,978

137,085

641,539

Household income

$0-$24,999

43,569

49,399

21,427

32,804

31,423

178,621

$25,000-$49,999

36,634

54,603

22,300

41,303

37,079

191,919

$50,000-$74,999

39,487

59,248

35,266

44,094

44,539

222,633

$75,000-$99,999

35,324

58,226

25,197

44,753

38,900

202,400

$100,000-$199,999

104,632

159,080

69,075

109,689

105,446

547,922

$200,000+

108,367

160,740

28,286

40,903

63,561

401,857

Number of workers

0

64,488

102,754

52,106

68,537

63,698

351,583

1

162,434

197,222

71,467

115,362

118,051

664,536

2+

141,090

241,320

77,977

129,646

139,200

729,233

Vehicle sufficiency

None

76,991

37,006

8,523

16,193

14,904

153,618

Insufficient

85,179

123,106

36,222

51,313

65,267

361,086

Sufficient

205,842

381,184

156,806

246,040

240,777

1,230,649

Number of children

0

303,802

371,352

143,820

214,796

216,633

1,250,404

1+

64,210

169,944

57,731

98,749

104,316

494,949

Total: Households

368,012

541,296

201,551

313,546

320,949

1,745,353

6.3.8.3 Table: Total Households in PUMAs Weighting Zone Groups: ACS 1-year Target Estimates

Target Variable

Target Category

King County - Seattle

King County - Other

Kitsap County - Expanded

Pierce County

Snohomish County

Region

Total Households in PUMA

5323301

0

54,363

0

0

0

54,363

5323302

0

43,261

0

0

0

43,261

5323303

0

74,179

0

0

0

74,179

5323304

0

64,104

0

0

0

64,104

5323305

0

57,591

0

0

0

57,591

5323306

0

45,916

0

0

0

45,916

5323307

0

43,711

0

0

0

43,711

5323308

0

0

51,302

0

0

51,302

5323309

0

51,522

0

0

0

51,522

5323310

0

54,637

0

0

0

54,637

5323311

0

52,012

0

0

0

52,012

5323312

47,709

0

0

0

0

47,709

5323313

42,353

0

0

0

0

42,353

5323314

63,202

0

0

0

0

63,202

5323315

64,788

0

0

0

0

64,788

5323316

49,403

0

0

0

0

49,403

5323317

48,003

0

0

0

0

48,003

5323318

52,555

0

0

0

0

52,555

5323501

0

0

55,078

0

0

55,078

5323502

0

0

56,560

0

0

56,560

5325301

0

0

0

54,734

0

54,734

5325302

0

0

0

40,243

0

40,243

5325303

0

0

0

38,343

0

38,343

5325304

0

0

0

45,812

0

45,812

5325305

0

0

0

50,283

0

50,283

5325306

0

0

0

38,135

0

38,135

5325307

0

0

0

45,996

0

45,996

5325308

0

0

38,612

0

0

38,612

5326101

0

0

0

0

57,883

57,883

5326102

0

0

0

0

54,040

54,040

5326103

0

0

0

0

51,159

51,159

5326104

0

0

0

0

51,840

51,840

5326105

0

0

0

0

42,767

42,767

5326106

0

0

0

0

63,260

63,260

6.3.8.4 Table: Person-Level Target Variables by Weighting Zone Groups: ACS 1-year Target Estimates

Target Variable

Target Category

King County - Seattle

King County - Other

Kitsap County - Expanded

Pierce County

Snohomish County

Region

Gender

Male

379,274

690,208

251,969

398,496

418,976

2,138,923

Female

349,855

683,335

250,302

404,618

416,649

2,104,759

Age

0-4

31,595

75,075

28,200

45,198

49,710

229,778

5-15

65,113

191,843

63,704

117,889

111,272

549,821

16-17

11,927

34,836

10,204

23,407

22,911

103,285

18-24

63,082

90,062

40,801

61,301

58,179

313,425

25-44

311,646

431,913

139,414

246,289

258,077

1,387,339

45-64

151,301

348,426

121,095

192,144

209,417

1,022,383

65+

94,465

201,388

98,853

116,886

126,059

637,651

Employment

Non worker

262,999

629,775

253,772

390,930

395,599

1,933,075

Part-time

82,074

138,237

56,702

77,056

86,823

440,892

Full-time

384,056

605,531

191,797

335,128

353,203

1,869,715

Commute mode

Bike/transit/walk

NA

47,722

10,747

13,541

16,337

NA

Work from home

134,902

163,590

34,840

51,388

75,630

460,350

Transit

62,077

NA

NA

NA

NA

115,386

Walk

37,122

NA

NA

NA

NA

67,584

Bike

12,791

NA

NA

NA

NA

17,367

Other (includes auto)

211,314

515,703

199,157

337,264

337,981

1,601,419

None

270,923

646,528

257,527

400,921

405,677

1,981,576

University student status

No

674,154

1,300,589

477,629

766,055

798,835

4,017,262

Yes

54,975

72,954

24,642

37,059

36,790

226,420

Educational attainment

No college

192,478

552,620

221,299

416,401

386,728

1,769,526

Some college

536,651

820,923

280,972

386,713

448,897

2,474,156

Race

Asian or Pacific Islander

126,375

340,737

43,464

67,555

128,513

706,644

Black or African American

46,566

78,842

28,661

61,461

32,174

247,704

Other

114,087

243,238

95,642

182,011

148,631

783,609

White

442,101

710,726

334,504

492,087

526,307

2,505,725

Ethnicity

Not Hispanic

666,368

1,214,119

441,688

694,024

732,639

3,748,838

Hispanic

62,761

159,424

60,583

109,090

102,986

494,844

Total: Persons

729,129

1,373,543

502,271

803,114

835,625

4,243,682

6.4 Save Cleaned and Tabulated PUMS Datasets

Note: The targets saved here are not the updated targets. That is because in preparing control tables for PopulationSim in Round 1 weighting, updates are made in those scripts. Updated outputs here are for review to ensure the updates applied are as specified in the project settings under target_updates prior to starting the weighting process.

# Write Out File ---------------------------------------------------------
saveRDS(pums, file = file.path(get("working_dir", settings), "pums_cleaned.rds"))
saveRDS(pums_tabbed, file = file.path(get("working_dir", settings), "pums_tabbed.rds"))
saveRDS(targets_ls[['puma']], file = get("target_puma_path", settings))
saveRDS(targets_ls[['czone']], file = get("target_czone_path", settings))
saveRDS(targets_ls[['region']], file =  get("target_region_path", settings))

6.5 Allocate PUMS Estimates to Sampled Block Groups

The last step of target data preparation is the allocation of PUMS households and persons to sample segments. This enables us to calculate the base weight at the same geographic scale as sampling occurred.

The allocation process involves a spatial crosswalk between PUMA and block group geographies. The fraction of each PUMA that overlaps with each block group is calculated, and this fraction is used to proportionally allocate PUMS households and persons to each block group. Totals are then aggregated to sample segments, which match the geographic scale of sampling.

Note: Small discrepancies (e.g., rounding errors) may occur during allocation, but total household and person estimates should remain consistent.

6.5.1 Load Data and Record Initial Target and Reference Sums

This step loads the tabulated target dataset (from 1-year PUMS/ACS) and the sample plan (from 5-year ACS), then records initial counts for households, persons, and observations.

# Load Client Zone and Regional Target Datasets (from ACS/PUMS).
target_czone = readRDS(get("target_czone_path", settings))
target_region = readRDS(get("target_region_path", settings))

# Number of Days Being Weighted (for Day-of-Week Splits).
n_weight_days = length(settings$weight_dow_groups)

check_path = file.path(get("report_dir", settings), "040_check_counts.csv")

# Record Initial Totals for Client Zone Targets.
check_dt = record_checksum(
    fname = check_path,
    append = FALSE,
    data.table(
        dataset = "PUMS",
        step = "initial client zone targets",
        n_hh = target_czone[, sum(h_total) / n_weight_days],
        n_per = target_czone[, sum(p_total) / n_weight_days],
        n_rows = target_czone[, .N],
        unit = 'client zone'
    )
)

# Load Value Labels and Sample Plan Used for Survey Assignment.
value_labels = fetch_hts_table("value_labels", settings)
sample_plan = fetch_hts_table("sample_plan", settings) 

# Record Reference Counts in the Sample Plan.
check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    data.table(
        dataset = "sample_plan",
        step = "initial sample plan reference counts",
        n_hh = sample_plan[, sum(ref_count_hh)],
        n_per = sample_plan[, sum(ref_count_per)],
        n_rows = sample_plan[, .N],
        unit = 'bg_geoid'
    ) 
)

6.5.2 Assign Weighting Zones to Sample Plan and Record Reference Sums

Weighting zone IDs are assigned to each block group in the sample plan. For block groups that partially overlap multiple weighting zones, the reference counts for households and persons are proportionally allocated using calculated area fractions. Zones outside the study area are excluded.

#  Assign Weighting Zones (Client Zones) to Sample Plan Block Groups.
# If not Using Client Zone, the PUMA ID is the Client Zone
if (!"client_zone_id" %in% names(sample_plan)) {   
    bg_puma_czone_xwalk = readRDS(file.path(get("working_dir", settings), 'bg_puma_czone_xwalk.rds'))
    sample_plan = merge(
        sample_plan,
        bg_puma_czone_xwalk[, .(bg_geoid = GEOID, client_zone, client_zone_id, area_prop)],
        by = "bg_geoid",
        allow.cartesian = TRUE
    )

    # Adjust reference counts by area proportion for partial overlaps
    sample_plan[,
        `:=`(
            ref_count_hh = ref_count_hh * area_prop,
            ref_count_per = ref_count_per * area_prop,
            area_prop = NULL
        )
    ]

    # Remove zones outside the study area
    sample_plan = sample_plan[client_zone_id != -1]
}

# Record Checksum After Zone Assignment
check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    data.table(
        dataset = "sample_plan_outer",
        step = "assign weighting zones to sample plan",
        n_hh = sample_plan[, sum(ref_count_hh)],
        n_per = sample_plan[, sum(ref_count_per)],
        n_rows = sample_plan[, .N],
        unit = 'bg_geoid'
    ) 
)

6.5.3 Adjust Sample Plan Reference Counts to Match Target Data

Because block group and weighting zone boundaries do not align perfectly, reference counts are proportionally allocated so that their sums match the target population estimates for each weighting zone. To align the reference counts to the target estimates, an allocation factor is calculated. In the previous step, we adjusted the block group reference counts by the proportion of the area that a block group overlaps with a PUMA. To ensure the sums of the reference counts in the sample plan match the sums in the PUMS data, we calculate a factor to adjust block group reference counts. This factor is calculated by summing the block group reference counts to the PUMA level, then divide the PUMA target estimates by the block group reference counts to get an allocation factor. This factor is then applied to the reference counts at the block group level.

# Calibrate the Sample Plan so Totals Align with Targets.
sample_plan_adj = adjust_reference_to_target(
    ref_counts = sample_plan,
    targets = target_czone,
    settings
)

# Record Checksum for the Adjusted Sample Plan
check_dt = record_checksum(
    fname = check_path,
    append = TRUE,
    data.table(
        dataset = "sample_plan_adj",
        step = "adjust reference counts to PUMS data",
        n_hh = sample_plan_adj[, sum(ref_count_hh)],
        n_per = sample_plan_adj[, sum(ref_count_per)],
        n_rows = sample_plan_adj[, .N],
        unit = 'bg_geoid'
    ) 
)

6.6 Save Adjusted Sample Plan for Base Weight Calculation

# Write Out File ---------------------------------------------------------
saveRDS(sample_plan_adj, file = file.path(get("working_dir", settings), "sample_plan_adj.rds"))
saveRDS(sample_plan, file = file.path(get("working_dir", settings), "sample_plan_unadj.rds"))

6.7 Table: Total PUMS-estimated Hhs and Persons at each Step of Pre-Processing

NotePSRC Specifics

PSRC did not use custom weighting zones, so the allocation of PUMS to specified weighting zones should not show any change in estimates.

Step

Household estimate

Person estimate

Initial PUMS file

1,737,345

4,322,435

Remove Group Quarters Residents

1,737,345

4,243,682

Align PUMS to itself

1,745,353

4,243,682

Allocate PUMS to sample plan without adjustment

1,673,496

4,163,187

Allocate PUMS to sample plan with adjustment

1,745,353

4,243,682