7 Survey Data Preparation

Tip

What You’ll Do in This Chapter

In this chapter, we prepare the HTS dataset for weighting by removing incomplete households, assigning households to relevant geographies, imputing key demographic variables, and formatting the data for PopulationSim.

8 Chapter Setup

8.1 Load Packages, Settings, and Directory paths

First, we set up our environment by loading necessary packages, settings, and directory paths.

# Load the Local Package for Development
devtools::load_all()


# Load Settings; Pass python_env Explicitly for Quarto
settings = get_settings(reload_settings = TRUE, 
                        print = FALSE)

# Extract Commonly Used Paths and Labels From Settings
code_root  <- get("code_root", settings)
work_dir   <- get("working_dir", settings)
inputs_dir <- get("inputs_dir", settings)
outputs_dir<- get("outputs_dir", settings)
report_dir <- get("report_dir", settings)
popsim_initial_label <- get("popsim_initial_label", settings)
popsim_daypat_label  <- get("popsim_daypat_label", settings)

8.2 Load Survey and PUMS Data

Next, we load the survey data tables, along with PUMS data that will be used later during missing data imputation.

sample_plan = fetch_hts_table("sample_plan", settings)
households = fetch_hts_table("household", settings)
persons = fetch_hts_table("person", settings)
days = fetch_hts_table("day", settings)
trips = fetch_hts_table("trip", settings)
value_labels = fetch_hts_table("value_labels", settings)
variable_list = fetch_hts_table("variable_list", settings)
pums_tabbed = readRDS(file.path(work_dir, 'pums_tabbed.rds'))
pums_cleaned = readRDS(file.path(work_dir, 'pums_cleaned.rds'))
pums_codebook = read_pums_codebook(settings)

#FIXME: In Raw Data, First Column Name Should be "variable."
setnames(variable_list, old = "V1", new = "variable", skip_absent = TRUE)

9 Remove Incomplete Households

RSG’s weighting process is configured to only weight households with complete data. These households are usually removed from the HTS dataset upstream of weighting, so this section serves as a safeguard.

Note

A household is considered complete and eligible for weighting if the entire household provided complete data on at least one eligible day of the week.

9.1 For Person-Level Studies - Limit to 1 Participant Per Household

The next step checks if the study is person-level. If so, it ensures that only one participant per household is included in the dataset. If multiple participants are found in a household, an error is raised, prompting the user to decide how to handle this case. Additionally, households with no participating persons are flagged with a warning. Finally, only the participating persons are retained in the dataset, and households without any participants are dropped.

Note

PSRC is a household-level study, so this code block will not execute.

if (get("study_unit", settings) == "person") {
    # If person level study then person #1 is the only person in the household
    # to be weighted.

    n_participants = persons[is_participant == 1, .N, by = hh_id]

    # Check if there are any households with more than one person that took the survey

    if (any(n_participants[, N > 1])) {
        # You will need to decide how to handle this case.
        # Likely split each person into their own household
        msg = stringr::str_glue(
            "Person level study but {sum(multiple_participants)} households have >1 participating persons"
        )
        stop(msg)
    }

    # Check if any households have no participating persons
    if (any(n_participants[, N < 1])) {
        msg = stringr::str_glue(
            "Person level study but {sum(zero_participants)} households have no participating persons"
        )
        warning(msg)
    }    

    # Keep only the person who took the survey
    persons = persons[is_participant == 1]

    # Drop households with no participating persons
    households = households[hh_id %in% persons$hh_id]

}

9.2 Drop Households with Incomplete/No Study Days

These households should not be in the dataset at this point but it can happen with “messy” completes (e.g., if completion status is manually set for some households). This code identifies weighted weekdays (day_ints) and keeps only households that have at least one complete study day on those weekdays.

day_ints = unlist(settings$weight_dow_groups, use.names = FALSE)

hh_with_study_days = days[
    is_complete == 1 &
    hh_day_complete == 1 &
    travel_dow %in% day_ints,
    unique(hh_id)
]
households = households[hh_id %in% hh_with_study_days]

9.3 Set Correct Block Group Field

Next, if use_reported_home is true in the settings, households are assigned to the block group corresponding to their reported home location (rather than their sampled home location). The appropriate block group field (2010 or 2020) is selected based on the ACS year specified in the settings. The code then creates a new column bg_geoid in the households data table, which contains the block group geoid used for sample segment assignment. Finally, the block group geoid is padded to ensure it is 12 digits long.

Note

Why use reported home location instead of sampled home location?

Across RSG’s studies, approximately 2–5% of households report a home address that differs from the sampled address to which the invitation was sent. In many cases, the reported home address is in a different weighting zone or block group from the sampled home address.

RSG tracks sampled home locations during fieldwork to monitor response rates and adjust sample plans. However, using the reported home location provides a more accurate representation of where the household actually resides. This is especially important for weighting and segment assignment, as it ensures households are categorized based on their true geographic context, which can influence demographic characteristics and travel behavior.

# Regex if 2020 or 2010 home_bg_201x or home_bg_202x
regex = ifelse(settings$acs_year < 2020, "^home_bg_201[0-9]", "^home_bg_202[0-9]")
home_bg_field = str_subset(names(households), regex)

# Get ABS Sample Segments Based on Block Group of Home Location
if (get("use_reported_home", settings)) {
  households[, home_bg := get(home_bg_field)]
  households[, bg_geoid := get(home_bg_field)]
} else {
  # Use sample home location if it is available
  households[, home_bg := get(home_bg_field)]
  households[, bg_geoid := fifelse(is.na(sample_home_bg), get(home_bg_field), sample_home_bg)]
}

# Assign Block Group to Persons Based on Household bg_geoid
persons[
    households, 
    `:=`(
        bg_geoid = i.bg_geoid,
        home_bg = i.home_bg
    ),
    on = "hh_id"
]
persons[is.na(bg_geoid), bg_geoid := home_bg]
persons[, bg_geoid := str_pad(bg_geoid, width = 12, pad = 0)]

# Drop Households That don't Have a BG in the Sample Plan
households[, bg_geoid := str_pad(bg_geoid, width = 12, pad = 0)]

9.4 Drop Households Not in Region From All Datasets

In this step, households that are not located within the defined study region (i.e., their block group geoid is not present in the sample plan) are removed from the dataset. Typically, upstream data processing removes these households before they reach this stage, but this code serves as a safeguard to ensure that only relevant households are retained for weighting.

cat("Dropping ", nrow(households[!bg_geoid %in% sample_plan$bg_geoid]), " households not in region")

Dropping  0  households not in region

# Drop Households Not in Region, Cascade to Persons, Day, Trips
households = households[bg_geoid %in% sample_plan$bg_geoid]
persons = persons[hh_id %in% households$hh_id]
days = days[hh_id %in% households$hh_id]
trips = trips[hh_id %in% households$hh_id]

9.5 Run Data Check

These checks ensure that all households are complete, located within the study region, and present in the sample plan. Additionally, they verify that the households and persons tables are symmetric, meaning that every household has corresponding persons and vice versa.

# Check for Any Incomplete Households
stopifnot(nrow(households[is_complete != 1]) == 0)

# Check for Households Outside the Region
stopifnot(nrow(households[home_in_region == 0]) == 0)

# Check for Households not in Sample Plan
stopifnot(nrow(households[!bg_geoid %in% sample_plan$bg_geoid]) == 0)

# Ensure the Tables are Symmetric 
stopifnot(nrow(persons[!hh_id %in% households$hh_id]) == 0)
stopifnot(nrow(households[!hh_id %in% persons$hh_id]) == 0)

10 Assign Sample Segment Using Reported Home Block Group

This section deals with sample segment assignment, preparing households for base weight estimation.

Sample segments are geographic strata used to structure survey sampling and track response rates. Assigning each household to the correct segment ensures that sampling effort and observed response rates are accurately reflected in the weighting process. The code below assigns sample segments to households based on their reported home block group geoid, using the provided sample plan. It also handles special cases for supplemental segments (e.g., non-probability segments) if specified in the settings.

Note

Why does segment assignment matter for weighting? Base weights in survey sampling adjust for the probability that each unit (person, household, etc.) was selected. If sampling effort is stratified by segment (e.g., urban/rural, age group), each segment may have a different chance of being sampled and a different response rate.

Constructing base weights using observed response rates within each segment ensures that survey results are representative of the target population. Segments with lower response rates get higher weights, compensating for underrepresentation. This corrects for sampling bias and allows valid population estimates.

First, persons are assigned to block groups based on their household’s block group geoid.

# Used to Keep Track of the Number of Households or Persons in the Final Seed as it Gets Processed
expected_hh = nrow(households)
expected_per = nrow(persons)

10.1 Join Segment Label to Households Table

We start with the original sample segment from the households table.

# Add _orig Suffix Because we are Going to Add New Cols From sample_plan
setnames(households, "sample_segment", "segment_num", skip_absent = TRUE)
value_labels[, value := as.integer(value)]
segments = value_labels[variable == "sample_segment", .(segment_num = value, label)]
households[segments, segment_name := i.label, on = "segment_num"]

10.2 Set Invitation Column

All households are presumed to be invited via address-based sampling (ABS), unless non-probability segments (NPS)are specified in the settings. If NPS segments are defined, households in those segments are marked as “NPS” invitations, while all others remain “ABS.”

PSRC does not have NPS segments, so all households will be marked as “ABS.”

Note

NPS households may require different treatment due to their non-probability sampling method, especially if they comprise a large (more than ~10%) share of the overall sample. For surveys with a large NPS segment, or if desired, RSG can “blend” ABS and NPS samples using the functions blend_initial_weights() and calc_alpha() during base weight estimation. These functions leverage modern composite weighting methods (See: Fahimi et al. 2024)[https://publichealth.jmir.org/2024/1/e48186/]. For surveys with small NPS segments, RSG typically treats them as part of the ABS sample for weighting purposes.

if ("nps_segments" %in% names(settings) && !is.null(settings$nps_segments)) {
    nps_segments = value_labels[
        variable == "sample_segment" &
        label %in% settings$nps_segments,
    ]

    nps_seg_num = unique(nps_segments$value)
    stopifnot(length(nps_seg_num) > 0)

    households[!segment_num %in% nps_seg_num, invitation := "ABS"]
    households[segment_num %in% nps_seg_num, invitation := "NPS"]
} else {
    households[, invitation := "ABS"]
}

stopifnot(households[is.na(invitation), .N] == 0)

10.3 Assign Segment by Sample Plan

Households are assigned to sample segments based on the reported home block group geoid (rather than sampled block group geoid) using the sample plan. This ensures that each household is correctly categorized according to its geographic segment for weighting purposes.

# Joining Households Table to Sample Plan for ABS Segments on bg_geoid
# Enforce 12 Digit Geoids
sample_plan[, bg_geoid := str_pad(bg_geoid, width = 12, pad = 0)]

# Assert That Each BG is Represented Once
stopifnot(!any(sample_plan[, .N > 1, keyby = bg_geoid]$V1))
# Even Though Segment Should be in there Already, we do this Because:
# - Confirms based on home location, not sampled address location.
# - Handles the case where the segment is missing.
# - Enforces consistent sample segment names.

households[, segment_name_orig := segment_name]

# Assign Segment Name from Sample Plan not what was Reported
households[
    sample_plan,
    segment_name := i.segment_name,
    on = .(bg_geoid)
]

Next, if NPS segments are defined in the settings, households in those segments are assigned a supplemental segment name based on their segment number.

# Add Supplemental Segment for NPS Segments
if ("nps_segments" %in% names(settings) && !is.null(settings$nps_segments)) {
    households[
        nps_segments,
        segment_name := i.label,
        # Do this if you want a single segment name for all NPS segments
        # segment_name := "supplemental",
        on = .(segment_num = value)
    ]
}

Checks are run to ensure that all households have been assigned a sample segment, that all household block group geoids are present in the sample plan, and that each block group belongs to only one segment.

# Make Sure All the Households Have a Sample Segment
stopifnot(households[is.na(segment_name), .N] == 0) 

#  Make Sure All the Household geoids are in the Sample Plan
stopifnot(
  all(households[invitation == "ABS", bg_geoid] %in% sample_plan$bg_geoid)
)

# Make Sure Each BG Belongs to only One Segment
stopifnot(
  all(
    households[invitation == "ABS", .N, keyby = .(bg_geoid, segment_name)][,
    .N, keyby = bg_geoid]$N == 1
))

Segment information is then added to the persons table based on their household’s segment assignment.

# Add Segment Information to Persons Table for University Sample
persons[
    households,
    `:=`(
        invitation = i.invitation,
        segment_name = i.segment_name
    ),
    on = "hh_id"
]

10.4 Table: Households by Segment - Before and After Re-Assignment

This table summarizes the number of households in each segment before and after reassignment based on block group geoid. It provides a comparison of household counts to illustrate how segment assignments have changed. Large changes may indicate issues with segment definitions or sample plan accuracy.

Segment	Before Re-assignment	After Re-assignment
King County-General	460	456
King County-Hard-to-survey	245	250
King County-Rural	79	76
King County-Walk/Bike/Transit	385	389
Kitsap County-General	78	81
Kitsap County-Hard-to-survey	14	14
Kitsap County-Rural	44	43
Pierce County, Incorporated-General	116	120
Pierce County, Incorporated-Hard-to-survey	45	48
Pierce County, Rural (All)	490	483
Pierce County, Unincorporated-General	495	498
Pierce County, Unincorporated-Hard-to-survey	121	116
Snohomish County-General	231	231
Snohomish County-Hard-to-survey	56	55
Snohomish County-Rural	51	50
Total	2,910	2,910

11 Optional: Restructure Households to Adjust for Unrelated Householders

In most RSG HTSs, unrelated householders (e.g., roommates, nonrelatives) are not included in trip diaries, so they must be excluded from household size and related calculations. This function identifies unrelated persons as those present in the person table but not in the day table, removes them from household size, and adjusts vehicle counts and other attributes to reflect the reduced household size.

This step is only performed if specified in the settings (i.e., unrelated_adjustment is set to “restructure”); note that this setting also controls the adjustment of PUMS data to exclude unrelated householders.

Warning

This process has been deprecated in favor of the person method of adjustment, wherein unrelated persons are assigned zero person weights and weights for remaining persons are adjusted upwards to preserve household totals. Household weights remain unchanged. Used for person-level weighting.

This method is preferred because it does not restructure the HTS nor PUMS data, avoiding unintended downstream consequences; it also preserves the full household composition for weighting targets that may require it.

if (get("unrelated_adjustment", settings) == "restructure") {

    days = fetch_hts_table("day", settings)

    message("Adjusting for unrelated householders")
    adjusted = adjust_unrelated_survey(households, persons, days, settings)
    households = adjusted$households
    persons = adjusted$persons

}

12 Assign Households to Weighting Zones (PUMAs)

RSG matches the geographic structure of the survey data to that of the weighting targets by assigning households and persons to Public Use Microdata Areas (PUMAs). This section assigns PUMA IDs to households and persons based on their block group geoid, which is essential for aligning the survey data with census-based weighting targets.

12.1 Append PUMA IDs to Households/Persons Data

To kick off weighting zone group assignment - which is based on PUMA geographies - households and persons are assigned PUMA IDs based on their block group geoid. If a household’s block group geoid is missing, the home block group is used as a fallback. The corresponding census tract ID is derived from the block group geoid, and a crosswalk is used to map tracts to PUMAs. Finally, the PUMA ID is added to both the households and persons tables.

households[is.na(bg_geoid), bg_geoid := home_bg]
households[, tract_id := str_sub(bg_geoid, 1, 11)]

tracts_to_pumas = get_tracts_puma_xwalk(
    settings,
    puma_year = settings$pums_year,
    tract_year = settings$acs_year
)

households[
  tracts_to_pumas,
  puma_id := i.PUMA,
  on = .(tract_id)
]

households[is.na(puma_id), puma_id := home_puma_2022]
persons[households, puma_id := i.puma_id, on = "hh_id"]

# h_puma ------------------------------
households[, h_puma := puma_id]

# p_puma ------------------------------
persons[, p_puma := puma_id]

13 Prepare Key Variables: Income, Household Size

This section serves as a safeguard to ensure that key variables required for weighting, specifically household income and household size, are correctly prepared and consistent with the requirements of RSG’s weighting process.

Caution

This code will likely be deprecated in the future as these variables are typically prepared upstream of weighting. A patch to hts.iochecker schema is the preferred solution to ensure these variables are correctly formatted before reaching this stage.

13.1 Assign Broad Income Categories

Household income is a key variable for weighting and analysis. This section assigns broad income categories to households based on their reported income. Helper function get_income_broad() is concatenates detailed income responses (income_detailed) with non-response followup responses.

# If income_followup is Missing, use income_broad (legacy)
if (!"income_followup" %in% value_labels$variable) {
    inc_broad_labs = copy(value_labels[variable == 'income_broad', ])
    inc_broad_labs[, variable := 'income_followup']
    value_labels = rbind(value_labels, inc_broad_labs)
}

if(!"income_followup" %in% names(households)){
  households[, income_followup := income_broad]
}

# Recode 995 to 999 so that Income Imputation Catches it
households[income_followup == 995, income_followup := 999L]

# Label Bins and Top Code if Necessary
households = get_income_broad(households, value_labels, settings)

# Assert no Missing Labels
stopifnot(all(!is.na(households$income_broad)))

13.2 Calculate Household Size

Household size is calculated upstream of the weighting process, but here we check that it is consistent with the number of persons in each household. If the study unit is household, we count the number of persons per household and compare it to the num_people field in the households table. If there are discrepancies (which may occur due to top-coding of household size), they are reported for review.

if (get("study_unit", settings) == "household") {
    hh_size = persons[, .N, keyby = hh_id]

    # Check that hh_size is consistent with the number of persons in the household
    households[, num_people_orig := num_people]
    households[hh_size, num_people := i.N, on = "hh_id"]
    
    inc_hhs <- nrow(households[num_people != num_people_orig])
    
    cat("Households with inconsistent household size: ", inc_hhs)
    

}

Households with inconsistent household size:  0

13.3 Run Checks

Here, we check that data preparation steps have preserved the expected number of households and persons. We calculate the total number of unique households and persons in the dataset and compare them to the expected counts stored earlier. If there are discrepancies, an error is raised to alert the user.

# Check Unique Household and Person Counts
households[, h_total := .N, by = hh_id]
persons[, p_total := .N, by = person_id]

stopifnot(households[h_total != 1, .N] == 0)
stopifnot(persons[p_total != 1, .N] == 0)

# Check that Totals Have Not Changed
stopifnot(nrow(households) == expected_hh)
stopifnot(nrow(persons) == expected_per)

13.4 Sort and Save Cleaned Household and Persons Data

households = households[order(hh_id)]
persons = persons[order(person_id)]

saveRDS(households, file.path(work_dir, 'households_survey_cleaned.rds'))
saveRDS(persons, file.path(work_dir, 'persons_survey_cleaned.rds'))

14 Survey Data Imputation

The income, gender, and ethnicity questions in the survey allowed participants to respond with “prefer not to answer.” Additionally, non-related household members (i.e., roommates) were not required to supply their gender, race, or ethnicity; and children were not required to provide their race or ethnicity. To facilitate data weighting, imputed missing values for these variables are created when a participant was not required to answer or selected “prefer not to answer.”

14.1 Income Imputation

This process imputed income using an ordinal regression model. Specifically, using a proportional odds logistic regression model (POLR), which is designed for predicting ordinal outcomes (appropriate for income brackets ). The model estimates the probability that an observation falls at or below a particular income bracket, based on the values of the predictor variables. Missing income was predicted based on a set of independent variables including:

Income distribution of the block group based on ACS 2022 5-year data, with the median income ($75-100K) used as a reference level.
Number of non-working adults in the household.
Number of children in the household.
Employment status of household members.
Educational attainment of household members.
Age of the primary survey respondent.
Whether the home is owned by the household.
Whether the household is a single-family home.

Note

The function impute_income_nonrelatives imputes household income for households with nonrelatives using PUMS and survey variables. This is implemented if impute_unrelated_income is TRUE in the settings. A user might wish to use this functionality if unrelated householders were specifically excluded from the household income question.

This function is not typically applied but is retained here for backwards compatibility.

if (any(stringr::str_detect(names(settings$targets), "income"))) {

    # Income PNTA ------------------------------------------------------------
    hh_income_pnta = impute_income_pnta(
        households = households,
        persons = persons,
        value_labels = value_labels,
        settings = settings
    )

    ## Nonrelatives' Income ----------------------------------------------------
    hh_income_imputed_nr = impute_income_nonrelatives(
        persons = persons,
        pums_cleaned = pums_cleaned,
        hh_income_pnta = hh_income_pnta,
        codebook = pums_codebook,
        value_labels = value_labels,
        settings = settings
    )

    ## Join imputed income back to households ---------------------------------
    households[hh_income_imputed_nr, `:=`(
        income_imputed = i.income_imputed,
        income_imputed_label = i.income_imputed_label
        ), on = "hh_id"]

    label_df = unique(households[, .(income_imputed_label)])
    label_df[
    , 
    income_imputed_value := get_income_bounds(income_imputed_label, direction = 'upper'),
    ]

    households[
    label_df,
    income_imputed_value := i.income_imputed_value,
    on = .(income_imputed_label)
    ]

}

14.1.1 Table: Income imputation model summary

Here, we show the summary of the income imputation model. Significant coefficients (low p-values) indicate reliable predictors (often things like education or homeownership) while others may vary by dataset or model run. Coefficient signs should align with expectations (e.g., higher education -> higher income). McFadden’s R² values around 0.1–0.3 generally indicate a good fit for behavioral or survey models, while values below 0.1 suggest weak fit and above 0.3 are unusually strong. Variable significance can shift between runs – interpret the overall direction and pattern of effects rather than any single estimate.

Parameter	Estimate	Std Error	T-Statistic	p.value
f_0_24999	-0.022	0.600	-0.04	0.97
f_25000_49999	0.884	0.650	1.36	0.173
f_50000_74999	0.806	0.656	1.23	0.219
f_100000_199999	1.700	0.543	3.13	0.002
f_200000_plus	3.987	0.496	8.03	<0.001
nonworking_adult_n	0.301	0.058	5.19	<0.001
child_n	0.013	0.054	0.24	0.807
ft_grad_n	2.243	0.104	21.48	<0.001
pt_grad_n	0.897	0.212	4.23	<0.001
ft_bachelors_n	1.808	0.096	18.76	<0.001
pt_bachelors_n	0.417	0.179	2.33	0.02
ft_no_college_n	1.088	0.088	12.31	<0.001
pt_no_college_n	0.054	0.142	0.38	0.703
head_under_35_n	-0.014	0.104	-0.13	0.894
head_65_plus_n	-0.124	0.104	-1.19	0.234
own_home	1.151	0.099	11.66	<0.001
single_family_home	0.351	0.095	3.68	<0.001
McFadden's rho-squared: 0.17

14.2 Gender Imputation

The process probabilistically assigned missing gender responses using a Monte Carlo procedure based on the sample data’s gender distribution within the respondent’s age category.

if (any(stringr::str_detect(names(settings$targets), "gender"))) {
    gender_dt = impute_gender(
    persons = persons,
    value_labels = value_labels,
    report_dir = report_dir,
    outputs_dir = outputs_dir,
    seed = get('rng_seed', settings),
    settings = settings
    )

    if ("gender" %in% names(persons)) {
        persons[, gender_orig := gender]
    }

    # Join imputed gender back to persons
    persons[gender_dt, gender_imputed := i.gender_imputed, on = .(person_id)]

}

14.2.1 Table: Gender Distribution: Original (Relabeled) vs Imputed

Gender distribution is summarized here before and after imputation, broken out by adults and children. This allows for comparison of counts to see how imputation affected the gender breakdown. Large changes may indicate issues with the imputation model or assumptions.

age	gender	num_before_imputation
Adult	Man/Boy (cisgender or transgender)	2,291
Adult	Non-binary/Something else fits better	65
Adult	Prefer not to answer	418
Adult	Woman/Girl (cisgender or transgender)	2,392
Child	Man/Boy (cisgender or transgender)	397
Child	Non-binary/Something else fits better	6
Child	Prefer not to answer	87
Child	Woman/Girl (cisgender or transgender)	381

age	gender	num_after_imputation
Adult	female	2,639
Adult	male	2,527
Child	female	423
Child	male	448

14.3 Race and Ethnicity Imputation

The imputation process probabilistically assigned missing race and ethnicity responses using a Monte Carlo procedure based on the ACS data’s race and ethnicity distribution within the respondent’s reported home block group.

# If Race is not in Targets, Skip Imputation
if (any(stringr::str_detect(names(settings$targets), "race"))) {
    race_dt = impute_race(
    persons = persons, 
    variable_list = variable_list, 
    settings = settings
    )

    if ("race" %in% names(persons)) {
        persons[, `:=`(race_orig = race, race = NULL)]
    }
    # Join imputed race back to persons
    persons[race_dt, `:=`(race = i.race, race_imputed = i.race_imputed), on = "person_id"]
}

# Impute Ethnicity if it is in Targets
if (any(stringr::str_detect(names(settings$targets), "ethnicity"))) {
    eth_dt = impute_ethnicity(
        persons = persons, 
        variable_list = variable_list, 
        settings = settings
    )
    if ("ethnicity" %in% names(persons)) {   
        persons[, `:=`(ethnicity_orig = ethnicity, ethnicity = NULL)]
    }
    persons[eth_dt, ethnicity_imputed := i.ethnicity_imputed, on = "person_id"]
}

14.3.1 Table: Race Distribution: Original (Relabeled) vs Imputed (Counts and Percentages)

adult	Race	Original (relabeled)	Imputed	Pct Original	Pct Imputed
Adult	afam	107	134	2.1%	2.6%
Adult	asian_pacific	488	555	9.4%	10.7%
Adult	missing	599	0	11.6%	0.0%
Adult	other	342	454	6.6%	8.8%
Adult	white	3,630	4,023	70.3%	77.9%
Child	afam	0	50	0.0%	5.7%
Child	asian_pacific	0	99	0.0%	11.4%
Child	missing	871	0	100.0%	0.0%
Child	other	0	175	0.0%	20.1%
Child	white	0	547	0.0%	62.8%

14.3.2 Table: Imputation Rates of Race and Ethnicity Across Weighting Zone Groups

Weighting Zone Group	Pct Imputed Race	Pct Imputed Ethnicity
King County - Seattle	18.1%	18.6%
King County - Other	27.1%	28.5%
Kitsap County - Expanded	23.7%	26.1%
Pierce County	26.8%	28.4%
Snohomish County	21.5%	22.8%

14.4 Save Imputed Data

saveRDS(households, file.path(work_dir, 'households_survey_imputed.rds'))
saveRDS(persons, file.path(work_dir, 'persons_survey_imputed.rds'))

15 Format Survey Data for PopulationSim

15.1 Configure Target Variables

The final step in survey data preparation is generating survey data tables in PopulationSim input format. Using the processed survey household and person files, this step prepares a PopulationSim “seed” or incidence table and tabulated summaries at the level of weighting zone groups.

# Load Data =====================================================================
sample_plan = fetch_hts_table("sample_plan", settings)
value_labels = fetch_hts_table("value_labels", settings)
days = fetch_hts_table("day", settings)
trips = fetch_hts_table("trip", settings)
households_raw = fetch_hts_table("household", settings)

households = readRDS(file.path(work_dir, 'households_survey_imputed.rds'))
persons = readRDS(file.path(work_dir, 'persons_survey_imputed.rds'))

target_puma = readRDS(get("target_puma_path", settings))
target_czones = readRDS(get("target_czone_path", settings))
target_region = readRDS(get("target_region_path", settings))


# Prepare & Tabulate Targets ===================================================
# This tabulates the prepared data into seed targets for the weighting process
# In other words, it converts the targets to one-hot encoding (wide format)

seed = prepare_targets(
    households,
    persons,
    value_labels,
    settings
)

seed[households, puma_id := i.puma_id, on = "hh_id"]
seed[, region := 1]
seed[, h_total := 1]

15.2 Append Sample Segments and Weighting Zones to Seed

This step appends segment names and client zones to a “seed” dataset by joining with household and sample plan data. It first assigns segment-related fields from households, ensuring every record has a segment name. Then, depending on the zone type in the settings, it adds weighting zone information either directly from the sample plan or by spatially joining household coordinates to weighting zones. If client zones are not used, it assigns PUMA IDs as the weighting zone IDs instead of specified client zones.

# Add the Segment ----------------------------------------------
seed[households, `:=`(
    bg_geoid = i.bg_geoid,
    segment_name = i.segment_name,
    invitation = i.invitation
    ),
    on = "hh_id"
]

stopifnot(seed[is.na(segment_name), .N] == 0)

# Add Client Zones ------------------------------------------------

# If Using Client Zones, Add Them to the Seed. Else Use PUMAS as Client Zones
if (
    get("zone_type", settings) == "client_zones"
) {
    
    # If client zone is defined in the sample plan, use that, otherwise use lat/lon
    if ("client_zone" %in% names(sample_plan)) {
        seed[
            sample_plan, `:=`(
                client_zone = i.client_zone,
                client_zone_id = i.client_zone_id
            ),
            on = "bg_geoid"
        ]
    } else {
        # Assign client zone by lat/lon point
        client_zones = fetch_hts_table("client_zones", settings)
        points_sf = st_as_sf(
            households[, .(hh_id, home_lat, home_lon)],
            coords = c("home_lon", "home_lat"),
            crs = 4326
            )
        client_zones_sf = client_zones %>% sf::st_transform(4326)
        seed_czones = sf::st_join(points_sf, client_zones_sf, join = st_nearest_feature)

        # Join client zones to seed on block group using the crosswalk
        seed[
            as.data.table(seed_czones),
            `:=`(
                client_zone = i.client_zone,
                client_zone_id = i.client_zone_id
            ),
            on = "hh_id"
        ]
    }
    
} else {
    seed[, client_zone_id := puma_id]
}

15.3 Format Variables for PopulationSim

Next, variables are prepared and flattened variables in the “seed” dataset for use in PopulationSim. It identifies household-level, person-level, and additional info columns. If the study unit is “household,” it merges household data with aggregated person-level variables for each household; otherwise, it uses the seed data as-is. Finally, it sets the column order for the flattened dataset, ensuring household and person variables are organized appropriately.

# Flatten Household/Person Variables ===========================================

hh_vars = str_subset(names(seed), "^h_")
per_vars = str_subset(names(seed), "^p_")
info_cols = str_subset(names(seed), "^(?!h_|p_|hh_id|person_id).*$")

if (get("study_unit", settings) == "household") {
    # Grab first person in household for household level data
    seed_flat = merge(
        seed[, .SD[1], by = hh_id, .SDcols = c(info_cols, hh_vars)],
        seed[, lapply(.SD, sum), by = hh_id, .SDcols = per_vars],
        by = "hh_id"
    )

} else {
    seed_flat = seed
}

setcolorder(
  seed_flat,
  c(
    'hh_id', 'puma_id', "client_zone_id", #'initial_weight',
    setdiff(hh_vars, 'h_total'),
    'h_total',
    setdiff(per_vars, 'p_total'),
    'p_total'
  )
)

15.4 Prepare Seed Data for Day Groups

This code prepares seed data for day groups by reshaping household-day records so each household and day group combination becomes a separate row. It assigns day group labels based on travel days, removes records without valid day groups, and pivots the data to wide format with indicators for each day group. The code then merges this with the flattened seed table, creates compound household-day IDs, and calculates person counts per day group. It includes checks for expected day group counts and completeness, stopping execution if missing or invalid day groups are detected.

# If Multiple Day Groups, Reshape Seed so Each Hh Day is a Separate Row
# Convert day_groups into a `data.table`
day_groups_dt = get_day_groups(settings)
day_groups = unique(day_groups_dt$day_group)

# Create a Unique List of Days for Each Household in the Seed Table
hh_day = unique(days[hh_id %in% seed$hh_id & hh_day_complete == 1, .(hh_id, travel_dow)])
hh_day[day_groups_dt, day_group := i.day_group, on = "travel_dow"]

# Drop NA Days Not in the day_groups_dt
hh_day = hh_day[!is.na(day_group)]

# Reshape the Data into a Wide Format
hh_day[, h_dow := paste0("h_dow_", day_group)]
hh_day_wide = dcast(
    hh_day,
    hh_id + day_group ~ h_dow,
    value.var = "h_dow",
    fun.aggregate = function(x) as.integer(length(x) > 0)
)

# Join the Seed Table with the Household-Day Table ---------------------------
# Intentionally Using an Inner Join to Drop HH's with Only Non-Study Days
seed_day = merge(seed_flat, hh_day_wide, by = "hh_id")

# Create New Compound hh_id
seed_day[, hh_id_dow := paste0(hh_id, "_", day_group)]

# Create p_dow as h_dow * p_total
seed_day[,
    (paste0("p_dow_", day_groups)) := lapply(.SD, function(x) x * p_total),
    .SDcols = paste0("h_dow_", day_groups)
]

# Check that There are as Many Day Groups as Expected and no NAs
if (
    any(is.na(seed_day$day_group)) |
    (seed_day[, uniqueN(day_group)] != length(day_groups))
) {
    stop("Warning: NAs in day_group. possible bad join.")
}

15.5 Optional: Create Transit Targets

This code optionally calculates household-level transit trip targets if transit trips are specified in the settings. It identifies transit modes, aggregates transit trips by household and day group, adjusts for surveyable persons, reshapes the data to wide format per day group, and merges the results into the seed dataset. It ensures households with no transit trips are properly filled and applies a scaling factor for boardings when required. If transit trips are not specified, it simply passes through the existing seed data.

PSRC Specifics

PSRC did not use a transit target.

if ("h_transit_trips" %in% names(settings$targets)) {

    # Transit trips are effectively the average transit trips per days for each household.
    transit_regex = paste(settings$transit_modes, collapse = "|")
    transit_modes = value_labels[
        variable == "mode_type" &
        tolower(label) %like% transit_regex,
        ]

    if (nrow(transit_modes) == 0) {
        stop("No transit modes found in value_labels that match ", paste(settings$transit_modes, collapse = ", "))
    }

    # Extract the DOWs we're weighting for
    weight_dow = unlist(get("weight_dow_groups", settings))
    day_groups = get_day_groups(settings)
    trips[day_groups, day_group := i.day_group, on = .(travel_dow)]

    transit_by_day = trips[
        mode_type %in% transit_modes$value &
        hh_day_complete == 1 &
        travel_dow %in% weight_dow, .N,
        by = day_group
    ]

    message("Transit modes: ", paste(transit_modes$label, collapse = ", "))
    message("Transit trips by day group:")
    print(transit_by_day)

    # Aggregate transit trips to households 
    agg_fun = ifelse(settings$transit_target_type == "boardings", '.N', 'uniqueN(linked_trip_id)')
    h_transit_trips = trips[
        travel_dow %in% weight_dow &
        mode_type %in% transit_modes$value &
        hh_day_complete == 1,
            .(total_transit_trips = eval(parse(text = agg_fun))),
            by = .(hh_id, day_group)
        ]

    # Number of complete study days
    num_complete_dow = calc_complete_hhdays(households, days, settings)
    num_complete_dow = num_complete_dow[day_group != "nonstudy_day",]

    # Calculate average trips per day group
    h_transit_trips[
        num_complete_dow,
        n_complete_dow := i.n_complete_hh_days,
        on = .(hh_id, day_group)
    ]
    h_transit_trips[is.na(n_complete_dow), n_complete_dow := 0]
    h_transit_trips[, h_transit_trips := total_transit_trips / n_complete_dow]

    # Get fraction of unsurveyable persons per household
    # We do this to counteract the unrelated person adjustment in the trip weights
    if (settings$unrelated_adjustment %in% c("person", "day")) {
        h_transit_trips[households_raw, f_surveyable := i.num_people / i.num_surveyable, on = "hh_id"]
        h_transit_trips[, h_transit_trips := h_transit_trips * f_surveyable]
    }

    # Reshape to wide on day_group
    h_transit_trips[, label := paste0("h_transit_trips_", day_group)]
    h_transit_trips_daygroup = dcast(
        h_transit_trips,
        hh_id + day_group ~ label,
        value.var = "h_transit_trips",
        fill = 0
    )

    # Assign to household seed
    seed_final = merge(
        seed_day,
        h_transit_trips_daygroup,
        by = c("hh_id", "day_group"),
        all.x = TRUE
    )

    # fill in households with 0 transit trips
    transit_cols = grep("^h_transit_trips_", names(seed_final), value = TRUE)
    seed_final[, (transit_cols) := lapply(.SD, function(x) {
            ifelse(!is.finite(x), 0, x)
        }),
        .SDcols = transit_cols
    ]

    # Stop if all 0 transit trips, bad join
    stopifnot(
        seed_final[,
            lapply(.SD, function(x) all(x == 0)), .SDcols = transit_cols
        ][, Reduce(`&`, .SD)] == FALSE
    )

    # Scale boardings per trip rate for rMove -----------------------------------
    # ONLY TARGET BOARDINGS IF ALL TRIPS ARE SUFFICIENTLY UNLINKED
    # Otherwise you should "link" the "unlinked" trips first and make linked trips the target.
    if (get("transit_target_type", settings) == "boardings") {
        bpt = get("transit_boardings_per_trip", settings)
        seed_final[
            diary_platform != 'rmove',
            (transit_cols) := lapply(.SD, function(x) {x * bpt}),
            .SDcols = transit_cols
        ]
    }

} else {
    seed_final = seed_day
}

15.6 Optional: Aggregate Target Categories for Weighting

Sometimes certain targets and target categories can have small sample sizes in the survey data, which can impact the results of weighting. One option is to aggregate target categories for certain weighting zone groups or the entire region. This step will optionally aggregate target categories as specified in the project settings file.

# Append the Zone Group to the Matching Weighting Zone ID in the Seed Table
seed_adjusted <- seed_final[zone_groups_dt, zone_group := i.zone_group_label, on = "puma_id"]
zone_groups_dt[, zone_group := zone_group_label]

# Update the Targets as Specified in the Project Settings Configurations
seed_adjusted <- update_targets(seed_adjusted, settings, geo_cross_walk = as.data.table(zone_groups_dt), geom = "zone_group")

15.7 Perform Seed Checks and Save Data

This code validates the final seed dataset by checking group totals and ensuring column names match the survey targets. It reorders columns for consistency, sorts the data by household ID, and saves the final seed data to disk as an RDS file.

# Check that Totals Add Up Properly --------------------------------------------
check_group_sums(seed_final, settings)

# Check that Column Names Match Survey Data -------------------------------
seed_names = str_subset(names(seed_final), "^(h|p)_")
target_names = str_subset(unique(c(names(target_czones), names(target_region))), "^(h|p)_")

not_in_targets = setdiff(seed_names, target_names)
not_in_seed = setdiff(target_names, seed_names)

stopifnot(length(c(not_in_targets, not_in_seed)) == 0)

seed_names = seed_names[seed_names %in% names(target_czones)]
setcolorder(target_puma, seed_names)
setcolorder(target_czones, seed_names)
setcolorder(target_region, seed_names)


# FINAL FULL OUTPUT FILES ------------------------------------------------------
seed_final = seed_final[order(hh_id)]

saveRDS(seed_final, get("seed_path", settings))

15.8 Review Household- and Person-Level Targets by Weighting Zone Group: Unweighted Survey Counts

15.8.1 Table: Household-Level Target Counts for Unweighted Survey Data

Target Variable	Target Category	King County - Seattle	King County - Other	Kitsap County - Expanded	Pierce County	Snohomish County	Total
Household size	1	340	155	98	302	108	1,003
	2	205	209	200	441	144	1,199
	3+	74	154	89	307	84	708
Household income	$0-$24,999	58	33	20	85	21	217
	$25,000-$49,999	77	51	51	180	46	405
	$50,000-$74,999	64	62	55	165	38	384
	$75,000-$99,999	78	54	60	148	53	393
	$100,000-$199,999	211	178	153	366	135	1,043
	$200,000+	131	140	48	106	43	468
Number of workers	0	155	173	197	438	127	1,090
	1	298	183	122	372	121	1,096
	2+	166	162	68	240	88	724
Vehicle sufficiency	None	158	32	6	24	11	231
	Insufficient	150	103	52	161	52	518
	Sufficient	311	383	329	865	273	2,161
Number of children	0	556	408	331	831	279	2,405
Number of children	1+	63	110	56	219	57	505
Total: Households		619	518	387	1,050	336	2,910
Total-Household DOW	Avg weekday	619	518	387	1,050	336	2,910

15.8.2 Table: Person-Level Target Counts for Unweighted Survey Data

Target Variable	Target Category	King County - Seattle	King County - Other	Kitsap County - Expanded	Pierce County	Snohomish County	Total
Gender	Male	515	568	392	1,153	347	2,975
Gender	Female	497	567	432	1,216	350	3,062
Age	0-4	32	43	25	85	28	213
	5-15	56	114	61	269	57	557
	16-17	6	22	12	52	9	101
	18-24	53	66	28	106	27	280
	25-44	489	346	172	582	211	1,800
	45-64	219	288	195	615	181	1,498
	65+	157	256	331	660	184	1,588
Employment	Non worker	319	549	507	1,379	347	3,101
	Part-time	76	80	62	144	54	416
	Full-time	617	506	255	846	296	2,520
Commute mode	Bike/transit/walk		68	18	45	38	169
	Work from home	255	194	89	254	117	909
	Transit	128					128
	Walk	61					61
	Bike	35					35
	Other (includes auto)	206	310	203	660	191	1,570
	None	327	563	514	1,410	351	3,165
University student status	No	944	1,077	783	2,275	667	5,746
University student status	Yes	68	58	41	94	30	291
Educational attainment	No college	167	323	243	1,033	211	1,977
Educational attainment	Some college	845	812	581	1,336	486	4,060
Race	Asian or Pacific Islander	151	248	37	126	92	654
	Black or African American	43	25	13	93	10	184
	Other	98	119	62	285	65	629
	White	720	743	712	1,865	530	4,570
Ethnicity	Not Hispanic	940	1,069	790	2,152	651	5,602
Ethnicity	Hispanic	72	66	34	217	46	435
Total: Persons		1,012	1,135	824	2,369	697	6,037
Total-Person DOW	Avg weekday	1,012	1,135	824	2,369	697	6,037