%%{init: {"theme":"base","flowchart":{"htmlLabels":true,"curve":"basis","nodeSpacing":36,"rankSpacing":52},"themeVariables":{"fontFamily":"Bai Jamjuree, Arial, sans-serif"}}}%%
flowchart TB
subgraph ALL["<b>All trips</b>"]
direction LR
subgraph GPS["<b>GPS-Tracked Trips</b>"]
direction TB
TRACE("<span style='font-size:1.02em; font-weight:700;'>Cleaned GPS trace</span><br/><span style='font-size:0.88em; font-weight:400;'>(cleaned location points)</span>"):::gpsNode
SUMHAV("<span style='font-size:1.02em; font-weight:700;'>Summed trace distance</span><br/><span style='font-size:0.88em; font-weight:400;'>point-to-point haversine distances</span>"):::gpsNode
TRACE --> SUMHAV
end
BEELINE("<span style='font-size:1.02em; font-weight:700;'>distance_beeline_m</span><br/><span style='font-size:0.88em; font-weight:400;'>= haversine(origin,<br/>destination)</span>"):::beelineNode
subgraph NONGPS["<b>Non-GPS Trips</b>"]
direction TB
OD("<span style='font-size:1.02em; font-weight:700;'>Origin + Destination</span><br/><span style='font-size:0.88em; font-weight:400;'>coordinates only</span>"):::manualNode
OSRM("<span style='font-size:1.02em; font-weight:700;'>OSRM shortest path</span><br/><span style='font-size:0.88em; font-weight:400;'>network route</span>"):::manualNode
OD --> OSRM
end
end
DM("<span style='font-size:1.05em; font-weight:700;'>distance_m</span><br/><span style='font-size:0.89em; font-weight:400;'>path-distance estimate</span>"):::outputNode
MILES("<span style='font-size:1.03em; font-weight:700;'>distance_miles</span><br/><span style='font-size:0.89em; font-weight:400;'>= distance_m / 1,609.34</span>"):::outputNode
SPEED("<span style='font-size:1.03em; font-weight:700;'>speed_mph</span><br/><span style='font-size:0.89em; font-weight:400;'>= distance_miles /<br/>(duration_s / 3,600)</span>"):::outputNode
SUMHAV --> DM
OSRM --> DM
DM --> MILES
MILES --> SPEED
classDef gpsNode fill:#FFF7E8,stroke:#F4A300,color:#232323,font-family:Bai Jamjuree,stroke-width:2.6px,font-weight:bold,fill-opacity:0.98
classDef manualNode fill:#FFF0EB,stroke:#E94B2E,color:#232323,font-family:Bai Jamjuree,stroke-width:2.6px,font-weight:bold,fill-opacity:0.98
classDef beelineNode fill:#F6F6F6,stroke:#9CA3AF,color:#232323,font-family:Bai Jamjuree,stroke-width:2.4px,font-weight:bold,fill-opacity:0.98
classDef outputNode fill:#F7F7F7,stroke:#9CA3AF,color:#232323,font-family:Bai Jamjuree,stroke-width:2.4px,font-weight:bold,fill-opacity:0.98
style ALL fill:#FCFCFB,stroke:#A8A29E,color:#232323,fill-opacity:0.42,stroke-width:2px
style GPS fill:#FFF7E8,stroke:#F4A300,color:#232323,fill-opacity:0.24,stroke-width:2.2px
style NONGPS fill:#FFF0EB,stroke:#E94B2E,color:#232323,fill-opacity:0.24,stroke-width:2.2px
linkStyle 0 stroke:#F4A300,stroke-width:4px,stroke-linecap:round
linkStyle 1 stroke:#E94B2E,stroke-width:4px,stroke-linecap:round
linkStyle 2 stroke:#F4A300,stroke-width:4px,stroke-linecap:round
linkStyle 3 stroke:#E94B2E,stroke-width:4px,stroke-linecap:round
linkStyle 4 stroke:#6B7280,stroke-width:4px,stroke-linecap:round
linkStyle 5 stroke:#6B7280,stroke-width:4px,stroke-linecap:round
Massachusetts Travel Study Data User Guide
Data User Guide
Massachusetts Travel Study Data User Guide
A reusable reference for study design, survey documentation, delivered data structure, weighting, codebook metadata, and analyst workflows.
Choose one of the three buckets below for a guided path through the guide.
Design Start with study overview, sample design, and survey instrument to understand how the data were collected. Data Review processing, weighting, dataset overview, and codebook details to see how raw records become analysis-ready files. Analysis Move into setup, joins, analytic units, variables, weighting, and examples for working with the data.
1 Overview
The Massachusetts Department of Transportation (MassDOT) contracted with RSG to conduct the 2024-2025 Massachusetts Travel Study (MTS), a statewide survey designed to collect demographically and geographically representative travel behavior data from 15,140 households across the Commonwealth. The survey exceeded this target and collected data from 18,122 households. Since the last Massachusetts Household Travel Survey was conducted in 2011, the transportation landscape, inclusive of infrastructure, services, and travel behaviors, has changed substantially. MassDOT conducted the MTS to gain a better understanding of changing travel patterns and mode choice, as well as to inform planning efforts and tools, including the statewide travel demand model.
1.1 Study Geography
The study sampled 0.6% of Massachusetts households and was geographically and demographically representative of the Commonwealth’s population.
To support adequate sample across the state, the survey team used Massachusetts’ 13 sample geographies for stratification. These geographies align with the Commonwealth’s MPO and regional planning areas and are summarized in Table 1.
| Sample Geography | MPO / Regional Planning Body | Regional Description | Sample-Frame Households |
|---|---|---|---|
| Berkshire | Berkshire Regional Planning Commission | Berkshire County and surrounding western Massachusetts communities. | 56,078 |
| Boston Region | Boston Region MPO / MAPC | Greater Boston core and inner suburban communities. | 1,315,052 |
| Cape Cod | Cape Cod Commission | Barnstable County and the Cape Cod region. | 99,969 |
| Central Massachusetts | Central Massachusetts Regional Planning Commission | Worcester-area communities in central Massachusetts. | 229,416 |
| Franklin | Franklin Regional Council of Governments | Franklin County and nearby western Massachusetts communities. | 31,234 |
| Martha's Vineyard | Martha's Vineyard Commission | The island region of Martha's Vineyard. | 6,899 |
| Merrimack Valley | Merrimack Valley Planning Commission | Northeastern Massachusetts communities centered on the Merrimack Valley. | 137,029 |
| Montachusett | Montachusett Regional Planning Commission | North-central Massachusetts communities in the Montachusett region. | 98,602 |
| Nantucket | Nantucket Planning and Economic Development Commission | The island region of Nantucket. | 4,659 |
| Northern Middlesex | Northern Middlesex Council of Governments | Lowell-area and nearby communities in northern Middlesex County. | 113,727 |
| Old Colony | Old Colony Planning Council | South Shore and Plymouth County communities in the Old Colony region. | 142,628 |
| Pioneer Valley | Pioneer Valley Planning Commission | Connecticut River Valley communities in western Massachusetts. | 244,794 |
| Southeastern Massachusetts | Southeastern Regional Planning and Economic Development District | South Coast and southeastern Massachusetts communities. | 260,908 |
| Household totals come from the MassDOT sampling plan and reflect the ABS sample frame, not achieved completes. | |||
These geographies are shown in Figure 1 below.
Section 2 provides additional details about the sample design.
1.2 Study Timeline
Data collection for the 2024 Massachusetts Travel Study started in May 2024 and continued through June 2025 over three fielding periods, detailed in Table 2.
| SURVEY TASK | TIMELINE |
|---|---|
| Survey Design (Sample planning, survey website programming, invitation development) | January 2024 - April 2024 |
| Data Collection - Spring 2024 (Sending invitations, data monitoring, and adjustments) | May 2024 - July 2024 |
| Data Collection - Fall 2024 (Sending invitations, data monitoring, and adjustments) | September 2024 - November 2024 |
| Data Collection - Winter/Spring 2025 (Sending invitations, data monitoring, and adjustments) | January 2025 - June 2025 |
| Data Preparation (Data cleaning and weighting, finalizing dashboard, final reporting) | June 2025 - June 2026 |
Table 3 displays the number of completed households by year and month of first travel date. Travel data collection was limited in the Summer, when school was out of session, and in the winter during peak holiday travel periods.
| Complete Households by Month and Year of First Travel Date | ||
| Month | 2024 | 2025 |
|---|---|---|
| January | 0 | 1,390 |
| February | 0 | 276 |
| March | 0 | 473 |
| April | 0 | 1,980 |
| May | 1,246 | 522 |
| June | 2,461 | 0 |
| July | 0 | 0 |
| August | 0 | 0 |
| September | 2,719 | 0 |
| October | 4,188 | 0 |
| November | 386 | 0 |
| December | 0 | 0 |
| Incomplete households are excluded from this summary. | ||
1.3 Data Collection
Survey data were collected through a mixed-mode design that combined:
- Smartphone-based travel diary (rMove®): Participants recorded travel via smartphone app in real-time for up to seven consecutive days.
- Web-based travel diary (rMove for Web): Participants reported travel via a web survey on one assigned weekday.
- Call center interviews: Participants reported travel via a call center on one assigned weekday, and were recorded in the web-based travel diary (or rMove for Web).
Each household first completed a recruit survey describing household composition, demographics, and vehicles, followed by a travel diary describing all trips made during the assigned day(s).
Table 4 shows a count of completed households by survey mode.
| Completed Households by Survey Mode | |||
| Survey Mode | Completed Households | Percentage | |
|---|---|---|---|
| Web-based Diary (rMove for Web) | 8,660 | 55.4% | |
| Smartphone App (rMove app) | 6,410 | 41.0% | |
| Call Center Interview (rMove for Web) | 571 | 3.7% | |
| Total | — | 15,641 | — |
Section 3 provides further details about the survey instrument and question content.
2 Sample Design
The Massachusetts Travel Study used a probability-based, geographically stratified sample of households across Massachusetts.
The primary method of sampling was a probability address-based sampling (ABS) approach, whereby Massachusetts was stratified by key demographic features along census block groups and within those segments, random households throughout Massachusetts were invited into the study through the mail.
2.1 Sampling Framework
Sampling Frame
The survey used a United States Postal Service (USPS) address-based sampling (ABS) frame that includes all Massachusetts residential addresses, excluding group quarters such as dormitories, prisons, or assisted living facilities. Each sampled address represented a single household eligible for recruitment. The ABS frame provided complete statewide coverage and supported stratification by geography, land-use density, and socioeconomic characteristics.
Primary Sampling Unit
The primary sampling unit was the household, selected through random sampling from the ABS frame. All household members reported person-level and trip-level details, but only one member (the “primary respondent”) completed the recruit survey on behalf of the household.
Though the primary sampling unit was the household, the data collected also represent the behavior of individual persons. For participants who reported data using the smartphone, data were collected across multiple days, representing a multitude of travel and daily activity data.
Surveyable Population
Not all household members were surveyable. Only persons related to Person 1 (the primary respondent) were considered surveyable. Non-surveyable members (e.g., guests, visitors, or unrelated roommates) did not have trip or day completion requirements and were excluded from household-level completeness determinations.
Non-surveyable members can be identified in the person table using the surveyable variable. These members do not contribute to the household’s completeness status. They count towards the total number of household members but are not weighted (see Section 5.4.2.2).
2.2 Target Completions
The study’s goal was 15,140 completed household surveys statewide, distributed across MPO areas. The study achieved 18,122 completed households, exceeding the statewide target while maintaining proportionality across MPO geographies.
Sampling by Season and Wave
Sampling was conducted in three main fielding periods: Spring 2024, Fall 2024, and Winter/Spring 2025. Each wave included households from across the study geography, with adjustments to invitation pacing and oversampling emphasis based on observed response patterns and representativeness in earlier periods. This adaptive approach helped the final sample maintain geographic balance and demographic coverage across the state.
2.3 Stratification and Oversampling
The sample design combined statewide address-based sampling with geographic stratification and targeted oversampling. Within each MPO geography, sampled block groups were assigned to one of four mutually exclusive strata so the study could improve representation of groups that are typically harder to recruit while also increasing sample for key policy questions.
Sample strata
- General population. Block groups that did not qualify for any of the targeted oversampling strata.
- Rural population. Block groups that did not qualify for other oversampling strata and had fewer than 150 people per square kilometer.
- Hard-to-reach oversample. Block groups with at least 30% of households earning less than $25,000 per year, at least 60% of households identified as Hispanic and/or BIPOC, or at least 15% of households speaking limited English.
- Walk/Bike/Transit oversample. Block groups within the Boston Region MPO with transit access density classified as
CBDorDense Urban.
The sample plan identified 270 block groups that qualified for both the hard-to-reach and walk/bike/transit strata. Those block groups were assigned to the hard-to-reach stratum, which the plan anticipated would have the lower response rate. That overlap rule made the final strata mutually exclusive for sample management and weighting. Table 5 summarizes the resulting block groups, households, and adults in each sample stratum.
| Sample Stratum | Number of BGs | Total Households | Total Adults | Adults per Household |
|---|---|---|---|---|
| Walk/Bike/Transit | 327 | 183,508 | 354,282 | 1.9 |
| Hard-to-reach | 1,330 | 670,542 | 1,337,712 | 2.0 |
| Rural | 475 | 250,685 | 517,967 | 2.0 |
| General | 2,923 | 1,636,260 | 3,380,998 | 2.0 |
| Total | 5,055 | 2,740,995 | 5,590,959 | 2.0 |
The hard-to-reach oversample increased representation for lower-income, BIPOC, and limited-English block groups, while the walk/bike/transit oversample increased representation for dense Boston-area block groups where multimodal travel was expected to be more common. Together, these design choices improved analytic coverage without changing the fact that the final weighted dataset represents the statewide household population. Table 6 summarizes the reference households, invitations sent, and invitation rates used across geographies and sample strata.
| Geography | Sample Stratum | Invitations Sent | Reference Households | Invitation Rate |
|---|---|---|---|---|
| Berkshire | General | 3,755 | 22,471 | 16.7% |
| Berkshire | Hard-to-reach | 4,826 | 10,343 | 46.7% |
| Berkshire | Rural | 6,793 | 23,264 | 29.2% |
| Boston Region | General | 146,945 | 754,832 | 19.5% |
| Boston Region | Hard-to-reach | 171,751 | 355,304 | 48.3% |
| Boston Region | Rural | 5,870 | 21,408 | 27.4% |
| Boston Region | Walk/Bike/Transit | 28,061 | 183,508 | 15.3% |
| Cape Cod | General | 31,328 | 79,927 | 39.2% |
| Cape Cod | Hard-to-reach | 4,927 | 7,617 | 64.7% |
| Cape Cod | Rural | 4,499 | 12,425 | 36.2% |
| Central Massachusetts | General | 26,344 | 133,520 | 19.7% |
| Central Massachusetts | Hard-to-reach | 35,015 | 57,589 | 60.8% |
| Central Massachusetts | Rural | 11,583 | 38,307 | 30.2% |
| Franklin | General | 1,477 | 10,124 | 14.6% |
| Franklin | Hard-to-reach | 1,230 | 4,258 | 28.9% |
| Franklin | Rural | 4,395 | 16,852 | 26.1% |
| Martha's Vineyard | General | 3,559 | 3,128 | 113.8% |
| Martha's Vineyard | Rural | 3,678 | 3,771 | 97.5% |
| Merrimack Valley | General | 22,319 | 86,361 | 25.8% |
| Merrimack Valley | Hard-to-reach | 40,407 | 42,244 | 95.7% |
| Merrimack Valley | Rural | 4,504 | 8,424 | 53.5% |
| Montachusett | General | 14,498 | 53,897 | 26.9% |
| Montachusett | Hard-to-reach | 7,321 | 12,125 | 60.4% |
| Montachusett | Rural | 11,911 | 32,580 | 36.6% |
| Nantucket | General | 5,474 | 3,032 | 180.5% |
| Nantucket | Rural | 2,856 | 1,627 | 175.5% |
| Northern Middlesex | General | 22,001 | 86,555 | 25.4% |
| Northern Middlesex | Hard-to-reach | 22,886 | 23,834 | 96.0% |
| Northern Middlesex | Rural | 1,356 | 3,338 | 40.6% |
| Old Colony | General | 37,417 | 103,891 | 36.0% |
| Old Colony | Hard-to-reach | 28,394 | 31,685 | 89.6% |
| Old Colony | Rural | 3,182 | 7,052 | 45.1% |
| Pioneer Valley | General | 22,925 | 129,544 | 17.7% |
| Pioneer Valley | Hard-to-reach | 48,752 | 71,373 | 68.3% |
| Pioneer Valley | Rural | 10,740 | 43,877 | 24.5% |
| Southeastern Massachusetts | General | 50,296 | 168,978 | 29.8% |
| Southeastern Massachusetts | Hard-to-reach | 36,651 | 54,170 | 67.7% |
| Southeastern Massachusetts | Rural | 13,365 | 37,760 | 35.4% |
Across all waves, the study sent 903,291 invitations. Relative to the reference household counts in each sample segment, the statewide invitation rate was 23.7% in the general stratum, 60.0% in the hard-to-reach stratum, 33.8% in the rural stratum, and 15.3% in the walk/bike/transit stratum. These realized invitation rates show that the field effort emphasized hard-to-reach households statewide while also maintaining targeted coverage of rural areas and dense multimodal areas in the Boston Region.
In a small number of segments, the cumulative invitation rate exceeded 100%. This reflects repeated fielding across waves relative to the segment’s reference household count and should be interpreted as the total invitation effort rather than as unique-household coverage of the sampling frame.
Recruitment Channels
Households were recruited primarily through mailed invitation letters that directed sampled addresses to the study website and provided their survey access information. Follow-up reminder postcards were sent to nonresponding households, and the project team also used website support, email, and phone follow-up where appropriate to help households complete the study.
Participation Modes
Eligible households could participate via:
- rMove smartphone app (seven-day diary),
- rMove for Web (one-day diary), or
- Call-center interview (one-day diary).
Mode assignment depended on household technology access and preference; all modes followed identical survey logic and data validation.
Incentives
The study used different incentive amounts by participation mode and sample stratum. rMove incentives were paid per adult; online and call-center incentives were paid per household.
- General population — rMove: $25 per adult.
- General population — web or call center: $15 per household.
- Hard-to-reach population — rMove: $35 per adult.
- Hard-to-reach population — web or call center: $25 per household.
Monitoring and Response Tracking
RSG maintained a real-time survey monitoring dashboard accessible to the Massachusetts Travel Study team throughout data collection.
The dashboard provided:
- Response rates by segment and demographic subgroup,
- Comparison to American Community Survey (ACS) benchmarks, and
- Progress toward study targets
This tool supported geographic balance and demographic representativeness through adaptive field management.
Differences in observed response rates across MPOs or demographic strata reflect design priorities, not data quality. The weighting process fully corrects for these differences, so analysts should rely on weighted data for representativeness.
2.4 Representativeness and Nonresponse
Post-survey comparisons to ACS and model control totals indicated that the achieved sample closely reflected the statewide household population by:
- Income group,
- Household size,
- Vehicle availability, and
- Land-use context.
Residual differences were addressed through weighting adjustments (see Section 5).
3 Survey Instrument
The Massachusetts Travel Study collected detailed information about households, people, vehicles, and daily travel through a unified instrument designed for use across multiple reporting platforms. Each mode implemented the same core survey logic, ensuring results are directly comparable across participants and modes.
3.1 Survey Modes and Language Support
The survey instrument was administered through three participation modes:
- Smartphone-based travel diary (rMove): Participants recorded travel via smartphone app in real time for up to seven consecutive days.
- Web-based travel diary (rMove for Web): Participants reported travel via a web survey on one assigned weekday.
- Call center interviews: Participants reported travel via a call center on one assigned weekday, and were recorded in the web-based travel diary (or rMove for Web).
The survey instrument was available in English and Spanish. The call center also supported participation in Portuguese, Chinese, Haitian Creole, Vietnamese, and Russian.
3.2 Recruit Survey
The recruit survey established household eligibility and collected baseline information to assign travel days and tailor diary prompts. Key modules included:
- Household composition and member roster
- Demographics (age, gender, race/ethnicity, income, employment, student status)
- Housing characteristics (type, tenure, vehicles available)
- Technology access and preferred reporting mode
The recruit survey was completed by the primary respondent (Person 1) on behalf of all household members. The primary respondent was also responsible for ensuring that all eligible household members completed their assigned travel diary.
3.3 Travel Diary
The travel diary collected information about travel made on the assigned reporting day or days. In the smartphone app diary, participants reviewed passively collected travel and completed prompted trip surveys. In rMove for Web and the call center diaries, respondents reported travel directly through the prompted diary instrument.
Across modes, the travel diary collected:
- Day-begin and day-end location confirmation
- Trip destinations, purposes, travel modes, and timing
- Access, transfer, and egress details for transit trips
- Companion and escort activity (where applicable)
- Reasons for no travel on the assigned day (where applicable)
Adults could report their own travel, while proxy reporting was used for children (under 18 years) and other eligible household members.
Only participants who used the rMove smartphone app recorded travel for multiple days, including weekends. The standard weights represent Monday, Tuesday, Wednesday, and Thursday travel. For analyses that compare travel across Monday through Sunday, use the alternate day-of-week weights described in the weighting chapter and analyst handbook.
3.4 Daily Surveys
The daily survey collected additional context about each reporting day, including:
- Deliveries and pickups (e-commerce activity)
- Telecommuting activity
- Attitudinal questions
- School attendance and activities
Some questions were repeated across all travel days; others (e.g., attitudinal questions) were asked only once. Household-level questions were asked of the primary respondent, while person-level questions were directed to each individual respondent.
When respondents completed the travel diary via browser or call center, “daily” questions were consolidated into a single survey following the travel diary.
3.5 Travel Date Assignment
Households were assigned one of the study’s weighted travel weekdays (Monday, Tuesday, Wednesday, and Thursday) during the study period. Households participating via rMove were assigned a seven-day reporting period beginning with the assigned start date. Households participating via web or call center reported travel for one assigned day and completed the survey after that travel date.
3.6 Questionnaire
The survey instrument covered the standard household travel survey modules needed for household, person, vehicle, travel day, trip, location, and tour delivery tables. It also included modules that were especially important for MassDOT’s analysis needs, including deliveries, telecommuting, school and work travel context, and household roster detail.
Question wording and skip logic were aligned across smartphone, web, and call center participation so that the delivered analysis variables remain comparable across modes. Table 7 summarizes the major topic areas covered by the instrument.
| Topic Area | What the Instrument Collected |
|---|---|
| Household | Household composition, vehicles, income, home context, and respondent assignment |
| Person | Demographics, employment, student status, technology access, and proxy reporting |
| Travel day | Assigned travel date, begin/end-of-day location, no-travel confirmation, and daily context |
| Trip | Destinations, purposes, modes, timing, transfers, and related trip details |
| School and work context | Commuting, telework, school attendance, and related routine travel context |
| Special topics | Deliveries and pickups, attitudinal questions, and study-specific follow-up items |
4 Data Processing
4.1 Overview
This section describes the procedures used to transform raw household travel survey data – collected from participants’ smartphones and survey responses – into clean, analysis-ready tables. The process was designed to preserve the integrity of participants’ reported travel while correcting errors, filling gaps, and enriching the data with geographic and analytical variables that support modeling and planning applications.
Data processing occurred in four phases:
- Automated Processing – Raw survey records were copied into a structured working environment, trips were routed on the road network, and an automated classifier flagged trips that required human attention.
- Analyst Review – Trained data analysts reviewed flagged trips using a web-based interface, correcting errors in trip start and end points, splitting trips that contained unreported stops, joining trip fragments that were incorrectly separated, and removing invalid trips.
- Post-Review Processing – After analyst review, the data underwent a second round of automated processing that cleaned remaining issues, assigned geographic identifiers, imputed trip purposes where necessary, and performed a second pass of transit trip unlinking.
- TICTOC Processing – The processed unlinked trip data underwent additional treatment through TICTOC (Trip Imputation, Coordination, and Tour Organization Compiler), which prepared household travel survey data for travel forecasting by imputing selected missing trips, coordinating joint household travel, organizing unlinked trips into linked trips and tours, and adding model-facing attributes to the household, person, day, trip, linked trip, and tour outputs.
Each phase is described in detail in the sections that follow.
4.2 Automated Processing
When a participant completed a travel day, the smartphone application transmitted a set of raw records to our survey platform. These records included household and person characteristics from the recruitment survey, GPS traces from the phone’s location sensors, and the participant’s own descriptions of their trips – where they went, how they traveled, and why.
Automated processing began by copying these raw records into a working environment where they could be modified without affecting the original data. Several operations were then performed in sequence.
Data Completion and Household Disposition
The system evaluated whether each participant had provided sufficient data for their assigned travel days. Households that had completed all assigned travel periods were marked as complete. Households with insufficient data – such as those that uninstalled the app before their travel period ended – were flagged and could be excluded from further processing depending on the study’s sample requirements.
Trip Classification
After routing, an automated classifier examined each trip to determine whether it required analyst review. The classifier evaluated a set of rules based on trip characteristics – for example, whether the trip had an unusually high speed for its reported mode, whether it appeared to duplicate another trip, or whether its start and end times overlapped with other trips by the same person.
Trips that passed all checks were considered clean and did not require review. Trips that failed one or more checks were flagged for analyst attention and assigned to the review queue.
Transit Unlinking
Trips reported by participants as a single transit journey were automatically separated into their component segments: the walk, bike, or drive to the transit stop (the access leg); the ride on the transit vehicle itself; and the walk, bike, or drive from the alighting stop to the final destination (the egress leg). This separation was based on routing data that identified where the mode of travel changed.
To do this, the system used the Google Routes API (Google API) to identify the most likely path between the trip’s origin and destination, then classified each segment of that path as walk, bike, drive, or transit based on the routing profile. This step produced separate trip records for each segment of a transit journey. Only rMove-recorded trips were subject to transit unlinking; manually added trips and trips recorded through the online survey instrument was not processed through this step. The Post-Review Processing step described later performs a second round of transit unlinking that applies to all trip types, including those not processed through this initial automated unlinking.
4.3 Analyst Review
After automated processing, trained analysts reviewed all flagged trips using a web-based editing tool that displayed each trip on a map alongside its GPS trace and survey responses. The goal of analyst review was to help the final trip table accurately reflect the travel that actually occurred, correcting errors that automated processing could not resolve.
Analysts performed four primary types of edits:
Dropping invalid trips. Some GPS traces were recorded as trips by the application when no actual travel occurred – for example, when a phone drifted in a parking garage or when a brief walk to the mailbox was detected. Analysts removed these records.
Joining trip fragments. Occasionally a single trip was recorded as two or more separate trips due to GPS signal loss (for example, when a participant entered a tunnel or a large building). Analysts merged these fragments back into the trip they represented.
Splitting trips with unreported stops. When a participant made an intermediate stop during a trip – such as stopping for coffee on the way to work – but the application recorded it as a single trip, analysts split it into two trips with the correct stop location and times.
Reviewing transit trips. Analysts verified that transit trips had been correctly separated into access, transit, and egress segments, and adjusted the segment boundaries if the automated unlinking produced incorrect results.
After analyst review, households with no remaining flagged trips were marked as ready for post-review processing.
4.4 Post-Review Processing
Post-review processing transformed the analyst-reviewed trip records into the final tables that make up the delivered dataset. This phase involved extensive cleaning, quality checks, and enrichment of the data. Steps included table construction, trip cleaning, location processing, distance derivation, geographic enrichment, and purpose assignment and imputation. Each of these steps is described in detail below.
Table Construction
The survey platform stores data in a “normalized” (long-format) database structure optimized for data collection. Post-review processing reshaped these records into the “denormalized” (wide-format), analysis-ready tables familiar to data users: household, person, day, trip, and vehicle. During this step, variables were renamed to standard conventions and identifiers were standardized across tables.
Post-Review Trip Cleaning
Post-review cleaning addressed a range of data quality issues that remained after analyst review. The cleaning process proceeded through several sub-steps:
Missing coordinates. A small number of trips may have missing origin or destination coordinates. Where possible, these were filled from nearby GPS points in the trip’s trace.
Travel period enforcement. Trips that fell outside the participant’s assigned travel period were removed. The travel day boundary was set at 3:00 AM rather than midnight, so a trip departing at 1:00 AM was assigned to the previous calendar day’s travel.
Zero or negative duration. Trips with nonsensical durations were removed.
Transit segment processing. Transit trips were separated into their component legs using the routing data produced during automated processing. Each segment received its own origin, destination, departure time, and arrival time.
Spatial gap cleaning. When the destination of one trip and the origin of the next trip were far apart but the trips themselves appeared to be near-duplicates (similar origins and destinations, similar times), one of the duplicate trips was removed. This addressed situations where overlapping device recordings or delayed survey submissions produced redundant trip records.
Overlapping trip resolution. Trips whose time windows overlapped were resolved through an iterative set of rules that favored completed surveys over incomplete ones, longer trips over shorter ones, and trips with reasonable speeds over those with extreme speeds.
TNC trip correction. In some cases, analysts split ride-hailing trips (such as Uber or Lyft) into separate segments during review. Because a ride-hailing trip is a single journey from the passenger’s perspective, these segments were automatically merged back together.
Loop trip splitting. A loop trip is one where the participant departs from and returns to the same location – for example, a jog around the neighborhood or a drive to run multiple errands that ends back at home. When the GPS trace revealed a clear outbound and return path, the loop was split at the point farthest from the origin. This produced two trips: one outbound and one return. This is important for modeling because the outbound and return portions of a loop trip often serve different purposes or pass through different areas.
Dwell time calculation. The time spent at each destination (the interval between arriving at one location and departing for the next trip) was calculated and stored as
dwell_minsanddwell_time_hr.
Proxy and Copied Trips
In a household travel survey, not every household member carries a smartphone or directly reports their own travel. Young children, for example, typically do not have their own devices. Instead, an adult household member, called a proxy, reports travel on the child’s behalf. In most cases, this means the child was traveling with the adult; the child’s trip record was therefore created as a copy of the adult’s trip.
These copied trips were created during automated processing (before analyst review) and were preserved through all subsequent processing steps. A copied trip has the same GPS trace, origin, destination, departure time, arrival time, and distance as the trip it was copied from – only the person identifier differs. The flag copied_from_proxy identifies these records.
Analysts should be aware that copied trips will produce identical geometries and travel times for multiple household members. This is expected and correct: it reflects the fact that those individuals were traveling together. Proxy-copied trips are distinct from TICTOC joint trip imputations, which are created later in processing based on a different set of rules (see Section 4.5).
Location Processing and Distance Derivation
Raw GPS traces can contain hundreds of individual location points for each trip. Location processing cleaned these traces and prepared them for distance and duration calculations.
The location-processing step included:
- Removing erroneous points, including points flagged as untrustworthy by rMove.
- Eliminating duplicate points from the GPS trace.
- Imputing start and end points where the GPS trace did not perfectly align with the trip’s reported departure and arrival.
After cleaning, the trace data was used to recalculate distance and duration measures for GPS-tracked trips.
distance_mwas calculated as the sum of straight-line distances between consecutive points along the cleaned GPS trace. This trace-based distance represents the approximate path the traveler followed. Units: meters.distance_beeline_mwas calculated as the direct straight-line distance from origin to destination. Unlikedistance_m, which depends on the available trip geometry,distance_beeline_mis calculated consistently from the trip origin and destination and is retained for comparison and quality assurance. Units: meters.distance_milesis derived fromdistance_mby converting meters to miles (1 mile = 1,609.34 meters). Units: miles.duration_swas recalculated from the cleaned GPS timestamps. Units: seconds.
For trips without a usable GPS trace, including participant-added trips, analyst-added trips, and trips collected through the online survey instrument, distance_m could not be calculated from observed trace points. These trips were processed separately using origin-destination network routing, described below.
Origin-Destination Routing for Non-GPS Trips
Origin-destination routing was used for trips that had only a reported origin and destination and no full GPS trace. This included manually added trips and trips recorded through the online survey instrument.
To estimate a realistic path distance for these trips, origins and destinations were routed through the Open Source Routing Machine (OSRM), a routing engine built on OpenStreetMap data.
The routing process used:
- Origin and destination coordinates as the required inputs.
- Mode-specific routing profiles for automobile, bicycle, and pedestrian travel.
- Shortest feasible network paths between each trip’s endpoints.
The routing step produced distance_m, a network-based path distance.
For GPS-tracked trips, distance_m values were derived from the cleaned GPS trace by summing point-to-point distances along the observed path. For trips without GPS traces, OSRM origin-destination routing provided an analogous path-based distance rather than a simple straight-line distance.
As a result, the delivered distance_m field represents the best available path-distance estimate for each trip:
- GPS-tracked trips: observed trace distance.
- Trips with only origin and destination coordinates: routed network distance.
The separate distance_beeline_m field provides a consistent straight-line origin-to-destination distance for comparison and quality assurance.
Table 8 summarizes the distance source used for each major trip type in the delivered data.
| Distance derivation by trip type | |||
| Trip Type | distance_m Source |
distance_beeline_m |
Notes |
|---|---|---|---|
| GPS-tracked trips | Cleaned GPS trace | Haversine O-D | Observed path distance from cleaned trace points |
| Manually added trips | OSRM origin-destination network route | Haversine O-D | Only origin and destination were available; no trace |
| Online survey trips | OSRM origin-destination network route | Haversine O-D | Only origin and destination were available; no trace |
| Split loop legs | Cleaned GPS trace, where trace geometry was available | Haversine O-D per leg | Each leg was processed independently |
| Unlinked transit legs | Cleaned GPS trace, where trace geometry was available | Haversine O-D per leg | May have been sparse if portions of the trip occurred underground |
| Synthetic access/egress | NA | NA | Zero-distance placeholder |
distance_m was a path-distance measure: cleaned trace distance for GPS-tracked trips and OSRM origin-destination network distance for trips without usable trace geometry. distance_beeline_m was calculated consistently as the direct origin-to-destination distance and was provided for comparison and quality assurance. |
|||
Geographic Enrichment
Trip origins and destinations, home locations, and habitual work and school locations were assigned to U.S. Census geographic units through spatial point-in-polygon joins. Table 9 summarizes the identifiers added during this step.
| Geographic variables added during spatial enrichment | ||
| Variable Pattern | Geography | Applied To |
|---|---|---|
| *_bg_2020 | Census Block Group (2020) | Home, work, school, trip O/D |
| *_puma_2022 | Public Use Microdata Area (2022) | Home, work, school, trip O/D |
| *_county | County (derived from block group) | Home, work, school, trip O/D |
| *_state | State (derived from block group) | Home, work, school, trip O/D |
| o_in_region, d_in_region | Study region boundary | Trip O/D |
These identifiers enable geographic analysis at multiple levels without requiring users to perform their own spatial operations.
Purpose Assignment and Imputation
Each trip in the dataset has a destination purpose (d_purpose) describing why the traveler went to that location, for example, going to work, shopping, or returning home. Respondents report the destination purpose in the trip survey. The origin purpose (o_purpose) is generally derived from the previous trip’s destination purpose, reflecting the activity the traveler was engaged in before departing.
Purpose assignment involves several steps:
Purpose cleaning. Purposes on split loop trips were corrected: the return leg was assigned the purpose of the location the traveler was returning to (typically the same as the purpose two trips prior). Unlinked transit segments were assigned a purpose of “change mode.”
Purpose categorization. Detailed purpose codes from the survey were grouped into broader purpose categories (e.g., “work,” “school,” “shopping,” “social/recreation”) to support aggregate analysis.
Open-ended purpose classification. When a participant selected “other” as the trip purpose and provided a free-text description, that description could be assigned to one of the standard purpose categories used in the dataset using a language model. This step reduced the number of uncategorized “other” trips while preserving a consistent set of purpose categories for analysis. If no suitable standard category could be identified, the trip remained classified as “other.”
Habitual location matching. Trip endpoints were compared to known home, work, and school locations. If a trip’s destination fell within 100 meters of the participant’s home, it was classified as a home location; similar thresholds applied to work and school. When a trip was reported as “work” but the destination was far from any known work location, it was reclassified as “work-related” to distinguish between commute trips and trips to secondary work sites.
Purpose imputation. Respondents report the purpose of each trip destination, and the origin purpose is generally derived from the destination purpose of the previous trip. During processing, a rules-based algorithm identified trips whose reported purposes appeared inconsistent with their locations or with the surrounding trip sequence and corrected them where appropriate. For example, a trip ending at the participant’s home but reported as “shopping” would be reclassified. This processing could include location-based corrections, derived values for analyst-split trips, and broader imputations when reported purposes were missing or implausible. The imputation algorithm iterated across related trips to resolve chains of dependencies.
Table 10 summarizes the main purpose-assignment and imputation outcomes reflected in the delivered trip data.
| Purpose assignment and imputation outcomes | |
| Label | Description |
|---|---|
| Reported | Purpose as reported by participant, unchanged |
| Location-corrected | Reported purpose conflicted with proximity to a habitual location (e.g., reported 'work' but location is home) |
| AI-classified | Participant selected 'other' and provided text; the text was assigned to a standard purpose category using a language model when automated coding was used |
| Split loop | Return leg of a split loop trip; purpose set to match the location being returned to |
| Algorithm-imputed | Purpose assigned by the iterative imputation algorithm based on location type, dwell time, and trip sequence |
| Linked transit | Purpose set to 'change mode' during transit trip linking |
| Incomplete survey | Trip survey was not completed; purpose defaulted to 'other' |
| Browser/proxy | Trip was not processed through imputation (browser-move or non-participant copy) |
| These categories describe how trip purposes may remain as reported or be modified during processing. | |
Delivered purpose columns. After processing, the trip table includes both the originally reported purpose fields and the final delivered purpose fields for origins and destinations. The final detailed-purpose columns and final purpose-category columns are paired outputs of the same processing pipeline: they are delivered together and intended to remain consistent with one another. The reported fields preserve the pre-imputation values for comparison. Table 11 summarizes those delivered purpose fields.
| Delivered purpose columns on the trip table | |
| Column | Content |
|---|---|
| d_purpose / o_purpose | Final imputed detailed purpose code. Use these columns when detailed-purpose distinctions are needed. |
| d_purpose_category / o_purpose_category | Grouped category paired with the final imputed purpose. Derived from the same imputation as `*_purpose`, not from a separate downstream recode. |
| d_purpose_reported / o_purpose_reported | Originally reported detailed purpose before reclassification and imputation. Provided for comparison and quality assurance. |
| d_purpose_category_reported / o_purpose_category_reported | Grouped category corresponding to the originally reported purpose. |
Use the final *_purpose or *_purpose_category columns for analysis, depending on the level of detail needed. These final columns are designed to stay in sync; the _reported columns preserve the pre-imputation values. |
|
In most cases, the final and reported purposes are identical. They differ only when processing reclassified or imputed purpose values. The final detailed-purpose and purpose-category fields are intended to agree with each other, though open-ended “other” purposes may require additional analyst caution.
Origin purpose on first trips. Because o_purpose is derived from the previous trip’s destination purpose, the first trip of each person’s travel period has no preceding trip to draw from. In the post-review processed data, o_purpose for first trips will be missing (NA).
During TICTOC processing, origin purposes were recalculated after trip imputation. TICTOC set o_purpose to the previous trip’s d_purpose only when the trip’s origin was spatially consistent with the previous trip’s destination (i.e., within a configurable distance buffer). First trips of the day and trips with a spatial gap from the previous destination retained their existing o_purpose value and were not overwritten. Analysts filtering or tabulating on o_purpose should account for these missing values on first trips.
Mode Type Assignment
The survey asked respondents to select all modes used on each trip from a checkbox list. Respondents could select as many modes as applied. The first four selections are preserved in the delivered unlinked trip table as mode_1, mode_2, mode_3, and mode_4. These columns are unordered: mode_1 is simply the first-reported mode, not a primary or dominant one. For most analyses, including mode share, use mode_type rather than the mode_n columns directly.
mode_type is derived by applying a priority hierarchy across all populated mode_n columns on each trip. When a respondent selected more than one mode, the mode with the highest priority value wins. For example, if a respondent selected both walk and transit, the trip is assigned mode_type = 5 (Transit), because transit outranks walk in the hierarchy. In the rare cases where a respondent selected more than four modes, the mode_type assignment may not correspond to the first four mode_n values.
mode_priority records the numeric priority of the winning mode, and is useful for confirming which mode was selected when a trip has multiple mode_n values populated.
Table 12 shows the full crosswalk of detailed survey mode codes to mode_type groups, in priority order.
| Detailed mode to mode_type crosswalk | ||
mode_type |
Detailed Mode Value | Detailed Mode |
|---|---|---|
| Walk | ||
| 1 | 1 | Walk (or jog/wheelchair) |
| 1 | 43 | Skateboard or rollerblade |
| Bike | ||
| 2 | 2 | Standard bicycle (my household's) |
| 2 | 3 | Borrowed bicycle (e.g., a friend's) |
| 2 | 4 | Other rented bicycle |
| 2 | 56 | Other personal bicycle (e.g., cargo, tandem, etc.) |
| 2 | 82 | Electric bicycle (my household's) |
| 2 | 103 | Bicycle or e-bicycle |
| 2 | 107 | Micromobility (e.g., scooter, moped, skateboard) |
| Bike Share | ||
| 3 | 69 | Bike-share - standard bicycle |
| 3 | 70 | Bike-share - electric bicycle |
| Scooter Share | ||
| 4 | 73 | Moped-share (e.g., Scoot) |
| 4 | 74 | Segway |
| 4 | 83 | Scooter-share (e.g., Bird, Lime) |
| Taxi | ||
| 5 | 36 | Regular taxi (e.g., Yellow Cab) |
| 5 | 60 | Other hired car service (e.g., black car, limo) |
| Tnc | ||
| 6 | 49 | Uber, Lyft, or other smartphone-app ride service |
| 6 | 106 | Uber/Lyft, taxi or car service |
| Other | ||
| 7 | 5 | Other |
| 7 | 27 | Paratransit/Dial-A-Ride (e.g., The RIDE) |
| 7 | 44 | Golf cart |
| 7 | 45 | ATV |
| 7 | 75 | Other |
| 7 | 77 | Personal scooter or moped (not shared) |
| 7 | 80 | Other boat (e.g., kayak) |
| 7 | 81 | Snowmobile |
| 7 | 104 | Other |
| Car | ||
| 8 | 6 | Household vehicle 1 |
| 8 | 7 | Household vehicle 2 |
| 8 | 8 | Household vehicle 3 |
| 8 | 9 | Household vehicle 4 |
| 8 | 10 | Household vehicle 5 |
| 8 | 11 | Household vehicle 6 |
| 8 | 12 | Household vehicle 7 |
| 8 | 13 | Household vehicle 8 |
| 8 | 14 | Household vehicle 9 |
| 8 | 15 | Household vehicle 10 |
| 8 | 16 | Other vehicle in household |
| 8 | 17 | Rental car |
| 8 | 22 | Other vehicle (not my household's) |
| 8 | 33 | Car from work |
| 8 | 34 | Friend/relative/colleague's car |
| 8 | 47 | Other motorcycle in household |
| 8 | 54 | Other motorcycle (not my household's) |
| 8 | 68 | Cable car or streetcar |
| 8 | 100 | Household vehicle (or motorcycle) |
| 8 | 101 | Other vehicle (e.g., friend's car, rental, carshare, work car) |
| Car Share | ||
| 9 | 18 | Carshare service (e.g., Zipcar) |
| 9 | 59 | Peer-to-peer car rental (e.g., Turo) |
| 9 | 76 | Carpool match (e.g., Waze Carpool) |
| School Bus | ||
| 10 | 24 | School bus |
| Shuttle Vanpool | ||
| 11 | 21 | Vanpool |
| 11 | 26 | Other private shuttle/bus (e.g., a hotel's, an airport's) |
| 11 | 38 | University/college shuttle/bus |
| 11 | 62 | Employer-provided shuttle/bus |
| Ferry | ||
| 12 | 78 | Other public ferry or water taxi |
| 12 | 79 | Vehicle ferry (took vehicle on board) |
| Transit | ||
| 13 | 23 | Local bus |
| 13 | 28 | Other bus |
| 13 | 30 | Subway |
| 13 | 39 | Light rail |
| 13 | 42 | Other rail |
| 13 | 55 | Express/commuter bus |
| 13 | 58 | Commuter rail |
| 13 | 61 | Rapid transit bus (BRT) |
| 13 | 102 | Bus, shuttle, or vanpool |
| 13 | 105 | Rail (e.g., train, subway) |
| Ld Passenger | ||
| 14 | 25 | Intercity bus (e.g., Greyhound) |
| 14 | 31 | Airplane/helicopter |
| 14 | 41 | Intercity rail (e.g., Amtrak) |
| Higher mode_type values take priority when multiple modes are reported on a single trip. mode_type on unlinked transit access and egress legs is assigned by the routing engine rather than from the participant's original survey response. | ||
Secondary Transit Trip Unlinking
Initial pre-review processing used the Google Routes API to split rMove-recorded transit trips into their component legs. To split manually added, analyst-added, and online survey trips into access, transit, and egress legs, a secondary unlinking process was applied to the unlinked trip table after analyst review. This unlinking step identified transit trips without user-recorded or Google-derived leg splits and applied a set of rules to create “synthetic” access and egress legs where needed. This ensured that all transit trips had a consistent structure for downstream processing, even if the original survey response did not capture the full set of legs.
The unlinking algorithm applied a set of rules based on:
- Whether consecutive trips had a short dwell time between them (consistent with a transfer rather than a true stop)
- Whether the mode sequence suggested access/transit/egress
- Whether the destination purpose was “change mode” (indicating a transfer point)
Unlinking returned flags that identify each trip’s role in the linked journey: is_access, is_egress, is_transit_leg, and is_primary_leg (the highest-priority mode segment).
For transit trips that were missing an access or egress segment – for example, because the walk to the bus stop was too short to be detected – a synthetic zero-distance leg was created as a placeholder to maintain a consistent data structure. These synthetic legs were flagged with transit_quality_flag values of “SA” (synthetic access) or “SE” (synthetic egress).
Mode codes on unlinked transit legs. When a transit trip was separated into access, transit, and egress segments, the mode on each segment was reassigned to reflect the actual travel mode of that segment (e.g., walk for an access leg, bus for the transit leg) based on the routing engine’s classification. As part of this reassignment, the original survey-reported mode codes (mode_1 through mode_N) on affected segments were cleared and set to a missing/not-applicable value (995). Analysts querying the detailed mode columns on transit access or egress legs will find these values blank; the mode_type column contains the correct grouped mode for each segment.
Derived Travel Variables
After all cleaning and linking was complete, the following derived variables were calculated:
distance_miles– trip distance converted from meters to milesspeed_mph– derived from distance and durationduration_minutes– duration in minutesdepart_hour,depart_minute,arrive_hour,arrive_minute– time components for modeling softwaretravel_dow– day of weekmode_type– grouped mode category (e.g., car, transit, walk, bike). See Section 4.4.8.mode_priority– the highest-priority mode used on the tripspeed_flag– indicator for trips with speeds exceeding plausible thresholds for their modeteleport– indicator for spatial discontinuities between consecutive trips (destination of one trip is far from origin of the next)
Driver status. The driver variable indicates whether the trip-maker was the driver or a passenger for automobile trips. During processing, this variable was adjusted in two cases. First, if a person reported as “driver” was under the minimum driving age or did not hold a driver’s license, they were reclassified as a passenger. Second, if an automobile trip included exactly one licensed household member, that person was imputed as the driver regardless of their original response. These corrections maintained consistency between the driver variable and the person table’s age and license fields. Analysts performing auto occupancy or driver/passenger analyses should be aware that some driver values reflect these corrections rather than the original survey response.
Quality Assurance
A comprehensive suite of automated quality checks was applied to the final tables, producing an HTML diagnostic report and a CSV file of test results. Checks included verification of referential integrity across tables (every trip belongs to a valid person and day), consistency of trip counts, plausibility of speeds and distances, and completeness of required fields. A draft codebook documenting all variables and their value labels was also generated automatically.
4.5 TICTOC Processing
TICTOC (Trip Imputation, Coordination, and Tour Organization Compiler) prepared household travel survey data for travel forecasting by imputing selected missing trips, coordinating joint household travel, organizing unlinked trips into linked trips and tours, and adding model-facing attributes to the household, person, day, trip, linked trip, and tour outputs. Before organizing trips into linked trips and tours, TICTOC performed location-purpose correction, imputed missing joint trips for household members who traveled together but did not independently report the trip, and imputed selected child school trips when survey responses indicated school attendance but no corresponding trip was present. Each of these steps is described in detail in the sections that follow.
Joint Trip Detection and Imputation
Household travel surveys rely on individual participants to report their own travel. When household members travel together – for example, a parent driving children to school – each person’s trips should appear in the data. In practice, non-participant household members (particularly young children) often have missing or incomplete trip records.
TICTOC addresses this in two steps:
Detecting joint trips. For each household, the system examined all pairs of household members and identified trips that overlapped in both time and space. Trips were considered joint if they departed from and arrived at similar locations within similar time windows.
Imputing missing joint trips. When one household member had a reported trip that indicated joint travel with another member who did not have a corresponding trip, a new trip record was created for the missing member. The imputed trip was created from the host trip and populated using project-specific column-action rules that specified which attributes were copied directly from the host trip, which were filled with person-specific values (such as the target person’s demographics), and which received default or sampled values. Imputation occurred only when:
- The host trip indicated joint travel with the target person
- The imputed trip would not overlap with the target person’s existing trips
- The imputed trip would not create a spatial discontinuity (teleport) in the target person’s travel chain
- The host trip had not been dropped or flagged as invalid
School Trip Imputation
Children’s school trips are among the most commonly underreported trip types in household travel surveys, particularly for younger children who do not carry smartphones. TICTOC imputed school trips when survey responses indicated that a child attended school and no corresponding school trip was present in the data. Trips were not imputed when the available responses indicated that the child did not attend school, attended a different or unknown school location, was home-schooled, or was not a student.
Imputed school trips used the child’s reported home and school locations from the person table, the household’s reported usual school travel mode where available, and sampled departure times. Other trip attributes–such as occupancy, driver status, and mode details–were derived from project-specific rules rather than copied from another person’s trip. This distinguished school trip imputation from joint trip imputation, where most attributes came from a host trip.
Trip Linking
After trip imputation, TICTOC organized trips into linked trips. Linked trips were constructed from one or more unlinked trips, namely where “change mode” purposes indicated that multiple segments belonged to a single journey. The TICTOC-derived linked trip records were used downstream for tour organization, origin-destination analysis, and weighting. Each trip_linked record summarized one or more unlinked trip records into a single journey with the origin of the first segment, the destination of the last segment, and journey-level mode, distance, and duration.
Figure 3 illustrates this linked trip structure for downstream TICTOC processing.
%%{init: {"theme":"default","flowchart":{"htmlLabels":true,"curve":"basis"}}}%%
flowchart LR
subgraph UNLINKED["Unlinked Trips (trip table)"]
direction LR
T1["Trip 1: Walk
Home -> Bus Stop
purpose: change_mode
is_access: 1"]
T2["Trip 2: Bus
Stop A -> Stop B
purpose: change_mode
is_transit_leg: 1"]
T3["Trip 3: Walk
Bus Stop -> Work
purpose: work
is_egress: 1"]
T1 --> T2 --> T3
end
subgraph LINKED["Linked Trip (trip_linked table)"]
LT["Linked Trip
Mode: Transit (bus)
Home -> Work
Access: Walk | Egress: Walk
Distance: sum of all segments"]
end
UNLINKED --> LINKED
classDef linked_trip fill:#F68B1F,stroke:#C66916,color:#000000,font-family:Inter,stroke-width:2.5px,font-weight:bold,fill-opacity:0.92
classDef unlinked_trip fill:#E4572E,stroke:#BA3F21,color:#ffffff,font-family:Inter,stroke-width:2.5px,font-weight:bold,fill-opacity:0.92
style UNLINKED fill:#E4572E,stroke:#BA3F21,color:#ffffff,fill-opacity:0.14,stroke-width:1.75px
style LINKED fill:#F68B1F,stroke:#C66916,color:#000000,fill-opacity:0.16,stroke-width:1.75px
linkStyle default stroke:#475569,stroke-width:2.25px
class T1,T2,T3 unlinked_trip
class LT linked_trip
Multi-leg transit trips and intermediate transfer segments.
The example above shows a simple three-segment journey (access → transit → egress). Some linked trips include multiple transit vehicles — for example: bike to bus stop, bus leg, short transfer, second bus leg.
For these multi-leg linked trips:
is_accessis assigned only to the first leg, when it is non-transit and immediately precedes a transit leg.is_egressis assigned only to the last leg, when it is non-transit and immediately follows a transit leg.- An intermediate non-transit segment between two transit legs — such as a walk or bike transfer between buses — will appear as its own row in
trip_unlinkedunder the samelinked_trip_id, but will carry none of these flags: it is notis_access, notis_egress, and notis_transit_leg.
These intermediate segments are present in the unlinked trip table, but they should be treated as an occasional byproduct of the Google Routes API un-linking process or unusually detailed participant recording of trips rather than a guaranteed analytical construct. Their detailed mode fields (mode_1 through mode_N) are typically set to 995 (not applicable); mode_type reflects the routing engine’s classification (usually walk or bike).
To identify intermediate transfer-like segments in trip_unlinked, filter to rows where all four conditions hold:
is_transit == 1— part of a transit linked tripis_transit_leg == 0— not the transit vehicle legis_access == 0is_egress == 0
Time and distance for these segments are included in the linked trip totals in trip_linked.
Linked Trip Mode Assignment
TICTOC assigned a single mode to each linked trip through a two-step process that operated on the raw survey mode values (mode_n columns) across all constituent unlinked trips — not on the mode_type variable derived during post-review processing.
Step 1: Group raw survey modes. TICTOC collected every populated mode_n value across all unlinked segments belonging to the linked trip and mapped each value to an intermediate mode group using the project-configurable crosswalk in Table 13. For example, a respondent who selected “Local bus” (value 23) and “Walk” (value 1) on separate segments would contribute both LOCAL and WALK to the mode group set. All unique mode groups present across all segments were collected into a single set for use in Step 2.
| Step 1: Survey mode value to mode group crosswalk | |
| Each raw mode_n value is mapped to an intermediate group before the hierarchy is applied | |
| Survey Mode Value | Survey Mode Label |
|---|---|
| SCHOOLBUS | |
| 24 | School bus |
| LONGDIST | |
| 25 | Intercity bus (e.g., Greyhound) |
| 31 | Airplane/helicopter |
| 41 | Intercity rail (e.g., Amtrak) |
| REGIONAL | |
| 55 | Express/commuter bus |
| 58 | Commuter rail |
| 78 | Other public ferry or water taxi |
| 79 | Vehicle ferry (took vehicle on board) |
| LOCAL | |
| 23 | Local bus |
| 26 | Other private shuttle/bus (e.g., a hotel's, an airport's) |
| 28 | Other bus |
| 30 | Subway |
| 38 | University/college shuttle/bus |
| 39 | Light rail/trolley |
| 42 | Other rail |
| 61 | Rapid transit bus (BRT) |
| 62 | Employer-provided shuttle/bus |
| 102 | Bus, shuttle, or vanpool |
| 105 | Rail (e.g., train, subway) |
| DRIVE | |
| 6 | Household vehicle 1 |
| 7 | Household vehicle 2 |
| 8 | Household vehicle 3 |
| 9 | Household vehicle 4 |
| 10 | Household vehicle 5 |
| 11 | Household vehicle 6 |
| 12 | Household vehicle 7 |
| 13 | Household vehicle 8 |
| 14 | Household vehicle 9 |
| 15 | Household vehicle 10 |
| 16 | Other vehicle in household |
| 17 | Rental car |
| 18 | Carshare service (e.g., Zipcar) |
| 21 | Vanpool |
| 22 | Other vehicle (not my household's) |
| 27 | Medical transportation service |
| 33 | Car from work |
| 34 | Friend/relative/colleague's car |
| 47 | Other motorcycle in household |
| 54 | Other motorcycle (not my household's) |
| 59 | Peer-to-peer car rental (e.g., Turo) |
| 76 | Carpool match (e.g., Waze Carpool) |
| 100 | Household vehicle (or motorcycle) |
| 101 | Other vehicle (e.g., friend's car, rental, carshare, work car) |
| BIKE | |
| 2 | Standard bicycle (my household's) |
| 3 | Borrowed bicycle (e.g., a friend's) |
| 4 | Other rented bicycle |
| 56 | Other personal bicycle (e.g., cargo, tandem, etc.) |
| 82 | Electric bicycle (my household's) |
| 103 | Bicycle or e-bicycle |
| PERSONAL MOBILITY | |
| 43 | Skateboard or rollerblade |
| 44 | Golf cart |
| 45 | ATV |
| 74 | Segway |
| 77 | Personal scooter or moped (not shared) |
| 80 | Other boat (e.g., kayak) |
| 81 | Snowmobile |
| 83 | Scooter-share (e.g., Bird, Lime) |
| 107 | Micromobility (e.g., scooter, moped, skateboard) |
| TNC | |
| 36 | Regular taxi (e.g., Yellow Cab) |
| 49 | Uber, Lyft, or other smartphone-app ride service |
| 60 | Other hired car service (e.g., black car, limo) |
| 106 | Uber/Lyft, taxi or car service |
| 200 | Paratransit/Dial-A-Ride (e.g., The RIDE) |
| SHARED | |
| 69 | Bike-share - standard bicycle |
| 70 | Bike-share - electric bicycle |
| WALK | |
| 1 | Walk (or jog/wheelchair) |
| OTHER | |
| 5 | Other |
| 75 | Other |
| 104 | Other |
| 995 | Missing Response |
| DRIVE is an intermediate group only and does not appear directly as a linked trip mode. It is further resolved into SOV, HOV2, or HOV3 in Step 2 based on vehicle occupancy. | |
Step 2: Apply the mode hierarchy. Given the set of mode groups collected in Step 1, TICTOC walked down the priority-ordered hierarchy in Table 14 and assigned the linked trip the first mode group present in the set. This meant that a more “significant” mode always took precedence: a trip that included any transit segment would be classified as LOCAL or REGIONAL transit regardless of how many walk segments accompanied it, and a long-distance trip would outrank a local transit trip.
The one exception to a simple group-wins rule is DRIVE. When DRIVE was present in the mode group set, the final linked_trip_mode — SOV, HOV2, or HOV3 — was determined by the maximum number of travelers (num_travelers) reported across all constituent unlinked trips: a single traveler yields SOV, two travelers yields HOV2, and three or more yields HOV3.
If all mode_n values were missing across every segment of the linked trip, the linked_trip_mode was set to missing. This is the case for incomplete survey responses where no mode information is available. Analysts can filter out these records when analyzing mode share or apply imputation rules as needed; these records are unweighted and therefore automatically excluded from weighted analyses.
| Step 2: Linked trip mode hierarchy | |
| TICTOC assigns the first matching mode group present in the set | |
| Linked Trip Mode | Priority (1 = highest) |
|---|---|
| SCHOOLBUS | 1 |
| LONGDIST | 2 |
| REGIONAL | 3 |
| LOCAL | 4 |
| HOV3 (3 or more person occupancy vehicle) | 5 |
| HOV2 (2-person occupancy vehicle) | 6 |
| SOV (Single-occupancy vehicle) | 7 |
| BIKE | 8 |
| PERSONAL MOBILITY | 9 |
| TNC | 10 |
| SHARED | 11 |
| WALK | 12 |
| OTHER | 13 |
HOV3, HOV2, and SOV all derive from the DRIVE mode group. The split is determined by the maximum num_travelers across all constituent unlinked trips: 1 = SOV, 2 = HOV2, 3+ = HOV3. The priority order is project-configurable. |
|
Example. Consider a linked transit trip consisting of three unlinked segments: a walk to the bus stop (mode_n value 1 → WALK), a local bus ride (mode_n value 23 → LOCAL), and a walk from the stop to the destination (mode_n value 1 → WALK). The mode group set is {WALK, LOCAL}. TICTOC checks the hierarchy from the top: SCHOOLBUS — not present; LONGDIST — not present; REGIONAL — not present; LOCAL — present. The linked trip is assigned mode LOCAL.
Linked Trip Purpose Assignment
TICTOC assigned a single destination purpose to each linked trip from the final unlinked segment’s d_purpose. Because the intermediate segments of a transit journey carry d_purpose = "change mode" by convention, the last segment’s destination purpose reflects the true trip destination — where the traveler actually ended up and why.
Origin purpose on the linked trip was taken from the first segment’s o_purpose, following the same convention used for unlinked trips.
As a result, analysts working with trip_linked can use d_purpose and o_purpose directly for trip-purpose summaries without needing to filter out “change mode” records, which are an artifact of the unlinked segment structure and do not appear on linked trip records.
Tour Organization
A tour is a sequence of trips that begins and ends at the same anchor location. Most tours are home-based – the traveler departs from home, makes one or more stops, and returns home. At-work subtours begin and end at the workplace (for example, leaving work for lunch and returning). Open-jawed tours occur when a travel day begins or ends away from home, so that the full home-to-home circuit is not completed within the observed day.
TICTOC organized all of a person’s daily trips into tours by:
- Identifying anchor locations (departures from and returns to home, or to work for subtours)
- Grouping intermediate trips into the tour they belong to
- Scoring candidate primary destinations based on activity duration, purpose, and trip characteristics using configurable scoring functions (see Section 4.5.4.1 below).
- Assigning a tour purpose based on the primary destination
- Identifying sub-tours within parent tours
- Classifying each person’s daily activity pattern (e.g., “mandatory” for days with work/school tours, “non-mandatory” for discretionary travel only, “home” for days with no travel)
Figure 4 illustrate how TICTOC organized unlinked trips into tours and subtours for a closed and open-jawed example.
Home-based work tour with at-work subtour:
%%{init: {"theme":"default","flowchart":{"htmlLabels":true,"curve":"basis"}}}%%
flowchart LR
HOME(("Home")) -- "Trip 1: Car" --> WORK["Work"]
WORK -- "Trip 2: Walk" --> LUNCH["Lunch"]
LUNCH -- "Trip 3: Walk" --> WORK
WORK -- "Trip 4: Car" --> HOME
classDef household fill:#024D5F,stroke:#013845,color:#ffffff,font-family:Inter,stroke-width:2.75px,font-weight:bold,fill-opacity:0.90
classDef person fill:#0D7993,stroke:#085C70,color:#ffffff,font-family:Inter,stroke-width:2.75px,font-weight:bold,fill-opacity:0.90
classDef tour fill:#FDD835,stroke:#C2A200,color:#000000,font-family:Inter,stroke-width:2.75px,font-weight:bold,fill-opacity:0.90
linkStyle default stroke:#475569,stroke-width:2.4px
class HOME household
class WORK person
class LUNCH tour
Open-jawed tour (day begins away from home):
%%{init: {"theme":"default","flowchart":{"htmlLabels":true,"curve":"basis"}}}%%
flowchart LR
WORK(("Work
(day starts here)")) -- "Trip 1: Car" --> GROCERY["Grocery"]
GROCERY -- "Trip 2: Car" --> HOME(("Home"))
classDef household fill:#024D5F,stroke:#013845,color:#ffffff,font-family:Inter,stroke-width:2.75px,font-weight:bold,fill-opacity:0.90
classDef person fill:#0D7993,stroke:#085C70,color:#ffffff,font-family:Inter,stroke-width:2.75px,font-weight:bold,fill-opacity:0.90
classDef tour fill:#FDD835,stroke:#C2A200,color:#000000,font-family:Inter,stroke-width:2.75px,font-weight:bold,fill-opacity:0.90
linkStyle default stroke:#475569,stroke-width:2.4px
class HOME household
class WORK person
class GROCERY tour
Tour Purpose Scoring and Primary Destination
Tour purpose was determined by the primary destination — the most “important” stop on the tour.
Because a tour includes multiple trips, one destination was selected as the representative stop, and its purpose became the tour purpose.
For tours with mandatory travel (work or school), primary-destination selection was straightforward: the mandatory stop took precedence.
Complexity was higher for discretionary tours. For example, a tour might include two shopping stops (e.g., grocery store and auto shop), where purpose alone did not clearly identify which stop was primary.
TICTOC resolved this using a weighted penalty method:
- Each candidate destination was scored using a decay function of duration, stratified by purpose.
- The destination with the minimum penalty score was selected as primary.
- Duration was defined as:
- time spent at the destination (dwell time), plus
- duration of the preceding trip, plus
- duration of the subsequent trip.
This approach gave more favorable scores to destinations reached via longer trips or with longer dwell times, reflecting greater relative importance within the tour.
Purpose priority was built into scoring:
- Mandatory destinations (work, school) received higher base scores than discretionary activities, ensuring a work or school stop was selected whenever present.
- Discretionary destinations were differentiated by the duration-based decay function, with scoring functions stored in project-configurable files delivered with the data.
The assigned tour purpose was taken from the primary destination’s purpose category (e.g., “work,” “school,” “shop”).
At-work subtours — sequences that departed from and returned to the workplace within a parent tour — were identified separately and assigned their own tour purpose using the same scoring logic.
Additional TICTOC Outputs
TICTOC appended model-facing fields to the household, person, day, and trip tables, and produced new linked trip and tour tables. Key additions include:
- Daily activity pattern (
daily_activity_pattern) on the day table, classifying each person-day as mandatory, non-mandatory, or home - Tour identifiers (
tour_id,tour_num) on trips and the tour table - Linked trip identifiers (
linked_trip_id) connecting unlinked trip segments to their linked trip record - Stop counts on tours, indicating the number of intermediate stops
- Joint travel indicators identifying which trips were taken with other household members
- Escorting attributes identifying trips where one household member accompanies another
- Imputation flags distinguishing reported trips from imputed joint and school trips (see Section 4.6 for details)
- Summary diagnostics documenting imputation rates, tour distributions, and data quality metrics
4.6 Reference: Flags and Classifications
The following tables provide reference definitions for flags and classification variables used throughout the delivered tables.
Trip Flag Reference
Table 15 summarizes the main trip-level flags included in the delivered trip table.
| Trip-level flags in the delivered trip table | |||
| Flag | Values | Meaning | Filtering Guidance |
|---|---|---|---|
| browser | 0/1 | Trip created via browser survey (not GPS) | Exclude for GPS-quality analysis |
| added_trip | 0/1 | Trip manually added by analyst or participant | No GPS trace; OD-routed distance |
| split_loop | 0/1 | Trip created by splitting a loop trip | Original loop no longer exists |
| unlinked_trip | 0/1 | Trip is a segment of a transit journey | Use trip_linked for O-D analysis |
| is_primary_leg | 0/1 | Highest-priority mode leg in a linked trip | Use to avoid double-counting linked trips |
| is_access / is_egress | 0/1/995 | Role in a linked transit trip (995 = not applicable) | -- |
| is_synthetic_transit_leg | 0/1 | Placeholder leg for missing access/egress | Distance and duration are NA |
| speed_flag | 0/1 | Speed exceeds plausible threshold for mode | Review or exclude |
| teleport | 0/1 | Gap >= 250m between destination and next trip's origin | May indicate missing trip |
| copied_from_proxy | 0/1 | Trip record copied from a proxy reporter | Same trace as reporter's trip |
TICTOC Imputation Flags
Additional flags identify imputed and modified records. Table 16 summarizes the additional TICTOC-specific fields used to identify imputed and coordinated records.
| TICTOC-specific flags and identifiers | |
| Field | Description |
|---|---|
| imputed_record_type | Indicates whether the trip is reported (0), imputed as a joint trip, or imputed as a school trip |
| imputed_host_trip | For joint trip imputations, the trip_id of the household member's trip that served as the basis for the imputed record |
| imputed_joint_trip | Flag indicating whether this trip was created through joint trip imputation |
| joint_trip_id | Identifier grouping household members who traveled together on the same trip |
| daily_activity_pattern | Person-day classification: mandatory, non-mandatory, or home |
| These fields allow analysts to distinguish reported travel from imputed travel and to identify joint-travel episodes. | |
Joint Travel Taxonomy
TICTOC classifies joint travel at both the trip and tour level:
- Non-joint: Trip or tour made by the person alone
- Partially joint: Some but not all segments of the tour include another household member
- Fully joint: All segments of the tour are shared with another household member
- Joint tour participants: Identifiers linking all household members sharing a tour
- Escorting: Trips where the primary purpose is to transport another household member (e.g., driving a child to school); further classified by whether the escort makes a dedicated round trip or chains the escort with other activities
4.7 Delivered Data Products
Table 17 summarizes the delivered tables and their units of observation.
| Delivered tables | |||
| Table | Records | Unit of Observation | Source |
|---|---|---|---|
| Household | One per household | Household | Survey + processing |
| Person | One per person | Person | Survey + processing |
| Day | One per person per travel day | Person-day | Survey + processing + TICTOC |
| Vehicle | One per household vehicle | Vehicle | Survey |
| Trip | One per unlinked trip (includes imputed joint and school trips) | Trip | Survey + processing + TICTOC |
| Trip Linked | One per linked journey | Linked trip | TICTOC |
| Location | GPS trace points per trip | Location point | Survey app + processing |
| Tour | One per tour | Tour | TICTOC |
| Joint Tour Participant | One per person per joint tour | Person-tour | TICTOC |
| All tables are accompanied by a codebook listing every variable, its data type, and value labels. | |||
All tables are accompanied by a codebook listing every variable, its data type, and its value labels (see Section 7). Weighted versions of these tables are produced separately by the weighting process and documented in Section 5.
For questions about specific variables, processing decisions, or data quality metrics for your study, please contact the project team.
5 Weighting
This section summarizes the weighting and expansion procedures used in the Massachusetts Travel Study dataset. The goal of weighting is to expand the survey sample so that it represents the full resident population of Massachusetts.
The Massachusetts Travel Study leveraged two related weighting workflows. The standard weights represent travel behavior on a typical weekday, using Monday through Thursday travel. The day-of-week weights represent travel behavior on each day of the week, including Friday, Saturday, and Sunday. Both workflows used the same general approach, but they differed in the records included, the weighting geographies, and the way the final weights should be used in analysis.
To produce statistics that represent an entire population without surveying every household or individual, survey researchers assign weights to each completed observation. In household travel surveys, the survey weight indicates how many people, households, days, or trips in the population a given respondent or record is estimated to represent. By applying these weights, analysts can generate regional estimates even when the sample is only a small fraction of the full population.
5.1 Overview of Weighting Goals
The weighting process aligns weighted survey estimates with external population totals and distributions across key household, person, day, and trip characteristics. Weighting corrects for differential sampling, differences in survey completion across demographic groups, and systematic differences in trip reporting that arise from the method respondents used to report their travel, such as smartphone app, web diary, or call center.
For the Massachusetts Travel Study, this process produced two sets of final weights. The standard weighting process expanded the survey sample to represent travel on an average weekday across Monday, Tuesday, Wednesday, and Thursday. The day-of-week weighting process expanded the survey sample separately for each day across Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday. The day-of-week weights are most useful when an analysis explicitly depends on the day of week, such as comparing weekday and weekend travel or estimating travel totals for a specific day.
The day-of-week workflow built on the standard weighting workflow. Both workflows began with initial expansion, adjusted household weights to demographic targets, accounted for day-pattern reporting differences by diary platform, derived person, day, and trip weights, and then applied trip-level adjustments. The main difference was that the day-of-week workflow repeated the relevant weighting steps separately by day, so the weighted records represent Monday travel, Tuesday travel, Saturday travel, and so on, rather than one average weekday.
Table 18 summarizes the practical difference between the two weighting workflows.
| Dimension | Standard weighting | Day-of-week weighting |
|---|---|---|
| Travel days represented | Monday through Thursday as an average weekday | Monday through Sunday, weighted separately by day |
| Recommended use | Default weighted workflow for typical weekday summaries and most standard reporting | Day-specific, weekday/weekend, and weekend travel analysis |
| Geographic controls | Custom client-defined weighting zones developed for the project | Broader Boston / Not Boston weighting groups |
| Completion basis | Complete eligible weekday travel days | Complete travel data for the specific day being weighted |
| Day weights | Person weights are divided across complete eligible weekdays | Day weights are equal to the person weight for that day |
| Trip weights | Trip totals represent an average weekday | Trip totals represent a specific day of week |
5.2 What the Weights Represent
Across all steps, the weighting process produces final weights at multiple analytic levels. These weights allow the survey records to represent households, people, person-days, trips, linked trips, and tours in the study area.
- Household weight: expands each surveyed household to represent households in the study area.
- Person weight: expands each person to represent the population of persons.
- Day weight: expands each complete person-day to represent daily travel, with the interpretation depending on whether the standard or day-of-week weighting workflow is used.
- Unlinked trip weight: expands individual trip segments, including transit access, transfer, and egress legs.
- Linked trip weight: expands complete trips between an origin and destination, with intermediate transit transfers combined into a single trip.
- Tour weight: expands sequences of linked trips that begin and end at the same location.
For this study, the standard final weights represent a typical weekday based on Monday through Thursday travel. Standard day weights represent average weekday person-days, and standard trip, linked trip, and tour weights represent travel on an average weekday.
When the analytic question is explicitly about differences across Monday through Sunday, use the alternate day-of-week weighting workflow described in Section 16. The day-of-week weights produce a separate set of weights for each day of the week, so weighted estimates can represent Monday travel, Tuesday travel, weekend travel, or other day-specific comparisons.
The final dataset may contain weights equal to zero. When a weight is equal to zero, it means that the record is present in the delivered data, but was not eligible to receive that particular weight.
Records may receive zero weights for the following reasons:
- Partially complete records. For example, if a household participated for seven days but only provided three days of complete diary data, the incomplete days would be retained in the delivered data but would not receive positive day weights.
- For households with children, days without complete proxy-reported child travel. For the standard weights, households with children needed complete reported travel for children on the weighted travel day. Some additional household days involving children may therefore receive zero standard weights if they do not meet the standard completion rules. The day-of-week weights use a relaxed child-completion rule because children’s travel was proxy-reported for only one day.
- For standard weights, days outside of the standard “typical weekday” definition. The standard weights represent typical weekday travel based on Monday through Thursday. Friday, Saturday, and Sunday records are therefore not eligible to receive positive standard day, trip, linked trip, or tour weights.
- For day-of-week weights, day-specific eligibility. The day-of-week weights are assigned separately for each day of the week. A record with a zero standard weight may still receive a positive day-of-week weight if it meets the completion criteria for that specific day.
Analysts should treat zero weights as specific to the weight being used. A zero value does not necessarily mean that the record is invalid or unusable for all analyses.
5.3 Inputs to Weighting
The Massachusetts Travel Study used two primary inputs in the weighting process: survey data and population target data. These same general inputs supported both the standard weights and the day-of-week weights, but the eligible survey records, weighting geographies, and control totals differed between the two sets of weights.
Survey Data
The survey data consisted of cleaned household, person, day, and trip records that met the completion criteria for weighting. The records eligible for weighting depended on whether the standard weights or day-of-week weights were being created.
For the standard weights, households were included if they provided complete data for at least one Monday, Tuesday, Wednesday, or Thursday travel day. These records support estimates of travel on an average weekday.
For the day-of-week weights, survey records were evaluated separately for each day of the week. A household was included for a given day only when it provided complete data for that specific day. As a result, the set of records that receive positive day-of-week weights can differ across Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.
Before weighting, missing demographic values needed for weighting were imputed where possible. These included income, gender, race, and ethnicity. The imputed values were used to reduce missingness in the demographic control variables used during weighting.
Population Target Data
The target data provided the household and person control totals used to adjust the survey records to the study area population. Demographic weighting targets were developed from the 2023 ACS 1-year Public Use Microdata Sample (PUMS). Selected auxiliary inputs used in imputation, including block-group income distributions, were drawn from 2023 ACS 5-year data.
The target data provided total household and population counts for each weighting geography, as well as detailed demographic distributions used as control totals in weighting. These controls included household characteristics, person characteristics, and selected travel-related controls where appropriate.
Weighting Geographies
The standard weights and day-of-week weights used different weighting geographies.
The standard weights used custom weighting zones developed for this study (Figure 5). These zones were based on MPO geographies, with smaller areas grouped where needed to maintain enough sample for stable weighting. The weighting zones were designed to balance the need for geographic specificity with the need for stable estimation across a range of demographic and travel behavior targets.
The day-of-week weights used broader geographic groups because estimating weights separately for each day reduces the available sample size. For this reason, the day-of-week weights were developed using a simpler Boston / Not Boston geography (Figure 6). In the weighting memo, the “Boston” group is described as the four inner-most commuter rings; the remaining areas are included in the “Not Boston” group.
The practical implication is important for analysis. Standard weights are calibrated to the custom weighting zones used for the typical weekday workflow. Day-of-week weights are calibrated to broader Boston / Not Boston geographies by day. Estimates are generally more stable when summarized at or above the geography used in weighting. Estimates for smaller geographies, or geographies that cut across weighting zones, should be interpreted with additional caution and should be accompanied by checks of unweighted sample size and weight variability (e.g., standard errors, design effects, or effective sample size).
Targets
Targets are the specific demographic and household distributions that the weighting procedure seeks to align between the survey sample and the underlying population estimates. For the Massachusetts Travel Study, targets were defined using ACS PUMS data to promote statistical representativeness across the weighting geographies.
Each target represents a key dimension of the study area’s population and travel behavior that is important for accurate expansion of survey data to reflect the total population. Target variables span household characteristics, person-level attributes, and selected travel-related controls. At the highest level, the weighting process was constrained to match the total number of households and total number of persons in the study area and within the relevant weighting geographies.
- Total households in the study area: approximately 2,816,000
- Total persons in the study area: approximately 6,760,613
The standard and day-of-week workflows use similar target concepts, but some day-of-week categories are combined to improve stability after the sample is split by day. For example, the day-of-week process uses broader geographic controls and combines selected target levels where the day-specific sample is smaller.
Household-level targets
Table 19 summarizes the household-level target categories used in the two weighting workflows.
| Variable | Standard weighting categories | Day-of-week weighting categories |
|---|---|---|
| Household size | 1 person; 2 people; 3 people; 4 people; 5 people or more | Same as standard weighting |
| Income | Under $25,000; $25,000-$49,999; $50,000-$74,999; $75,000-$99,999; $100,000-$199,999; $200,000 or more | Same as standard weighting |
| Workers | 0 workers; 1 worker; 2 workers; 3 workers or more | 0 workers; 1 worker; 2 workers or more |
| Vehicles | No vehicles; at least one vehicle and fewer vehicles than drivers age 16 or older; vehicles greater than or equal to drivers | Same as standard weighting |
| Presence of children | 0 children; 1 or more children | Same as standard weighting |
| Total households | Total households by weighting geography | Total households by weighting geography and day |
Person-level targets
Table 20 summarizes the person-level target categories used in the two weighting workflows.
| Variable | Standard weighting categories | Day-of-week weighting categories |
|---|---|---|
| Gender | Male; female | Same as standard weighting |
| Age | Under 5; 5-15; 16-17; 18-24; 25-44; 45-64; 65 or older | Under 5; 5-17; 18-24; 25-44; 45-64; 65 or older |
| Worker status | Full-time worker; part-time worker; non-worker | Same as standard weighting |
| Commute mode | Work from home; walk; bike; transit; Other (include auto); not applicable | Work from home; walk; bike; transit; other; not applicable |
| University student status | University student; not a university student | Same as standard weighting |
| Educational attainment | Some college education; no college education | Same as standard weighting |
| Race | African American; Asian Pacific; White; Other | Same as standard weighting |
| Ethnicity | Hispanic; Non-Hispanic | Same as standard weighting |
| Total persons | Total persons by weighting geography | Total persons by weighting geography and day |
Travel-related controls
The standard weights included a regional transit trip target to address overrepresentation of transit trips in the survey data. The day-of-week weights did not use the same transit-trip control target.
Combined Weighting Targets
The categories listed above summarize the household- and person-level controls used in weighting. Analysts can use them as a quick reference for the dimensions and levels at which the weighted survey was calibrated to known population totals.
In practice, the weighting controls do two things simultaneously:
- match the total households and total persons in each weighting zone group; and
- match the marginal distributions of these household and person characteristics within each weighting zone group.
The controls do not guarantee that every cross-classification of those characteristics is perfectly represented. For example, age and income may each match target distributions, while age by income may still reflect sampling variability.
Some target categories were simplified, combined, or selectively applied to maintain stable estimation in smaller geographies and to avoid over-constraining the weighting process. Analysts should therefore interpret these categories as the effective levels at which the survey was calibrated to known population totals.
Weighting targets define the population dimensions used to calibrate the survey. These controls make the marginal distributions (e.g., age, gender, income groups) in the weighted data match known population totals. However, there are important limitations:
1. Weighting improves representativeness only within defined categories.
Estimates are most reliable at the level of the weighting targets. More detailed breakdowns (e.g., finer income bins) were not controlled and may still reflect sampling variability or bias. In practice, targets define the finest level of safe aggregation.
2. Joint distributions are not guaranteed to match the population.
Weights align individual targets, not combinations of them. For example, age and race may each match population totals, but age x race may still be misrepresented. Be cautious with highly disaggregated cross-tabulations.
3. Non-targeted variables and small cells may be unstable.
Variables not included in weighting controls are not explicitly bias-corrected. Small or sparse groups remain unstable after weighting, especially when weights are large or variable. We recommend checking cell sizes and relative standard errors (RSEs) before interpreting results, especially when sample sizes are small.
4. Weighting does not correct measurement error.
Targets adjust who is represented, not what was reported. Misreporting or limitations in survey design (e.g., coarse mode categories) are not fixed through weighting.
Other useful diagnostics include the effective sample size, which reflects the equivalent number of equally weighted observations, and the design effect, which captures how weighting inflates variance (see Section 5.6.3 below).
Bottom line:
Weighting improves representativeness along specific dimensions, but it does not guarantee reliable estimates for all subgroups. Use targets as a guide to where estimates are most trustworthy.
5.4 Weighting Process
The flow chart below summarizes the full weighting workflow described in this section. The sections that follow explain each step in more detail.
%%{init: {"theme":"default","flowchart":{"htmlLabels":true,"curve":"basis"}}}%%
flowchart TD
Survey[("Survey Data<br/>(households, persons, days, trips)")]
Census[("Target Data<br/>(ACS PUMS and project controls)")]
Targets[["Household and Person Targets"]]
Base{"Base Weight Estimation"}
P1{"Round 1:<br/>Demographic Reweighting"}
DP{"Day-Pattern Modeling"}
DayTargets[["Day-Pattern Targets"]]
P2{"Round 2:<br/>Day-Pattern Reweighting"}
PersonDay{"Person and Day Weight Derivation"}
TripAdj{"Round 3:<br/>Trip Adjustment"}
FinalWeights[/"Final Household, Person,<br/>Day, Trip, Linked Trip,<br/>and Tour Weights"/]
Survey --> Base
Census --> Targets
Base --> P1
Targets --> P1
P1 --> DP
DP --> DayTargets
Targets --> P2
DayTargets --> P2
P2 --> PersonDay
PersonDay --> TripAdj
TripAdj --> FinalWeights
classDef source fill:#024D5F,stroke:#013845,color:#ffffff,font-family:Inter,stroke-width:2.5px,font-weight:bold,fill-opacity:0.92
classDef control fill:#1B9E77,stroke:#15785B,color:#ffffff,font-family:Inter,stroke-width:2.5px,font-weight:bold,fill-opacity:0.92
classDef process fill:#0D7993,stroke:#085C70,color:#ffffff,font-family:Inter,stroke-width:2.5px,font-weight:bold,fill-opacity:0.92
classDef weight fill:#695CB4,stroke:#4B4180,color:#ffffff,font-family:Inter,stroke-width:2.5px,font-weight:bold,fill-opacity:0.92
linkStyle default stroke:#475569,stroke-width:2.25px
class Survey,Census source
class Targets,DayTargets control
class Base,P1,DP,P2,PersonDay,TripAdj process
class FinalWeights weight
Base Weights
Weighting began with base weights, which reflected the probability that a household was included in the survey. For each sample segment, RSG calculated a base weight as the inverse of the probability of inclusion, which depended on both the probability of selection and the probability of response. For segment s with H total households and R responding households, the base weight can be understood as:
\[ w_s = \frac{H_s}{R_s} \]
Base weights provided the initial expansion from the sample to the population and served as the seed weights for subsequent rounds of weighting adjustments. For the day-of-week workflow, the same concept was applied separately by day. If a segment had a different number of complete records on Monday than on Tuesday, then the initial expansion could differ by day.
Round 1 Weighting: Adjusting for Demographic Bias
Round 1 weighting used PopulationSim to adjust base weights so that weighted survey estimates matched demographic control totals derived from ACS PUMS. PopulationSim performed constrained entropy maximization, adjusting household weights in the smallest way necessary to match a set of household- and person-level targets.
Entropy maximization is a statistical method used to adjust survey weights so that the weighted survey data matches known population totals, such as the number of households, adults, workers, or children in a region.
The key idea is simple: change the initial weights as little as possible while forcing the final weighted totals to match external control totals. Groups that were underrepresented in the sample receive higher weights, while groups that were overrepresented receive lower weights. This approach preserves the structure of the collected data while helping the survey reflect the population.
For standard weighting, this reweighting process was applied to the Monday-through-Thursday records together to represent an average weekday. For day-of-week weighting, weights were estimated to match the household- and person-level targets for each day of the week. In the end, a household could have different weights for different days, depending on which days of complete travel data were available and how those records fit the day-specific controls.
The output of Round 1 consisted of target-optimized household weights. These weights aligned the survey with demographic targets and served as inputs to the day-pattern adjustment described below.
Round 2 Weighting: Adjusting for Day-Pattern Bias
Survey trip rates differed across diary platforms, in part because smartphone app users tended to report more complete travel than online diary or call center respondents. To address this issue, RSG applied a day-pattern adjustment before finalizing household, person, and day weights.
RSG classified each person-day into three mutually exclusive day-pattern categories: made no trips, made mandatory trips, or made only non-mandatory trips. Mandatory trips are trips to work, work-related activities, school, or school-related activities. The day-pattern model estimated how likely each person-day was to fall into one of these categories after accounting for demographic characteristics and diary platform.
For standard weighting, the day-pattern adjustment represented the Monday-through-Thursday average weekday. For day-of-week weighting, the same general procedure was applied separately by day, with an additional day-of-week term in the model. The resulting day-pattern targets were added to a second PopulationSim run so that the final household weights accounted for both demographic targets and diary-platform reporting differences.
The table below shows the general direction of the day-pattern adjustment for the day-of-week workflow. For Monday through Thursday, the adjustment reduces the share of no-travel days for online diary and call center respondents and increases the share of days with reported travel. Online and call center diaries were collected Monday through Thursday, so no analogous adjustment is needed for Friday through Sunday trip records.
| Day | Day type | Call center before | Call center after | Online diary before | Online diary after | Smartphone |
|---|---|---|---|---|---|---|
| Mon | No trips | 26% | 20% | 22% | 13% | 12% |
| Mon | Made mandatory trips | 18% | 23% | 42% | 42% | 45% |
| Mon | Made only non-mandatory trips | 56% | 57% | 36% | 45% | 42% |
| Tue | No trips | 26% | 21% | 22% | 13% | 12% |
| Tue | Made mandatory trips | 18% | 23% | 44% | 43% | 48% |
| Tue | Made only non-mandatory trips | 55% | 56% | 35% | 44% | 40% |
| Wed | No trips | 25% | 20% | 21% | 12% | 11% |
| Wed | Made mandatory trips | 19% | 24% | 44% | 44% | 48% |
| Wed | Made only non-mandatory trips | 56% | 56% | 35% | 44% | 41% |
| Thu | No trips | 25% | 19% | 21% | 12% | 11% |
| Thu | Made mandatory trips | 21% | 27% | 43% | 43% | 47% |
| Thu | Made only non-mandatory trips | 54% | 54% | 36% | 45% | 41% |
Adjusting Person and Day Weights
After household weights were finalized, person and day weights were derived from the household weights. Person weights were created by assigning the household weight to each household member. Because the survey does not collect travel diaries from unrelated household members, unrelated persons received a person weight of zero and their weight was redistributed evenly among the remaining related household members.
Day weights were then assigned to complete person-days. This was one of the most important differences between the standard and day-of-week workflows. In standard weighting, a person with multiple complete eligible weekdays had their person weight divided across those complete days, so the resulting day records collectively represented that person’s average weekday contribution. In day-of-week weighting, the weights were already specific to a day of week, so the day weight was equal to the person weight for that day.
Round 3 Weighting: Adjusting for Trip-Type Reporting Bias
The final weighting step corrected for under-reporting of specific trip types across diary platforms. Trip records were grouped into work, school, and other trip categories. For each trip type, RSG estimated a weighted model predicting the number of trips per person-day and used the model to calculate a trip adjustment factor.
The adjustment factor was applied to unlinked trip weights, using the final day weight as the starting point. The adjustment was designed to account for under-reporting of stops in the trip diary, such as a brief stop that a respondent forgot to record. For day-of-week weighting, the same adjustment concept was applied to the day-specific weights. Because online diary and call center records were only collected Monday through Thursday, the adjustment was relevant to those days and platforms.
The table below summarizes the trip adjustment factors used in the day-of-week workflow.
| Trip type | Online diary adjustment | Call center adjustment |
|---|---|---|
| Work trips | 1.62 | 1.51 |
| School trips | 1.87 | 1.00 |
| Other trips | 1.50 | 1.15 |
After the unlinked trip weights were adjusted, linked trip weights and tour weights were calculated from the updated trip weights. Linked trip weights represent complete origin-to-destination trips, while tour weights represent sequences of linked trips that begin and end at the same location.
5.5 Weighted Totals
The final weights expand the survey to population totals at the household, person, day, trip, linked trip, and tour levels.
Table 23 shows the total weighted households, persons, person-days, trips, linked trips, and tours for the standard weekday workflow. These totals represent the overall scale of travel in the study area on an average weekday across Monday through Thursday.
| Final Weighted Totals by Analysis Level | |
| Weight Level | Weighted Total |
|---|---|
| Household | 2,814,595.3 |
| Person | 6,759,611.8 |
| Day | 6,759,611.8 |
| Trip | 30,078,666.6 |
Table 24 summarizes the day-specific totals available for the alternate day-of-week workflow. These totals represent the scale of travel in the study area on each specific day of the week, with separate estimates for Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday travel.
| Day-of-Week Weighted Totals | ||
| Day | Weight Level | Weighted Total |
|---|---|---|
| Monday | Day | 6,751,877.4 |
| Monday | Trip | 30,625,904.0 |
| Tuesday | Day | 6,770,579.5 |
| Tuesday | Trip | 31,791,642.6 |
| Wednesday | Day | 6,765,954.5 |
| Wednesday | Trip | 32,235,056.1 |
| Thursday | Day | 6,764,262.0 |
| Thursday | Trip | 32,082,955.5 |
| Friday | Day | 6,745,710.1 |
| Friday | Trip | 33,560,235.3 |
| Saturday | Day | 6,739,056.6 |
| Saturday | Trip | 33,110,358.2 |
| Sunday | Day | 6,746,910.2 |
| Sunday | Trip | 27,489,425.3 |
5.6 Additional Guidance for Analysts
Choosing the Right Weight
Different analyses require different weight types. Analysts should select the weight that matches both the level of measurement and the weighting workflow.
For most typical weekday summaries, analysts should use the standard weights. Standard household, person, day, trip, linked trip, and tour weights are appropriate when the research question is about average weekday travel, especially for summaries that are not intended to distinguish Monday from Tuesday or weekday from weekend behavior.
Day-of-week weights should be used when the analytic question is explicitly about a specific day of week or about differences across days. Examples include comparing Saturday and Sunday travel, estimating Friday trip rates, or comparing weekday and weekend mode share. When using day-of-week weights, analysts should filter to the relevant day or group of days before applying the corresponding day-of-week weight. A Monday estimate should use Monday records and Monday weights; a Saturday estimate should use Saturday records and Saturday weights.
At each level, the same unit-matching principle applies:
- Household weights should be used when households are the unit of analysis or when studying household-level characteristics.
- Person weights should be used for demographic characteristics, person-level behaviors, and analyses where individuals, not days or trips, are the unit.
- Day weights should be used when analyzing person-day travel behavior, including day patterns, trip rates, and average daily travel.
- Trip weights should be used when analyzing trips.
- Linked trip weights should be used when analyzing complete origin-to-destination linked trips.
- Tour weights should be used when analyzing tours.
Using the wrong weight type can lead to biased estimates. For example, applying person weights to trip tables will underestimate total travel, while using day-of-week weights without filtering to the intended day can mix together records that represent different target days.
What the Weights Can and Cannot Correct
The weighting process corrects for several forms of bias:
- differences in sampling likelihood across geographies;
- differential response rates across demographic groups;
- reporting differences across diary platforms, including smartphone, online diary, and call center reporting; and
- under-reporting of specific trip types.
However, weighting cannot correct for every possible source of error. It cannot fully correct misreported or miscoded trip purposes, missing data not captured through imputation, recall errors unrelated to diary platform, GPS or routing errors, or sparse samples in very small geographies or rare population groups.
The day-of-week weights also have a specific limitation. Because the sample is split by day, each day-specific weighting run has fewer records than the standard Monday-through-Thursday workflow. The broader Boston / Not Boston geography and combined target categories improve stability, but they do not make all day-specific subgroup estimates equally reliable. Analysts should interpret highly granular day-of-week estimates with caution.
Design Effects and Effective Sample Size
Unequal weights reduce the statistical precision of estimates compared with a simple random sample of the same size. This reduction is summarized by the design effect (DEFF) and the effective sample size (ESS). DEFF reflects how much weight variability inflates variance; ESS reflects the size of an unweighted sample that would yield equivalent precision.
Because day-of-week weights are estimated separately by day, they may have larger variance than the standard weights, particularly for Friday, Saturday, Sunday, or small subgroups. As a rule of thumb, when DEFF exceeds 2.0, analysts should expect a noticeable loss of precision, especially when estimates are based on small subgroups where limited sample size and weight variability compound.
| Weight Quality Diagnostics by Analysis Level | |||
| Weight Level | CV | DEFF | ESS |
|---|---|---|---|
| Household | 1.18 | 2.38 | 6,521.83 |
| Person | 1.22 | 2.48 | 11,917.58 |
| Day | 1.60 | 3.57 | 13,727.30 |
| Trip | 1.81 | 4.26 | 46,971.97 |
The standard diagnostics in Table 25 summarize the default weekday workflow. Table 26 shows the weekday-specific diagnostics for the alternate day-of-week weights. These values are especially helpful when comparing the relative stability of Monday-through-Thursday estimates with Friday, Saturday, or Sunday estimates.
| Day-of-Week Weight Quality Diagnostics | ||||
| Day | Weight Level | CV | DEFF | ESS |
|---|---|---|---|---|
| Monday | Day | 1.36 | 2.85 | 4,461.10 |
| Monday | Trip | 1.44 | 3.08 | 17,038.78 |
| Tuesday | Day | 1.33 | 2.78 | 5,760.86 |
| Tuesday | Trip | 1.45 | 3.10 | 21,093.64 |
| Wednesday | Day | 1.39 | 2.93 | 5,444.14 |
| Wednesday | Trip | 1.56 | 3.42 | 19,229.54 |
| Thursday | Day | 1.37 | 2.88 | 5,253.48 |
| Thursday | Trip | 1.49 | 3.22 | 19,562.40 |
| Friday | Day | 1.37 | 2.88 | 3,557.98 |
| Friday | Trip | 1.44 | 3.08 | 16,416.02 |
| Saturday | Day | 1.36 | 2.86 | 3,574.00 |
| Saturday | Trip | 1.42 | 3.00 | 17,224.92 |
| Sunday | Day | 1.37 | 2.88 | 3,528.14 |
| Sunday | Trip | 1.44 | 3.06 | 13,759.76 |
Distribution of Weights
Weight variability differs by dataset level and by weighting workflow. Household and person weights reflect the demographic and geographic calibration process. Day weights reflect the way person weights are assigned to complete travel days. Trip weights inherit day-weight variability and incorporate trip-type adjustment factors.
In the day-of-week workflow, weight distributions also differ by day. Friday, Saturday, and Sunday generally have fewer records because online diary and call center travel data were collected Monday through Thursday, while Friday through Sunday records come from smartphone respondents. This smaller sample size contributes to larger day-of-week weights and greater uncertainty for some estimates. Figure 7 visualizes those differences across dataset levels.
Analysts should be cautious when conducting analyses in which a small number of high-weight observations dominate the estimates. This caution is especially important for day-specific estimates, small geographies, rare subgroups, and cross-tabulations with many categories.
Geographic Considerations and Small-Area Estimates
Because weighting was performed to specific weighting geographies, those are the geographies at which the weighted data is most internally consistent. For standard weighting, the relevant geography is the custom client-defined weighting zone structure. For day-of-week weighting, the relevant geography is the broader Boston / Not Boston structure.
This does not prevent analysts from summarizing the data to other geographies, but it does affect interpretation. Cities, towns, neighborhoods, corridors, and other analyst-defined geographies are not individually controlled unless they align with the weighting geographies. Weighted totals for those areas may not match external benchmarks, and estimates may be driven by a small number of high-weight records.
For fine-scale estimates, analysts should check the unweighted sample size, the number of households or persons contributing to the estimate, and the distribution of weights. In some cases, pooling multiple areas, pooling multiple days, or reporting estimates at a broader geography may be more appropriate.
Small Population Groups
Rare population groups, rare travel behaviors, and highly specific day-by-geography combinations may have limited representation in the weighted data. This can include groups such as zero-vehicle households, transit commuters, active transportation users, university students, or weekend travelers in a small geography.
For these groups, analysts should consider pooling response categories, pooling across geographies or days where conceptually appropriate, reporting uncertainty measures, or using model-based estimation techniques. The day-of-week weights improve the ability to analyze daily variation in travel, but they do not eliminate the need to evaluate sample size and precision.
5.7 Summary
Weighting for the Massachusetts Travel Study dataset follows a structured and incremental process. Base weights correct for sample design. Round 1 adjustments correct for demographic nonresponse. Round 2 adjustments correct for day-pattern reporting bias. Round 3 adjustments correct for trip-type under-reporting. Together, these steps yield household, person, day, trip, linked trip, and tour weights for estimating population-level travel behavior.
The standard weights should be treated as the default workflow for typical weekday analysis. The day-of-week weights should be used when the research question depends on a specific day of week or on comparisons across days, including weekday/weekend analysis. In either workflow, analysts should match the weight to the analytic unit, use the geography and target structure as a guide to stable interpretation, and check sample size and weight variability before interpreting small or highly detailed estimates.
6 Dataset Overview
This section describes the prepared tables available for analysis and how they relate to one another. It uses the prepared hts object and the current settings.yml configuration.
6.1 Data Structure
Household travel survey data are hierarchical: though the primary sampling unit (see Section 2.1.2) is the household, the data collected also represent the behavior of individual persons. For participants who reported data via the rMove smartphone app, data were collected across multiple days, representing a multitude of travel and daily activity data.
In the delivered dataset, this hierarchical structure shows up in the form of multiple tables that link to one another using stable identifiers (typically columns ending in _id). Figure 8 summarizes the relationship among the prepared tables.
6.2 Summary of Data Tables
Table 27 lists the prepared tables, their units of observation, and their primary identifiers.
| Table Name | Record Unit | Primary ID(s) | Weight Column | What's in the Table |
|---|---|---|---|---|
| Household (`hh`) | Household | `hh_id` | `hh_weight` | One record per household, with household-level attributes (e.g., sampling/strata fields and household characteristics) used for analysis and weighting. |
| Person (`person`) | Person | `person_id` | `person_weight` | One record per person in each household, including demographic attributes and person-level variables used for analysis and weighting. |
| Day (`day`) | Person-day | `day_id` | `day_weight` | One record per person-day (a single survey day for a person), used for day-based analysis such as trip rates and daily metrics. |
| Vehicle (`vehicle`) | Vehicle | `vehicle_id` | `hh_weight` | One record per household vehicle (when delivered), including vehicle identifiers and vehicle characteristics used in vehicle-based analysis and joins. |
| Location (`location`) | GPS point on a trip | `trip_id`, `collect_time` | One record per place/location reference (when delivered), often used to store geocoded attributes or repeated location metadata linked to trips, tours, or activities. | |
| Unlinked Trip (`trip_unlinked`) | Person-trip | `trip_id` | `trip_weight` | One record per unlinked trip segment (when delivered), typically representing each movement between stops; used for detailed mode/path and trip-chaining analysis. |
| Linked Trip (`trip_linked`) | Person-trip | `linked_trip_id` | `linked_trip_weight` | One record per linked trip (when delivered), typically aggregating unlinked segments into a single journey between primary origin and destination. |
| Tour (`tour`) | Person-tour | `tour_id` | `tour_weight` | One record per tour (when delivered), grouping trips into an out-and-back sequence anchored at home or a primary location; used for tour-based analysis. |
6.3 Trip Unit of Measure: Person-Trips
Understanding how travel events are represented in the dataset is essential for correctly interpreting trip and tour outputs. This section describes the structure of person-trip records, how shared travel is represented, and what that means for later analyses.
What is a Person-Trip?
Trips are represented at the person-trip level: each row is a single travel event made by a single person. If multiple household members traveled together, the shared movement appears as multiple records, one per participating person.
The guide’s default trip table is trip_unlinked. See the Analyst Handbook for examples that build from this table.
6.4 Record Counts
Table 29 summarizes the delivered record counts and the number of positively weighted records by table.
| Table | Records | Weight Column | Weighted Records | Percent Weighted |
|---|---|---|---|---|
| Household (`hh`) | 18,122 | `hh_weight` | 15,552 | 85.8% |
| Person (`person`) | 37,616 | `person_weight` | 29,560 | 78.6% |
| Day (`day`) | 134,187 | `day_weight` | 49,028 | 36.5% |
| Vehicle (`vehicle`) | 25,849 | `hh_weight` | 21,669 | 83.8% |
| Location (`location`) | 8,607,225 | NA | NA | |
| Unlinked Trip (`trip_unlinked`) | 468,018 | `trip_weight` | 200,120 | 42.8% |
| Linked Trip (`trip_linked`) | 419,469 | `linked_trip_weight` | 173,983 | 41.5% |
| Tour (`tour`) | 160,091 | `tour_weight` | 68,007 | 42.5% |
| Value Labels (`value_labels`) | 2,422 | NA | NA | |
| Variable List (`variable_list`) | 567 | NA | NA |
6.5 Household Completion Status
The delivered MassDOT data include both complete and incomplete households. For most descriptive and inferential analyses, the recommended analytic universe is the set of households where hts$hh$is_complete == 1.
This household-level completeness rule is different from trip-level completion flags. In particular, trip_unlinked$is_complete describes trip-record completion or usability, not whether the household belongs in the complete-household analytic universe. When an analysis starts from person-, day-, trip-, vehicle-, or tour-level records, use the household table to define the complete-household universe and then carry that restriction down through hh_id.
Table 30 shows how many delivered records belong to complete versus incomplete households across the main prepared tables.
| Table | Household Completion Status | Records |
|---|---|---|
| hh | Complete household | 15,641 |
| hh | Incomplete household | 2,481 |
| person | Complete household | 31,255 |
| person | Incomplete household | 6,361 |
| day | Complete household | 96,370 |
| day | Incomplete household | 37,817 |
| trip_unlinked | Complete household | 411,573 |
| trip_unlinked | Incomplete household | 56,445 |
| trip_linked | Complete household | 366,186 |
| trip_linked | Incomplete household | 53,283 |
| tour | Complete household | 139,240 |
| tour | Incomplete household | 20,851 |
| vehicle | Complete household | 21,770 |
| vehicle | Incomplete household | 4,079 |
For lower-level analysis tables, the simplest workflow is:
- create
complete_hh_idsfromhts$hh - filter households directly with
dplyr::filter(is_complete == 1) - filter lower-level tables with
dplyr::filter(hh_id %in% complete_hh_ids) - use trip-level
is_completeonly when the question is specifically about trip completion or trip usability
6.6 Data Types and Considerations
The dataset includes variables that behave differently in analysis. Understanding common patterns helps avoid common mistakes in summaries and models.
Categorical Variables
Categorical variables store labels rather than magnitudes. They include binary fields, nominal fields (no inherent order), ordinal fields (with a natural order), and many count-like fields that are top-coded or otherwise treated as binned categories.
When you build a table, chart, or derived factor from a categorical variable, use codebook$value_labels as the source of truth for both labels and ordering.
Continuous Numeric Variables
Continuous numeric variables represent numeric measures where arithmetic operations are meaningful (for example, distances, durations, travel time, or speed). These fields can have wide ranges and may include extreme values.
Some variables that look numeric (for example age brackets or capped household sizes) should be treated as categorical in analysis when they represent binned values rather than true continuous measures.
The table below summarizes the configured outlier diagnostics used to review the tails of selected numeric variables.
Outlier diagnostics.
| Min | Max | P01 | P99 | IQR | Lower bound | Upper bound | Outliers | % outliers | Worst gap | Severity | Suggested handling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| day | ||||||||||||
| num_trips | 0 | 68 | 0 | 15 | 5 | −8 | 12 | 3,061 | 2.3% | 56 | Moderate | Consider trimming >= 99th pct. |
| person | ||||||||||||
| num_trips | 0 | 310 | 0 | 70 | 17 | −24 | 44 | 2,560 | 6.8% | 266 | High | Trim or winsorize >= 95th pct. |
| trip | ||||||||||||
| distance_meters | 0 | 19,872,573 | 93 | 111,517 | 8,568 | −11,723 | 22,549 | 49,331 | 10.8% | 19,850,024 | High | Trim or winsorize >= 95th pct. |
| distance_miles | 0 | 12,348 | 0 | 69 | 5 | −7 | 14 | 49,331 | 10.8% | 12,334 | High | Trim or winsorize >= 95th pct. |
| duration_minutes | 0 | 8,008 | 0 | 158 | 16 | −18 | 46 | 34,690 | 7.6% | 7,962 | High | Trim or winsorize >= 95th pct. |
| duration_seconds | 1 | 480,464 | 1 | 9,472 | 979 | −1,102 | 2,814 | 34,281 | 7.5% | 477,650 | High | Trim or winsorize >= 95th pct. |
| dwell_mins | 0 | 9,162 | 0 | 2,367 | 311 | −462 | 783 | 52,110 | 11.1% | 8,379 | High | Trim or winsorize >= 95th pct. |
| speed_mph | 0 | 14,766,979 | 0 | 539 | 21 | −26 | 57 | 19,239 | 4.2% | 14,766,922 | High | Trim or winsorize >= 95th pct. |
Missing Values
Table 31 summarizes the configured missing-value codes and their labels in the codebook.
| Code | Label(s) in codebook | # Variables |
|---|---|---|
| -1 | Not imputable; Missing | 13 |
| 995 | Missing Response; 995 | 310 |
| 996 | Never; None; None (I do not drive a vehicle) | 7 |
| 997 | Other; Other/prefer to self-describe; Other (e.g., boat, RV, van); Other vehicle | 12 |
| 998 | Don't know | 3 |
| 999 | Prefer not to answer | 8 |
7 Codebook
The codebook is the primary reference for understanding the variables, response categories, question and response logic, and data structures used in this dataset.
A clear codebook is essential for reproducible analysis. It helps analysts identify the meaning of each variable, understand where it appears in the dataset, distinguish categorical variables from numeric or top-coded fields, verify skip logic and valid values, and interpret coded values consistently across tables.
Start here whenever you need to answer any of these questions:
- What table contains this variable?
- What does this field mean?
- Who was asked this question, and under what conditions should I expect a value?
- Is this field categorical, numeric, top-coded, or part of a grouped response?
- What do the stored values mean?
- What order should categories appear in a plot or table?
- Is this field part of a “select-all-that-apply” group or controlled by survey logic?
7.1 What the Codebook Contains
Variable List
The variable list is the structural reference for the dataset. It describes each delivered data element and helps analysts understand how variables are organized across household, person, day, trip, tour, vehicle, location, or other study tables.
The variable list includes:
- variable name
- table membership
- delivered data type
- description of the variable’s meaning, units, or derivation
- survey question text, when available
- survey logic that governs whether a respondent was asked the question or should have a value
- checkbox or select-all-that-apply flags for multiple-response categorical variables
Value Labels
The value-label table is the categorical reference for the dataset. For coded categorical variables, it maps stored values to human-readable labels so analysts can reconstruct ordered factors, standard tabulations, and interpretable plots.
Depending on the study, value-label records may include:
- table name, when labels vary by table
- variable name
- stored value or code
- human-readable label
- category order
7.2 Variable List
The variable list below is searchable and downloadable. For display, table membership flags present in the raw delivered codebook as binary hh, person, day columns are combined into a single table_membership field when the source codebook stores membership as separate table columns.
Codebook variable list.
7.3 Value Labels
The value-label table below lists the available value labels for categorical variables. Use it alongside the variable list to interpret coded values and preserve the intended category order in summaries, charts, and models.
Codebook value labels.
8 Frequency Tables
This chapter provides frequency summaries for variables in each prepared data table. Use the table of contents on the left to jump directly to each dataset table section.
8.1 Household
is_complete
is_complete
|
|||||
| Record is complete | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_trips
num_trips
|
||
| Number of trips | ||
| Value | Value | |
|---|---|---|
num_days_complete
num_days_complete
|
|||||
| Number of complete days | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
participation_group
participation_group
|
|||||
| Participation group | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
sample_segment
sample_segment
|
|||||
| Sample segment | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
signup_platform
signup_platform
|
|||||
| Signup platform | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
diary_platform
diary_platform
|
|||||
| Diary platform | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_people
num_people
|
|||||
| Number of household members | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_surveyable
num_surveyable
|
|||||
| Number of surveyable household members | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_participants
num_participants
|
|||||
| Number of participants | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_adults
num_adults
|
|||||
| Number of adults | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_kids
num_kids
|
|||||
| Number of children | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_workers
num_workers
|
|||||
| Number of workers | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_students
num_students
|
|||||
| Number of students | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_vehicles
num_vehicles
|
|||||
| Number of vehicles | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
income_detailed
income_detailed
|
|||||
| Last year’s household income (detailed categories) | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
income_followup
income_followup
|
|||||
| Last year’s household income (broad categories) | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if income_detailed = ‘Prefer not to answer’ | |||||
income_broad
income_broad
|
|||||
| Last year’s household income upcoded responses from income_detailed and income_broad | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
residence_rent_own
residence_rent_own
|
|||||
| Current residence ownership | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
home_county
home_county
|
|||||
| Home location– County | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
residence_type
residence_type
|
|||||
| Type of current residence | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
8.2 Person
is_complete
is_complete
|
|||||
| Record is complete | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_trips
num_trips
|
||
| Number of trips | ||
| Value | Value | |
|---|---|---|
num_days_complete
num_days_complete
|
|||||
| Number of complete days | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
hh_is_complete
hh_is_complete
|
|||||
| Household day completion status | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
is_participant
is_participant
|
|||||
| Active participant (age 18+ and surveyable) | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_bicycles
num_bicycles
|
|||||
| Number of bicycles | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or (rMove for Web and person 1) | |||||
bicycle_type
bicycle_type
|
|||||||
| Type of bicycle | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: show if number of bicycles > 0 | |||||||
second_home_in_region
second_home_in_region
|
|||||
| Second home in study region | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if has second_home | |||||
second_home_state
second_home_state
|
|||||
| Second home location– State | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if has second_home | |||||
second_home_county
second_home_county
|
|||||
| Second home location– County | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if has second_home | |||||
is_proxy
is_proxy
|
|||||
| Assigned proxy reporter | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
has_proxy
has_proxy
|
|||||
| Has a proxy | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
has_phone
has_phone
|
|||||
| Participant has phone | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
phone_type
phone_type
|
|||||
| Participant phone type | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
relationship
relationship
|
|||||
| Relationship to household person number 1 | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
age
age
|
|||||
| Age of household member | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
gender
gender
|
|||||
| Gender | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if surveyable | |||||
race
race
|
|||||||
| Race | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if surveyable | |||||||
ethnicity
ethnicity
|
|||||||
| Ethnicity | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if surveyable | |||||||
employment
employment
|
|||||
| Employment status | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if 16 years or older | |||||
work_mode
work_mode
|
|||||
| Mode of transportation to work | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if job_type IS NOT “work only from home” or “drive/bike/travel for work” | |||||
job_type
job_type
|
|||||
| Work location type | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if employed full/part/self/volunteer | |||||
num_jobs
num_jobs
|
|||||
| Number of jobs | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if employed full/part/furloughed/self/volunteer | |||||
commute_subsidy
commute_subsidy
|
|||||||
| Commute benefits provided by employer | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if employed full/part/furloughed/volunteer | |||||||
commute_subsidy_use
commute_subsidy_use
|
|||||||
| Commute benefit used | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if selected benefit in commute_subsidy | |||||||
work_in_region
work_in_region
|
|||||
| Work in study region | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if job_type is “only one work location” or “teleworks some days and travels to a work location some days” | |||||
work_state
work_state
|
|||||
| Work location– State | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if job_type is “only one work location” or “teleworks some days and travels to a work location some days” | |||||
work_county
work_county
|
|||||
| Work location– County | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if job_type is “only one work location” or “teleworks some days and travels to a work location some days” | |||||
education
education
|
|||||
| Highest level of education completed | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if participant | |||||
student
student
|
|||||
| Student status and location | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if surveyable | |||||
school_mode
school_mode
|
|||||
| Mode of transportation to school | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if adult student and school_freq is not never or child who attends school or daycare | |||||
school_type
school_type
|
|||||
| Type of school attends | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if age 0-15 or adult student | |||||
school_freq
school_freq
|
|||||
| Frequency of travel to school | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if adult student or child who attends school or daycare | |||||
remote_class_freq
remote_class_freq
|
|||||
| Frequency of remote schooling | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if adult student and school_freq is not 6-7 days or child who is not cared for at home or attending daycare and school_freq is not 6 or 7 days | |||||
school_in_region
school_in_region
|
|||||
| School in study region | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if attends school in person | |||||
school_state
school_state
|
|||||
| School location– State | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if attends school in person | |||||
school_county
school_county
|
|||||
| School location– County | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if attends school in person | |||||
second_home
second_home
|
|||||
| Regularly spends the night at a second home (e.g., another parent or grandparent’s house, partner or spouse’s home, or a vacation home) | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
can_drive
can_drive
|
|||||
| Household member drives | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if surveyable and 16 or over | |||||
vehicle
vehicle
|
|||||
| Vehicle driven most often | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if household has vehicle and person drives | |||||
transit_freq
transit_freq
|
|||||
| Frequency of transit trips | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if participant | |||||
tnc_freq
tnc_freq
|
|||||
| Frequency of TNC trips | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if uses smartphone-app ride services | |||||
bike_freq
bike_freq
|
|||||
| Frequency of bike trips | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if participant | |||||
vanpool_freq
vanpool_freq
|
|||||
| Frequency of vanpool trips | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if uses vanpool | |||||
walk_freq
walk_freq
|
|||||
| Frequency of walk trips | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if participant | |||||
micromobility_devices
micromobility_devices
|
|||||||
| Micromobility device used | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if rMove or (rMove for Web and person 1) | |||||||
transit_pass
transit_pass
|
|||||
| Ownership/type of transit pass | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if participant | |||||
disability
disability
|
|||||
| Disability status | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if participant | |||||
participate
participate
|
|||||
| Willingness to participate in future studies | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if participant | |||||
barriers
barriers
|
|||||||
| Barrier to making trips | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
bike_comfort_lane
bike_comfort_lane
|
|||||
| Comfort level riding a bike on a major street with four lanes and a wide bike lane physically separated from traffic by a raised curb, planters, or parked cars | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or bMove person 1 | |||||
bike_comfort_local
bike_comfort_local
|
|||||
| Comfort level riding a bike on a quiet residential street with bicycle route markings, wide speed humps, and other things to discourage and slow down car traffic | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or bMove person 1 | |||||
bike_comfort_major
bike_comfort_major
|
|||||
| Comfort level riding a bike on a major street with four lanes and no bike lane | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or bMove person 1 | |||||
bike_comfort_minor
bike_comfort_minor
|
|||||
| Comfort level riding a bike on a minor street with two lanes and no bike lane | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or bMove person 1 | |||||
bike_comfort_neighborhood
bike_comfort_neighborhood
|
|||||
| Comfort level riding a bike on a quiet residential street | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or bMove person 1 | |||||
bike_comfort_paths
bike_comfort_paths
|
|||||
| Comfort level riding a bike on a path or trail separate from the street | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or bMove person 1 | |||||
bike_comfort_street
bike_comfort_street
|
|||||
| Comfort level riding a bike on a minor street with two lanes and a striped bike lane | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or bMove person 1 | |||||
bike_comfort_striped
bike_comfort_striped
|
|||||
| Comfort level riding a bike on a major street with four lanes and a striped bike lane | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or bMove person 1 | |||||
bike_factors
bike_factors
|
|||||||
| Factor to increase biking frequency | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
bike_purpose
bike_purpose
|
|||||||
| Purpose used bicycle for in the past 30 days | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if bike_freq > never or less than monthly | |||||||
bike_safety
bike_safety
|
|||||||
| Safety concerns preventing bicycle use | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if why_no_bike = safety concern | |||||||
bike_store
bike_store
|
|||||||
| Bicycle storage location | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if household has at least one bike | |||||||
commute_days
commute_days
|
|||||||
| Day commuted to workplace last week | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if employment = full/part/self/volunteer and job_type IS NOT “work only from home” or “drive/bike/travel for work” | |||||||
ev_subsidies
ev_subsidies
|
|||||
| Familiarity rebates/subsidies for purchasing an electric vehicle | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if rMove or (rMove for Web and person 1) | |||||
ev_typical_charge
ev_typical_charge
|
|||||||
| Electric vehicle charging location | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if fuel type of primary vehicle driven is electric or PHEV | |||||||
home_vehicle_park
home_vehicle_park
|
|||||
| Typical household vehicle parking location | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if household has vehicles and person drives; if rMove or (rMove for Web and person 1) | |||||
home_vehicle_park_pay
home_vehicle_park_pay
|
|||||
| Pays to park vehicle at home | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if household has vehicles and person drives; if rMove or (rMove for Web and person 1) | |||||
home_vehicle_park_permit
home_vehicle_park_permit
|
|||||
| Purchased a residential parking pass to park vehicle | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if household typically parks on-street parking; if rMove or (rMove for Web and person 1) | |||||
peerrent_freq
peerrent_freq
|
|||||
| Peer-to-peer car rental use frequency | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if uses peer-to-peer car rental | |||||
telework_days
telework_days
|
|||||||
| Day teleworked last week | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if job_type IS NOT “work only from home” or “drive/bike/travel for work” | |||||||
telework_freq_pre_covid
telework_freq_pre_covid
|
|||||
| Days worked from home before March 2020 | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if job_type IS NOT “work only from home” or “drive/bike/travel for work” | |||||
transit_factors
transit_factors
|
|||||||
| Factor to increase transit use | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
transit_purpose
transit_purpose
|
|||||||
| Purpose for using transit in the past 30 days | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if transit_freq is not never or less than monthly | |||||||
walk_purpose
walk_purpose
|
|||||||
| Reason for walking in the past 30 days | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if walk_freq > less than monthly | |||||||
why_no_bike
why_no_bike
|
|||||||
| Reasons for not using a bicycle in the past 30 days | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if bike_freq = never or less than monthly | |||||||
8.3 Day
is_complete
is_complete
|
|||||
| Record is complete | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
num_trips
num_trips
|
||
| Number of trips | ||
| Value | Value | |
|---|---|---|
hh_is_complete
hh_is_complete
|
|||||
| Household day completion status | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
is_participant
is_participant
|
|||||
| Active participant (age 18+ and surveyable) | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
begin_day
begin_day
|
|||||
| Where participant began their day | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
end_day
end_day
|
|||||
| Where participant ended their day | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
school_daily
school_daily
|
|||||
| Student traveled to school | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if attends school in -person or daycase at least some of the time | |||||
attend_school
attend_school
|
|||||||
| Traveled to school | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if person attends in-person school or daycare at least some of the time AND school was not selected as a purpose on travel day | |||||||
attend_school_no
attend_school_no
|
|||||||
| Reason for not attending school | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if did not attend school or daycare on travel day | |||||||
telecommute_time
telecommute_time
|
|||||
| Time spent teleworking on travel day (minutes, where 600 = 10+ hours) | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if employment = full/part/self/volunteer | |||||
delivery
delivery
|
|||||||
| Type of delivery | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if rMove or (rMove for Web and person 1) | |||||||
made_travel
made_travel
|
|||||
| Made trips on travel day | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if using rMove and has zero trips for the day and did not say they went to school/daycare in attend_school and begin_day = end_day and begin_day is not other | |||||
no_travel
no_travel
|
|||||||
| Reason for no travel on date | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if made zero trips on day | |||||||
congestion
congestion
|
|||||
| Person adjusted travel time to account for congestion | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if employment = full/part/self/volunteer and job_type IS NOT “work only from home” or “drive/bike/travel for work” | |||||
8.4 Vehicle
is_complete
is_complete
|
|||||
| Record is complete | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
make
make
|
|||||
| Vehicle make | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
year
year
|
|||||
| Vehicle year | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
fuel_type
fuel_type
|
|||||
| Vehicle fuel type | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
vehicle_ownership
vehicle_ownership
|
|||||
| Vehicle ownership status | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
transponder
transponder
|
|||||
| Vehicle has a toll transponder | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
8.5 Location
8.6 Unlinked Trip
is_complete
is_complete
|
|||||
| Record is complete | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
hh_is_complete
hh_is_complete
|
|||||
| Household day completion status | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
day_is_complete
day_is_complete
|
|||||
| Day survey completion status | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
o_state
o_state
|
|||||
| Origin– State | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if state borders MA | |||||
d_state
d_state
|
|||||
| Destination– State | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if state borders MA | |||||
mode_1
mode_1
|
|||||
| Trip mode 1 | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
mode_2
mode_2
|
|||||
| Trip mode 2 | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
mode_3
mode_3
|
|||||
| Trip mode 3 | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
mode_4
mode_4
|
|||||
| Trip mode 4 | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
transit_egress
transit_egress
|
|||||
| Mode used to leave transit stop | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if mode = bus or rail | |||||
transit_access
transit_access
|
|||||
| Mode used to access transit stop | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if mode = bus or rail | |||||
ev_charge_station
ev_charge_station
|
|||||
| Electric vehicle charging stations at stop | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if used household electric vehicle on trip | |||||
ev_charge_station_level
ev_charge_station_level
|
|||||||
| Charge station level | |||||||
| Selected | Percent | Missing | Selected | Percent | Missing | ||
|---|---|---|---|---|---|---|---|
| Logic: if EV charge stations were at destination | |||||||
ev_charge_station_decision
ev_charge_station_decision
|
|||||
| Electric vehicle charging stations influenced decision to stop here | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if used EV charge station at destination and destination is not home/work/school location | |||||
o_purpose_category
o_purpose_category
|
|||||
| Origin purpose category | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
o_purpose_category_reported
o_purpose_category_reported
|
|||||
| Reported Origin purpose category | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
o_purpose
o_purpose
|
|||||
| Origin purpose | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
d_purpose_category
d_purpose_category
|
|||||
| Destination purpose category | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
d_purpose
d_purpose
|
|||||
| Destination purpose | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
d_purpose_reported
d_purpose_reported
|
|||||
| Reported destination purpose | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
bike_park_loc
bike_park_loc
|
|||||
| Bicycle parking location | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if mode or transit_access or transit_egress = bicycle | |||||
scooter_park_location
scooter_park_location
|
|||||
| Scooter parking location | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if mode or transit_access or transit_egress = micromobility | |||||
park_cost
park_cost
|
||
| Amount paid for to park | ||
| Value | Value | |
|---|---|---|
taxi_cost
taxi_cost
|
||
| Amount paid for taxi | ||
| Value | Value | |
|---|---|---|
taxi_pay
taxi_pay
|
|||||
| Knows amount paid for taxi | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if taxi_type = I paid, employer paid, split/shared | |||||
taxi_type
taxi_type
|
|||||
| Type of taxi used on trip | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if mode or transit_access or transit_egress = taxi | |||||
tnc_type
tnc_type
|
|||||
| Shared smartphone-app ride service | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if mode_taxi = Uber/Lyft | |||||
transit_type
transit_type
|
|||||
| Payment method for transit | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if mode = bus (except school bus) or rail | |||||
vehicle_park_pay
vehicle_park_pay
|
|||||
| Knows amount paid to park vehicle | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
| Logic: if vehicle_park_type = Paid via cash, credit card, tickets or parking service | |||||
8.7 Linked Trip
is_complete
is_complete
|
|||||
| Record is complete | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
o_purpose_category
o_purpose_category
|
|||||
| Origin purpose category | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
o_purpose
o_purpose
|
|||||
| Origin purpose | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
d_purpose_category
d_purpose_category
|
|||||
| Destination purpose category | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
d_purpose
d_purpose
|
|||||
| Destination purpose | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
linked_trip_mode
linked_trip_mode
|
|||||
| Linked trip mode | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
joint_status
joint_status
|
|||||
| Indicates whether tour is individual, partially joint, or fully joint | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
escort_category
escort_category
|
|||||
| No escort, escorted drop-off, escorted pick-up, escorting drop-off, or escorting pick-up | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
8.8 Tour
is_complete
is_complete
|
|||||
| Record is complete | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
joint_status
joint_status
|
|||||
| Indicates whether tour is individual, partially joint, or fully joint | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
tour_category
tour_category
|
|||||
| Tour category (mandatory, non-mandatory) | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
tour_mode
tour_mode
|
|||||
| Tour mode | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
tour_purpose
tour_purpose
|
|||||
| Tour purpose | |||||
| Count | Percent | Count | Percent | ||
|---|---|---|---|---|---|
9 How to Use This Guide
This handbook provides practical, end-to-end guidance for analysts working with the Massachusetts Travel Study dataset. It focuses on how to work from the prepared study tables and codebook tables, join, filter, weight, and analyze the data with reproducible examples. It should serve as the primary resource for descriptive analysis, common metrics, and design-aware inference using these data.
This guide is written in R, but the same principles apply in other statistical software such as Python, Stata, or SAS. The examples below assume the prepared tables and codebook objects are already available in hts and codebook.
Use this handbook alongside the dataset overview in Section 6 and the codebook in Section 7. The dataset overview explains which tables are included, the codebook explains what variables mean, and the later sections of this handbook explain analytic units, weights, and common metrics.
10 Setup and Initial Exploration
10.1 System Requirements and Software
This guide focuses on using R for analysis. Many of the same ideas also apply in other software such as Python, Stata, or SAS.
To follow the examples in this guide, you will need:
- R (tested with version 4.4.3)
- An R development environment such as Positron or RStudio
The following packages are used throughout:
data.tablefor large table workflowsdplyrandtidyrfor data manipulationsrvyrfor survey-weighted analysisggplot2andgtfor figures and tablesstringrfor string handlinglubridatefor date/time processing
Install these packages if they are not already available in your environment.
suppressPackageStartupMessages({
library(data.table)
library(dplyr)
library(tidyr)
library(srvyr)
library(ggplot2)
library(gt)
library(stringr)
library(lubridate)
})10.2 Load Data
This code assumes you have manually unzipped the dataset to a local folder. Adjust the data_dir variable to point to your unzipped dataset location.
The list of .csv files should include:
hh.csv
persons.csv
day.csv
vehicle.csv
trip_unlinked.csv
trip_linked.csv
tour.csv
location.csv
Additionally, you should have two .csv files for the codebook:
value_labels.csv
variable_list.csv
The code below reads all CSV files into a list-of-data.frames called hts (for household travel survey), plus a separate list codebook for the codebook tables. If your study also includes a standalone sample plan CSV, read that file separately rather than expecting it inside the delivered dataset ZIP.
We use data.table::fread() for efficient reading of large CSV files; this can be replaced with read.csv() or other functions as needed, but only if you handle large integers manually. Both base::read.csv() and readr::read_csv cast long IDs as floating-point numbers, which can lead to duplicate IDs, particularly for linked trips. If using those functions, specify colClasses to read ID columns as character or use the bit64 package to handle 64-bit integers.
# Folder where you manually unzipped the dataset
data_dir <- "data_cache"
csv_paths <- list.files(
data_dir,
pattern = "\\.csv$",
full.names = TRUE,
recursive = TRUE
)
object_names <- tolower(gsub("\\.csv$", "", basename(csv_paths)))
object_names <- make.names(object_names, unique = TRUE)
# Read all csvs
all_data <- setNames(
lapply(csv_paths, data.table::fread),
object_names
)
# Separate codebook tables
codebook <- list(
value_labels = all_data$value_labels,
variable_list = all_data$variable_list
)
# Separate core HTS tables
hts <- all_data[setdiff(names(all_data), c("value_labels", "variable_list"))]
# Optional: standalone weighting sample plan CSV
sample_plan_path <- "path/to/sample_plan.csv"
sample_plan <- data.table::fread(sample_plan_path)
rm(all_data)For MassDOT, one of the first setup steps after loading the data should be defining the complete-household analytic universe. Most person-, day-, trip-, and vehicle-level analyses should be limited to households where hts$hh$is_complete == 1.
complete_hh_ids <- hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::pull(hh_id)Use complete_hh_ids when you need to restrict lower-level tables to complete households.
10.3 Inspect Tables
Once the files are loaded, inspect the tables before starting analysis.
Get List of Tables
Start by listing the prepared tables that are available in hts.
table_names <- data.frame(
table = names(hts),
stringsAsFactors = FALSE
)Table 32 confirms which prepared HTS tables are loaded for analysis.
Code
gt::gt(table_names) %>%
gt::tab_header(title = "Loaded HTS Tables")| Loaded HTS Tables |
| table |
|---|
| hh |
| person |
| day |
| vehicle |
| location |
| trip_unlinked |
| trip_linked |
| tour |
| value_labels |
| variable_list |
Glimpse Data
Each table includes a mix of identifiers, survey variables, and often one or more weight columns. A quick glance at the person table is usually a good starting point.
dplyr::glimpse(hts$person)
#> Rows: 37,616
#> Columns: 269
#> $ person_id <chr> "2400008901", "2400008902", "2400012201", "2…
#> $ person_num <int> 1, 2, 1, 1, 2, 3, 4, 1, 1, 2, 1, 1, 2, 1, 1,…
#> $ hh_id <chr> "24000089", "24000089", "24000122", "2400014…
#> $ surveyable <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ is_participant <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ is_proxy <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ has_proxy <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ has_phone <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ phone_type <int> 1, 1, 2, 995, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, …
#> $ hh_is_complete <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ is_complete <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ num_days_complete <int> 1, 1, 7, 7, 1, 1, 1, 1, 1, 1, 1, 7, 7, 7, 6,…
#> $ num_trips <int> 2, 5, 39, 45, 2, 4, 2, 0, 4, 3, 3, 36, 13, 1…
#> $ relationship <int> 0, 1, 0, 0, 2, 2, 1, 0, 0, 1, 0, 0, 1, 0, 0,…
#> $ age <int> 8, 9, 5, 9, 5, 4, 8, 7, 9, 9, 6, 10, 9, 10, …
#> $ gender <int> 1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2,…
#> $ race_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ ethnicity_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ employment <int> 5, 5, 5, 1, 1, 2, 1, 1, 5, 2, 1, 5, 5, 5, 1,…
#> $ work_mode <int> 995, 995, 995, 104, 100, 100, 1, 100, 995, 1…
#> $ job_type <int> 995, 995, 995, 5, 1, 5, 1, 1, 995, 2, 1, 995…
#> $ num_jobs <int> 995, 995, 995, 1, 1, 2, 1, 1, 995, 1, 2, 995…
#> $ work_lon <dbl> NA, NA, NA, -71.05214, -71.79986, -72.67345,…
#> $ work_lat <dbl> NA, NA, NA, 42.35606, 42.26745, 41.76257, 42…
#> $ work_in_region <int> 995, 995, 995, 1, 1, 0, 1, 1, 995, 995, 1, 9…
#> $ work_state <chr> NA, NA, NA, "25", "25", "09", "25", "25", NA…
#> $ work_county <chr> NA, NA, NA, "25025", "25027", "09003", "2502…
#> $ work_bg_2010 <chr> NA, NA, NA, "250250701018", "250277317001", …
#> $ work_bg_2020 <chr> NA, NA, NA, "250250701042", "250277317002", …
#> $ work_puma_2012 <chr> NA, NA, NA, "03302", "00300", "00302", "0030…
#> $ work_puma_2022 <chr> NA, NA, NA, "00802", "00505", "20201", "0050…
#> $ education <int> 7, 7, 7, 7, 7, 6, 7, 7, 7, 7, 7, 6, 2, 3, 6,…
#> $ student <int> 2, 2, 0, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
#> $ school_mode <int> 995, 995, 1, 995, 995, 995, 995, 995, 995, 9…
#> $ school_type <int> 995, 995, 13, 995, 995, 13, 995, 995, 995, 9…
#> $ school_freq <int> 995, 995, 4, 995, 995, 995, 995, 995, 995, 9…
#> $ remote_class_freq <int> 995, 995, 996, 995, 995, 2, 995, 995, 995, 9…
#> $ school_in_region <int> 995, 995, 1, 995, 995, 995, 995, 995, 995, 9…
#> $ school_state <chr> NA, NA, "25", NA, NA, NA, NA, NA, NA, NA, NA…
#> $ school_county <chr> NA, NA, "25017", NA, NA, NA, NA, NA, NA, NA,…
#> $ school_puma_2012 <chr> NA, NA, "03400", NA, NA, NA, NA, NA, NA, NA,…
#> $ school_puma_2022 <chr> NA, NA, "00613", NA, NA, NA, NA, NA, NA, NA,…
#> $ school_bg_2010 <chr> NA, NA, "250173736002", NA, NA, NA, NA, NA, …
#> $ school_bg_2020 <chr> NA, NA, "250173736002", NA, NA, NA, NA, NA, …
#> $ school_lon <dbl> NA, NA, -71.16924, NA, NA, NA, NA, NA, NA, N…
#> $ school_lat <dbl> NA, NA, 42.33609, NA, NA, NA, NA, NA, NA, NA…
#> $ second_home <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ second_home_in_region <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ second_home_state <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ second_home_county <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ second_home_puma_2012 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ second_home_puma_2022 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ second_home_bg_2010 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ second_home_bg_2020 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ second_home_lon <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ second_home_lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ can_drive <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ vehicle <int> 6, 7, 995, 6, 6, 9, 7, 8, 6, 7, 6, 6, 6, 6, …
#> $ transit_freq <int> 4, 8, 5, 4, 9, 9, 9, 9, 8, 9, 8, 9, 9, 9, 8,…
#> $ tnc_freq <int> 7, 8, 8, 8, 995, 995, 995, 995, 995, 995, 8,…
#> $ bike_freq <int> 8, 996, 8, 996, 8, 996, 8, 996, 996, 996, 8,…
#> $ vanpool_freq <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bikeshare_freq <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ scootshare_freq <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ walk_freq <int> 4, 5, 1, 2, 5, 5, 1, 1, 2, 2, 1, 1, 8, 8, 4,…
#> $ transit_pass <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ disability <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
#> $ participate <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,…
#> $ barriers_1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
#> $ barriers_10 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_3 <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_8 <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,…
#> $ barriers_9 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_997 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_999 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ barriers_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bicycle_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bicycle_type_1 <int> 1, 995, 1, 995, 995, 995, 995, 1, 1, 995, 99…
#> $ bicycle_type_2 <int> 0, 995, 0, 995, 995, 995, 995, 0, 0, 995, 99…
#> $ bicycle_type_997 <int> 0, 995, 0, 995, 995, 995, 995, 0, 0, 995, 99…
#> $ bike_comfort_lane <int> 3, NA, 2, 1, NA, NA, NA, 2, 4, NA, 2, 4, 4, …
#> $ bike_comfort_local <int> 1, NA, 1, 2, NA, NA, NA, 1, 1, NA, 2, 3, 4, …
#> $ bike_comfort_major <int> 4, NA, 4, 1, NA, NA, NA, 4, 4, NA, 4, 4, 4, …
#> $ bike_comfort_minor <int> 3, NA, 4, 1, NA, NA, NA, 2, 3, NA, 3, 4, 4, …
#> $ bike_comfort_neighborhood <int> 2, NA, 2, 2, NA, NA, NA, 1, 1, NA, 1, 3, 4, …
#> $ bike_comfort_paths <int> 1, NA, 1, 2, NA, NA, NA, 1, 1, NA, 1, 1, 4, …
#> $ bike_comfort_street <int> 2, NA, 3, 2, NA, NA, NA, 1, 2, NA, 2, 4, 4, …
#> $ bike_comfort_striped <int> 3, NA, 3, 1, NA, NA, NA, 3, 4, NA, 3, 4, 4, …
#> $ bike_factors_1 <int> 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,…
#> $ bike_factors_10 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ bike_factors_11 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ bike_factors_12 <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1,…
#> $ bike_factors_2 <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ bike_factors_3 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ bike_factors_4 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ bike_factors_5 <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,…
#> $ bike_factors_6 <int> 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,…
#> $ bike_factors_7 <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ bike_factors_8 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ bike_factors_9 <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ bike_factors_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bike_purpose_1 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bike_purpose_2 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bike_purpose_3 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bike_purpose_4 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bike_purpose_5 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bike_purpose_6 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bike_purpose_7 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bike_purpose_8 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ bike_purpose_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bike_safety_1 <int> 0, 995, 1, 995, 995, 995, 995, 995, 1, 0, 0,…
#> $ bike_safety_2 <int> 1, 995, 1, 995, 995, 995, 995, 995, 0, 0, 1,…
#> $ bike_safety_3 <int> 0, 995, 1, 995, 995, 995, 995, 995, 0, 0, 1,…
#> $ bike_safety_4 <int> 0, 995, 1, 995, 995, 995, 995, 995, 0, 0, 1,…
#> $ bike_safety_5 <int> 1, 995, 0, 995, 995, 995, 995, 995, 0, 0, 0,…
#> $ bike_safety_6 <int> 0, 995, 0, 995, 995, 995, 995, 995, 0, 0, 0,…
#> $ bike_safety_7 <int> 0, 995, 0, 995, 995, 995, 995, 995, 0, 0, 0,…
#> $ bike_safety_8 <int> 0, 995, 0, 995, 995, 995, 995, 995, 0, 1, 0,…
#> $ bike_safety_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "At our …
#> $ bike_store_1 <int> 1, 995, 1, 995, 995, 995, 995, 1, 0, 995, 99…
#> $ bike_store_2 <int> 0, 995, 0, 995, 995, 995, 995, 0, 0, 995, 99…
#> $ bike_store_3 <int> 0, 995, 0, 995, 995, 995, 995, 0, 0, 995, 99…
#> $ bike_store_4 <int> 0, 995, 0, 995, 995, 995, 995, 0, 0, 995, 99…
#> $ bike_store_5 <int> 0, 995, 0, 995, 995, 995, 995, 0, 0, 995, 99…
#> $ bike_store_6 <int> 0, 995, 0, 995, 995, 995, 995, 0, 0, 995, 99…
#> $ bike_store_7 <int> 0, 995, 0, 995, 995, 995, 995, 0, 0, 995, 99…
#> $ bike_store_997 <int> 0, 995, 0, 995, 995, 995, 995, 0, 1, 995, 99…
#> $ carshare_freq <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ commute_days_1 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 1, 995…
#> $ commute_days_2 <int> 995, 995, 995, 1, 1, 0, 1, 0, 995, 0, 1, 995…
#> $ commute_days_3 <int> 995, 995, 995, 1, 1, 0, 1, 0, 995, 0, 1, 995…
#> $ commute_days_4 <int> 995, 995, 995, 1, 1, 0, 0, 0, 995, 0, 1, 995…
#> $ commute_days_5 <int> 995, 995, 995, 0, 1, 0, 0, 0, 995, 0, 1, 995…
#> $ commute_days_6 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_days_7 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_days_996 <int> 995, 995, 995, 0, 0, 1, 0, 1, 995, 1, 0, 995…
#> $ commute_subsidy_1 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 1, 995…
#> $ commute_subsidy_10 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_11 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 1, 995…
#> $ commute_subsidy_12 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_13 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_14 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_2 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_3 <int> 995, 995, 995, 0, 0, 0, 1, 1, 995, 0, 0, 995…
#> $ commute_subsidy_4 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 1, 995…
#> $ commute_subsidy_5 <int> 995, 995, 995, 1, 0, 0, 0, 1, 995, 0, 0, 995…
#> $ commute_subsidy_6 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_7 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_8 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_9 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_996 <int> 995, 995, 995, 0, 1, 1, 0, 0, 995, 1, 0, 995…
#> $ commute_subsidy_998 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ commute_subsidy_use_1 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_10 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_11 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_12 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_13 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_14 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_2 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_3 <int> 995, 995, 995, 0, 995, 995, 1, 1, 995, 995, …
#> $ commute_subsidy_use_4 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_5 <int> 995, 995, 995, 1, 995, 995, 0, 1, 995, 995, …
#> $ commute_subsidy_use_6 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_7 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_8 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_9 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ commute_subsidy_use_996 <int> 995, 995, 995, 0, 995, 995, 0, 0, 995, 995, …
#> $ ethnicity_1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ ethnicity_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ethnicity_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ethnicity_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ethnicity_997 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ethnicity_999 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ev_subsidies <int> 4, 995, 995, 5, 995, 995, 995, 5, 5, 995, 2,…
#> $ ev_typical_charge_1 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ ev_typical_charge_2 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ ev_typical_charge_3 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ ev_typical_charge_4 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ ev_typical_charge_5 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ ev_typical_charge_6 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ ev_typical_charge_997 <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ home_vehicle_park <int> 1, 995, 995, 3, 995, 995, 995, 1, 1, 995, 1,…
#> $ home_vehicle_park_pay <int> 0, 995, 995, 0, 995, 995, 995, 0, 0, 995, 0,…
#> $ home_vehicle_park_permit <int> 995, 995, 995, 1, 995, 995, 995, 995, 995, 9…
#> $ micromobility_devices_1 <int> 0, 995, 0, 0, 995, 995, 995, 0, 0, 995, 0, 0…
#> $ micromobility_devices_2 <int> 0, 995, 0, 0, 995, 995, 995, 0, 0, 995, 0, 0…
#> $ micromobility_devices_3 <int> 0, 995, 0, 0, 995, 995, 995, 0, 0, 995, 0, 0…
#> $ micromobility_devices_996 <int> 1, 995, 1, 1, 995, 995, 995, 1, 1, 995, 1, 1…
#> $ micromobility_devices_997 <int> 0, 995, 0, 0, 995, 995, 995, 0, 0, 995, 0, 0…
#> $ num_bicycles <int> 1, 995, 1, 0, 995, 995, 995, 2, 2, 995, 0, 0…
#> $ peerrent_freq <int> 995, 995, 995, 995, 995, 995, 995, 995, 995,…
#> $ race_1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ race_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ race_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
#> $ race_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ race_5 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,…
#> $ race_997 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ race_999 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ share_2 <int> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
#> $ share_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ share_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ share_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
#> $ share_6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ share_7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ share_996 <int> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,…
#> $ telework_days_1 <int> 995, 995, 995, 1, 0, 0, 1, 1, 995, 0, 0, 995…
#> $ telework_days_2 <int> 995, 995, 995, 0, 0, 0, 1, 1, 995, 0, 0, 995…
#> $ telework_days_3 <int> 995, 995, 995, 0, 0, 0, 1, 1, 995, 0, 0, 995…
#> $ telework_days_4 <int> 995, 995, 995, 0, 0, 0, 0, 1, 995, 0, 0, 995…
#> $ telework_days_5 <int> 995, 995, 995, 1, 0, 0, 1, 0, 995, 0, 0, 995…
#> $ telework_days_6 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ telework_days_7 <int> 995, 995, 995, 0, 0, 0, 0, 0, 995, 0, 0, 995…
#> $ telework_days_996 <int> 995, 995, 995, 0, 1, 1, 0, 0, 995, 1, 1, 995…
#> $ telework_freq_pre_covid <int> 995, 995, 3, 8, 996, 8, 8, 7, 995, 996, 996,…
#> $ transit_factors_1 <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
#> $ transit_factors_10 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
#> $ transit_factors_11 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ transit_factors_12 <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ transit_factors_2 <int> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,…
#> $ transit_factors_3 <int> 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,…
#> $ transit_factors_4 <int> 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
#> $ transit_factors_5 <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ transit_factors_6 <int> 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
#> $ transit_factors_7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
#> $ transit_factors_8 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ transit_factors_9 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ transit_factors_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ transit_purpose_1 <int> 0, 995, 1, 0, 995, 995, 995, 995, 995, 995, …
#> $ transit_purpose_2 <int> 1, 995, 1, 0, 995, 995, 995, 995, 995, 995, …
#> $ transit_purpose_3 <int> 0, 995, 0, 0, 995, 995, 995, 995, 995, 995, …
#> $ transit_purpose_4 <int> 0, 995, 1, 0, 995, 995, 995, 995, 995, 995, …
#> $ transit_purpose_5 <int> 0, 995, 0, 1, 995, 995, 995, 995, 995, 995, …
#> $ transit_purpose_6 <int> 0, 995, 0, 0, 995, 995, 995, 995, 995, 995, …
#> $ transit_purpose_7 <int> 1, 995, 0, 0, 995, 995, 995, 995, 995, 995, …
#> $ transit_purpose_other <chr> "cultural events", NA, NA, NA, NA, NA, NA, N…
#> $ walk_purpose_1 <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 995, 995…
#> $ walk_purpose_2 <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 995, 995…
#> $ walk_purpose_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 995, 995…
#> $ walk_purpose_4 <int> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 995, 995…
#> $ walk_purpose_5 <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 995, 995…
#> $ walk_purpose_6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 995, 995…
#> $ walk_purpose_7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 995, 995…
#> $ walk_purpose_8 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 995, 995…
#> $ walk_purpose_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ why_no_bike_1 <int> 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0,…
#> $ why_no_bike_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ why_no_bike_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
#> $ why_no_bike_4 <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ why_no_bike_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
#> $ why_no_bike_6 <int> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,…
#> $ why_no_bike_7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ why_no_bike_8 <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
#> $ person_type <int> 4, 4, 3, 1, 1, 3, 1, 1, 4, 2, 1, 5, 4, 5, 1,…
#> $ person_weight <dbl> 157.42583, 157.42583, 103.34663, 188.21553, …
#> $ race_imputed <chr> "white", "white", "white", "white", "white",…
#> $ ethnicity_imputed <chr> "not_hispanic", "not_hispanic", "not_hispani…
#> $ gender_imputed <chr> "female", "male", "female", "female", "male"…
#> $ person_weight_tue <dbl> 187.88768, 187.88768, 59.87428, 458.44733, 2…
#> $ person_weight_fri <dbl> NA, NA, 75.95139, 655.71500, NA, NA, NA, NA,…
#> $ person_weight_mon <dbl> NA, NA, 65.09989, 546.06550, NA, NA, NA, NA,…
#> $ person_weight_sat <dbl> NA, NA, 76.29594, 651.46690, NA, NA, NA, NA,…
#> $ person_weight_sun <dbl> NA, NA, 74.33908, 640.43646, NA, NA, NA, NA,…
#> $ person_weight_thu <dbl> NA, NA, 64.49532, 489.71295, NA, NA, NA, NA,…
#> $ person_weight_wed <dbl> NA, NA, 58.57788, 463.14780, NA, NA, NA, NA,…View Table Dimensions
Table sizes reflect the hierarchical structure of the dataset: households contain people, people contain travel days, and days may contain zero or more trips.
table_dimensions <- data.frame(
table = character(),
rows = integer(),
columns = integer(),
stringsAsFactors = FALSE
)
for (table_name in names(hts)) {
table_dimensions <- rbind(
table_dimensions,
data.frame(
table = table_name,
rows = nrow(hts[[table_name]]),
columns = ncol(hts[[table_name]]),
stringsAsFactors = FALSE
)
)
}Use Table 33 to confirm that the row counts follow the expected household-to-person-to-day-to-trip hierarchy.
Code
gt::gt(table_dimensions) %>%
gt::fmt_number(
columns = c(rows, columns),
decimals = 0,
sep_mark = ","
) %>%
gt::cols_label(
table = "Table",
rows = "Rows",
columns = "Columns"
) %>%
gt::tab_options(
table.font.size = gt::px(13),
data_row.padding = gt::px(4)
)| Table | Rows | Columns |
|---|---|---|
| hh | 18,122 | 52 |
| person | 37,616 | 269 |
| day | 134,187 | 65 |
| vehicle | 25,849 | 19 |
| location | 8,607,225 | 9 |
| trip_unlinked | 468,018 | 139 |
| trip_linked | 419,469 | 58 |
| tour | 160,091 | 57 |
| value_labels | 2,422 | 3 |
| variable_list | 567 | 19 |
For MassDOT, it is also useful to compare delivered row counts with the complete-household subset before calculating any substantive estimate.
complete_household_table_dimensions <- data.frame(
table = character(),
complete_household_rows = integer(),
stringsAsFactors = FALSE
)
for (table_name in names(hts)) {
if ("hh_id" %in% names(hts[[table_name]])) {
complete_household_rows <- sum(hts[[table_name]]$hh_id %in% complete_hh_ids)
} else {
complete_household_rows <- NA_integer_
}
complete_household_table_dimensions <- rbind(
complete_household_table_dimensions,
data.frame(
table = table_name,
complete_household_rows = complete_household_rows,
stringsAsFactors = FALSE
)
)
}Table 34 shows the number of rows that belong to complete households in each table with a household identifier.
Code
gt::gt(complete_household_table_dimensions) %>%
gt::fmt_number(
columns = complete_household_rows,
decimals = 0,
sep_mark = ","
) %>%
gt::cols_label(
table = "Table",
complete_household_rows = "Complete-Household Rows"
)| Table | Complete-Household Rows |
|---|---|
| hh | 15,641 |
| person | 31,255 |
| day | 96,370 |
| vehicle | 21,770 |
| location | NA |
| trip_unlinked | 411,573 |
| trip_linked | 366,186 |
| tour | 139,240 |
| value_labels | NA |
| variable_list | NA |
View Sample Records
Before summarizing a table, it is often useful to preview a few records and confirm that the key fields look as expected.
person_preview <- head(hts$person)Table 35 shows the first few person records from the prepared data.
Code
gt::gt(person_preview) %>%
gt::tab_header(title = "Sample Person Records")| Sample Person Records | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| person_id | person_num | hh_id | surveyable | is_participant | is_proxy | has_proxy | has_phone | phone_type | hh_is_complete | is_complete | num_days_complete | num_trips | relationship | age | gender | race_other | ethnicity_other | employment | work_mode | job_type | num_jobs | work_lon | work_lat | work_in_region | work_state | work_county | work_bg_2010 | work_bg_2020 | work_puma_2012 | work_puma_2022 | education | student | school_mode | school_type | school_freq | remote_class_freq | school_in_region | school_state | school_county | school_puma_2012 | school_puma_2022 | school_bg_2010 | school_bg_2020 | school_lon | school_lat | second_home | second_home_in_region | second_home_state | second_home_county | second_home_puma_2012 | second_home_puma_2022 | second_home_bg_2010 | second_home_bg_2020 | second_home_lon | second_home_lat | can_drive | vehicle | transit_freq | tnc_freq | bike_freq | vanpool_freq | bikeshare_freq | scootshare_freq | walk_freq | transit_pass | disability | participate | barriers_1 | barriers_10 | barriers_2 | barriers_3 | barriers_4 | barriers_5 | barriers_6 | barriers_7 | barriers_8 | barriers_9 | barriers_997 | barriers_999 | barriers_other | bicycle_other | bicycle_type_1 | bicycle_type_2 | bicycle_type_997 | bike_comfort_lane | bike_comfort_local | bike_comfort_major | bike_comfort_minor | bike_comfort_neighborhood | bike_comfort_paths | bike_comfort_street | bike_comfort_striped | bike_factors_1 | bike_factors_10 | bike_factors_11 | bike_factors_12 | bike_factors_2 | bike_factors_3 | bike_factors_4 | bike_factors_5 | bike_factors_6 | bike_factors_7 | bike_factors_8 | bike_factors_9 | bike_factors_other | bike_purpose_1 | bike_purpose_2 | bike_purpose_3 | bike_purpose_4 | bike_purpose_5 | bike_purpose_6 | bike_purpose_7 | bike_purpose_8 | bike_purpose_other | bike_safety_1 | bike_safety_2 | bike_safety_3 | bike_safety_4 | bike_safety_5 | bike_safety_6 | bike_safety_7 | bike_safety_8 | bike_safety_other | bike_store_1 | bike_store_2 | bike_store_3 | bike_store_4 | bike_store_5 | bike_store_6 | bike_store_7 | bike_store_997 | carshare_freq | commute_days_1 | commute_days_2 | commute_days_3 | commute_days_4 | commute_days_5 | commute_days_6 | commute_days_7 | commute_days_996 | commute_subsidy_1 | commute_subsidy_10 | commute_subsidy_11 | commute_subsidy_12 | commute_subsidy_13 | commute_subsidy_14 | commute_subsidy_2 | commute_subsidy_3 | commute_subsidy_4 | commute_subsidy_5 | commute_subsidy_6 | commute_subsidy_7 | commute_subsidy_8 | commute_subsidy_9 | commute_subsidy_996 | commute_subsidy_998 | commute_subsidy_use_1 | commute_subsidy_use_10 | commute_subsidy_use_11 | commute_subsidy_use_12 | commute_subsidy_use_13 | commute_subsidy_use_14 | commute_subsidy_use_2 | commute_subsidy_use_3 | commute_subsidy_use_4 | commute_subsidy_use_5 | commute_subsidy_use_6 | commute_subsidy_use_7 | commute_subsidy_use_8 | commute_subsidy_use_9 | commute_subsidy_use_996 | ethnicity_1 | ethnicity_2 | ethnicity_3 | ethnicity_4 | ethnicity_997 | ethnicity_999 | ev_subsidies | ev_typical_charge_1 | ev_typical_charge_2 | ev_typical_charge_3 | ev_typical_charge_4 | ev_typical_charge_5 | ev_typical_charge_6 | ev_typical_charge_997 | home_vehicle_park | home_vehicle_park_pay | home_vehicle_park_permit | micromobility_devices_1 | micromobility_devices_2 | micromobility_devices_3 | micromobility_devices_996 | micromobility_devices_997 | num_bicycles | peerrent_freq | race_1 | race_2 | race_3 | race_4 | race_5 | race_997 | race_999 | share_2 | share_3 | share_4 | share_5 | share_6 | share_7 | share_996 | telework_days_1 | telework_days_2 | telework_days_3 | telework_days_4 | telework_days_5 | telework_days_6 | telework_days_7 | telework_days_996 | telework_freq_pre_covid | transit_factors_1 | transit_factors_10 | transit_factors_11 | transit_factors_12 | transit_factors_2 | transit_factors_3 | transit_factors_4 | transit_factors_5 | transit_factors_6 | transit_factors_7 | transit_factors_8 | transit_factors_9 | transit_factors_other | transit_purpose_1 | transit_purpose_2 | transit_purpose_3 | transit_purpose_4 | transit_purpose_5 | transit_purpose_6 | transit_purpose_7 | transit_purpose_other | walk_purpose_1 | walk_purpose_2 | walk_purpose_3 | walk_purpose_4 | walk_purpose_5 | walk_purpose_6 | walk_purpose_7 | walk_purpose_8 | walk_purpose_other | why_no_bike_1 | why_no_bike_2 | why_no_bike_3 | why_no_bike_4 | why_no_bike_5 | why_no_bike_6 | why_no_bike_7 | why_no_bike_8 | person_type | person_weight | race_imputed | ethnicity_imputed | gender_imputed | person_weight_tue | person_weight_fri | person_weight_mon | person_weight_sat | person_weight_sun | person_weight_thu | person_weight_wed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2400008901 | 1 | 24000089 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 2 | 0 | 8 | 1 | NA | NA | 5 | 995 | 995 | 995 | NA | NA | 995 | NA | NA | NA | NA | NA | NA | 7 | 2 | 995 | 995 | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 6 | 4 | 7 | 8 | 995 | 995 | 995 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 1 | 0 | 0 | 3 | 1 | 4 | 3 | 2 | 1 | 2 | 3 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 4 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 995 | 0 | 0 | 0 | 1 | 0 | 1 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | NA | 0 | 1 | 0 | 0 | 0 | 0 | 1 | cultural events | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | NA | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 157.42583 | white | not_hispanic | female | 187.88768 | NA | NA | NA | NA | NA | NA |
| 2400008902 | 2 | 24000089 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 5 | 1 | 9 | 2 | NA | NA | 5 | 995 | 995 | 995 | NA | NA | 995 | NA | NA | NA | NA | NA | NA | 7 | 2 | 995 | 995 | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 7 | 8 | 8 | 996 | 995 | 995 | 995 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 157.42583 | white | not_hispanic | male | 187.88768 | NA | NA | NA | NA | NA | NA |
| 2400012201 | 1 | 24000122 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 1 | 7 | 39 | 0 | 5 | 1 | NA | NA | 5 | 995 | 995 | 995 | NA | NA | 995 | NA | NA | NA | NA | NA | NA | 7 | 0 | 1 | 13 | 4 | 996 | 1 | 25 | 25017 | 03400 | 00613 | 250173736002 | 250173736002 | -71.16924 | 42.33609 | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | 5 | 8 | 8 | 995 | 995 | 995 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 1 | 0 | 0 | 2 | 1 | 4 | 4 | 2 | 1 | 3 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 1 | 0 | 1 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NA | 1 | 1 | 0 | 1 | 0 | 0 | 0 | NA | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | NA | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 3 | 103.34663 | white | not_hispanic | female | 59.87428 | 75.95139 | 65.09989 | 76.29594 | 74.33908 | 64.49532 | 58.57788 |
| 2400014001 | 1 | 24000140 | 1 | 1 | 0 | 0 | 1 | 995 | 1 | 1 | 7 | 45 | 0 | 9 | 1 | NA | NA | 1 | 104 | 5 | 1 | -71.05214 | 42.35606 | 1 | 25 | 25025 | 250250701018 | 250250701042 | 03302 | 00802 | 7 | 2 | 995 | 995 | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 6 | 4 | 8 | 996 | 995 | 995 | 995 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 995 | 995 | 995 | 1 | 2 | 1 | 1 | 2 | 2 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 5 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 3 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 0 | 0 | 0 | 0 | 1 | 0 | 0 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 188.21553 | white | not_hispanic | female | 458.44733 | 655.71500 | 546.06550 | 651.46690 | 640.43646 | 489.71295 | 463.14780 |
| 2400015802 | 2 | 24000158 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 5 | 2 | NA | NA | 1 | 100 | 1 | 1 | -71.79986 | 42.26745 | 1 | 25 | 25027 | 250277317001 | 250277317002 | 00300 | 00505 | 7 | 2 | 995 | 995 | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 6 | 9 | 995 | 8 | 995 | 995 | 995 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 996 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 50.69754 | white | not_hispanic | male | 216.88728 | NA | NA | NA | NA | NA | NA |
| 2400015803 | 3 | 24000158 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | 4 | 2 | 4 | 2 | NA | NA | 2 | 100 | 5 | 2 | -72.67345 | 41.76257 | 0 | 09 | 09003 | 090035021002 | 090035021001 | 00302 | 20201 | 6 | 3 | 995 | 13 | 995 | 2 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 9 | 9 | 995 | 996 | 995 | 995 | 995 | 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 50.69754 | white | not_hispanic | male | 216.88728 | NA | NA | NA | NA | NA | NA |
Inspect Weight Columns
Weights are typically included in the household, person, day, trip, and tour tables when weighted estimates are supported.
weight_summaries <- data.frame(
table = character(),
weight_column = character(),
min = numeric(),
median = numeric(),
mean = numeric(),
max = numeric(),
zero_count = integer(),
stringsAsFactors = FALSE
)
for (table_name in names(hts)) {
weight_columns <- grep("_weight$", names(hts[[table_name]]), value = TRUE)
if (length(weight_columns) == 0L) {
next
}
weight_col <- weight_columns[[1]]
weight_vec <- hts[[table_name]][[weight_col]]
weight_summaries <- rbind(
weight_summaries,
data.frame(
table = table_name,
weight_column = weight_col,
min = min(weight_vec, na.rm = TRUE),
median = stats::median(weight_vec, na.rm = TRUE),
mean = mean(weight_vec, na.rm = TRUE),
max = max(weight_vec, na.rm = TRUE),
zero_count = sum(weight_vec == 0, na.rm = TRUE),
stringsAsFactors = FALSE
)
)
}Table 36 is a quick way to check whether the main weight columns are present and populated.
Code
gt::gt(weight_summaries) %>%
gt::fmt_number(
columns = c(min, median, mean, max),
decimals = 3
) %>%
gt::fmt_number(
columns = zero_count,
decimals = 0,
sep_mark = ","
) %>%
gt::cols_label(
table = "Table",
weight_column = "Weight Column",
min = "Min",
median = "Median",
mean = "Mean",
max = "Max",
zero_count = "Zero Count"
) %>%
gt::tab_options(
table.font.size = gt::px(13),
data_row.padding = gt::px(4)
)| Table | Weight Column | Min | Median | Mean | Max | Zero Count |
|---|---|---|---|---|---|---|
| hh | hh_weight | 15.780 | 106.787 | 180.980 | 1,129.933 | 0 |
| person | person_weight | 0.000 | 112.787 | 217.239 | 4,230.993 | 1,556 |
| day | day_weight | 0.000 | 30.002 | 108.071 | 3,570.328 | 13,520 |
| vehicle | hh_weight | 15.780 | 116.151 | 204.342 | 1,129.933 | 0 |
| trip_unlinked | trip_weight | 0.000 | 28.591 | 117.434 | 5,604.895 | 56,013 |
| trip_linked | linked_trip_weight | 0.000 | 29.896 | 124.146 | 5,604.895 | 51,963 |
| tour | tour_weight | 0.000 | 30.100 | 128.477 | 5,373.668 | 21,243 |
The codebook object can also be inspected immediately after loading.
variable_list_preview <- head(codebook$variable_list)
value_labels_preview <- head(codebook$value_labels)Review Table 37 and Table 38 before analysis so you can confirm variable definitions and labeled response values.
Code
gt::gt(variable_list_preview) %>%
gt::tab_header(title = "Variable List Preview")| Variable List Preview | ||||||||||||||||||
| order | source | variable | is_checkbox | hh | person | day | vehicle | location | unlinked_trip | linked_trip | tour | logic | description | data_type | write_to_export | exclude_from_frequencies | category | exclude |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | pipeline | hh_id | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | NA | Household ID | integer | TRUE | TRUE | NA | NA |
| 2 | pipeline | is_complete | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | NA | Record is complete | integer/categorical | TRUE | FALSE | NA | NA |
| 3 | pipeline | num_trips | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | NA | Number of trips | integer | TRUE | FALSE | NA | NA |
| 4 | pipeline | num_days_complete | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NA | Number of complete days | integer/categorical | TRUE | FALSE | NA | NA |
| 5 | pipeline | first_travel_date | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | NA | First travel date | date | TRUE | TRUE | NA | NA |
| 6 | pipeline | last_travel_date | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | NA | Last travel date | date | TRUE | TRUE | NA | NA |
Code
gt::gt(value_labels_preview) %>%
gt::tab_header(title = "Value Labels Preview")| Value Labels Preview | ||
| variable | value | label |
|---|---|---|
| added_trip | 0 | No |
| added_trip | 1 | Yes |
| age | 1 | Age under 5 |
| age | 10 | Age 75-84 |
| age | 11 | Age 85 up |
| age | 2 | Age 5-15 |
11 Data Structure and Joins
The dataset is organized as a set of related tables. Each table represents a different unit of observation, such as households, people, travel days, vehicles, trips, locations, or tours. Most analyses require using more than one table, so it is important to understand how records are linked before joining tables. The examples below use the same dplyr join pattern used throughout the analyst handbook.
For MassDOT, the main linkage pattern is:
- Households are the primary sampling unit and are identified by
hh_id. - Persons are nested within households and linked by
hh_id. - Each person can have multiple travel days, linked by
person_idandday_id. - Each day can record zero or more trips, linked from the day and person tables by
day_idandperson_id. - Trips can be analyzed at the linked or unlinked level depending on the research question.
- Location records provide point-level context along trips and are linked by trip and day identifiers.
- Tours summarize sequences of linked trips that begin and end at the same anchor location.
Figure 9 repeats the dataset overview figure so the table hierarchy is visible while working through joins.
The first step in any join workflow is to inspect the identifier columns that connect the tables.
id_columns_summary <- data.frame(
table = character(),
id_columns = character(),
stringsAsFactors = FALSE
)
for (table_name in names(hts)) {
id_columns <- grep("_id$", names(hts[[table_name]]), value = TRUE)
if (length(id_columns) == 0L) {
next
}
id_columns_summary <- rbind(
id_columns_summary,
data.frame(
table = table_name,
id_columns = paste(id_columns, collapse = ", "),
stringsAsFactors = FALSE
)
)
}Use Table 39 to confirm which keys are available before you start joining tables.
Code
gt::gt(id_columns_summary) %>%
gt::cols_label(
table = "Table",
id_columns = "Identifier Columns"
) %>%
gt::tab_options(
table.font.size = gt::px(13),
data_row.padding = gt::px(4)
)| Table | Identifier Columns |
|---|---|
| hh | hh_id |
| person | person_id, hh_id |
| day | day_id, person_id, hh_id |
| vehicle | hh_id, vehicle_id |
| location | trip_id |
| trip_unlinked | trip_id, day_id, hh_id, person_id, linked_trip_id, joint_trip_id, tour_id |
| trip_linked | linked_trip_id, hh_id, person_id, day_id, joint_trip_id, tour_id |
| tour | tour_id, hh_id, person_id, day_id, out_chauffeur_id, inb_chauffeur_id, out_chauffeur_tour_id, inb_chauffeur_tour_id, parent_tour_id, joint_tour_id |
11.1 Common Join Patterns
The most common joins follow the hierarchy from households to lower-level records.
For MassDOT, many substantive analyses should also carry the household completion rule through the join process. A common pattern is to join hh$is_complete or filter lower-level tables with hh_id %in% complete_hh_ids before summarizing.
For example, to join household characteristics to people, first select the household fields that belong in the person-level analysis file.
household_join_fields <- hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::mutate(hh_is_complete = is_complete)
if ("income_detailed" %in% names(hts$hh)) {
household_join_fields <- household_join_fields %>%
dplyr::mutate(household_income = income_detailed)
}
if ("num_vehicles" %in% names(hts$hh)) {
household_join_fields <- household_join_fields %>%
dplyr::mutate(household_vehicles = num_vehicles)
}
household_join_fields <- household_join_fields %>%
dplyr::select(
hh_id,
hh_is_complete,
dplyr::any_of(c("household_income", "household_vehicles"))
)
person_with_household <- hts$person %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::left_join(
household_join_fields,
by = "hh_id"
)Table 40 shows the person table after household variables have been joined in.
Code
gt::gt(head(person_with_household)) %>%
gt::tab_header(title = "Person Records Joined to Household Variables")| Person Records Joined to Household Variables | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| person_id | person_num | hh_id | surveyable | is_participant | is_proxy | has_proxy | has_phone | phone_type | hh_is_complete.x | is_complete | num_days_complete | num_trips | relationship | age | gender | race_other | ethnicity_other | employment | work_mode | job_type | num_jobs | work_lon | work_lat | work_in_region | work_state | work_county | work_bg_2010 | work_bg_2020 | work_puma_2012 | work_puma_2022 | education | student | school_mode | school_type | school_freq | remote_class_freq | school_in_region | school_state | school_county | school_puma_2012 | school_puma_2022 | school_bg_2010 | school_bg_2020 | school_lon | school_lat | second_home | second_home_in_region | second_home_state | second_home_county | second_home_puma_2012 | second_home_puma_2022 | second_home_bg_2010 | second_home_bg_2020 | second_home_lon | second_home_lat | can_drive | vehicle | transit_freq | tnc_freq | bike_freq | vanpool_freq | bikeshare_freq | scootshare_freq | walk_freq | transit_pass | disability | participate | barriers_1 | barriers_10 | barriers_2 | barriers_3 | barriers_4 | barriers_5 | barriers_6 | barriers_7 | barriers_8 | barriers_9 | barriers_997 | barriers_999 | barriers_other | bicycle_other | bicycle_type_1 | bicycle_type_2 | bicycle_type_997 | bike_comfort_lane | bike_comfort_local | bike_comfort_major | bike_comfort_minor | bike_comfort_neighborhood | bike_comfort_paths | bike_comfort_street | bike_comfort_striped | bike_factors_1 | bike_factors_10 | bike_factors_11 | bike_factors_12 | bike_factors_2 | bike_factors_3 | bike_factors_4 | bike_factors_5 | bike_factors_6 | bike_factors_7 | bike_factors_8 | bike_factors_9 | bike_factors_other | bike_purpose_1 | bike_purpose_2 | bike_purpose_3 | bike_purpose_4 | bike_purpose_5 | bike_purpose_6 | bike_purpose_7 | bike_purpose_8 | bike_purpose_other | bike_safety_1 | bike_safety_2 | bike_safety_3 | bike_safety_4 | bike_safety_5 | bike_safety_6 | bike_safety_7 | bike_safety_8 | bike_safety_other | bike_store_1 | bike_store_2 | bike_store_3 | bike_store_4 | bike_store_5 | bike_store_6 | bike_store_7 | bike_store_997 | carshare_freq | commute_days_1 | commute_days_2 | commute_days_3 | commute_days_4 | commute_days_5 | commute_days_6 | commute_days_7 | commute_days_996 | commute_subsidy_1 | commute_subsidy_10 | commute_subsidy_11 | commute_subsidy_12 | commute_subsidy_13 | commute_subsidy_14 | commute_subsidy_2 | commute_subsidy_3 | commute_subsidy_4 | commute_subsidy_5 | commute_subsidy_6 | commute_subsidy_7 | commute_subsidy_8 | commute_subsidy_9 | commute_subsidy_996 | commute_subsidy_998 | commute_subsidy_use_1 | commute_subsidy_use_10 | commute_subsidy_use_11 | commute_subsidy_use_12 | commute_subsidy_use_13 | commute_subsidy_use_14 | commute_subsidy_use_2 | commute_subsidy_use_3 | commute_subsidy_use_4 | commute_subsidy_use_5 | commute_subsidy_use_6 | commute_subsidy_use_7 | commute_subsidy_use_8 | commute_subsidy_use_9 | commute_subsidy_use_996 | ethnicity_1 | ethnicity_2 | ethnicity_3 | ethnicity_4 | ethnicity_997 | ethnicity_999 | ev_subsidies | ev_typical_charge_1 | ev_typical_charge_2 | ev_typical_charge_3 | ev_typical_charge_4 | ev_typical_charge_5 | ev_typical_charge_6 | ev_typical_charge_997 | home_vehicle_park | home_vehicle_park_pay | home_vehicle_park_permit | micromobility_devices_1 | micromobility_devices_2 | micromobility_devices_3 | micromobility_devices_996 | micromobility_devices_997 | num_bicycles | peerrent_freq | race_1 | race_2 | race_3 | race_4 | race_5 | race_997 | race_999 | share_2 | share_3 | share_4 | share_5 | share_6 | share_7 | share_996 | telework_days_1 | telework_days_2 | telework_days_3 | telework_days_4 | telework_days_5 | telework_days_6 | telework_days_7 | telework_days_996 | telework_freq_pre_covid | transit_factors_1 | transit_factors_10 | transit_factors_11 | transit_factors_12 | transit_factors_2 | transit_factors_3 | transit_factors_4 | transit_factors_5 | transit_factors_6 | transit_factors_7 | transit_factors_8 | transit_factors_9 | transit_factors_other | transit_purpose_1 | transit_purpose_2 | transit_purpose_3 | transit_purpose_4 | transit_purpose_5 | transit_purpose_6 | transit_purpose_7 | transit_purpose_other | walk_purpose_1 | walk_purpose_2 | walk_purpose_3 | walk_purpose_4 | walk_purpose_5 | walk_purpose_6 | walk_purpose_7 | walk_purpose_8 | walk_purpose_other | why_no_bike_1 | why_no_bike_2 | why_no_bike_3 | why_no_bike_4 | why_no_bike_5 | why_no_bike_6 | why_no_bike_7 | why_no_bike_8 | person_type | person_weight | race_imputed | ethnicity_imputed | gender_imputed | person_weight_tue | person_weight_fri | person_weight_mon | person_weight_sat | person_weight_sun | person_weight_thu | person_weight_wed | hh_is_complete.y | household_income | household_vehicles |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2400008901 | 1 | 24000089 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 2 | 0 | 8 | 1 | NA | NA | 5 | 995 | 995 | 995 | NA | NA | 995 | NA | NA | NA | NA | NA | NA | 7 | 2 | 995 | 995 | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 6 | 4 | 7 | 8 | 995 | 995 | 995 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 1 | 0 | 0 | 3 | 1 | 4 | 3 | 2 | 1 | 2 | 3 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 4 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 995 | 0 | 0 | 0 | 1 | 0 | 1 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | NA | 0 | 1 | 0 | 0 | 0 | 0 | 1 | cultural events | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | NA | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 157.42583 | white | not_hispanic | female | 187.88768 | NA | NA | NA | NA | NA | NA | 1 | 999 | 2 |
| 2400008902 | 2 | 24000089 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 5 | 1 | 9 | 2 | NA | NA | 5 | 995 | 995 | 995 | NA | NA | 995 | NA | NA | NA | NA | NA | NA | 7 | 2 | 995 | 995 | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 7 | 8 | 8 | 996 | 995 | 995 | 995 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 157.42583 | white | not_hispanic | male | 187.88768 | NA | NA | NA | NA | NA | NA | 1 | 999 | 2 |
| 2400012201 | 1 | 24000122 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 1 | 7 | 39 | 0 | 5 | 1 | NA | NA | 5 | 995 | 995 | 995 | NA | NA | 995 | NA | NA | NA | NA | NA | NA | 7 | 0 | 1 | 13 | 4 | 996 | 1 | 25 | 25017 | 03400 | 00613 | 250173736002 | 250173736002 | -71.16924 | 42.33609 | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | 5 | 8 | 8 | 995 | 995 | 995 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 1 | 0 | 0 | 2 | 1 | 4 | 4 | 2 | 1 | 3 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 1 | 0 | 1 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NA | 1 | 1 | 0 | 1 | 0 | 0 | 0 | NA | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | NA | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 3 | 103.34663 | white | not_hispanic | female | 59.87428 | 75.95139 | 65.09989 | 76.29594 | 74.33908 | 64.49532 | 58.57788 | 1 | 4 | 0 |
| 2400014001 | 1 | 24000140 | 1 | 1 | 0 | 0 | 1 | 995 | 1 | 1 | 7 | 45 | 0 | 9 | 1 | NA | NA | 1 | 104 | 5 | 1 | -71.05214 | 42.35606 | 1 | 25 | 25025 | 250250701018 | 250250701042 | 03302 | 00802 | 7 | 2 | 995 | 995 | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 6 | 4 | 8 | 996 | 995 | 995 | 995 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 995 | 995 | 995 | 1 | 2 | 1 | 1 | 2 | 2 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 5 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 3 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 0 | 0 | 0 | 0 | 1 | 0 | 0 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 188.21553 | white | not_hispanic | female | 458.44733 | 655.71500 | 546.06550 | 651.46690 | 640.43646 | 489.71295 | 463.14780 | 1 | 9 | 1 |
| 2400015802 | 2 | 24000158 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 5 | 2 | NA | NA | 1 | 100 | 1 | 1 | -71.79986 | 42.26745 | 1 | 25 | 25027 | 250277317001 | 250277317002 | 00300 | 00505 | 7 | 2 | 995 | 995 | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 6 | 9 | 995 | 8 | 995 | 995 | 995 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 996 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 50.69754 | white | not_hispanic | male | 216.88728 | NA | NA | NA | NA | NA | NA | 1 | 999 | 4 |
| 2400015803 | 3 | 24000158 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | 4 | 2 | 4 | 2 | NA | NA | 2 | 100 | 5 | 2 | -72.67345 | 41.76257 | 0 | 09 | 09003 | 090035021002 | 090035021001 | 00302 | 20201 | 6 | 3 | 995 | 13 | 995 | 2 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 9 | 9 | 995 | 996 | 995 | 995 | 995 | 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | NA | 995 | 995 | 995 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 0 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NA | 995 | 995 | 995 | 995 | 995 | 995 | 995 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 50.69754 | white | not_hispanic | male | 216.88728 | NA | NA | NA | NA | NA | NA | 1 | 999 | 4 |
To join person characteristics to trips, build a person-level lookup first and then join it to the trip table.
person_join_fields <- hts$person %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::transmute(
person_id,
hh_id,
person_age = age,
person_gender = gender,
person_employment = employment
)
trip_with_person <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::left_join(
person_join_fields,
by = "person_id"
)Table 41 shows the trip table after person-level fields have been added.
Code
gt::gt(head(trip_with_person)) %>%
gt::tab_header(title = "Trip Records Joined to Person Variables")| Trip Records Joined to Person Variables | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| trip_id | day_id | trip_num | hh_id.x | first_travel_date | last_travel_date | person_id | travel_date | travel_dow | day_num | hh_day_complete | hh_is_complete | day_is_complete | trip_survey_complete | depart_time | depart_date | depart_dow | depart_hour | depart_minute | depart_seconds | arrive_time | arrive_date | arrive_dow | arrive_hour | arrive_minute | arrive_second | distance_meters | distance_miles | duration_seconds | duration_minutes | dwell_mins | speed_mph | speed_flag | o_in_region | o_state | o_county | o_puma_2012 | o_puma_2022 | o_bg_2010 | o_bg_2020 | o_lon | o_lat | d_in_region | d_state | d_county | d_puma_2012 | d_puma_2022 | d_bg_2010 | d_bg_2020 | d_lon | d_lat | mode_type | mode_1 | mode_2 | mode_3 | mode_4 | mode_other_specify | transit_egress | transit_access | num_travelers | num_hh_travelers | num_non_hh_travelers | hh_member_1 | hh_member_2 | hh_member_3 | hh_member_4 | hh_member_5 | hh_member_6 | hh_member_7 | hh_member_8 | driver | ev_charge_station | ev_charge_station_decision | o_purpose_category_reported | o_purpose_category | o_purpose_reported | o_purpose | d_purpose_category_reported | d_purpose_reported | d_purpose_category | d_purpose | d_purpose_other | park_location | park_type | bike_park_loc | scooter_park_location | park_cost | taxi_cost | taxi_pay | taxi_type | tnc_type | transit_type | user_merged | user_split | user_deleted | added_type | copied_from_proxy | unlinked_split | split_loop | days_first_trip | days_last_trip | is_transit_leg | linked_trip_id | is_transit | is_access | is_egress | has_access | has_egress | transit_quality_flag | has_synthetic_access | has_synthetic_egress | added_trip | person_num | ev_charge_station_level_1 | ev_charge_station_level_2 | ev_charge_station_level_3 | ev_charge_station_level_998 | other_bicycle | vehicle_park_pay | distance_beeline | joint_trip_num | joint_trip_id | corrected_hh_members | imputed_record_type | imputed_host_trip | imputed_joint_trip | home_distance | linked_trip_num | tour_num | tour_id | trip_weight | trip_weight_tue | trip_weight_fri | trip_weight_mon | trip_weight_sat | trip_weight_sun | trip_weight_thu | trip_weight_wed | is_complete | hh_id.y | person_age | person_gender | person_employment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2400008901001 | 240000890101 | 1 | 24000089 | 2024-06-11 | 2024-06-11 | 2400008901 | 2024-06-11 | 2 | 1 | 1 | 1 | 1 | 1 | 2024-06-11 18:20:00 | 2024-06-11 | 2 | 14 | 20 | 0 | 2024-06-11 18:27:00 | 2024-06-11 | 2 | 14 | 27 | 0 | 4710 | 2.9266656 | 420 | 7 | 95 | 25.085705 | 0 | 1 | 25 | 25009 | 00703 | 00706 | 250092174003 | 250092174023 | -70.87968 | 42.54132 | 1 | 25 | 25009 | 00703 | 00706 | 250092176004 | 250092176012 | -70.85492 | 42.57236 | 8 | 6 | 995 | 995 | 995 | None | 995 | 995 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | NA | 1 | NA | 1 | 9 | 54 | 9 | 54 | None | 1 | 995 | 995 | 995 | NA | NA | 995 | 995 | 995 | 995 | 995 | 995 | 0 | NA | 0 | 0 | 0 | 1 | 0 | 0 | 2400008901010101 | 0 | 995 | 995 | 995 | 995 | None | 995 | 995 | 1 | 1 | 995 | 995 | 995 | 995 | None | 995 | 4007.6875 | -1 | -1 | 0 | 0 | -1 | 0 | 2.9798525 | 1 | 1 | 24000089010101 | 247.1357 | 316.6946 | NA | NA | NA | NA | NA | NA | 1 | 24000089 | 8 | 1 | 5 |
| 2400008901002 | 240000890101 | 2 | 24000089 | 2024-06-11 | 2024-06-11 | 2400008901 | 2024-06-11 | 2 | 1 | 1 | 1 | 1 | 1 | 2024-06-11 20:02:00 | 2024-06-11 | 2 | 16 | 2 | 0 | 2024-06-11 20:15:00 | 2024-06-11 | 2 | 16 | 15 | 0 | 4730 | 2.9390930 | 780 | 13 | 645 | 13.565045 | 0 | 1 | 25 | 25009 | 00703 | 00706 | 250092176004 | 250092176012 | -70.85492 | 42.57236 | 1 | 25 | 25009 | 00703 | 00706 | 250092174003 | 250092174023 | -70.87968 | 42.54132 | 8 | 6 | 995 | 995 | 995 | None | 995 | 995 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 9 | 9 | 54 | 54 | 1 | 1 | 1 | 1 | None | 1 | 995 | 995 | 995 | NA | NA | 995 | 995 | 995 | 995 | 995 | 995 | 0 | NA | 0 | 0 | 0 | 0 | 1 | 0 | 2400008901010102 | 0 | 995 | 995 | 995 | 995 | None | 995 | 995 | 1 | 1 | 995 | 995 | 995 | 995 | None | 995 | 4007.6875 | -1 | -1 | 0 | 0 | -1 | 0 | 0.0000000 | 2 | 1 | 24000089010101 | 247.1357 | 316.6946 | NA | NA | NA | NA | NA | NA | 1 | 24000089 | 8 | 1 | 5 |
| 2400008902001 | 240000890201 | 1 | 24000089 | 2024-06-11 | 2024-06-11 | 2400008902 | 2024-06-11 | 2 | 1 | 1 | 1 | 1 | 1 | 2024-06-11 12:00:00 | 2024-06-11 | 2 | 8 | 0 | 0 | 2024-06-11 12:05:00 | 2024-06-11 | 2 | 8 | 5 | 0 | 666 | 0.4138342 | 300 | 5 | 75 | 4.966011 | 0 | 1 | 25 | 25009 | 00703 | 00706 | 250092174003 | 250092174023 | -70.87968 | 42.54132 | 1 | 25 | 25009 | 00703 | 00706 | 250092174004 | 250092174022 | -70.88607 | 42.54144 | 8 | 7 | 995 | 995 | 995 | None | 995 | 995 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | NA | 1 | NA | 1 | 8 | 50 | 8 | 50 | None | 4 | 1 | 995 | 995 | NA | NA | 995 | 995 | 995 | 995 | 995 | 995 | 0 | NA | 0 | 0 | 0 | 1 | 0 | 0 | 2400008902010101 | 0 | 995 | 995 | 995 | 995 | None | 995 | 995 | 1 | 2 | 995 | 995 | 995 | 995 | None | 995 | 524.2716 | -1 | -1 | 0 | 0 | -1 | 0 | 0.7113409 | 1 | 1 | 24000089020101 | 247.1357 | 316.6946 | NA | NA | NA | NA | NA | NA | 1 | 24000089 | 9 | 2 | 5 |
| 2400008902002 | 240000890201 | 2 | 24000089 | 2024-06-11 | 2024-06-11 | 2400008902 | 2024-06-11 | 2 | 1 | 1 | 1 | 1 | 1 | 2024-06-11 13:20:00 | 2024-06-11 | 2 | 9 | 20 | 0 | 2024-06-11 13:25:00 | 2024-06-11 | 2 | 9 | 25 | 0 | 1025 | 0.6369071 | 300 | 5 | 15 | 7.642885 | 0 | 1 | 25 | 25009 | 00703 | 00706 | 250092174004 | 250092174022 | -70.88607 | 42.54144 | 1 | 25 | 25009 | 00703 | 00706 | 250092174002 | 250092174013 | -70.88079 | 42.54626 | 8 | 7 | 995 | 995 | 995 | None | 995 | 995 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 8 | 8 | 50 | 50 | 10 | 33 | 10 | 33 | None | 4 | 1 | 995 | 995 | NA | NA | 995 | 995 | 995 | 995 | 995 | 995 | 0 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 2400008902010102 | 0 | 995 | 995 | 995 | 995 | None | 995 | 995 | 1 | 2 | 995 | 995 | 995 | 995 | None | 995 | 689.5092 | -1 | -1 | 0 | 0 | -1 | 0 | 0.2184301 | 2 | 1 | 24000089020101 | 247.1357 | 316.6946 | NA | NA | NA | NA | NA | NA | 1 | 24000089 | 9 | 2 | 5 |
| 2400008902003 | 240000890201 | 3 | 24000089 | 2024-06-11 | 2024-06-11 | 2400008902 | 2024-06-11 | 2 | 1 | 1 | 1 | 1 | 1 | 2024-06-11 13:40:00 | 2024-06-11 | 2 | 9 | 40 | 0 | 2024-06-11 13:45:00 | 2024-06-11 | 2 | 9 | 45 | 0 | 1084 | 0.6735680 | 300 | 5 | 95 | 8.082817 | 0 | 1 | 25 | 25009 | 00703 | 00706 | 250092174002 | 250092174013 | -70.88079 | 42.54626 | 1 | 25 | 25009 | 00703 | 00706 | 250092174003 | 250092174023 | -70.87968 | 42.54132 | 8 | 7 | 995 | 995 | 995 | None | 995 | 995 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 10 | 10 | 33 | 33 | 1 | 1 | 1 | 1 | None | 1 | 995 | 995 | 995 | NA | NA | 995 | 995 | 995 | 995 | 995 | 995 | 0 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 2400008902010103 | 0 | 995 | 995 | 995 | 995 | None | 995 | 995 | 1 | 2 | 995 | 995 | 995 | 995 | None | 995 | 557.4029 | -1 | -1 | 0 | 0 | -1 | 0 | 0.0000000 | 3 | 1 | 24000089020101 | 247.1357 | 316.6946 | NA | NA | NA | NA | NA | NA | 1 | 24000089 | 9 | 2 | 5 |
| 2400008902004 | 240000890201 | 4 | 24000089 | 2024-06-11 | 2024-06-11 | 2400008902 | 2024-06-11 | 2 | 1 | 1 | 1 | 1 | 1 | 2024-06-11 15:20:00 | 2024-06-11 | 2 | 11 | 20 | 0 | 2024-06-11 15:30:00 | 2024-06-11 | 2 | 11 | 30 | 0 | 1782 | 1.1072862 | 600 | 10 | 35 | 6.643717 | 0 | 1 | 25 | 25009 | 00703 | 00706 | 250092174003 | 250092174023 | -70.87968 | 42.54132 | 1 | 25 | 25009 | 00703 | 00706 | 250092173003 | 250092173003 | -70.87901 | 42.55242 | 8 | 7 | 995 | 995 | 995 | None | 995 | 995 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 1 | 1 | 1 | 1 | 10 | 37 | 10 | 37 | None | 4 | 1 | 995 | 995 | NA | NA | 995 | 995 | 995 | 995 | 995 | 995 | 0 | NA | 0 | 0 | 0 | 0 | 0 | 0 | 2400008902010201 | 0 | 995 | 995 | 995 | 995 | None | 995 | 995 | 1 | 2 | 995 | 995 | 995 | 995 | None | 995 | 1236.8675 | -1 | -1 | 0 | 0 | -1 | 0 | 0.4115587 | 1 | 2 | 24000089020102 | 247.1357 | 316.6946 | NA | NA | NA | NA | NA | NA | 1 | 24000089 | 9 | 2 | 5 |
If the dataset includes separate linked and unlinked trip tables, choose the trip table that matches the analysis question before joining. Linked trips are usually the better starting point for origin-destination, purpose, and whole-trip analyses. Unlinked trips are more appropriate when the question depends on leg detail, transfer behavior, or segment-level mode information.
If tours are included, they summarize groups of trips into larger travel patterns. Tour joins should be approached cautiously because the same household, person, day, or trip can appear in multiple downstream analytic datasets depending on the question being asked.
11.2 Join Cautions
Always choose the analytic unit before joining tables. A join can change the number of rows in an analysis dataset if the relationship between tables is one-to-many or many-to-many.
For example:
- Joining household data to person data creates one row per person, not one row per household.
- Joining person data to trip data creates one row per trip, not one row per person.
- Joining trip data to location data may duplicate trip records if multiple locations are associated with one trip or household.
- Joining lower-level tables back to higher-level tables can change the interpretation of counts and rates.
- Shared travel fields can create the appearance of duplicated movements because one physical trip may be represented on multiple person-trip records.
Before calculating summaries, check that the resulting row count still matches the intended analytic unit.
trip_row_count_before <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
nrow()
person_join_fields <- hts$person %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::select(person_id, hh_id)
trip_with_person <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::left_join(
person_join_fields,
by = "person_id"
)
trip_row_count_after <- nrow(trip_with_person)
row_count_check <- data.frame(
metric = c("Before join", "After join", "Difference"),
n = c(
trip_row_count_before,
trip_row_count_after,
trip_row_count_after - trip_row_count_before
),
stringsAsFactors = FALSE
)Table 42 helps confirm that the join preserved the trip-level analytic unit.
Code
gt::gt(row_count_check) %>%
gt::fmt_number(columns = n, decimals = 0, sep_mark = ",") %>%
gt::cols_label(
metric = "Metric",
n = "Rows"
)| Metric | Rows |
|---|---|
| Before join | 411,573 |
| After join | 411,573 |
| Difference | 0 |
If the row count changes unexpectedly, review the join keys and confirm that the joined table has the intended unit of observation. A row-count check is often the quickest way to catch a mistaken join before it affects a summary table or chart.
For MassDOT, it is often worth checking both row preservation and universe preservation: after the join, confirm that the analysis file still contains only records from complete households unless the question intentionally includes incomplete households.
11.3 Joining Trip and Vehicle Tables
When the selected DUG trip table preserves household vehicle numbering in the detailed mode fields, reshape those mode fields to long format, extract the household vehicle number, and then join to the vehicle table.
mode_value_labels <- codebook$value_labels %>%
dplyr::filter(
grepl("^mode_[0-9]+$", variable),
variable %in% trip_vehicle_mode_columns
) %>%
dplyr::transmute(
mode_num = variable,
mode_value = as.character(value),
mode_label = label
)
vehicle_trips_long <- hts$trip_unlinked %>%
dplyr::select(
hh_id,
person_id,
day_id,
trip_id,
dplyr::all_of(trip_vehicle_mode_columns)
) %>%
tidyr::pivot_longer(
cols = dplyr::all_of(trip_vehicle_mode_columns),
names_to = "mode_num",
values_to = "mode_value"
) %>%
dplyr::mutate(mode_value = as.character(mode_value)) %>%
dplyr::left_join(
mode_value_labels,
by = c("mode_num", "mode_value")
) %>%
dplyr::mutate(
vehicle_num = ifelse(
grepl("Household vehicle [0-9]+", mode_label),
as.integer(stringr::str_extract(mode_label, "[0-9]+")),
NA_integer_
)
) %>%
dplyr::left_join(
hts$vehicle %>%
dplyr::select(hh_id, vehicle_num, vehicle_id),
by = c("hh_id", "vehicle_num")
) %>%
dplyr::filter(!is.na(vehicle_id))Table 43 shows the trip and vehicle records after the long-format reshape and join steps are complete.
Code
gt::gt(head(vehicle_trips_long)) %>%
gt::tab_header(title = "Trip-to-Vehicle Linkage Preview")| Trip-to-Vehicle Linkage Preview | ||||||||
| hh_id | person_id | day_id | trip_id | mode_num | mode_value | mode_label | vehicle_num | vehicle_id |
|---|---|---|---|---|---|---|---|---|
| 24000089 | 2400008901 | 240000890101 | 2400008901001 | mode_1 | 6 | Household vehicle 1 | 1 | 2400008901 |
| 24000089 | 2400008901 | 240000890101 | 2400008901002 | mode_1 | 6 | Household vehicle 1 | 1 | 2400008901 |
| 24000089 | 2400008902 | 240000890201 | 2400008902001 | mode_1 | 7 | Household vehicle 2 | 2 | 2400008902 |
| 24000089 | 2400008902 | 240000890201 | 2400008902002 | mode_1 | 7 | Household vehicle 2 | 2 | 2400008902 |
| 24000089 | 2400008902 | 240000890201 | 2400008902003 | mode_1 | 7 | Household vehicle 2 | 2 | 2400008902 |
| 24000089 | 2400008902 | 240000890201 | 2400008902004 | mode_1 | 7 | Household vehicle 2 | 2 | 2400008902 |
11.4 Recommended Practice
For most analyses:
- Start with the table that already represents the desired analytic unit.
- Join only the fields needed from other tables.
- Check row counts before and after joins.
- Match the weight to the analytic unit after the joins are complete.
- Use the codebook to confirm variable definitions and value labels.
12 Choosing the Right Analytic Unit
Section 6 describes the structure of households, persons, days, trips, linked trips, tours, and vehicles. This section shifts from structure to practice: how do you choose the correct analytic unit for the question you want to answer? Most analyses in an HTS fail not because of weighting errors but because the wrong table was chosen as the starting point. These examples assume the prepared tables are already available in hts.
Choosing the analytic unit is the first design decision in any analysis. The correct unit aligns with three things:
- Who or what is being measured? (a household, a person, a person-day, a trip, a tour, or a vehicle)
- What the variable conceptually describes (a household attribute, a person characteristic, a daily behavior, a movement, or a chain of movements)
- At what level the population is represented in sampling weights
Table 44 connects each analytic unit to its best use cases, without repeating definitions already given in the Dataset Overview.
| Analytic Unit | Starting Table | Typical Use |
|---|---|---|
| Household | hh | Household characteristics and household-level summaries |
| Person | person | Demographics, employment, student status, and person-level summaries |
| Person-day | day | Daily behavior, zero-trip days, deliveries, and trip-rate denominators |
| Linked trip | trip_linked | Origin-destination, purpose, and whole-trip summaries |
| Unlinked trip | trip_unlinked | Leg-level mode detail, transfers, and segment-level summaries |
| Tour | tour | Tour-pattern analysis across linked travel chains |
| Vehicle | vehicle | Household fleet summaries and vehicle characteristics |
12.1 Household-Level Analyses
Use households as the analytic unit when the phenomenon is shared or decided collectively: income, vehicle fleet, home location, delivery behavior, household makeup, or whether a household has zero vehicles. Even if a household variable is influenced by individual people, the household is still the right level because sampling occurred at the household level.
Example question: “What is the average household income in the study area?”
12.2 Person-Level Analyses
Analyses about people, such as demographics, employment status, student status, or attitudinal questions, belong at the person level. Each person’s weight represents them in the population. Use day or trip tables only when the metric you want to measure exists at those levels.
Example question: “What percentage of people in the study area are employed?”
12.3 Day-Level Analyses
Use person-days when studying daily behavior: trip rates, telework frequency, deliveries, or analyses that depend on people who made zero trips. The day table keeps all sampled days in scope, not only days with trips.
Example question: “What is the average number of trips per person-day?”
12.4 Trip-Level Analyses
Most movement-based analyses start with trips. Linked trips are usually the better starting point for origin-destination, purpose, and whole-trip summaries. Unlinked trips are appropriate for leg-level mode detail, transfer behavior, or segment-specific metrics.
Example question: “What is the average trip distance for work trips?”
12.5 Tour-Level Analyses
Use tours when the analysis focuses on full activity patterns or concepts aligned with activity-based modeling: stop-making, work subtours, escorting, home-based versus non-home-based travel, or mode hierarchy across a chain of trips.
Example question: “What percentage of tours include a stop at a school?”
12.6 Vehicle-Level Analyses
The vehicle table is the correct unit for vehicle fleet summaries, EV prevalence, fuel-type distributions, household fleet size, or daily mileage when paired with trip data. Vehicles belong to households, so vehicle analyses usually rely on household weights.
Example question: “What is the average daily mileage for electric vehicles?”
13 Working with Variables
13.1 Categorical Response Data
The majority of data collected in an HTS are categorical variables, where respondents select from a predefined list of options. These appear as:
- Single-response categorical variables (SRCVs) where respondents select one option from a predefined list
- Multiple-response categorical variables (MRCVs) where respondents can select multiple options
- Grouped categorical variables, sometimes called “question batteries,” stored as sets of related indicator columns
- Count variables with top-coding where the highest category is open ended (e.g., “5 or more vehicles”)
General Considerations
When working with categorical response data, keep the following best practices in mind:
- Start with the codebook. Confirm variable definitions, valid values, table membership, and skip logic there before opening the questionnaire for extra context.
- Do not mistake missing for no. Only recode missing to zero when the respondent was logically not asked the question. For example, if a respondent was not asked about how long they teleworked on a given day because they are not employed, then it is appropriate to recode missing telework duration to zero. However, if a respondent was asked about telework duration but did not answer, then the missing value should be retained as missing rather than recoded to zero.
Single-Response Categorical Variables (SRCVs)
Single-response categorical variables are variables where respondents select one option from a predefined list. Examples include gender, employment status, or broad income category.
For example, the household income variable income_broad can be labeled using the codebook before it is summarized.
income_value_labels <- codebook$value_labels %>%
dplyr::filter(variable == "income_broad") %>%
dplyr::transmute(
income_code = value,
income_key = as.character(value),
income_broad_label = label
)
hh_income <- hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::mutate(income_key = as.character(income_broad)) %>%
dplyr::left_join(
income_value_labels %>%
dplyr::select(income_key, income_broad_label),
by = "income_key"
) %>%
dplyr::mutate(
income_broad_label = factor(
income_broad_label,
levels = income_value_labels$income_broad_label
)
)
hh_income_counts <- hh_income %>%
dplyr::group_by(income_broad_label) %>%
dplyr::summarize(n = dplyr::n(), .groups = "drop")After joining the value labels, group the labeled variable to produce the counts used in the final table. Table 45 shows the resulting counts of households by broad income category.
Code
gt::gt(hh_income_counts) %>%
gt::fmt_number(columns = n, decimals = 0, sep_mark = ",") %>%
gt::cols_label(
income_broad_label = "Household Income",
n = "Households"
) %>%
gt::tab_header(title = "Household Counts by Income Category")| Household Counts by Income Category | |
| Household Income | Households |
|---|---|
| Under $25,000 | 1,855 |
| $25,000-$49,999 | 1,892 |
| $50,000-$74,999 | 1,963 |
| $75,000-$99,999 | 2,028 |
| $100,000-$199,999 | 4,210 |
| $200,000 or more | 2,170 |
| Prefer not to answer | 1,523 |
Multiple-Response Categorical Variables (MRCVs)
Multiple-response variables are often delivered as groups of checkbox columns. When checkbox-style variables are present, reshaping them to long format is usually the clearest way to count selections and label the results.
delivery_checkbox_regex <- "^delivery_"
delivery_variable_list <- codebook$variable_list %>%
dplyr::filter(
day == 1,
is_checkbox == 1,
stringr::str_detect(variable, delivery_checkbox_regex)
) %>%
dplyr::select(variable, description)
delivery_none_of_above <- "delivery_996"
delivery_variables <- delivery_variable_list %>%
dplyr::filter(variable != delivery_none_of_above) %>%
dplyr::pull(variable)
delivery_descriptions <- delivery_variable_list %>%
dplyr::filter(variable %in% delivery_variables)
delivery_long <- hts$day %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::select(day_id, dplyr::all_of(delivery_variables)) %>%
tidyr::pivot_longer(
cols = dplyr::all_of(delivery_variables),
names_to = "variable",
values_to = "selected"
) %>%
dplyr::filter(selected == 1) %>%
dplyr::left_join(
delivery_descriptions,
by = "variable"
) %>%
dplyr::count(description, name = "n_days") %>%
dplyr::arrange(dplyr::desc(n_days))In MassDOT, these delivery questions are stored on the day table as checkbox columns named delivery_*. This example excludes delivery_996, which is the codebook field for “None of the above,” so the table focuses on positive delivery types.
Table 46 makes it easier to review how often each delivery type was selected across reported person-days.
Code
gt::gt(delivery_long) %>%
gt::fmt_number(columns = n_days, decimals = 0, sep_mark = ",") %>%
gt::cols_label(
description = "Delivery Variable",
n_days = "Days"
) %>%
gt::tab_header(title = "Delivery Checkboxes Selected Across Days")| Delivery Checkboxes Selected Across Days | |
| Delivery Variable | Days |
|---|---|
| Type of delivery: Received packages at home (e.g., USPS, FedEx, UPS) | 20,759 |
| Type of delivery: Take-out/prepared food delivered to home | 3,351 |
| Type of delivery: Someone came to do work at home (e.g., babysitter, housecleaning, lawn) | 1,995 |
| Type of delivery: Groceries delivered to home | 1,588 |
| Type of delivery: Received packages at another location (e.g., Amazon Locker, package pick-up point) | 1,012 |
| Type of delivery: Other item delivered to home (e.g., appliance) | 316 |
| Type of delivery: Received personal packages at work | 248 |
Missing Categorical Data
Missing categorical data should be handled explicitly rather than silently dropped. The codebook labels are often the quickest way to confirm whether a value represents a valid response category, an inapplicable record, or a nonresponse code.
gender_value_labels <- codebook$value_labels %>%
dplyr::filter(variable == "gender") %>%
dplyr::transmute(
gender_key = as.character(value),
label
)
gender_counts <- hts$person %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::count(gender, name = "n") %>%
dplyr::mutate(gender_key = as.character(gender)) %>%
dplyr::left_join(
gender_value_labels,
by = "gender_key"
) %>%
dplyr::select(-gender_key) %>%
dplyr::arrange(gender)Table 47 helps distinguish valid response codes from missing or special-case values.
Code
gt::gt(gender_counts) %>%
gt::fmt_number(columns = n, decimals = 0, sep_mark = ",") %>%
gt::cols_label(
gender = "Gender Code",
n = "Persons",
label = "Gender Label"
) %>%
gt::tab_header(title = "Gender Counts with Codebook Labels")| Gender Counts with Codebook Labels | ||
| Gender Code | Persons | Gender Label |
|---|---|---|
| 1 | 15,253 | Female |
| 2 | 13,165 | Male |
| 4 | 273 | Non-binary |
| 995 | 1,552 | Missing Response |
| 997 | 71 | Other/prefer to self-describe |
| 999 | 941 | Prefer not to answer |
Count Variables with Top-Coding
Variables such as num_vehicles are often stored as integer-coded categories rather than unconstrained numeric counts. Treat them as categories unless the codebook clearly indicates that they are true numeric measures.
vehicle_count_labels <- codebook$value_labels %>%
dplyr::filter(variable == "num_vehicles") %>%
dplyr::transmute(
vehicle_count_code = value,
vehicle_count_key = as.character(value),
num_vehicles_label = label
)
vehicle_count_distribution <- hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::transmute(vehicle_count_code = num_vehicles) %>%
dplyr::count(vehicle_count_code, name = "n_households") %>%
dplyr::mutate(vehicle_count_key = as.character(vehicle_count_code)) %>%
dplyr::left_join(
vehicle_count_labels %>%
dplyr::select(vehicle_count_key, num_vehicles_label),
by = "vehicle_count_key"
) %>%
dplyr::select(-vehicle_count_key) %>%
dplyr::arrange(vehicle_count_code)Table 48 treats the coded vehicle-count variable as a set of labeled categories rather than a continuous measure.
Code
gt::gt(vehicle_count_distribution) %>%
gt::fmt_number(columns = n_households, decimals = 0, sep_mark = ",") %>%
gt::cols_label(
vehicle_count_code = "Vehicles",
n_households = "Households",
num_vehicles_label = "Vehicle Count Label"
) %>%
gt::tab_header(title = "Household Vehicle Count Distribution")| Household Vehicle Count Distribution | ||
| Vehicles | Households | Vehicle Count Label |
|---|---|---|
| 0 | 2,161 | 0 (no vehicles in my household) |
| 1 | 7,122 | 1 vehicle |
| 2 | 4,925 | 2 vehicles |
| 3 | 1,066 | 3 vehicles |
| 4 | 274 | 4 vehicles |
| 5 | 68 | 5 vehicles |
| 6 | 16 | 6 vehicles |
| 7 | 4 | 7 vehicles |
| 8 | 5 | 8 or more vehicles |
13.2 Numeric Variables
The HTS dataset contains several numeric variables, such as trip distances, durations, and speeds. These variables can include extreme values or outliers that affect analysis results, so it is good practice to inspect definitions and distributions before calculating summaries.
For most analytic examples in this chapter, the code filters to complete households first. If your goal is delivered-data quality assurance rather than population analysis, you may choose a broader universe.
Consult the Codebook First
Before analyzing any numeric variable, verify its meaning and units from the codebook. Many errors stem from assuming what a variable represents rather than confirming it.
distance_variables <- codebook$variable_list %>%
dplyr::filter(
stringr::str_detect(variable, "distance"),
data_type %in% c("integer", "numeric")
) %>%
dplyr::select(variable, description)Review Table 49 before choosing which distance field to summarize.
Code
gt::gt(distance_variables) %>%
gt::tab_header(title = "Distance Variables in the Codebook")| Distance Variables in the Codebook | |
| variable | description |
|---|---|
| distance_meters | Distance (meters) |
| distance_beeline | Beeline distance (meters) |
| distance_miles | Distance (miles) |
| home_distance | Trip distance from home (miles) |
Inspect the Data
Before calculating any metric:
- check for missing values
- generate a quick
summary() - inspect the distribution with a histogram or boxplot
- review the minimum and maximum for plausibility
Start with a direct summary of the trip-distance field to understand its central tendency and range.
summary(
hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::pull(distance_miles)
)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0.000 0.680 2.225 8.577 6.004 12348.275 11747Then visualize the distribution in Figure 10 so you can see the shape of the variable and the upper tail.
Code
trip_distance_plot_data <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::filter(!is.na(distance_miles), distance_miles > 0)
trip_distance_histogram <- ggplot2::ggplot(
trip_distance_plot_data,
ggplot2::aes(x = distance_miles)
) +
ggplot2::geom_histogram(bins = 40, fill = "#17384e", color = "white") +
ggplot2::scale_x_log10(
labels = scales::label_number(big.mark = ",")
) +
ggplot2::labs(
title = "Distribution of Trip Distance",
subtitle = "Trip-distance histogram with a log-scaled distance axis",
x = "Distance (miles, log scale)",
y = "Trips"
) +
ggplot2::theme_minimal(base_size = 12) +
ggplot2::theme(
plot.title = ggplot2::element_text(face = "bold"),
panel.grid.minor = ggplot2::element_blank()
)
trip_distance_histogramHandle Outliers
Outlier handling depends on the study context and the variable being analyzed. In many cases, the first step is simply to identify the upper tail before deciding whether trimming, filtering, or a different summary statistic is appropriate.
trip_distance_quantiles <- stats::quantile(
hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::pull(distance_miles),
probs = c(0.5, 0.9, 0.95, 0.99),
na.rm = TRUE
)
trip_distance_summary <- data.frame(
statistic = names(trip_distance_quantiles),
miles = as.numeric(trip_distance_quantiles),
stringsAsFactors = FALSE
)
long_trips <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::filter(distance_miles > trip_distance_quantiles[[4]]) %>%
dplyr::transmute(
trip_id,
household_id = hh_id,
person_id,
distance_miles,
duration_minutes
)Start by reviewing Table 50 to locate the upper tail of the distance distribution.
Code
gt::gt(trip_distance_summary) %>%
gt::fmt_number(columns = miles, decimals = 2) %>%
gt::cols_label(
statistic = "Statistic",
miles = "Miles"
) %>%
gt::tab_header(title = "Trip Distance Quantiles")| Trip Distance Quantiles | |
| Statistic | Miles |
|---|---|
| 50% | 2.22 |
| 90% | 15.01 |
| 95% | 24.69 |
| 99% | 69.37 |
Then inspect Table 51 for a small sample of trips above the 99th percentile to see which records deserve follow-up.
Code
gt::gt(dplyr::slice_head(long_trips, n = 10)) %>%
gt::fmt_number(
columns = c(distance_miles, duration_minutes),
decimals = 2
) %>%
gt::tab_header(title = "Trips Above the 99th Percentile of Distance")| Trips Above the 99th Percentile of Distance | ||||
| trip_id | household_id | person_id | distance_miles | duration_minutes |
|---|---|---|---|---|
| 2400322201024 | 24003222 | 2400322201 | 87.77 | 216.00 |
| 2400322201026 | 24003222 | 2400322201 | 90.21 | 81.00 |
| 2400322202009 | 24003222 | 2400322202 | 90.21 | 81.00 |
| 2400346501001 | 24003465 | 2400346501 | 106.80 | 140.00 |
| 2400361001075 | 24003610 | 2400361001 | 73.16 | 79.00 |
| 2400453801032 | 24004538 | 2400453801 | 74.66 | 89.00 |
| 2400478201047 | 24004782 | 2400478201 | 1,189.56 | 243.00 |
| 2400478201048 | 24004782 | 2400478201 | 84.40 | 83.00 |
| 2400478201053 | 24004782 | 2400478201 | 1,129.97 | 0.00 |
| 2400478202036 | 24004782 | 2400478202 | 1,189.56 | 243.00 |
Example: Trip Speeds by Mode
Speed summaries are a good example of why both labels and outlier checks matter.
mode_value_labels <- codebook$value_labels %>%
dplyr::filter(variable == "mode") %>%
dplyr::transmute(
mode_code = value,
mode_key = as.character(value),
mode_label = label
)
speed_by_mode <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::filter(!is.na(speed_mph), speed_mph >= 0) %>%
dplyr::mutate(mode_key = as.character(mode_type)) %>%
dplyr::left_join(
mode_value_labels %>%
dplyr::select(mode_key, mode_label),
by = "mode_key"
) %>%
dplyr::group_by(mode_label) %>%
dplyr::summarize(
mean_speed_mph = mean(speed_mph, na.rm = TRUE),
median_speed_mph = stats::median(speed_mph, na.rm = TRUE),
n_trips = dplyr::n(),
.groups = "drop"
) %>%
dplyr::arrange(dplyr::desc(mean_speed_mph))Table 52 combines codebook labels with simple speed summaries so that mode-level differences are easy to review.
Code
gt::gt(speed_by_mode) %>%
gt::fmt_number(
columns = c(mean_speed_mph, median_speed_mph),
decimals = 1
) %>%
gt::fmt_number(columns = n_trips, decimals = 0, sep_mark = ",") %>%
gt::cols_label(
mode_label = "Mode",
mean_speed_mph = "Mean Speed (mph)",
median_speed_mph = "Median Speed (mph)",
n_trips = "Trips"
) %>%
gt::tab_header(title = "Trip Speeds by Mode")| Trip Speeds by Mode | |||
| Mode | Mean Speed (mph) | Median Speed (mph) | Trips |
|---|---|---|---|
| Long Distance Passenger | 92,224.8 | 260.0 | 926 |
| Shuttle / Vanpool | 2,335.7 | 9.6 | 939 |
| Car share | 1,325.1 | 13.5 | 372 |
| Missing Response | 694.8 | 11.9 | 15,716 |
| Ferry | 649.5 | 12.3 | 265 |
| Walk | 385.4 | 2.5 | 84,494 |
| Other | 175.1 | 9.7 | 2,739 |
| Car | 130.8 | 19.7 | 264,558 |
| TNC | 111.9 | 13.5 | 3,337 |
| Bike share | 101.6 | 6.1 | 785 |
| Transit | 97.0 | 8.9 | 17,443 |
| Bike | 66.2 | 8.4 | 6,602 |
| Taxi | 40.7 | 12.6 | 325 |
| School bus | 11.6 | 7.9 | 1,319 |
| Scooter share | 6.2 | 6.2 | 6 |
13.3 Date and Time Data
Trip departure and arrival times are stored as a set of split date and time components rather than a single datetime object. This structure avoids timezone conversion issues when importing data across different software environments and makes it straightforward to work with time-of-day components (e.g., filtering by hour) without parsing a full timestamp.
Time Variables on the Trip Table
The following fields together define when each trip departed and arrived:
| Variable | Type | Description |
|---|---|---|
travel_date |
Date (YYYY-MM-DD) |
Diary-day date carried on the trip record. |
depart_date |
Date (YYYY-MM-DD) |
Calendar date of departure. |
depart_hour |
Integer (0-23) | Hour of departure in 24-hour time, local to the study area. |
depart_minute |
Integer (0-59) | Minute of departure. |
depart_seconds |
Integer (0-59) | Second of departure. |
arrive_date |
Date (YYYY-MM-DD) |
Calendar date of arrival. Will differ from depart_date for trips crossing midnight. |
arrive_hour |
Integer (0-23) | Hour of arrival in 24-hour time. |
arrive_minute |
Integer (0-59) | Minute of arrival. |
arrive_second |
Integer (0-59) | Second of arrival. |
Timezone. When reconstructing timestamps in this guide, use the study timezone from
settings.yml, which isAmerica/New_Yorkfor this MassDOT delivery.
Travel day boundary. In the prepared MassDOT trip table, trips departing before 3:00 AM are attached to the previous diary day. For those records,
travel_datereflects the diary day whiledepart_datereflects the wall-clock calendar date.
Cross-midnight trips. For trips that depart before midnight and arrive after midnight,
arrive_datewill be one calendar day later thandepart_date. When reconstructing durations, always use the full datetime (date + time) rather than differencing hours alone.
Reconstructing Timestamps in R
When you need a full POSIXct timestamp, for example to calculate duration, filter by time window, or plot a time series, recombine the split fields explicitly using lubridate:
trip_with_datetime <- hts$trip_unlinked %>%
dplyr::mutate(
depart_datetime = lubridate::ymd_hms(
paste(
depart_date,
sprintf("%02d:%02d:%02d", depart_hour, depart_minute, depart_seconds)
),
tz = study_timezone
),
arrive_datetime = lubridate::ymd_hms(
paste(
arrive_date,
sprintf("%02d:%02d:%02d", arrive_hour, arrive_minute, arrive_second)
),
tz = study_timezone
)
)The preview below shows the reconstructed departure and arrival timestamps for the first few trip records.
Code
gt::gt(
trip_with_datetime %>%
dplyr::select(trip_id, depart_datetime, arrive_datetime) %>%
utils::head()
) %>%
gt::tab_header(title = "Reconstructed Trip Timestamps")Common Pitfalls
- Using
depart_houralone for peak-period analysis is fine for most purposes, but for trips near the hour boundary (e.g., 7:58 AM), use the full datetime if precision matters. - Differencing
arrive_hour - depart_hourwill produce incorrect (negative) durations for cross-midnight trips. Always difference full datetimes or use the pre-calculatedduration_minutesfield instead.
14 Weights and Inference
14.1 Getting Started
Weights exist so that the sample can represent the target population. In household travel surveys, analysts usually work with household, person, day, trip, and sometimes tour weights. The correct weight depends on the final analytic unit, not just the first table that was opened.
For analyses that compare travel behavior across specific weekdays, use the day-of-week workflow described in Section 16. The standard household, person, day, and trip weights remain the default for overall average-day estimates.
For MassDOT, most weighted estimates should also be restricted to complete households. The examples below define that universe from hts$hh$is_complete and then carry it into lower-level files through hh_id.
Choosing the Right Weight
Begin by matching each analytic unit to the starting table and weight column used in the prepared data.
weight_lookup <- data.frame(
analytic_unit = character(),
starting_table = character(),
weight_variable = character(),
stringsAsFactors = FALSE
)
weight_lookup <- rbind(
weight_lookup,
data.frame(
analytic_unit = "Household",
starting_table = "hh",
weight_variable = "hh_weight"
)
)
weight_lookup <- rbind(
weight_lookup,
data.frame(
analytic_unit = "Person",
starting_table = "person",
weight_variable = "person_weight"
)
)
weight_lookup <- rbind(
weight_lookup,
data.frame(
analytic_unit = "Person-day",
starting_table = "day",
weight_variable = "day_weight"
)
)
if ("trip_linked" %in% names(hts)) {
weight_lookup <- rbind(
weight_lookup,
data.frame(
analytic_unit = "Linked trip",
starting_table = "trip_linked",
weight_variable = "trip_weight"
)
)
}
if ("trip_unlinked" %in% names(hts)) {
weight_lookup <- rbind(
weight_lookup,
data.frame(
analytic_unit = "Unlinked trip",
starting_table = "trip_unlinked",
weight_variable = "trip_weight"
)
)
}
if ("tour" %in% names(hts) && "tour_weight" %in% names(hts$tour)) {
weight_lookup <- rbind(
weight_lookup,
data.frame(
analytic_unit = "Tour",
starting_table = "tour",
weight_variable = "tour_weight"
)
)
}Use Table 54 to confirm the correct weight before building any weighted estimate.
Code
gt::gt(weight_lookup) %>%
gt::cols_label(
analytic_unit = "Analytic Unit",
starting_table = "Starting Table",
weight_variable = "Weight Variable"
) %>%
gt::tab_header(title = "Recommended Weight by Analytic Unit")| Recommended Weight by Analytic Unit | ||
| Analytic Unit | Starting Table | Weight Variable |
|---|---|---|
| Household | hh | hh_weight |
| Person | person | person_weight |
| Person-day | day | day_weight |
| Linked trip | trip_linked | trip_weight |
| Unlinked trip | trip_unlinked | trip_weight |
| Tour | tour | tour_weight |
Calculating Simple Weighted Estimates
Before moving to design-aware inference, it is useful to confirm the weighted numerator and denominator directly.
zero_vehicle_share <- hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::summarize(
weighted_zero_vehicle_households = sum(
hh_weight * (num_vehicles == 0),
na.rm = TRUE
),
weighted_households = sum(hh_weight, na.rm = TRUE),
share_zero_vehicle = weighted_zero_vehicle_households / weighted_households
)Table 55 shows the numerator, denominator, and resulting share in one table.
Code
gt::gt(zero_vehicle_share) %>%
gt::fmt_number(
columns = c(weighted_zero_vehicle_households, weighted_households),
decimals = 0,
sep_mark = ","
) %>%
gt::fmt_percent(columns = share_zero_vehicle, decimals = 1) %>%
gt::tab_header(title = "Simple Weighted Share of Zero-Vehicle Households")| Simple Weighted Share of Zero-Vehicle Households | ||
| weighted_zero_vehicle_households | weighted_households | share_zero_vehicle |
|---|---|---|
| 325,071 | 2,814,595 | 11.5% |
14.2 Survey-Aware Methods for Inference
Simple weighted proportions are often enough for descriptive summaries, but they are not enough when you need valid standard errors, confidence intervals, or hypothesis tests. Those cases require a survey design object that respects clustering, stratification, and weights.
When Do You Need Survey-Aware Methods?
Use survey-aware methods when:
- reporting standard errors or confidence intervals
- comparing estimates across groups
- fitting regression models
- working with small subgroups where design effects matter
- estimating totals, means, or proportions for publication or external reporting
Specifying the Survey Design
In Massachusetts Travel Study, the household is the primary sampling unit (PSU; see Section 2.1.2). Even when analyzing person-, day-, or trip-level records, observations remain clustered within households.
The examples below use:
hh_idas the PSU- the weight that matches the analytic unit
sample_segmentas the design strata
Start by defining a household-level survey design object with the fields needed for the analysis.
hh_design <- hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::transmute(
hh_id,
sample_segment,
analysis_weight = hh_weight,
vehicle_count = num_vehicles
) %>%
dplyr::filter(
!is.na(sample_segment),
!is.na(analysis_weight),
analysis_weight > 0
) %>%
srvyr::as_survey_design(
ids = hh_id,
strata = sample_segment,
weights = analysis_weight
)For trip- or day-level analysis, join the design fields needed for clustering, stratification, and weights before building the survey object.
trip_design <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::left_join(
hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::select(hh_id, sample_segment),
by = "hh_id"
) %>%
dplyr::mutate(analysis_weight = trip_weight) %>%
dplyr::filter(
!is.na(sample_segment),
!is.na(analysis_weight),
analysis_weight > 0
) %>%
srvyr::as_survey_design(
ids = hh_id,
strata = sample_segment,
weights = analysis_weight
)Set the lonely-PSU handling before running summaries that request standard errors or confidence intervals.
options(
survey.lonely.psu = "adjust",
srvyr.lonely.psu = "adjust"
)Using the Survey Design for Weighted Estimates
Once the design object is defined, use survey_mean(), survey_total(), or survey_prop() instead of manually calculating standard errors.
hh_vehicle_summary <- hh_design %>%
dplyr::group_by(vehicle_count) %>%
dplyr::summarize(
weighted_households = srvyr::survey_total(vartype = c("se", "ci")),
weighted_share = srvyr::survey_prop(vartype = c("se", "ci")),
.groups = "drop"
)Table 56 shows the weighted totals and shares with their uncertainty measures.
Code
gt::gt(hh_vehicle_summary) %>%
gt::fmt_number(
columns = c(weighted_households, weighted_households_se, weighted_households_low, weighted_households_upp),
decimals = 1,
sep_mark = ","
) %>%
gt::fmt_percent(
columns = c(weighted_share, weighted_share_se, weighted_share_low, weighted_share_upp),
decimals = 1
) %>%
gt::tab_header(title = "Weighted Household Vehicle Summary")| Weighted Household Vehicle Summary | ||||||||
| vehicle_count | weighted_households | weighted_households_se | weighted_households_low | weighted_households_upp | weighted_share | weighted_share_se | weighted_share_low | weighted_share_upp |
|---|---|---|---|---|---|---|---|---|
| 0 | 325,070.5 | 10,284.6 | 304,911.6 | 345,229.5 | 11.5% | 0.4% | 10.9% | 12.3% |
| 1 | 1,096,480.0 | 17,548.7 | 1,062,082.5 | 1,130,877.4 | 39.0% | 0.6% | 37.8% | 40.1% |
| 2 | 1,002,864.0 | 20,037.9 | 963,587.3 | 1,042,140.6 | 35.6% | 0.6% | 34.5% | 36.8% |
| 3 | 275,203.7 | 12,361.9 | 250,972.9 | 299,434.5 | 9.8% | 0.4% | 9.0% | 10.6% |
| 4 | 84,670.2 | 7,340.9 | 70,281.2 | 99,059.3 | 3.0% | 0.3% | 2.5% | 3.6% |
| 5 | 22,895.8 | 3,708.6 | 15,626.5 | 30,165.1 | 0.8% | 0.1% | 0.6% | 1.1% |
| 6 | 5,609.9 | 2,114.4 | 1,465.5 | 9,754.3 | 0.2% | 0.1% | 0.1% | 0.4% |
| 7 | 1,166.8 | 848.5 | −496.3 | 2,829.8 | 0.0% | 0.0% | 0.0% | 0.2% |
| 8 | 634.5 | 343.3 | −38.5 | 1,307.5 | 0.0% | 0.0% | 0.0% | 0.1% |
Filtering Data vs. Filtering the Survey Design
Filtering records before defining the survey design is not always the same as defining the survey design first and then subsetting it. The difference matters most when a subgroup should remain part of the original design rather than being treated as a new standalone sample.
For example, an adult-only trip analysis should define adulthood from the person table and then carry that flag into the trip-level survey design.
adult_age_labels <- codebook$value_labels %>%
dplyr::filter(variable == "age") %>%
dplyr::transmute(
age_key = as.character(value),
age_label = sub("^Age\\s+", "", label)
) %>%
dplyr::mutate(
age_label = ifelse(age_label == "85 up", "85 or older", age_label)
)
adult_trip_design <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::left_join(
hts$person %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::transmute(
person_id,
age_key = as.character(age)
) %>%
dplyr::left_join(
adult_age_labels,
by = "age_key"
),
by = "person_id"
) %>%
dplyr::filter(
age_label %in% c(
"18-24",
"25-34",
"35-44",
"45-54",
"55-64",
"65-74",
"75-84",
"85 or older"
),
!is.na(trip_weight),
trip_weight > 0
) %>%
dplyr::left_join(
hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::select(hh_id, sample_segment),
by = "hh_id"
) %>%
srvyr::as_survey_design(
ids = hh_id,
strata = sample_segment,
weights = trip_weight
)Calculating Estimate Reliability (RSE)
One simple reliability check is to compare the estimate to its own standard error using a relative standard error (RSE).
trip_mode_rse <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::left_join(
hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::select(hh_id, sample_segment),
by = "hh_id"
) %>%
dplyr::mutate(mode_type = as.character(mode_type)) %>%
dplyr::left_join(
codebook$value_labels %>%
dplyr::filter(variable == "mode") %>%
dplyr::transmute(
analysis_mode_key = as.character(value),
analysis_mode = label
),
by = c("mode_type" = "analysis_mode_key")
) %>%
dplyr::mutate(
analysis_weight = trip_weight
) %>%
dplyr::filter(
!is.na(sample_segment),
!is.na(analysis_weight),
!is.na(analysis_mode),
analysis_weight > 0
) %>%
srvyr::as_survey_design(
ids = hh_id,
strata = sample_segment,
weights = analysis_weight
) %>%
dplyr::group_by(analysis_mode) %>%
dplyr::summarize(
share = srvyr::survey_prop(vartype = "se"),
.groups = "drop"
) %>%
dplyr::mutate(
rse = ifelse(share > 0, share_se / share, NA_real_)
)Table 57 reports each estimate beside its standard error and relative standard error.
Code
gt::gt(trip_mode_rse) %>%
gt::fmt_percent(columns = c(share, share_se, rse), decimals = 1) %>%
gt::cols_label(
analysis_mode = "Trip Mode",
share = "Share",
share_se = "Std. Error",
rse = "RSE"
) %>%
gt::tab_header(title = "Relative Standard Errors for Trip Mode Shares")| Relative Standard Errors for Trip Mode Shares | |||
| Trip Mode | Share | Std. Error | RSE |
|---|---|---|---|
| Bike | 1.2% | 0.1% | 8.7% |
| Bike share | 0.1% | 0.0% | 31.7% |
| Car | 73.3% | 0.5% | 0.7% |
| Car share | 0.1% | 0.0% | 40.3% |
| Ferry | 0.1% | 0.0% | 26.8% |
| Long Distance Passenger | 0.2% | 0.0% | 12.7% |
| Missing Response | 2.3% | 0.1% | 4.8% |
| Other | 0.9% | 0.1% | 9.4% |
| School bus | 1.3% | 0.1% | 7.1% |
| Scooter share | 0.0% | 0.0% | 49.7% |
| Shuttle / Vanpool | 0.3% | 0.0% | 15.0% |
| TNC | 0.9% | 0.1% | 10.4% |
| Taxi | 0.2% | 0.0% | 19.0% |
| Transit | 2.8% | 0.1% | 3.9% |
| Walk | 16.4% | 0.4% | 2.4% |
Working with Small Sample Sizes
When estimates are unstable:
- Broaden the reporting domain.
- Collapse sparse categories where it is substantively reasonable.
- Keep zero-valued days or households in the denominator when they are part of the analytic universe.
- Consider model-based approaches rather than repeated subgroup slicing.
- Report uncertainty clearly instead of presenting small-cell estimates as precise.
15 Common Travel Metrics
15.2 Trip Rates
Understanding trip rates requires aligning the unit of analysis with the survey’s hierarchical structure and weighting design. This section introduces the recommended analytic units for trips and person-days, outlines how to calculate weighted trip rates correctly, and highlights key pitfalls to avoid.
For analyses that compare trip-making across specific weekdays, use the day-of-week workflow described in Section 16. The standard day and trip weights remain the default for overall average-day trip-rate estimates.
For MassDOT, these examples use complete households as the default analytic universe.
To calculate a weighted trip rate, divide the weighted count of trips by the weighted count of person-days. This approach keeps both travelers and non-travelers represented correctly.
When both linked and unlinked trip tables are available, linked trips are usually the better numerator for whole-trip rates. Unlinked trips are more appropriate when the analysis is explicitly about trip legs or segments.
Start by calculating the weighted numerator and denominator directly.
weighted_trip_rate <- sum(
hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::pull(trip_weight),
na.rm = TRUE
) / sum(
hts$day %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::pull(day_weight),
na.rm = TRUE
)Table 59 reports the resulting weighted trip rate.
Code
gt::gt(data.frame(weighted_trip_rate = weighted_trip_rate)) %>%
gt::fmt_number(columns = weighted_trip_rate, decimals = 2) %>%
gt::tab_header(title = "Weighted Trip Rate")| Weighted Trip Rate |
| weighted_trip_rate |
|---|
| 4.45 |
Why the Denominator (Household, Person, Day) Weights Matter
Trip rates depend on both the number of trips recorded and the number of diary days those trips came from. Without day weights, respondents who provide more usable diary days can exert disproportionate influence, and zero-trip days can drop out of the denominator.
Why Trip Weights Matter
Trip weights expand recorded trips to population-level trip totals. A correct trip-rate calculation therefore uses:
- trip weights in the numerator
- household-, person-, or day-level weights in the denominator, depending on the metric
Why Zero-Travel Days Matter
Even after correcting for nonresponse and trip underreporting, people who did not travel on a given day remain part of the analytic universe. Excluding zero-trip days overstates trip rates because the denominator omits valid person-days with no travel.
Constructing a Person-Day Trip Rate Dataset
A typical workflow begins by aggregating trips to the day level, joining that summary back to the day table, and filling in zeros for days without travel.
weighted_trips <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::group_by(day_id) %>%
dplyr::summarize(
weighted_trips = sum(trip_weight, na.rm = TRUE),
.groups = "drop"
)
day_trip_rates <- hts$day %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::left_join(
weighted_trips,
by = "day_id"
) %>%
dplyr::mutate(
weighted_trips = ifelse(is.na(weighted_trips), 0, weighted_trips),
weighted_trips_per_day = weighted_trips / day_weight
)Table 60 shows the day-level dataset after weighted trips have been joined back to the person-day denominator.
Code
gt::gt(dplyr::slice_head(day_trip_rates, n = 10)) %>%
gt::fmt_number(
columns = c(day_weight, weighted_trips, weighted_trips_per_day),
decimals = 2
) %>%
gt::tab_header(title = "Day-Level Weighted Trip Rates")| Day-Level Weighted Trip Rates | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| day_id | person_id | travel_date | day_num | travel_dow | person_num | surveyable | is_participant | hh_id | travel_day | hh_day_complete | num_complete_trip_surveys | num_trips | is_complete | hh_is_complete | proxy_complete | begin_day | end_day | school_daily | telecommute_time | made_travel | num_reasons_no_travel | attend_school_1 | attend_school_2 | attend_school_3 | attend_school_998 | attend_school_999 | attend_school_no_1 | attend_school_no_2 | attend_school_no_4 | attend_school_no_5 | attend_school_no_997 | attend_school_no_998 | attend_school_no_999 | congestion | delivery_2 | delivery_3 | delivery_4 | delivery_5 | delivery_6 | delivery_7 | delivery_8 | delivery_996 | no_travel_1 | no_travel_11 | no_travel_12 | no_travel_2 | no_travel_3 | no_travel_4 | no_travel_5 | no_travel_6 | no_travel_7 | no_travel_8 | no_travel_9 | no_travel_99 | attend_school_no_3 | daily_activity_pattern | day_weight | day_weight_tue | day_weight_fri | day_weight_mon | day_weight_sat | day_weight_sun | day_weight_thu | day_weight_wed | weighted_trips | weighted_trips_per_day |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 240000890101 | 2400008901 | 2024-06-11 | 1 | 2 | 1 | 1 | 1 | 24000089 | 1 | 1 | 2 | 2 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 2 | 157.43 | 187.88768 | NA | NA | NA | NA | NA | NA | 494.27 | 3.14 |
| 240000890201 | 2400008902 | 2024-06-11 | 1 | 2 | 2 | 1 | 1 | 24000089 | 1 | 1 | 5 | 5 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 2 | 157.43 | 187.88768 | NA | NA | NA | NA | NA | NA | 1,235.68 | 7.85 |
| 240001220101 | 2400012201 | 2024-06-11 | 1 | 2 | 1 | 1 | 1 | 24000122 | 1 | 1 | 5 | 7 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 25.84 | 59.87428 | NA | NA | NA | NA | NA | NA | 180.86 | 7.00 |
| 240001220102 | 2400012201 | 2024-06-12 | 2 | 3 | 1 | 1 | 1 | 24000122 | 1 | 1 | 5 | 5 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 1 | 25.84 | NA | NA | NA | NA | NA | NA | 58.57788 | 129.18 | 5.00 |
| 240001220103 | 2400012201 | 2024-06-13 | 3 | 4 | 1 | 1 | 1 | 24000122 | 1 | 1 | 5 | 5 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 2 | 25.84 | NA | NA | NA | NA | NA | 64.49532 | NA | 129.18 | 5.00 |
| 240001220104 | 2400012201 | 2024-06-14 | 4 | 5 | 1 | 1 | 1 | 24000122 | 1 | 1 | 2 | 2 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 3 | NA | NA | 75.95139 | NA | NA | NA | NA | NA | 0.00 | NA |
| 240001220105 | 2400012201 | 2024-06-15 | 5 | 6 | 1 | 1 | 1 | 24000122 | 1 | 1 | 8 | 10 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 2 | NA | NA | NA | NA | 76.29594 | NA | NA | NA | 0.00 | NA |
| 240001220106 | 2400012201 | 2024-06-16 | 6 | 7 | 1 | 1 | 1 | 24000122 | 1 | 1 | 6 | 6 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 2 | NA | NA | NA | NA | NA | 74.33908 | NA | NA | 0.00 | NA |
| 240001220107 | 2400012201 | 2024-06-17 | 7 | 1 | 1 | 1 | 1 | 24000122 | 1 | 1 | 4 | 4 | 1 | 1 | 995 | 1 | 1 | 995 | NA | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 2 | 25.84 | NA | NA | 65.09989 | NA | NA | NA | NA | 103.35 | 4.00 |
| 240001400101 | 2400014001 | 2024-06-14 | 1 | 5 | 1 | 1 | 1 | 24000140 | 1 | 1 | 2 | 2 | 1 | 1 | 995 | 1 | 1 | 995 | 390 | 995 | 0 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 995 | 2 | NA | NA | 655.71500 | NA | NA | NA | NA | NA | 0.00 | NA |
15.3 Person-Miles Traveled (PMT) and Vehicle-Miles Traveled (VMT)
Analysis of person-miles and vehicle-miles traveled proceeds similarly to the analysis of trip rates, with some additional considerations for occupancy and drive-mode identification.
Calculating PMT
Because the trip table is a person-trip table, total person-miles traveled can be calculated by summing the product of trip distance and trip weight.
total_pmt <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::summarize(
total_pmt = sum(
distance_miles * trip_weight,
na.rm = TRUE
)
)Table 61 reports the weighted person-miles represented by the trip table.
Code
gt::gt(total_pmt) %>%
gt::fmt_number(columns = total_pmt, decimals = 2) %>%
gt::tab_header(title = "Total PMT")| Total PMT |
| total_pmt |
|---|
| 294,281,801.73 |
Calculating VMT
Vehicle-miles traveled require an occupancy adjustment. In the normalized fixtures, num_travelers is the most common starting point.
trip_mode_labels <- codebook$value_labels %>%
dplyr::filter(variable == "mode") %>%
dplyr::transmute(
mode_code = value,
mode_key = as.character(value),
mode_label = label
)
total_vmt <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::mutate(
mode_key = as.character(mode_type),
occupancy = num_travelers
) %>%
dplyr::left_join(
trip_mode_labels %>%
dplyr::select(mode_key, mode_label),
by = "mode_key"
) %>%
dplyr::filter(
!is.na(occupancy),
occupancy > 0,
occupancy != 995,
stringr::str_detect(
mode_label,
"Drive|Car|SOV|Hov|Motorcycle"
)
) %>%
dplyr::mutate(vmt = distance_miles / occupancy) %>%
dplyr::summarize(
total_vmt = sum(vmt * trip_weight, na.rm = TRUE)
)Table 62 reports the weighted VMT after filtering to drive-mode records and adjusting each trip by occupancy.
Code
gt::gt(total_vmt) %>%
gt::fmt_number(columns = total_vmt, decimals = 2) %>%
gt::tab_header(title = "Total VMT")| Total VMT |
| total_vmt |
|---|
| 151,301,985.08 |
Disaggregating PMT and VMT by Population Subgroups
To disaggregate PMT or VMT by population subgroup, aggregate the trip data to the day level first and then join the resulting day-level totals back to the day or household table.
day_trip_vmt <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::mutate(occupancy = num_travelers) %>%
dplyr::filter(!is.na(occupancy), occupancy > 0, occupancy != 995) %>%
dplyr::mutate(vmt = distance_miles / occupancy) %>%
dplyr::group_by(day_id) %>%
dplyr::summarize(
total_wtd_vmt_on_day = sum(
vmt * trip_weight,
na.rm = TRUE
),
.groups = "drop"
)
day_trip_vmt <- hts$day %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::transmute(
day_id,
day_weight,
household_id = hh_id
) %>%
dplyr::left_join(
day_trip_vmt,
by = "day_id"
) %>%
dplyr::mutate(
total_wtd_vmt_on_day = ifelse(is.na(total_wtd_vmt_on_day), 0, total_wtd_vmt_on_day),
wtd_vmt_per_day = total_wtd_vmt_on_day / day_weight
)Table 63 shows the day-level file that can be joined to other denominator tables for subgroup analysis.
Code
gt::gt(dplyr::slice_head(day_trip_vmt, n = 10)) %>%
gt::fmt_number(
columns = c(day_weight, total_wtd_vmt_on_day, wtd_vmt_per_day),
decimals = 2
) %>%
gt::tab_header(title = "Day-Level VMT Summary")| Day-Level VMT Summary | ||||
| day_id | day_weight | household_id | total_wtd_vmt_on_day | wtd_vmt_per_day |
|---|---|---|---|---|
| 240000890101 | 157.43 | 24000089 | 1,449.64 | 9.21 |
| 240000890201 | 157.43 | 24000089 | 973.44 | 6.18 |
| 240001220101 | 25.84 | 24000122 | 182.02 | 7.05 |
| 240001220102 | 25.84 | 24000122 | 90.33 | 3.50 |
| 240001220103 | 25.84 | 24000122 | 95.57 | 3.70 |
| 240001220104 | NA | 24000122 | 0.00 | NA |
| 240001220105 | NA | 24000122 | 0.00 | NA |
| 240001220106 | NA | 24000122 | 0.00 | NA |
| 240001220107 | 25.84 | 24000122 | 125.95 | 4.87 |
| 240001400101 | NA | 24000140 | 0.00 | NA |
15.4 Identifying Work Days and Telework Status
MassDOT does not include a separate day-level work_time field, but it does include a day-level telework duration field. In the codebook, telecommute_time is defined as time spent teleworking on the travel day, so for this study it can be used directly as a diary-day telework measure rather than as a prior-week proxy.
To identify work days, join the day table to person employment status and then flag days with at least one work-purpose trip. For MassDOT, d_purpose_category == 2 is Work and d_purpose_category == 3 is Work related, so it is usually best to use code 2 for travel to a work location and keep Work related separate unless the analysis specifically needs broader job-related travel.
The preparation steps below build the worker-day file in a linear sequence so each analytic decision stays visible. This example focuses on respondents coded as employed full-time, employed part-time, or self-employed (employment values 1, 2, and 3).
Start by selecting the day-level telework field and joining employment status from the person table.
worker_day_status <- hts$day %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::select(day_id, hh_id, person_id, telecommute_time) %>%
dplyr::left_join(
hts$person %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::select(person_id, employment),
by = "person_id"
)Then restrict the file to workers and add day-level work-trip flags from the default trip table.
worker_day_status <- worker_day_status %>%
dplyr::filter(employment %in% c(1, 2, 3)) %>%
dplyr::left_join(
hts[[default_trip_table_name]] %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::group_by(day_id) %>%
dplyr::summarize(
has_work_trip = any(d_purpose_category == 2, na.rm = TRUE),
has_work_related_trip = any(d_purpose_category == 3, na.rm = TRUE),
.groups = "drop"
),
by = "day_id"
)Next, derive the telework and work-day flags used in the classification table.
worker_day_status <- worker_day_status %>%
dplyr::mutate(
has_work_trip = ifelse(is.na(has_work_trip), FALSE, has_work_trip),
has_work_related_trip = ifelse(is.na(has_work_related_trip), FALSE, has_work_related_trip),
telework_min = telecommute_time,
teleworked_any = dplyr::if_else(
is.na(telework_min),
NA,
telework_min > 0
),
telework_flag = dplyr::case_when(
is.na(teleworked_any) ~ "Missing telework response",
teleworked_any ~ "telecommute_time > 0",
TRUE ~ "telecommute_time == 0"
),
work_trip_flag = dplyr::if_else(
has_work_trip,
"Work trip present",
"No work trip"
),
work_day_type = dplyr::case_when(
is.na(teleworked_any) & has_work_trip ~ "Missing telework response / work trip present",
is.na(teleworked_any) & !has_work_trip ~ "Missing telework response / no work trip",
teleworked_any & !has_work_trip ~ "Telework only",
teleworked_any & has_work_trip ~ "Hybrid",
!teleworked_any & has_work_trip ~ "In-person only",
!teleworked_any & !has_work_trip ~ "Non-work day"
)
)Finally, collapse the worker-day records into the cross-tab used in the handbook table.
worker_day_telework_crosstab <- worker_day_status %>%
dplyr::count(telework_flag, work_trip_flag, name = "n_worker_days") %>%
tidyr::pivot_wider(
names_from = work_trip_flag,
values_from = n_worker_days,
values_fill = 0
) %>%
dplyr::mutate(
total = `No work trip` + `Work trip present`
)Table 64 shows the observed cross-tab for complete-household worker days in the prepared MassDOT data. These are unweighted record counts, so use them to understand coding patterns rather than as population estimates.
Code
gt::gt(worker_day_telework_crosstab) %>%
gt::fmt_number(
columns = c(`No work trip`, `Work trip present`, total),
decimals = 0,
sep_mark = ","
) %>%
gt::cols_label(
telework_flag = "Telework Status",
`No work trip` = "No Work Trip",
`Work trip present` = "Work Trip Present",
total = "Total"
) %>%
gt::tab_header(title = "Worker-Day Telework and Work-Trip Cross-Tab")| Worker-Day Telework and Work-Trip Cross-Tab | |||
| Telework Status | No Work Trip | Work Trip Present | Total |
|---|---|---|---|
| Missing telework response | 1,088 | 199 | 1,287 |
| telecommute_time == 0 | 20,225 | 13,171 | 33,396 |
| telecommute_time > 0 | 16,434 | 5,690 | 22,124 |
This cross-tab maps cleanly to the common diary-day categories:
Telework only:telecommute_time > 0and no work tripHybrid:telecommute_time > 0and a work trip is presentIn-person only:telecommute_time == 0and a work trip is presentNon-work day:telecommute_time == 0and no work trip
Keep missing telework responses separate rather than forcing them into 0. In the prepared MassDOT files, missing telecommute_time values are already stored as NA.
16 Day-of-Week Analysis
Use the alternate day-of-week weights when the question is explicitly about differences across Monday through Sunday. The standard weights remain the default for overall household, person, and average-day reporting; the alternate day-of-week weights are for day-specific person-day and trip analysis.
16.1 Trip Rates by Day of Week
The key pattern is:
- reshape the weekday-specific day weights to long form
- reshape the matching weekday-specific trip weights to long form
- aggregate weighted trips to the day level
- join those weighted trips back to the weighted person-day denominator
- estimate weekday means with a survey design that uses the weekday-specific day weights
Use the prep step below to build the weekday-specific person-day analysis file before calculating the final estimates.
day_weight_lookup <- day_of_week_day_weight_columns %>%
dplyr::select(weekday, weight_column)
trip_weight_lookup <- day_of_week_trip_weight_columns %>%
dplyr::select(weekday, weight_column)
day_long <- hts$day %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::left_join(
hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::select(hh_id, sample_segment),
by = "hh_id"
) %>%
tidyr::pivot_longer(
cols = day_weight_lookup$weight_column,
names_to = "weight_column",
values_to = "day_weight_dow"
) %>%
dplyr::left_join(
day_weight_lookup,
by = "weight_column"
) %>%
dplyr::mutate(
weekday = factor(weekday, levels = day_of_week_weekday_order)
) %>%
dplyr::filter(
!is.na(sample_segment),
!is.na(day_weight_dow),
day_weight_dow > 0
)
trip_long <- hts[[day_of_week_trip_table_name]] %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
tidyr::pivot_longer(
cols = trip_weight_lookup$weight_column,
names_to = "weight_column",
values_to = "trip_weight_dow"
) %>%
dplyr::left_join(
trip_weight_lookup,
by = "weight_column"
) %>%
dplyr::mutate(
weekday = factor(weekday, levels = day_of_week_weekday_order)
) %>%
dplyr::filter(
!is.na(trip_weight_dow),
trip_weight_dow > 0
)
weighted_trips_by_day <- trip_long %>%
dplyr::group_by(day_id, weekday) %>%
dplyr::summarize(
weighted_trips = sum(trip_weight_dow, na.rm = TRUE),
.groups = "drop"
)
day_trip_rates_dow <- day_long %>%
dplyr::left_join(
weighted_trips_by_day,
by = c("day_id", "weekday")
) %>%
dplyr::mutate(
weighted_trips = ifelse(is.na(weighted_trips), 0, weighted_trips),
wtd_trips_on_day = weighted_trips / day_weight_dow
)
trip_rate_by_weekday <- day_trip_rates_dow %>%
dplyr::filter(!is.na(wtd_trips_on_day)) %>%
srvyr::as_survey_design(
ids = hh_id,
strata = sample_segment,
weights = day_weight_dow
) %>%
dplyr::group_by(weekday) %>%
dplyr::summarize(
trip_rate = srvyr::survey_mean(wtd_trips_on_day, vartype = "ci"),
.groups = "drop"
)Table 65 shows the weekday-specific trip-rate estimates after the long-format weights and day-level totals have been assembled.
Code
gt::gt(trip_rate_by_weekday) %>%
gt::fmt_number(
columns = c(trip_rate, trip_rate_low, trip_rate_upp),
decimals = 2
) %>%
gt::cols_label(
weekday = "Weekday",
trip_rate = "Trip Rate",
trip_rate_low = "CI Low",
trip_rate_upp = "CI High"
) %>%
gt::tab_header(title = "Trip Rates by Day of Week")| Trip Rates by Day of Week | |||
| Weekday | Trip Rate | CI Low | CI High |
|---|---|---|---|
| Monday | 4.54 | 4.40 | 4.68 |
| Tuesday | 4.70 | 4.57 | 4.82 |
| Wednesday | 4.76 | 4.64 | 4.89 |
| Thursday | 4.74 | 4.62 | 4.87 |
| Friday | 4.98 | 4.82 | 5.13 |
| Saturday | 4.91 | 4.74 | 5.09 |
| Sunday | 4.07 | 3.92 | 4.23 |
This workflow keeps zero-trip days in the denominator, which is critical for valid person-day trip rates.
It also keeps the day-of-week estimates inside the complete-household analytic universe used for most MassDOT reporting.
16.2 Telework Rates by Day of Week
Use the same weekday-specific person-day design for telework participation. In the prepared MassDOT files, missing telecommute_time values are already stored as NA, and positive minutes indicate that some telework occurred on that day.
telework_rate_by_weekday <- day_long %>%
dplyr::mutate(
telework_min = telecommute_time,
teleworked_any = dplyr::if_else(
is.na(telework_min),
NA,
telework_min > 0
)
) %>%
dplyr::filter(!is.na(teleworked_any)) %>%
srvyr::as_survey_design(
ids = hh_id,
strata = sample_segment,
weights = day_weight_dow
) %>%
dplyr::group_by(weekday) %>%
dplyr::summarize(
telework_rate = srvyr::survey_mean(teleworked_any, vartype = "ci"),
.groups = "drop"
)Table 66 reports the weekday-specific telework participation rates from the same day-level survey design.
Code
gt::gt(telework_rate_by_weekday) %>%
gt::fmt_percent(
columns = c(telework_rate, telework_rate_low, telework_rate_upp),
decimals = 1
) %>%
gt::cols_label(
weekday = "Weekday",
telework_rate = "Telework Rate",
telework_rate_low = "CI Low",
telework_rate_upp = "CI High"
) %>%
gt::tab_header(title = "Telework Rates by Day of Week")| Telework Rates by Day of Week | |||
| Weekday | Telework Rate | CI Low | CI High |
|---|---|---|---|
| Monday | 43.5% | 41.4% | 45.6% |
| Tuesday | 44.1% | 42.2% | 46.1% |
| Wednesday | 43.9% | 42.1% | 45.8% |
| Thursday | 43.2% | 41.3% | 45.1% |
| Friday | 45.4% | 43.1% | 47.6% |
| Saturday | 14.5% | 12.7% | 16.4% |
| Sunday | 12.9% | 11.1% | 14.6% |
When the goal is a single overall estimate for the study area, return to the standard average-day workflow in Section 15 and Section 14. Use the alternate day-of-week weights only when the day itself is part of the analytic question.
17 Advanced Analysis
17.1 From Description to Inference: Using Weighted Models
Simple weighted proportions, with accompanying standard errors or confidence intervals, are an excellent first tool for describing population patterns. However, there are many situations where weighted proportions alone are not sufficient for reliable inference. When subgroup sample sizes are small or design effects are large, analysts should use weighted multivariate models rather than relying solely on repeated subgroup tabulations.
Weighted models keep the full sample intact, improve statistical precision, and allow analysts to estimate the unique contribution of each factor while holding others constant. This approach avoids the instability that arises from slicing the data into many small subpopulations.
Most analysts will work in R, Stata, SPSS, or SAS. Each platform provides dedicated tools for fitting regression models that correctly incorporate survey weights, clustering, and stratification. Across platforms, the key principle is the same: define the survey design once, then fit models using functions that respect the sampling structure to obtain valid, population-representative inferences.
Does Telework Reduce VMT?
One common example is a survey-weighted regression that estimates daily VMT as a function of telework status while controlling for household and person characteristics.
For MassDOT, the model example below begins from complete households so the day-level outcome and predictors reflect the same household-complete analytic universe used elsewhere in the guide.
Start by aggregating trip-level VMT to the diary-day level so the outcome matches the day-level telework measure.
day_trip_vmt <- hts$trip_unlinked %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::mutate(occupancy = num_travelers) %>%
dplyr::filter(!is.na(occupancy), occupancy > 0, occupancy != 995) %>%
dplyr::mutate(vmt = distance_miles / occupancy) %>%
dplyr::group_by(day_id) %>%
dplyr::summarize(
total_wtd_vmt_on_day = sum(
vmt * trip_weight,
na.rm = TRUE
),
.groups = "drop"
)
day_trip_vmt <- hts$day %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::transmute(
day_id,
day_weight
) %>%
dplyr::left_join(
day_trip_vmt,
by = "day_id"
) %>%
dplyr::mutate(
total_wtd_vmt_on_day = ifelse(is.na(total_wtd_vmt_on_day), 0, total_wtd_vmt_on_day),
wtd_vmt_per_day = total_wtd_vmt_on_day / day_weight
)Next, assemble the model dataset by joining the day-level outcome to the household and person predictors used in the regression.
model_data <- hts$day %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::select(person_id, hh_id, day_id, day_weight, telecommute_time) %>%
dplyr::left_join(
day_trip_vmt %>%
dplyr::select(day_id, wtd_vmt_per_day),
by = "day_id"
) %>%
dplyr::left_join(
hts$hh %>%
dplyr::filter(is_complete == 1) %>%
dplyr::transmute(
hh_id,
sample_segment,
num_vehicles,
income_broad,
income_key = as.character(income_broad)
),
by = "hh_id"
) %>%
dplyr::left_join(
income_value_labels,
by = "income_key"
) %>%
dplyr::left_join(
hts$person %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::select(person_id),
by = "person_id"
) %>%
dplyr::filter(day_weight > 0) %>%
dplyr::mutate(
telework_min = telecommute_time,
telework_group = dplyr::case_when(
is.na(telework_min) ~ "Missing",
telework_min == 0 ~ "0 min",
telework_min <= 120 ~ "1-120 min",
telework_min <= 240 ~ "121-240 min",
telework_min > 240 ~ "240+ min"
),
telework_group = factor(
telework_group,
levels = c("0 min", "1-120 min", "121-240 min", "240+ min", "Missing")
),
num_vehicles_group = factor(
dplyr::case_when(
num_vehicles %in% c(995, 999) ~ "Missing",
num_vehicles >= 4 ~ "4+",
TRUE ~ as.character(num_vehicles)
),
levels = c("0", "1", "2", "3", "4+", "Missing")
),
income_broad_label = factor(
income_broad_label,
levels = income_value_labels$income_broad_label
)
)
model_data <- model_data %>%
dplyr::mutate(
telework_group = droplevels(telework_group),
num_vehicles_group = droplevels(num_vehicles_group),
income_broad_label = droplevels(income_broad_label)
)If the analytic question also depends on age, a useful extension is to collapse the delivered age categories into broader groups before fitting the regression.
age_value_labels <- codebook$value_labels %>%
dplyr::filter(variable == "age") %>%
dplyr::transmute(
age_key = as.character(value),
age_label = sub("^Age\\s+", "", label)
) %>%
dplyr::mutate(
age_label = ifelse(age_label == "85 up", "85 or older", age_label)
)
model_data <- model_data %>%
dplyr::left_join(
hts$person %>%
dplyr::filter(hh_id %in% complete_hh_ids) %>%
dplyr::transmute(
person_id,
age_key = as.character(age)
) %>%
dplyr::left_join(
age_value_labels,
by = "age_key"
),
by = "person_id"
) %>%
dplyr::mutate(
age_group = dplyr::case_when(
age_label %in% c("18-24", "25-34") ~ "18-34",
age_label %in% c("35-44", "45-54") ~ "35-54",
age_label %in% c("55-64", "65-74", "75-84", "85 or older") ~ "55+",
TRUE ~ "Missing"
),
age_group = factor(age_group, levels = c("18-34", "35-54", "55+", "Missing"))
) %>%
dplyr::mutate(age_group = droplevels(age_group))
vmt_model_formula <- wtd_vmt_per_day ~ telework_group + num_vehicles_group + income_broad_label + age_groupFinally, define the survey design, fit the weighted model, and tidy the coefficients for display.
vmt_design <- model_data %>%
srvyr::as_survey_design(
ids = hh_id,
strata = sample_segment,
weights = day_weight
)
vmt_model_formula <- wtd_vmt_per_day ~ telework_group + num_vehicles_group + income_broad_label
vmt_model <- survey::svyglm(
vmt_model_formula,
design = vmt_design
)
model_tbl <- broom::tidy(vmt_model, conf.int = TRUE) %>%
dplyr::mutate(
term_clean = dplyr::case_when(
term == "(Intercept)" ~ "Intercept",
stringr::str_detect(term, "^telework_group") ~ stringr::str_replace(term, "telework_group", "Telework: "),
stringr::str_detect(term, "^num_vehicles_group") ~ stringr::str_replace(term, "num_vehicles_group", "Vehicles: "),
stringr::str_detect(term, "^income_broad_label") ~ stringr::str_replace(term, "income_broad_label", "Income: "),
TRUE ~ term
),
stars = dplyr::case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
TRUE ~ ""
)
) %>%
dplyr::select(term_clean, estimate, std.error, statistic, p.value, stars, conf.low, conf.high)Table 67 presents the base weighted model with standard errors and confidence intervals.
Code
gt::gt(model_tbl) %>%
gt::fmt_number(
columns = c(estimate, std.error, statistic, conf.low, conf.high),
decimals = 2
) %>%
gt::fmt_number(columns = p.value, decimals = 3) %>%
gt::cols_label(
term_clean = "Term",
estimate = "Estimate",
std.error = "Std. Error",
statistic = "t-value",
p.value = "p-value",
stars = "",
conf.low = "CI Low",
conf.high = "CI High"
) %>%
gt::tab_header(
title = "Base Survey-Weighted Linear Model of Daily VMT",
subtitle = "Outcome: Weighted Vehicle-Miles Traveled per Diary Day"
) %>%
gt::tab_options(
table.font.size = gt::px(13),
data_row.padding = gt::px(4)
)| Base Survey-Weighted Linear Model of Daily VMT | |||||||
| Outcome: Weighted Vehicle-Miles Traveled per Diary Day | |||||||
| Term | Estimate | Std. Error | t-value | p-value | CI Low | CI High | |
|---|---|---|---|---|---|---|---|
| Intercept | 25.45 | 5.58 | 4.56 | 0.000 | *** | 14.51 | 36.40 |
| Telework: 1-120 min | 2.98 | 7.18 | 0.41 | 0.679 | −11.11 | 17.06 | |
| Telework: 121-240 min | −0.13 | 6.43 | −0.02 | 0.983 | −12.74 | 12.47 | |
| Telework: 240+ min | −17.63 | 5.75 | −3.07 | 0.002 | ** | −28.89 | −6.37 |
| Telework: Missing | −22.98 | 6.27 | −3.66 | 0.000 | *** | −35.27 | −10.69 |
| Vehicles: 1 | 6.08 | 3.35 | 1.82 | 0.069 | −0.48 | 12.65 | |
| Vehicles: 2 | 17.76 | 5.69 | 3.12 | 0.002 | ** | 6.61 | 28.90 |
| Vehicles: 3 | 17.77 | 9.62 | 1.85 | 0.065 | −1.10 | 36.64 | |
| Vehicles: 4+ | 21.04 | 9.44 | 2.23 | 0.026 | * | 2.54 | 39.54 |
| Income: $25,000-$49,999 | 11.98 | 9.45 | 1.27 | 0.205 | −6.54 | 30.51 | |
| Income: $50,000-$74,999 | 3.48 | 3.84 | 0.91 | 0.364 | −4.04 | 11.01 | |
| Income: $75,000-$99,999 | −0.31 | 3.04 | −0.10 | 0.920 | −6.27 | 5.66 | |
| Income: $100,000-$199,999 | 13.84 | 7.55 | 1.83 | 0.067 | −0.96 | 28.65 | |
| Income: $200,000 or more | 2.76 | 4.12 | 0.67 | 0.503 | −5.31 | 10.83 | |
| Income: Prefer not to answer | 1.52 | 4.53 | 0.34 | 0.737 | −7.36 | 10.40 | |
Why Use Weighted Models?
Weighted models become especially useful when analysts need to compare groups, adjust for multiple factors at once, or stabilize estimates for small subgroups. They do not replace descriptive tables, but they provide a more reliable route to inference when the question extends beyond simple description.