Contributing • hts

Scope & Principles

This repo (RSGInc/hts_weighting) is the general toolkit. It should contain reusable functions (R/), generic runner scripts (scripts/), and documentation (Quarto workbook). Avoid hard-coding project quirks.
Project-specific code belongs in a project fork. Keep local recodes, one-off joins, and special rules in a fork (or a project branch in your org fork). This keeps the core clean and lets us evolve it safely.
Assume you’ll upstream useful changes. Even if you start in a fork, design your changes as if you’ll open a PR back to RSGInc/hts_weighting. That mindset prevents unreconcilable drift if we need to “re-weight” later using updates in the core toolkit.
Keep forks current with upstream. Regularly sync your fork’s main with RSGInc/hts_weighting:main to minimize merge conflicts and ensure you’re building on the latest stable base.
Open draft PRs early. Don’t wait until your feature is “perfect” to open a PR. Early drafts help surface conflicts, get feedback, and validate your approach.
Keep changes small and focused. Large, sweeping changes are hard to review and merge. Break big features into smaller, manageable PRs.
Test everything. Add project-specific tests in your fork, and keep the core tests here passing. Use the shared test database (hts_weighting_testing) to validate end-to-end runs.
Should test data change? Almost always No Your results should match existing test outputs unless there’s a good reason. If your change causes test data to change (e.g., new expected outputs), this is an all-hands discussion and the proposed new outputs must be vetted and approved.

Preferred Workflow (Fork → Branch → PR)

We use an “Upstream-First” fork-based workflow. The diagram below illustrates how to manage feature branches, keep forks in sync with upstream changes, and handle multiple project forks.

Core Tenets:

Keep feature branches small and focused.

flowchart LR
    subgraph Main["*Main 'Upstream' Repo (RSGInc/hts_weighting)*"]
        A("Main")
        Merge1@{ shape: procs, label: "Squash &<br>Merge PR" }
        Update(["Main Updated"])
        Merge3@{ shape: procs, label: "Squash &<br>Merge PR" }
        B(["Main Updated, again"])
    end

    subgraph Proj1["*Project Fork*"]
        B1(["Create Feature Branch<br>feature/my-change"])
        C1(["Develop &<br> Commit"])
        PR1{{"Open PR"}}
        %% D["Push Branch to Fork"]
        Sync1("Sync Fork")@{shape: curv-trap}
        Sync3("Sync Fork")@{shape: curv-trap}
    end

    subgraph Proj2["*Other Project Fork*"]
        B2(["Create Feature Branch<br>feature/my-change"])
        C2(["Develop &<br> Commit"])
        Sync2@{shape: curv-trap, label: "Sync Fork"}
        Merge2@{ shape: procs, label: "Merge &<br>Reconcile" }
        C3(["Continue Developing &<br> Commit"])
        PR2{{"Open PR"}}
        Sync4("Sync Fork")@{shape: curv-trap}
    end

    A -->|Developer forks| Proj1
    A -->|Developer forks| Proj2
    B1 --> C1 --> PR1

    PR1 -.->|Revise| Merge1
    Merge1 -.->|Review| PR1
    Merge1 --> Update
    Update --> Sync1

    %% Update --> Merge3
    Merge3 --> B
    B --> Sync3
    B --> Sync4

    B2 --> C2
    C2 --> Sync2
    Update --> Sync2
    Sync2 --> Merge2
    Merge2 --> C3
    C3 --> PR2

    PR2 -.->|Revise| Merge3
    Merge3 -.->|Review| PR2


    style Main fill:#007acc,stroke:#004f8a,color:#fff
    style Proj1 fill:#b3e6ff,stroke:#004f8a,color:#004f8a
    style Proj2 fill:#b3e6ff,stroke:#004f8a,color:#004f8a
    style A fill:#b3e0ff,stroke:#004f8a
    style Merge1 fill:#66cc99,stroke:#26734d,color:#fff
    style Update fill:#99ffcc,stroke:#26734d
    style B1 fill:#ffe699,stroke:#b38600
    style C1 fill:#ffcc66,stroke:#b38600
    style B2 fill:#ffe699,stroke:#b38600
    style C2 fill:#ffcc66,stroke:#b38600
    style PR1 fill:#ffd699,stroke:#b38600
    style Sync1 fill:#cce6ff,stroke:#004f8a
    style Sync2 fill:#cce6ff,stroke:#004f8a

Fork the repo (keep it in the RSGInc org).

Clone your fork (naming convention: RSGInc/hts_weighting-<client>_<year>):

git clone git@github.com:RSGInc/<your_fork>.git
cd <your_fork>
git remote add upstream git@github.com:RSGInc/hts_weighting.git

Create a development branch (don’t work on main):
```
git checkout -b dev
```
Make and commit changes (small, focused commits; clear messages).

Keep your fork current (early and often):

git fetch upstream
git checkout main
git merge --ff-only upstream/main   # or: git rebase upstream/main
git push origin main
# update your feature branch:
git checkout feature/<short-slug>
git rebase main

You can also use GitHub’s “Sync fork” button. Before delivery, always sync with upstream main.

Open a draft PR to RSGInc/hts_weighting:main. Draft early to surface merge conflicts, CI results, and design feedback. Use “Compare across forks” if needed.
Iterate → request review → merge. When approved, squash/merge or rebase/merge per repo conventions.

What Belongs Where

Core repo (here)
- Generalizable functions, validations, and helpers
- Script improvements that work across projects (config-driven)
- Documentation, examples, QA/QC dashboards
Project fork

Coding Guidelines (R & Quarto)

R

Prefer pure functions; accept settings and paths as arguments (avoid global state).
Validate inputs early; fail with clear, informative messages.
Keep the public interface small; document parameters and return types using roxygen.
Use data.table syntax consistently.

Quarto

Chapters must render in a clean session (no interactive assumptions).
Use chunk options cache: true and document-level freeze: auto; avoid writing outside working_dir.
No absolute local paths; everything should resolve via settings.
Keep callouts for “Settings used” and troubleshooting up to date.

Data, Secrets, and Configuration

Do not commit data or client secrets. Keep large or temporary artifacts in working/ (gitignored). Deliverables belong in outputs/ or report/.
Secrets (e.g., GitHub PAT) should be stored in user ~/.Renviron or repository Secrets (under GitHub Actions).
Configs (configs/<client>_<year>.yaml) should be minimal and documented. Prefer toggles over code forks.

Configuration and Schema Validation

1. Configuration-driven design

This repository is configuration-first: all project behavior, inputs, and output mappings are defined in YAML files under configs/.
Code should read from settings (via get_settings()) and never hard-code constants.

When adding a new toggle or parameter: 1. Add it to the appropriate YAML (e.g. configs/example.yaml). 2. Update the JSON schema (configs/settings.schema.json) with: - a title, description, and type, - allowed enum values if applicable, - defaults if appropriate. 3. Validate locally (see below) and commit both files together.

2. Schema validation before commits

You can validate YAMLs locally in two ways:

A. In Positron (recommended) - Install the YAML (Red Hat) extension. - Add to user/workspace settings: ```json “yaml.schemas”: { “./configs/settings.schema.json”: [“configs/*.yaml”] }

This enables instant validation, autocomplete, and hover help.

B. From R

{r} devtools::load_all() check_settings("configs/<your_config>.yaml")

This uses the internal schema validator to check type safety and key names.

3. When editing the schema

Keep backward compatibility when possible; avoid breaking older project configs.
Add a clear description and default for every new property.
If deprecating keys, mark them with a “deprecated”: true comment and remove only in a major release.
Document notable changes in NEWS.md under a “Configuration” subsection.

4. CI / Pull request validation

Schema consistency is enforced automatically in CI:

All YAML files under configs/ are validated against configs/settings.schema.json.
CI fails if any are invalid, missing defaults, or contain unrecognized keys.

Run it locally before pushing:

{r} check_all_settings('configs')

Keeping forks healthy (avoid drift)

git fetch upstream
git checkout main && git merge --ff-only upstream/main && git push origin main
git checkout feature/<short-slug> && git rebase main

Resolve conflicts in your feature branch, not during the final PR.
If upstream behavior changes in a way that affects you, open a discussion or issue early.

Adding a New Project to the Test Database

Each project (for example, massdot_2024) gets its own schema inside the shared test database (hts_weighting_testing). The schema is populated with tables copied directly from the live POPS database.

1. Prepare the YAML Config

Create a new file under configs/examples/, for example:

configs/examples/myproject_2025.yaml

It should define the schema, database, table mappings, and file paths:

dbname: "hts_weighting_testing"
schema: "myproject_hts_2025"
working_dir: "working"
outputs_dir: "outputs"
report_dir: "reports"
hts_table_map:
  household: "household"
  person: "person"
  day: "day"
  trip: "trip"
  value_labels: "value_labels"
  sample_plan: "sample_plan"

Then create a smaller test configuration, e.g.:

configs/examples/myproject_2025_dow.yaml

This may override paths or parameters for a lightweight test run.

2. Copy Source Data into the Test Database

Edit and run:

# inst/copy_db_to_test.R

Inside, set:

target_db    = "hts_weighting_testing"
target_schema = "myproject_hts_2025"
source_db     = "myproject"
source_schema = "combined"
settings = get_test_settings("myproject_2025_dow.yaml")

Then run:

source("inst/copy_db_to_test.R")

This will:

Create schema myproject_hts_2025 in hts_weighting_testing
Copy all tables listed in settings$hts_table_map
Use pg_dump/pg_restore per table
Drop the temporary staging schema afterward

3. Verify Tables in the Test Schema

Check the test database (e.g., via psql):

\c hts_weighting_testing
\dn
\dt myproject_hts_2025.*

Confirm that all tables listed in your YAML exist.

Running Tests

There are two kinds of tests in this repo: * Unit tests for individual functions (in tests/testthat/) * End-to-end tests that run the full weighting scripts for a project (also in tests/testthat/)

These get run automatically in GitHub Actions CI, but you can also run them locally to test and debug.

When to Create a New test?

In general, always, but that’s often a luxury. It’s generally better to add a unit test to cover the specific function you’re changing. But end-to-end tests are also important to ensure the whole pipeline works. However, end-to-end tests are slow and difficult to maintain, so we try to limit them to key configurations.

The current end-to-end tests cover fundamental configurations for existing projects. - With/without linked trips (SWW and MetCouncil) - Custom weighting geographies (i.e., “client zones”) (MetCouncil) - Day of week weighting (MassDOT) - Person-level weighting (NYCDOT)

So as a general rule, end to end tests help cover either structurally different configurations (client zones or multiple States) or approaches (e.g., DOW or person-level weighting).

4. Create a Project-Specific Test File

Add to tests/testthat/, e.g.:

tests/testthat/test_myproject_2025.R

Example:

# testthat::test_dir("tests/testthat", filter = "myproject_2025$")

# Sometimes it can be useful to have the test prepared but skipped in CI runs
testthat::skip("Skipping standard MyProject weighting test")

settings = get_test_settings("myproject_2025.yaml")

test_state = new.env()
test_state$script_test_passed = FALSE

testthat::test_that("Testing scripts for myproject_2025", {
  script_path = file.path(settings$code_root, '000_run_weight_scripts.R')
  testthat::expect_true(file.exists(script_path))
  
  testthat::expect_no_error(
    tryCatch({
      source(script_path, local = TRUE)
      test_state$script_test_passed = TRUE
    }, error = function(e) {
      testthat::fail(paste("Error while sourcing script:", e$message))
    })
  )
})

testthat::test_that("Testing results for myproject_2025", {
  testthat::skip_if_not(test_state$script_test_passed, "Script run failed, skipping result tests.")
  test_results(settings)
})

5. Run the Tests

Run all tests:

devtools::test()

Or only your project’s tests:

testthat::test_dir("tests/testthat", filter = "myproject_2025$")

6. Inspect Output

Results are written to outputs/ and reports/
Check reports/*_check_counts.csv for intermediate summaries
If 000_run_weight_scripts.R fails, review its console output

7. Clean Up (Optional)

To reset the schema, set:

purge_target_schema = TRUE

in inst/copy_db_to_test.R before re-running.

Summary

Step	Action	File / Command
1	Create settings YAMLs	`configs/examples/myproject_2025.yaml`
2	Copy DB tables to test DB	`source("inst/copy_db_to_test.R")`
3	Verify schema contents	`\dt myproject_hts_2025.*`
4	Add test file	`tests/testthat/test_myproject_2025.R`
5	Run tests	`devtools::test()`
6	Inspect outputs	`reports/`, `outputs/`
7	Reset schema if needed	`purge_target_schema = TRUE`

Automated Checks (GitHub Actions CI)

All pushes and pull requests to main automatically trigger continuous integration (CI) via .github/workflows/code_check.yaml. This ensures your code is linted, tested, and passes R CMD check before merge.

Workflow Overview

Job	Purpose	Runner
linting	Runs `lintr::lint_package()` for consistent R style.	Ubuntu
discover-and-setup	Finds all test files and prepares a job matrix for parallel testing.	Self-hosted
Tests	Runs `testthat` tests per file, in parallel, using the test database.	Self-hosted
R-CMD-check	Runs `R CMD check` to verify package build integrity.	Ubuntu

How the CI Test Suite Works

Test Discovery All tests/testthat/test-*.R files are detected and distributed across runners for parallel execution.
Database Setup Each test file references a settings YAML (e.g. massdot_2024.yaml) that points to the test database (hts_weighting_testing). Schemas (e.g. massdot_hts_2024) must already exist — typically created with inst/copy_db_to_test.R.
Secrets and Credentials Required GitHub Secrets (set under Settings → Secrets and variables → Actions):
- PAT (GitHub token with repo and read:packages scope)
- POPS_USER
- POPS_PASSWORD These are securely masked in logs.
Python Environment Setup Installs dependencies via uv sync and validates populationsim. Exposes PYTHON_VENV_PATH for R integration.
R Environment Setup Uses r-lib/actions/setup-r@v2 and setup-renv@v2 to install R 4.4.3 and restore dependencies. Installs geospatial system libraries (libproj-dev, libpq-dev, etc.).
Running Tests Each test file runs individually:
```
testthat::with_reporter(testthat::JunitReporter$new(), {
  testthat::test_file(testfile)
})
```
Failures stop the job (testthat.stop_on_failure = TRUE).
Artifacts and Logs JUnit XML logs are created per test and shown in GitHub’s “Checks” tab. Output files (reports/, outputs/) are not committed but may be inspected locally.
R CMD Check Validation Finally, the workflow runs:
```
uses: r-lib/actions/check-r-package@v2
with:
  args: 'c("--no-tests", "--no-manual")'
```
This ensures package metadata, dependencies, and documentation are valid.

Before Pushing or Opening a PR

Run the same checks locally:

lintr::lint_package()
devtools::test()
devtools::check()

You can filter tests by project:

testthat::test_dir("tests/testthat", filter = "massdot_2024$")

If all checks pass locally, your CI pipeline should succeed once you push.

When in Doubt—Open a Draft PR

Draft PRs are encouraged:

Early visibility of your approach
Automatic CI and style checks
Feedback before you finalize the API

Include context in your PR: the problem statement, constraints, alternatives considered, and how the change was made reusable.

TL;DR

General code → this repo
Project-specific code → forks
Keep your fork synced with upstream
Test locally before pushing
CI will lint, test, and check your branch automatically if your schema and YAML are aligned