Skip to contents

Scope & Principles

  • This repo (RSGInc/hts_weighting) is the general toolkit. It should contain reusable functions (R/), generic runner scripts (scripts/), and documentation (Quarto workbook). Avoid hard-coding project quirks.

  • Project-specific code belongs in a project fork. Keep local recodes, one-off joins, and special rules in a fork (or a project branch in your org fork). This keeps the core clean and lets us evolve it safely.

  • Assume you’ll upstream useful changes. Even if you start in a fork, design your changes as if you’ll open a PR back to RSGInc/hts_weighting. That mindset prevents unreconcilable drift if we need to “re-weight” later using updates in the core toolkit.

  • Keep forks current with upstream. Regularly sync your fork’s main with RSGInc/hts_weighting:main to minimize merge conflicts and ensure you’re building on the latest stable base.

  • Open draft PRs early. Don’t wait until your feature is “perfect” to open a PR. Early drafts help surface conflicts, get feedback, and validate your approach.

  • Keep changes small and focused. Large, sweeping changes are hard to review and merge. Break big features into smaller, manageable PRs.

  • Test everything. Add project-specific tests in your fork, and keep the core tests here passing. Use the shared test database (hts_weighting_testing) to validate end-to-end runs.

  • Should test data change? Almost always No Your results should match existing test outputs unless there’s a good reason. If your change causes test data to change (e.g., new expected outputs), this is an all-hands discussion and the proposed new outputs must be vetted and approved.


Preferred Workflow (Fork → Branch → PR)

We use an “Upstream-First” fork-based workflow. The diagram below illustrates how to manage feature branches, keep forks in sync with upstream changes, and handle multiple project forks.

Core Tenets:

  • Keep feature branches small and focused.
flowchart LR
    subgraph Main["*Main 'Upstream' Repo (RSGInc/hts_weighting)*"]
        A("Main")
        Merge1@{ shape: procs, label: "Squash &<br>Merge PR" }
        Update(["Main Updated"])
        Merge3@{ shape: procs, label: "Squash &<br>Merge PR" }
        B(["Main Updated, again"])
    end

    subgraph Proj1["*Project Fork*"]
        B1(["Create Feature Branch<br>feature/my-change"])
        C1(["Develop &<br> Commit"])
        PR1{{"Open PR"}}
        %% D["Push Branch to Fork"]
        Sync1("Sync Fork")@{shape: curv-trap}
        Sync3("Sync Fork")@{shape: curv-trap}
    end

    subgraph Proj2["*Other Project Fork*"]
        B2(["Create Feature Branch<br>feature/my-change"])
        C2(["Develop &<br> Commit"])
        Sync2@{shape: curv-trap, label: "Sync Fork"}
        Merge2@{ shape: procs, label: "Merge &<br>Reconcile" }
        C3(["Continue Developing &<br> Commit"])
        PR2{{"Open PR"}}
        Sync4("Sync Fork")@{shape: curv-trap}
    end

    A -->|Developer forks| Proj1
    A -->|Developer forks| Proj2
    B1 --> C1 --> PR1

    PR1 -.->|Revise| Merge1
    Merge1 -.->|Review| PR1
    Merge1 --> Update
    Update --> Sync1

    %% Update --> Merge3
    Merge3 --> B
    B --> Sync3
    B --> Sync4

    B2 --> C2
    C2 --> Sync2
    Update --> Sync2
    Sync2 --> Merge2
    Merge2 --> C3
    C3 --> PR2

    PR2 -.->|Revise| Merge3
    Merge3 -.->|Review| PR2


    style Main fill:#007acc,stroke:#004f8a,color:#fff
    style Proj1 fill:#b3e6ff,stroke:#004f8a,color:#004f8a
    style Proj2 fill:#b3e6ff,stroke:#004f8a,color:#004f8a
    style A fill:#b3e0ff,stroke:#004f8a
    style Merge1 fill:#66cc99,stroke:#26734d,color:#fff
    style Update fill:#99ffcc,stroke:#26734d
    style B1 fill:#ffe699,stroke:#b38600
    style C1 fill:#ffcc66,stroke:#b38600
    style B2 fill:#ffe699,stroke:#b38600
    style C2 fill:#ffcc66,stroke:#b38600
    style PR1 fill:#ffd699,stroke:#b38600
    style Sync1 fill:#cce6ff,stroke:#004f8a
    style Sync2 fill:#cce6ff,stroke:#004f8a
  1. Fork the repo (keep it in the RSGInc org).

  2. Clone your fork (naming convention: RSGInc/hts_weighting-<client>_<year>):

    git clone git@github.com:RSGInc/<your_fork>.git
    cd <your_fork>
    git remote add upstream git@github.com:RSGInc/hts_weighting.git
  3. Create a development branch (don’t work on main):

    git checkout -b dev
  4. Make and commit changes (small, focused commits; clear messages).

  5. Keep your fork current (early and often):

    git fetch upstream
    git checkout main
    git merge --ff-only upstream/main   # or: git rebase upstream/main
    git push origin main
    # update your feature branch:
    git checkout feature/<short-slug>
    git rebase main

    You can also use GitHub’s “Sync fork” button. Before delivery, always sync with upstream main.

  6. Open a draft PR to RSGInc/hts_weighting:main. Draft early to surface merge conflicts, CI results, and design feedback. Use “Compare across forks” if needed.

  7. Iterate → request review → merge. When approved, squash/merge or rebase/merge per repo conventions.


What Belongs Where

  • Core repo (here)

    • Generalizable functions, validations, and helpers
    • Script improvements that work across projects (config-driven)
    • Documentation, examples, QA/QC dashboards
  • Project fork

Coding Guidelines (R & Quarto)

R

  • Prefer pure functions; accept settings and paths as arguments (avoid global state).
  • Validate inputs early; fail with clear, informative messages.
  • Keep the public interface small; document parameters and return types using roxygen.
  • Use data.table syntax consistently.

Quarto

  • Chapters must render in a clean session (no interactive assumptions).
  • Use chunk options cache: true and document-level freeze: auto; avoid writing outside working_dir.
  • No absolute local paths; everything should resolve via settings.
  • Keep callouts for “Settings used” and troubleshooting up to date.

Data, Secrets, and Configuration

  • Do not commit data or client secrets. Keep large or temporary artifacts in working/ (gitignored). Deliverables belong in outputs/ or report/.

  • Secrets (e.g., GitHub PAT) should be stored in user ~/.Renviron or repository Secrets (under GitHub Actions).

  • Configs (configs/<client>_<year>.yaml) should be minimal and documented. Prefer toggles over code forks.


Configuration and Schema Validation

1. Configuration-driven design

This repository is configuration-first: all project behavior, inputs, and output mappings are defined in YAML files under configs/.
Code should read from settings (via get_settings()) and never hard-code constants.

When adding a new toggle or parameter: 1. Add it to the appropriate YAML (e.g. configs/example.yaml). 2. Update the JSON schema (configs/settings.schema.json) with: - a title, description, and type, - allowed enum values if applicable, - defaults if appropriate. 3. Validate locally (see below) and commit both files together.

2. Schema validation before commits

You can validate YAMLs locally in two ways:

A. In Positron (recommended) - Install the YAML (Red Hat) extension. - Add to user/workspace settings: ```json “yaml.schemas”: { “./configs/settings.schema.json”: [“configs/*.yaml”] }

This enables instant validation, autocomplete, and hover help.

B. From R

{r} devtools::load_all() check_settings("configs/<your_config>.yaml")

This uses the internal schema validator to check type safety and key names.

3. When editing the schema

  • Keep backward compatibility when possible; avoid breaking older project configs.
  • Add a clear description and default for every new property.
  • If deprecating keys, mark them with a “deprecated”: true comment and remove only in a major release.
  • Document notable changes in NEWS.md under a “Configuration” subsection.

4. CI / Pull request validation

Schema consistency is enforced automatically in CI:

  • All YAML files under configs/ are validated against configs/settings.schema.json.
  • CI fails if any are invalid, missing defaults, or contain unrecognized keys.

Run it locally before pushing:

{r} check_all_settings('configs')


Keeping forks healthy (avoid drift)

git fetch upstream
git checkout main && git merge --ff-only upstream/main && git push origin main
git checkout feature/<short-slug> && git rebase main
  • Resolve conflicts in your feature branch, not during the final PR.
  • If upstream behavior changes in a way that affects you, open a discussion or issue early.

Adding a New Project to the Test Database

Each project (for example, massdot_2024) gets its own schema inside the shared test database (hts_weighting_testing). The schema is populated with tables copied directly from the live POPS database.

1. Prepare the YAML Config

Create a new file under configs/examples/, for example:

configs/examples/myproject_2025.yaml

It should define the schema, database, table mappings, and file paths:

dbname: "hts_weighting_testing"
schema: "myproject_hts_2025"
working_dir: "working"
outputs_dir: "outputs"
report_dir: "reports"
hts_table_map:
  household: "household"
  person: "person"
  day: "day"
  trip: "trip"
  value_labels: "value_labels"
  sample_plan: "sample_plan"

Then create a smaller test configuration, e.g.:

configs/examples/myproject_2025_dow.yaml

This may override paths or parameters for a lightweight test run.

2. Copy Source Data into the Test Database

Edit and run:

# inst/copy_db_to_test.R

Inside, set:

target_db    = "hts_weighting_testing"
target_schema = "myproject_hts_2025"
source_db     = "myproject"
source_schema = "combined"
settings = get_test_settings("myproject_2025_dow.yaml")

Then run:

source("inst/copy_db_to_test.R")

This will:

  • Create schema myproject_hts_2025 in hts_weighting_testing
  • Copy all tables listed in settings$hts_table_map
  • Use pg_dump/pg_restore per table
  • Drop the temporary staging schema afterward

3. Verify Tables in the Test Schema

Check the test database (e.g., via psql):

\c hts_weighting_testing
\dn
\dt myproject_hts_2025.*

Confirm that all tables listed in your YAML exist.


Running Tests

There are two kinds of tests in this repo: * Unit tests for individual functions (in tests/testthat/) * End-to-end tests that run the full weighting scripts for a project (also in tests/testthat/)

These get run automatically in GitHub Actions CI, but you can also run them locally to test and debug.

When to Create a New test?

In general, always, but that’s often a luxury. It’s generally better to add a unit test to cover the specific function you’re changing. But end-to-end tests are also important to ensure the whole pipeline works. However, end-to-end tests are slow and difficult to maintain, so we try to limit them to key configurations.

The current end-to-end tests cover fundamental configurations for existing projects. - With/without linked trips (SWW and MetCouncil) - Custom weighting geographies (i.e., “client zones”) (MetCouncil) - Day of week weighting (MassDOT) - Person-level weighting (NYCDOT)

So as a general rule, end to end tests help cover either structurally different configurations (client zones or multiple States) or approaches (e.g., DOW or person-level weighting).

4. Create a Project-Specific Test File

Add to tests/testthat/, e.g.:

tests/testthat/test_myproject_2025.R

Example:

# testthat::test_dir("tests/testthat", filter = "myproject_2025$")

# Sometimes it can be useful to have the test prepared but skipped in CI runs
testthat::skip("Skipping standard MyProject weighting test")

settings = get_test_settings("myproject_2025.yaml")

test_state = new.env()
test_state$script_test_passed = FALSE

testthat::test_that("Testing scripts for myproject_2025", {
  script_path = file.path(settings$code_root, '000_run_weight_scripts.R')
  testthat::expect_true(file.exists(script_path))
  
  testthat::expect_no_error(
    tryCatch({
      source(script_path, local = TRUE)
      test_state$script_test_passed = TRUE
    }, error = function(e) {
      testthat::fail(paste("Error while sourcing script:", e$message))
    })
  )
})

testthat::test_that("Testing results for myproject_2025", {
  testthat::skip_if_not(test_state$script_test_passed, "Script run failed, skipping result tests.")
  test_results(settings)
})

5. Run the Tests

Run all tests:

devtools::test()

Or only your project’s tests:

testthat::test_dir("tests/testthat", filter = "myproject_2025$")

6. Inspect Output

  • Results are written to outputs/ and reports/
  • Check reports/*_check_counts.csv for intermediate summaries
  • If 000_run_weight_scripts.R fails, review its console output

7. Clean Up (Optional)

To reset the schema, set:

purge_target_schema = TRUE

in inst/copy_db_to_test.R before re-running.


Summary

Step Action File / Command
1 Create settings YAMLs configs/examples/myproject_2025.yaml
2 Copy DB tables to test DB source("inst/copy_db_to_test.R")
3 Verify schema contents \dt myproject_hts_2025.*
4 Add test file tests/testthat/test_myproject_2025.R
5 Run tests devtools::test()
6 Inspect outputs reports/, outputs/
7 Reset schema if needed purge_target_schema = TRUE

Automated Checks (GitHub Actions CI)

All pushes and pull requests to main automatically trigger continuous integration (CI) via .github/workflows/code_check.yaml. This ensures your code is linted, tested, and passes R CMD check before merge.

Workflow Overview

Job Purpose Runner
linting Runs lintr::lint_package() for consistent R style. Ubuntu
discover-and-setup Finds all test files and prepares a job matrix for parallel testing. Self-hosted
Tests Runs testthat tests per file, in parallel, using the test database. Self-hosted
R-CMD-check Runs R CMD check to verify package build integrity. Ubuntu

How the CI Test Suite Works

  1. Test Discovery All tests/testthat/test-*.R files are detected and distributed across runners for parallel execution.

  2. Database Setup Each test file references a settings YAML (e.g. massdot_2024.yaml) that points to the test database (hts_weighting_testing). Schemas (e.g. massdot_hts_2024) must already exist — typically created with inst/copy_db_to_test.R.

  3. Secrets and Credentials Required GitHub Secrets (set under Settings → Secrets and variables → Actions):

    • PAT (GitHub token with repo and read:packages scope)
    • POPS_USER
    • POPS_PASSWORD These are securely masked in logs.
  4. Python Environment Setup Installs dependencies via uv sync and validates populationsim. Exposes PYTHON_VENV_PATH for R integration.

  5. R Environment Setup Uses r-lib/actions/setup-r@v2 and setup-renv@v2 to install R 4.4.3 and restore dependencies. Installs geospatial system libraries (libproj-dev, libpq-dev, etc.).

  6. Running Tests Each test file runs individually:

    testthat::with_reporter(testthat::JunitReporter$new(), {
      testthat::test_file(testfile)
    })

    Failures stop the job (testthat.stop_on_failure = TRUE).

  7. Artifacts and Logs JUnit XML logs are created per test and shown in GitHub’s “Checks” tab. Output files (reports/, outputs/) are not committed but may be inspected locally.

  8. R CMD Check Validation Finally, the workflow runs:

    uses: r-lib/actions/check-r-package@v2
    with:
      args: 'c("--no-tests", "--no-manual")'

    This ensures package metadata, dependencies, and documentation are valid.


Before Pushing or Opening a PR

Run the same checks locally:

lintr::lint_package()
devtools::test()
devtools::check()

You can filter tests by project:

testthat::test_dir("tests/testthat", filter = "massdot_2024$")

If all checks pass locally, your CI pipeline should succeed once you push.


When in Doubt—Open a Draft PR

Draft PRs are encouraged:

  • Early visibility of your approach
  • Automatic CI and style checks
  • Feedback before you finalize the API

Include context in your PR: the problem statement, constraints, alternatives considered, and how the change was made reusable.


TL;DR

  • General code → this repo
  • Project-specific code → forks
  • Keep your fork synced with upstream
  • Test locally before pushing
  • CI will lint, test, and check your branch automatically if your schema and YAML are aligned