Updating EJScreen Datasets Annually (via the Pipeline)
Source:vignettes/dev-update-ejscreen-datasets-yearly.Rmd
dev-update-ejscreen-datasets-yearly.RmdThis document describes the annual staged pipeline for updating the
blockgroup-level EJScreen/EJAM datasets, especially the objects
historically called blockgroupstats, bgej,
usastats, and statestats.
For general dataset maintenance outside this pipeline, such as FRS-related tables, NAICS/SIC tables, block-level files, Arrow releases, and package-data publication mechanics, see Updating and Managing the Datasets Used by EJAM.
Scope
The pipeline is designed to make each major data step explicit,
saved, and rerunnable. The main entry point is
calc_ejscreen_dataset(). The recommended annual runner
is:
source("data-raw/run_ejscreen_pipeline_annual.R")That wrapper builds a validated annual pipeline config and then
delegates to the package pipeline runner,
EJAM:::run_ejscreen_pipeline(). The repository also
provides smaller runner wrappers for common rerun modes:
# Recheck saved outputs and prior-version comparisons without rebuilding stages.
source("data-raw/run_ejscreen_pipeline_validation_only.R")
# Recreate EJScreen-facing export files from saved pipeline stages.
source("data-raw/run_ejscreen_pipeline_exports_only.R")
# Run a release-oriented recipe after outputs and settings have been reviewed.
source("data-raw/run_ejscreen_pipeline_release.R")These wrappers build a validated pipeline config first and then call
the same underlying package runner. They are meant for review and
release-maintenance passes after the main annual run has already created
the needed intermediate stages. The long-standing compatibility script,
data-raw/run_ejscreen_dataset_pipeline.R, remains available
for older interactive workflows that set environment variables and then
call source(). The release wrapper also keeps package-data
replacement opt-in; set EJAM_REPLACE_PACKAGE_DATA = "TRUE"
explicitly only when reviewed outputs should replace package
.rda objects.
The runner writes pipeline checkpoints, by default as CSV files. It
can also write secondary formats such as .rda files for the
same table stages when EJAM_STAGE_FORMATS includes those
formats. It does not, by itself, replace every installed package dataset
in data/*.rda. Replacing package data objects with EJAM
metadata helpers is a separate release step after the pipeline outputs
have been reviewed. bgej is the main exception: the
accepted annual bgej table should be republished as
bgej.arrow in the ejamdata release tag
recorded in DESCRIPTION as ejamdata_required_tag, rather
than saved as data/bgej.rda.
Pipeline Stages
The annual workflow creates or reads these stages:
bg_acs_raw: raw ACS table-based summary file data downloaded from the Census Bureau. This is saved before EJAM renaming or formula calculations. The default raw ACS storage is a folder of per-table files, with a manifest.bg_islandareas_raw: optional raw 2020 Island Areas Census DHC tables for American Samoa, Guam, the Commonwealth of the Northern Mariana Islands, and the U.S. Virgin Islands. The default EJScreen-compatible release path does not need this stage, because AS/GU/MP/VI row IDs, area fields, and available environmental fields come from the archived EPA EJScreen ACS2022 reference named byEJAM_ISLANDAREAS_REFERENCE_PATH. Puerto Rico is not part of this stage because it is already covered by ACS.bg_islandareas_demographics: optional transformed 2020 Island Areas Census DHC demographics. This is saved as a separate checkpoint for review and possible supplemental use. The Island Areas source is not ACS 5-year data, and the legacy EPA/EJScreen file with AS/GU/MP/VI rows had no usable ACS demographic values for those rows. Therefore the default EJScreen-compatible pipeline does not use these DHC demographics inbg_acsdataor downstream outputs. SetEJAM_USE_ISLANDAREAS_DEMOGRAPHICS = "TRUE"only when intentionally creating a mixed-source supplemental dataset.bg_acsdata: ACS-derived blockgroup indicators calculated frombg_acs_raw, including demographic indicators andpctpre1960. When the Island Areas stage is enabled, AS/GU/MP/VI rows are appended from the archived EPA reference with no DHC-derived demographic values by default. The separatebg_islandareas_demographicsfile keeps the available Island Areas Census values available for review without changing the EJScreen-compatible demographic calculations.bg_envirodata: blockgroup environmental indicators used for EJ indexes. This normally comes from a separate environmental-data workflow. For draft builds, it can be provisionally reused from the current package data.bg_extra_indicators: other blockgroup indicators that are not ACS and not the main EJ environmental indicators, such as health, life expectancy, and related context variables. These can also be provisionally reused from the current package data.bg_geodata: Census/TIGER block group geography attributes. This stage storesbgfips, square-meterarealandandareawater, optional internal point fieldsintptlatandintptlon, and a compatibility-onlyareacolumn. The pipeline uses Census TIGER/Line block group shapefiles by default because theirALANDandAWATERvalues best match the legacy EJScreen tables; Census TIGERweb remains available as a lighter fallback source. TIGER/Line zip files are cached locally usingEJAM_TIGER_BG_CACHE_DIR, or the EJAM user cache folder when that variable is unset, so later reruns can reuse the state files. By default, the pipeline requests geography only for blockgroups found in the ACS tabulated rows for that vintage.blockgroupstats: combined blockgroup table with ACS indicators, environmental indicators, extra indicators, and geography fields.usastats_acs,statestats_acs,usastats_envirodata, andstatestats_envirodata: percentile lookup tables for ACS and environmental inputs.bgej: blockgroup EJ index values calculated from demographic indexes and environmental percentiles.usastats_ejandstatestats_ej: percentile lookup tables for EJ index columns.usastatsandstatestats: combined lookup tables used by EJAM.ejscreen_export: EJScreen-ready export combiningblockgroupstatsandbgej, applying EJScreen-style names frommap_headernames, and adding map helper fields where possible.ejscreen_export_statepct: EJScreen-ready export matching EPA’s StatePct convention, where state raw scores and state percentiles are written into the generic EJScreen field names.ejscreen_us_pctile_lookupandejscreen_state_pctile_lookup: optional EJScreen-style percentile lookup CSVs created fromusastatsandstatestats. These use EJScreen field names and addstdrows to match EPA lookup tables such asEJScreen_2024_BG_National_Lookup.csvandEJScreen_2024_BG_State_Lookup.csv. The annual pipeline does not create these by default because the live EJScreen app maps from the blockgroup exports and reports are served through EJAM-API/EJAM.ejscreen_dataset_creator_input: optional smaller input table for EPA’s Pythonejscreen-dataset-creator-2.3workflow. Enable it withEJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT = "TRUE".
The runner also writes pipeline_validation_summary.csv
and pipeline_run_manifest.csv. If prior-version validation
is requested, it writes prior_validation_summary.csv and
per-stage prior-validation details. If the EJScreen export is requested,
it writes ejscreen_export_schema_report.csv. For ACS
2018-2022 replication runs, the runner can also compare
ejscreen_export to the EPA-style
EJSCREEN_2024_BG_with_AS_CNMI_GU_VI.csv reference file and
write prior_validation_ejscreen_export_vs_epa_2024_acs2022*
reports.
The run manifest records the package version, Git branch and SHA, ACS vintage, pipeline location, primary stage format, selected run settings, and whether provisional environmental or extra-indicator inputs were reused.
Related Data Update Groups
EJAM data objects do not all change on the same schedule. For annual work, use these update groups to decide what must be rebuilt, validated, or only checked.
Facility Data Updates include FRS-related and facility-code datasets such as
frs,frs_by_programid,frs_by_naics,frs_by_sic,frs_by_mact,frsprogramcodes,epa_programs, NAICS, SIC, and MACT lookup tables. These can be refreshed when EPA facility data are updated, independently of the annual EJScreen-style pipeline.EJSCREEN Annual Data Update includes the main pipeline stages:
bg_acs_raw,bg_acsdata,bg_envirodata,bg_extra_indicators,bg_geodata,blockgroupstats,bgej,usastats,statestats, andejscreen_export. It also includes supporting objects that may need review or regeneration, such asmap_headernames,names_*,namez,tables_ejscreen_acs,formulas_ejscreen_acs,formulas_ejscreen_acs_disability,formulas_ejscreen_demog_index,avg.in.us,high_pctiles_tied_with_min, andtestoutput*. In practice,dataload_dynamic()uses theejamdata_required_tagfield in DESCRIPTION to find the requiredejamdatarelease tag, such asv3, for Arrow assets. This tag usually matches the EJAM package version for an annual data release, but can differ for patch releases.bgej.arrowis part of this annual release bundle and is additionally checked against the installedblockgroupstatsblockgroup universe.Blockgroup Geography Updates include blockgroup-keyed geography files and crosswalks such as
bgid2fips,blockwts,bgpts, andbg_cenpop2020. The annual pipeline always checks whether these are still compatible with the current blockgroup universe, but they only need to be regenerated when blockgroup FIPS, EJAMbgid, internal points, or blockgroup-to-block relationships change.bg_cenpop2020requires special care because it is tied to Census 2020 geography. Related state/place geography objects such asstates_shapefile,stateinfo2, andcensusplacesshould also be checked when their FIPS or boundaries change. For the v3 Island Area decision and live EJSCREEN layer availability notes, see Island Areas in EJAM v3.Block Geography Updates include block-level geometry/index files such as
blockpoints,quaddata, andblockid2fips. These change only when block FIPS or block internal-point geography changes.
The runner writes dynamic_geography_arrow_report.csv to
summarize whether category 3 and 4 Arrow files cover the current
blockgroupstats blockgroups and whether block-level IDs
line up across blockwts, blockpoints,
quaddata, and blockid2fips.
ACS Geography Universe
For a given ACS 5-year release, use the Census geography vintage
associated with the ACS end year. For example, ACS 2020-2024 should use
2024 Census/TIGER or TIGERweb geography attributes. The release pipeline
prefers the downloadable Census TIGER/Line block group shapefiles for
arealand and areawater, with TIGERweb as a
fallback. Those geography sources can occasionally include block groups
that are valid geography features but that are not present in the ACS
tabulated summary-file tables used by EJAM.
That mismatch is unusual but real. In the draft ACS 2020-2024 build,
the Census/TIGER block group geography source included 39 Suffolk
County, New York block groups that were not present in the relevant ACS
block group or tract tables downloaded for the pipeline. Including those
geography-only rows would expand blockgroupstats beyond the
ACS data universe and create rows with no ACS-derived indicators.
For that reason, the default pipeline setting is
EJAM_BLOCKGROUP_UNIVERSE_SOURCE = "acs". Under that
setting, bg_acsdata defines the final blockgroup universe,
bg_geodata is downloaded or subset to those
bgfips values, and environmental or extra-indicator inputs
cannot add extra rows to the final blockgroupstats. The
alternative setting
EJAM_BLOCKGROUP_UNIVERSE_SOURCE = "union" is kept only for
diagnostic or special-purpose runs where the maintainer intentionally
wants to retain blockgroups found only in other inputs.
The pipeline uses the packaged formulas_ejscreen_acs
object for ACS-derived indicator formulas and sorts formula rows by
dependency before calculating them. The old
data-raw/archived_datacreate_formulas_ejscreen_acs_notes.R
file is reference material only. It is not the current formula rebuild
workflow.
Storage
The default local pipeline folder is:
data-raw/pipeline_outputs/ejscreen_acs_2024data-raw/pipeline_outputs/ is ignored by Git because
checkpoint files can be large. The repository also has build-ignore
rules for pipeline outputs and Arrow data files, so release artifacts
should normally be stored outside the package source tree.
The pipeline can also use AWS S3. S3 support uses the AWS CLI, so
aws must be installed and configured before running an
S3-backed pipeline. By default the runner uses
EJAM_STAGE_FORMAT = "csv" for loading/review and
EJAM_STAGE_FORMATS = "csv,rda" for saving major table
stages in both formats. Small summary and manifest files are written as
CSV.
Sys.setenv(
AWS_PROFILE = "ejam",
AWS_REGION = "us-east-1",
EJAM_PIPELINE_DIR = "s3://pedp-data-preserved/ejscreen-data-processing/pipeline/ejscreen_acs_2024",
EJAM_PIPELINE_STORAGE = "s3"
)For local testing, use a local directory:
Sys.setenv(
EJAM_PIPELINE_DIR = file.path(
getwd(),
"data-raw",
"pipeline_outputs",
"ejscreen_acs_2024"
),
EJAM_PIPELINE_STORAGE = "local"
)Key Settings
The preferred interface is a validated config object, usually built
by one of the recipe helpers such as
pipeline_config_annual() or
pipeline_config_release(). Environment variables remain
supported for RStudio/source-based workflows and GitHub Actions. The
main settings are:
| Variable | Purpose |
|---|---|
EJAM_PIPELINE_YR |
ACS 5-year end year, such as "2024" for ACS
2020-2024. |
EJAM_PIPELINE_DIR |
Local folder or s3://... pipeline location. |
EJAM_PIPELINE_STORAGE |
"auto", "local", or
"s3". |
EJAM_STAGE_FORMAT |
Primary stage format used for loading/validation, usually
"csv". |
EJAM_STAGE_FORMATS |
Comma-separated formats to save for major table stages, usually
"csv,rda". |
EJAM_BLOCKGROUP_UNIVERSE_SOURCE |
"acs" uses the ACS tabulated blockgroup rows as the
final universe. "union" also keeps rows found only in
environmental or extra-indicator inputs. |
EJAM_TRACT_WEIGHT_SOURCE |
"decennial2020" uses 2020 Decennial Census population
weights to apportion tract-only ACS tables to blockgroups, matching
legacy EJSCREEN. "acs" uses same-vintage ACS blockgroup
population weights. |
EJAM_DECENNIAL_BGWTS_CACHE |
Optional local .rds cache path for 2020 Decennial
blockgroup-to-tract weights. If unset, EJAM uses a user cache
folder. |
EJAM_REFRESH_DECENNIAL_BGWTS |
"TRUE" to ignore and overwrite the cached decennial
blockgroup weights. |
EJAM_TIGER_BG_CACHE_DIR |
Optional local folder for downloaded Census TIGER/Line block group zip files. If unset, EJAM uses a durable user cache folder. |
AWS_PROFILE, AWS_REGION
|
Used by the AWS CLI for S3-backed runs. |
CENSUS_API_KEY |
Used by ACS/Census download helpers where needed. |
EJAM_FORCE_ACS |
"TRUE" to redownload raw ACS and rebuild ACS
stages. |
EJAM_FORCE_BG_ACSDATA |
"TRUE" to rebuild bg_acsdata from saved
raw ACS. |
EJAM_FORCE_BG_GEODATA |
"TRUE" to redownload/rebuild the Census/TIGER
bg_geodata stage. |
EJAM_ACS_DOWNLOAD_TIMEOUT |
Download timeout in seconds. Useful for large ACS tables. |
EJAM_ACS_DOWNLOAD_RETRIES |
Number of retry attempts for ACS downloads. |
EJAM_INCLUDE_ISLANDAREAS_DATA |
"TRUE" to append AS/GU/MP/VI rows. The annual/release
runner defaults to "TRUE" unless explicitly set
otherwise. |
EJAM_ISLANDAREAS_REFERENCE_PATH |
Archived EPA EJScreen ACS2022 reference CSV used for Island Areas row IDs, area fields, and available environmental fields. |
EJAM_USE_ISLANDAREAS_DEMOGRAPHICS |
"TRUE" only for an intentional mixed-source
supplemental dataset that uses 2020 Island Areas Census DHC demographics
in bg_acsdata. The default EJScreen-compatible path is
"FALSE". |
EJAM_USE_PROVISIONAL_BG_ENVIRODATA |
"FALSE" to require a supplied
bg_envirodata stage file, such as
bg_envirodata.csv under the default stage format. |
EJAM_INCLUDE_EJSCREEN_EXPORT |
"TRUE" to create the ejscreen_export
stage, such as ejscreen_export.csv under the default stage
format. |
EJAM_INCLUDE_EJSCREEN_EXPORT_STATEPCT |
"TRUE" to create the
ejscreen_export_statepct stage. |
EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS |
"TRUE" only when intentionally refreshing
EJScreen-style lookup CSVs. These are not created by default. |
EJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT |
"TRUE" to also create the smaller
ejscreen_dataset_creator_input stage for EPA’s Python
dataset-creator workflow. |
EJAM_VALIDATE_VS_PRIOR |
"TRUE" to write prior-version comparison files. |
EJAM_PRIOR_PIPELINE_DIR |
Prior pipeline folder/S3 prefix to compare against. |
EJAM_PRIOR_PACKAGE_REF |
Optional Git ref/tag/SHA for prior package data comparison. |
EJAM_EJSCREEN_EXPORT_REFERENCE_PATH |
Optional EPA-style EJScreen export CSV to compare against
ejscreen_export. For S3-backed 2022 runs, the runner
defaults to the preserved
EJSCREEN_2024_BG_with_AS_CNMI_GU_VI.csv file, which uses
ACS 2018-2022 data despite its 2024 filename. |
EJAM_VALIDATE_EJSCREEN_EXPORT_REFERENCE |
"TRUE" to write
prior_validation_ejscreen_export_vs_epa_2024_acs2022.csv,
*_summary.csv, and *_summary.txt when a
reference export path is available. |
To see what the runner will use:
Sys.getenv(c(
"EJAM_PIPELINE_YR",
"EJAM_PIPELINE_DIR",
"EJAM_PIPELINE_STORAGE",
"EJAM_STAGE_FORMAT",
"EJAM_STAGE_FORMATS",
"EJAM_BLOCKGROUP_UNIVERSE_SOURCE",
"EJAM_TRACT_WEIGHT_SOURCE",
"EJAM_DECENNIAL_BGWTS_CACHE",
"EJAM_REFRESH_DECENNIAL_BGWTS",
"EJAM_TIGER_BG_CACHE_DIR",
"AWS_PROFILE",
"AWS_REGION",
# "CENSUS_API_KEY",
"EJAM_FORCE_ACS",
"EJAM_FORCE_BG_ACSDATA",
"EJAM_FORCE_BG_GEODATA",
"EJAM_ACS_DOWNLOAD_TIMEOUT",
"EJAM_ACS_DOWNLOAD_RETRIES",
"EJAM_INCLUDE_ISLANDAREAS_DATA",
"EJAM_USE_ISLANDAREAS_DEMOGRAPHICS",
"EJAM_USE_PROVISIONAL_BG_ENVIRODATA",
"EJAM_INCLUDE_EJSCREEN_EXPORT",
"EJAM_INCLUDE_EJSCREEN_EXPORT_STATEPCT",
"EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS",
"EJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT",
"EJAM_VALIDATE_VS_PRIOR",
"EJAM_PRIOR_PIPELINE_YR",
"EJAM_PRIOR_PIPELINE_DIR",
"EJAM_PRIOR_PACKAGE_REF",
"EJAM_EJSCREEN_EXPORT_REFERENCE_PATH",
"EJAM_VALIDATE_EJSCREEN_EXPORT_REFERENCE"
))For ACS 2022 and later, Connecticut ACS tract FIPS use
planning-region county equivalents while 2020 Decennial blockgroup FIPS
use the older county equivalents. The pipeline detects that no
Connecticut tract FIPS overlap in the decennial weight table and uses
same-vintage ACS blockgroup population weights for Connecticut only. In
normal package use, the decennial weights are created from packaged
bg_cenpop2020 data. If that data is unavailable, EJAM falls
back to tidycensus::get_decennial() and caches the
downloaded weights locally.
New or renamed indicators
map_headernames
If new indicators are being used compared with prior version of the
datasets and package, map_headernames may need metadata
rows for those new indicators, including the variable name
(rname), longname, calculation type,
calculation weight, rounding information, EJScreen export names, and
varlist groups such as names_e and
names_d. The editable source for release work is
data-raw/map_headernames.csv. Edit that CSV directly, then
source data-raw/datacreate_map_headernames.R to validate
and save data/map_headernames.rda. Older spreadsheet
workflows are obsolete and should not be used to regenerate this
object.
Run a Fresh ACS Update
Start from a clean branch. For ACS 2020-2024 using local checkpoints:
yr <- 2024
cfg <- EJAM:::pipeline_config_annual(
yr = yr,
pipeline_dir = file.path(
getwd(),
"data-raw",
"pipeline_outputs",
paste0("ejscreen_acs_", yr)
),
pipeline_storage = "local",
force_acs = TRUE,
force_bg_acsdata = TRUE,
force_bg_geodata = TRUE
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)For the S3-backed pipeline:
yr <- 2024
cfg <- EJAM:::pipeline_config_annual(
yr = yr,
pipeline_root = "s3://pedp-data-preserved/ejscreen-data-processing/pipeline",
pipeline_storage = "s3",
force_acs = TRUE,
force_bg_acsdata = TRUE,
force_bg_geodata = TRUE,
aws_profile = "ejam",
aws_region = "us-east-1"
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)The runner prints the resolved settings before it starts the pipeline. Review those settings carefully, especially the year, storage backend, stage formats, force flags, Island Areas settings, provisional-input flags, and prior validation target.
For ACS2024/v3 and later annual/release runs, AS/GU/MP/VI rows are
added by default at the blockgroup dataset, EJSCREEN export, and
map-data visibility level, in the same general style as the legacy
EPA/EJScreen export. The default source for Island Areas row IDs, area
fields, and available environmental fields is the archived EPA EJScreen
ACS2022 reference named by EJAM_ISLANDAREAS_REFERENCE_PATH.
Keep the DHC demographics out of bg_acsdata unless you are
intentionally creating a mixed-source supplemental dataset. To make the
default explicit:
cfg <- EJAM:::pipeline_config_annual(
yr = 2024,
include_islandareas_data = TRUE,
use_islandareas_demographics = FALSE
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)Set EJAM_USE_ISLANDAREAS_DEMOGRAPHICS = "TRUE" only for
an intentional mixed-source supplemental dataset. That uses 2020 Island
Areas Census DHC demographic values in bg_acsdata, which is
useful for review but is not the default EJScreen replication path. Set
EJAM_INCLUDE_ISLANDAREAS_DATA = "FALSE" only when
deliberately creating a States/DC/PR-only run for comparison or
debugging.
Rerun From Saved ACS Data
If raw ACS has already been downloaded, rerun downstream ACS calculations without redownloading:
cfg <- EJAM:::pipeline_config_annual(
yr = 2024,
force_acs = FALSE,
force_bg_acsdata = TRUE
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)If both raw ACS and bg_acsdata should be reused, leave
both force flags false:
cfg <- EJAM:::pipeline_config_annual(
yr = 2024,
force_acs = FALSE,
force_bg_acsdata = FALSE
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)If bg_geodata has already been created for the same
ACS/TIGER vintage and same blockgroup universe, leave
EJAM_FORCE_BG_GEODATA false. Set it to "TRUE"
when changing vintages or when you want to refresh the Census TIGER/Line
area and internal-point attributes. Even with
EJAM_FORCE_BG_GEODATA = "TRUE", already-downloaded
TIGER/Line state zip files are reused from
EJAM_TIGER_BG_CACHE_DIR when present and valid.
cfg <- EJAM:::pipeline_config_annual(
yr = 2024,
force_acs = FALSE,
force_bg_acsdata = FALSE,
force_bg_geodata = TRUE
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)Supplying Updated Environmental Data
The environmental stage is intentionally separate from the ACS stage.
When updated environmental indicators are available, save them in the
pipeline folder as the stage file for bg_envirodata. With
the default stage format, that file is
bg_envirodata.csv.
For a local pipeline:
file.path(pipeline_dir, "bg_envirodata.csv")For an S3 pipeline:
The file must include bgfips and the environmental
indicators used for EJ indexes. It should also include
pctpre1960. The environmental-data workflow may create
pctpre1960 by reading the saved bg_acsdata
stage.
Environmental indicator missing values should be preserved as missing
values. Do not convert NA values to zero unless the source
explicitly reports a valid zero score. This is especially important for
the drinking-water non-compliance indicator. EJAM versions through
v2.32.8.001 converted missing EPA DWATER values to
drinking = 0 in blockgroupstats; later EJAM
releases should preserve the distinction between missing/no valid score
and a valid zero score.
To force the runner to stop unless bg_envirodata.csv has
been supplied:
cfg <- EJAM:::pipeline_config_annual(
yr = 2024,
use_provisional_bg_envirodata = FALSE
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)After replacing bg_envirodata.csv, rerun without forcing
ACS:
cfg <- EJAM:::pipeline_config_annual(
yr = 2024,
force_acs = FALSE,
force_bg_acsdata = FALSE,
force_bg_geodata = FALSE,
use_provisional_bg_envirodata = FALSE
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)This reuses the saved ACS stages and regenerates downstream
blockgroupstats, bgej, usastats,
statestats, ejscreen_export, and
ejscreen_export_statepct. EJScreen-style lookup exports are
created only when
include_ejscreen_pctile_lookup_exports = TRUE in the
config, or
EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS = "TRUE" in
compatibility-runner workflows.
Extra Indicators
Some blockgroupstats columns are not ACS indicators and
are not the main environmental indicators. Examples include life
expectancy, health indicators, facility-count context variables,
climate-related fields, and other columns grouped in
map_headernames$varlist.
The pipeline makes these explicit in the
bg_extra_indicators stage, usually
bg_extra_indicators.csv under the default stage format. If
an updated table is not supplied, the runner currently creates a
provisional version from the packaged
EJAM::blockgroupstats. That is useful for testing the ACS
update, but final release review should document clearly any reuse of
older non-ACS data.
Provisional Draft Builds
For early pipeline testing, it is acceptable to reuse existing environmental and extra indicators:
cfg <- EJAM:::pipeline_config_annual(
yr = 2024,
use_provisional_bg_envirodata = TRUE
)
pipeline_run <- EJAM:::run_ejscreen_pipeline(cfg)The runner writes source-note text files next to provisional stages,
such as bg_envirodata_SOURCE.txt and
bg_extra_indicators_SOURCE.txt. Final release review should
confirm whether any provisional stage remains.
Reviewing Outputs
Start with the validation summary:
library(data.table)
pipeline_dir <- "data-raw/pipeline_outputs/ejscreen_acs_2024"
validation <- fread(file.path(pipeline_dir, "pipeline_validation_summary.csv"))
validation[, .(stage, rows, columns, errors, warnings)]
validation[nzchar(errors)]
validation[nzchar(warnings)]There should be no validation errors. Warnings should be understood and either fixed or explicitly accepted for a draft build.
Then inspect the main outputs. The example below assumes the default CSV stage format:
bg_acsdata <- fread(file.path(pipeline_dir, "bg_acsdata.csv"))
blockgroupstats <- fread(file.path(pipeline_dir, "blockgroupstats.csv"))
bgej <- fread(file.path(pipeline_dir, "bgej.csv"))
usastats <- fread(file.path(pipeline_dir, "usastats.csv"))
statestats <- fread(file.path(pipeline_dir, "statestats.csv"))
bg_geodata <- fread(file.path(pipeline_dir, "bg_geodata.csv"))
nrow(blockgroupstats)
nrow(bgej)
names(blockgroupstats)
names(bgej)
island_prefixes <- c("60", "66", "69", "78")
for (stage in list(bg_acsdata, blockgroupstats, bgej)) {
print(stage[substr(bgfips, 1, 2) %in% island_prefixes, .N, by = ST])
}Useful checks include:
- expected FIPS/geography columns are present and typed as character;
- row counts are plausible for the ACS vintage and geography coverage;
- under the default
EJAM_BLOCKGROUP_UNIVERSE_SOURCE = "acs",blockgroupstats,bgej, andbg_geodatahave the samebgfipsvalues asbg_acsdata; - key ACS indicators are non-missing for most populated blockgroups;
- percentage/rate variables are in the expected range;
-
blockgroupstatsandbgejjoin cleanly bybgfips; -
bg_geodatahas one row perbgfipsand non-missing, nonnegativearealandandareawater; - lookup tables include
REGION,PCTILE,0,100, andmean; -
usastatshas one region,"USA"; -
statestatshas expected state/territory regions. - if optional EJScreen-style lookup exports were requested, they
include
PCTILE,REGION,0,100,mean, andstd, with EJScreen field names rather than EJAMrnamecolumns. - for ACS2024/v3, Island Areas AS/GU/MP/VI are present in
bg_acsdata,blockgroupstats,bgej,ejscreen_export, andejscreen_export_statepctunlessEJAM_INCLUDE_ISLANDAREAS_DATA = "FALSE".
For the default EJScreen-compatible path, Island Areas demographic
columns in bg_acsdata and downstream stages are expected to
be NA because the archived EPA/EJScreen reference file with
AS/GU/MP/VI rows had no usable ACS demographic values for those rows.
The separate bg_islandareas_demographics checkpoint is
optional and contains available 2020 Island Areas Census DHC values for
review. Those values are used in bg_acsdata only when
EJAM_USE_ISLANDAREAS_DEMOGRAPHICS = "TRUE". Available EPA
environmental and area fields for AS/GU/MP/VI are retained from the
reference where supplied. This visibility-level support does not mean
radius/buffer or block-weighted polygon analysis works in Island Areas.
Island Area blocks are not added to blockwts,
blockpoints, bgid2fips,
blockid2fips, quaddata, or related helper
files for this release path, so analyses there should return no-data
results rather than block-weighted estimates.
For example:
stopifnot("0" %in% as.character(usastats$PCTILE))
stopifnot("100" %in% as.character(usastats$PCTILE))
stopifnot("mean" %in% as.character(usastats$PCTILE))
stopifnot("0" %in% as.character(statestats$PCTILE))
stopifnot("100" %in% as.character(statestats$PCTILE))
stopifnot("mean" %in% as.character(statestats$PCTILE))
if (file.exists(file.path(pipeline_dir, "ejscreen_us_pctile_lookup.csv"))) {
ejscreen_us_pctile_lookup <- fread(file.path(pipeline_dir, "ejscreen_us_pctile_lookup.csv"))
stopifnot("std" %in% as.character(ejscreen_us_pctile_lookup$PCTILE))
stopifnot(all(c("PCTILE", "REGION", "DEMOGIDX_2", "LOWINCPCT", "D2_PM25") %in%
names(ejscreen_us_pctile_lookup)))
}
if (file.exists(file.path(pipeline_dir, "ejscreen_state_pctile_lookup.csv"))) {
ejscreen_state_pctile_lookup <- fread(file.path(pipeline_dir, "ejscreen_state_pctile_lookup.csv"))
stopifnot("std" %in% as.character(ejscreen_state_pctile_lookup$PCTILE))
}
stopifnot(!anyDuplicated(blockgroupstats$bgfips))
stopifnot(!anyDuplicated(bgej$bgfips))
stopifnot(!anyDuplicated(bg_geodata$bgfips))
stopifnot(setequal(blockgroupstats$bgfips, bg_acsdata$bgfips))
stopifnot(setequal(bgej$bgfips, bg_acsdata$bgfips))
stopifnot(setequal(bg_geodata$bgfips, bg_acsdata$bgfips))
stopifnot(all(bg_geodata$arealand >= 0, na.rm = TRUE))
stopifnot(all(bg_geodata$areawater >= 0, na.rm = TRUE))Also review the run manifest:
manifest <- fread(file.path(pipeline_dir, "pipeline_run_manifest.csv"))
manifest[key %in% c(
"package_version",
"git_sha",
"git_branch",
"git_dirty",
"acs_version",
"stage_format",
"setting_EJAM_STAGE_FORMATS",
"used_provisional_bg_envirodata",
"used_provisional_bg_extra_indicators"
)]For S3-backed runs, read the same files from S3 using EJAM’s pipeline input helpers or the AWS CLI.
Slow Stages
The longest stages are usually:
- fresh
bg_acs_rawdownloads, because several ACS table-based summary files are large; - S3 writes of large raw and derived stage files, especially when
saving both
.csvand.rda; -
bg_acsdata, because it applies ACS formulas and tract-to-blockgroup calculations; -
bg_geodata, because it downloads and reads Census TIGER/Line block group shapefiles for every state, the District of Columbia, Puerto Rico, and optional Island Areas, with TIGERweb available as a fallback. ReusingEJAM_TIGER_BG_CACHE_DIRmakes laterbg_geodatarebuilds much faster because the state or territory zip files do not need to be downloaded again; - final
blockgroupstats/bgej/statistics/export calculations; - prior-version validation when comparing large blockgroup tables.
When debugging pipeline speed, check the console timestamps and consider rerunning without forcing earlier stages once their saved outputs are known to be current.
Reviewing the EJScreen Export Schema
If EJAM_INCLUDE_EJSCREEN_EXPORT is true, the runner
writes the export stage and its schema report. With the default stage
format, those files are:
ejscreen_export.csv
ejscreen_export_schema_report.csvUse the schema report as a field-by-field checklist:
schema <- fread(file.path(pipeline_dir, "ejscreen_export_schema_report.csv"))
schema[, .N, by = status]
schema[status == "missing_expected"]
schema[status == "missing_expected", .N, by = field_type][order(-N)]
schema[status == "unexpected_output"]Each missing expected field should be classified as one of:
- a field EJScreen needs and the export must add;
- a metadata mapping issue in
map_headernames; - a deliberately deferred field that is not needed for the current export.
For release, the ideal schema report has no
missing_expected rows for the FeatureServer fields required
by the EJScreen app.
For 2022 replication checks, also review the EPA-reference comparison reports when they are present:
prior_validation_ejscreen_export_vs_epa_2024_acs2022.csv
prior_validation_ejscreen_export_vs_epa_2024_acs2022_summary.csv
prior_validation_ejscreen_export_vs_epa_2024_acs2022_summary.txtThose reports use ID as a character field so leading
zeroes in block group FIPS are preserved. They are meant to explain
differences between EJAM’s current export and the EPA-style reference
export, not to force EJAM to replicate legacy behavior where the new
pipeline has intentionally corrected a formula or missing-value
rule.
Replacing Package Data
Pipeline stage files are review artifacts. After they are accepted,
update the package data objects deliberately. The runner currently has
an interactive helper path for replacing blockgroupstats,
but it does not automatically replace every final package dataset.
A release update should explicitly replace at least:
-
blockgroupstats; -
usastats; -
statestats; - any related lookup or metadata objects that changed.
Use the established EJAM metadata helpers before saving package
.rda data. When reviewing CSV stages manually, a typical
pattern is:
blockgroupstats <- fread(file.path(pipeline_dir, "blockgroupstats.csv"))
usastats <- fread(file.path(pipeline_dir, "usastats.csv"))
statestats <- fread(file.path(pipeline_dir, "statestats.csv"))
EJAM:::metadata_add_and_use_this("blockgroupstats")
EJAM:::metadata_add_and_use_this("usastats")
EJAM:::metadata_add_and_use_this("statestats")Confirm the exact metadata values before saving, especially ACS
vintage, Census vintage, EJSCREEN/EJAM version, data source notes, and
provisional reuse notes. bgej should be checked and saved
through the pipeline stage files, then published as
bgej.arrow in the ejamdata release tag
recorded in DESCRIPTION as ejamdata_required_tag, such as
v3.2024.0 for the matching EJAM v3 release. A local
data/bgej.arrow copy can be useful for testing from source,
but it is ignored for package builds and should not be treated as normal
package data.
After those key datasets are updated, rerun the scripts that create
and save testoutput_* files and datasets, especially
datacreate_testpoints_testoutputs.R and
datacreate_testoutput_ejamit_*.
Then run EJAM:::metadata_check() and
EJAM:::metadata_check_print() from the current source
package to confirm that package datasets with metadata-style attributes
have the expected EJAM version, ACS version, release dates, and save
dates. Atomic name-vector objects such as many names_*
datasets do not need metadata attributes.
After package data are updated, reinstall the package and rerun release-critical tests. Also regenerate any Arrow-format files used outside the package build, if needed.
Comparing Two Vintages Side by Side
While testing an annual update it is useful to run two
vintages/releases at once (for example the new ACS2024 /
v3.2024.0 build and the prior ACS2023 /
v3.2023.0 build) and compare the same sites in each. Each
EJAM release pins its own ejamdata_required_tag in
DESCRIPTION, so each pulls the matching vintage of the large Arrow
datasets (bgej, etc.) from its tagged ejamdata
release.
The cleanest approach is one git worktree per
vintage, so each checkout has its own data/ folder
and Arrow cache and there is no cross-vintage collision:
# from the main checkout, create a worktree per release branch/tag:
# (run in a terminal)
# git worktree add ../EJAM-2024 ACS2024
# git worktree add ../EJAM-2023 ACS2023
# then, in a SEPARATE R session per folder:
devtools::load_all(".") # first load downloads this vintage's bgej (~90 MB)
EJAM:::ejamdata_required_tag() # confirm: "v3.2024.0" (or v3.2023.0, ...)
attr(blockgroupstats, "acs_version")# confirm: "2020-2024" (or "2019-2023", ...)
out_new <- ejamit(testpoints_10, radius = 1) # run the SAME input in each session
# compare out_new$results_overall across the two sessions / vintagesIf instead you stay in one checkout and switch
branches, the downloaded data/bgej.arrow is
gitignored and persists across git switch, while the
version marker data/ejamdata_version.txt is tracked and
changes with the branch. To be sure you are testing the right vintage’s
EJ indexes, remove the cached file so it is re-downloaded for the new
vintage:
# after: git switch ACS2023
file.remove("data/bgej.arrow") # force re-download of the v3.2023.0 bgej
devtools::load_all(".")As a safeguard, dataload_dynamic_validate_bgej()
compares the loaded bgej to the package’s
blockgroupstats and drops it if they do not match, so a
stale bgej is rejected rather than silently used.
Release Checklist
Before releasing a new annual dataset build:
- Run the pipeline with the intended ACS year and storage backend.
- Confirm whether
bg_envirodataandbg_extra_indicatorsare updated or provisional. - Review
pipeline_run_manifest.csv, including package version, Git SHA, ACS vintage, run settings, and provisional-input flags. - Review
pipeline_validation_summary.csv. - Review row counts, missingness, ranges, and joins for final tables,
including the
bg_geodataarea fields. - Review
ejscreen_export_schema_report.csv. - Replace package
.rdadatasets only after the stage files are accepted. - Update dataset metadata and documentation.
- Rebuild documentation and pkgdown.
- Reinstall the package and run focused tests plus package checks.
- Publish large artifacts through the chosen storage path, such as S3 or the data repository release process.
The general release and large-data publication steps are covered in Updating and Managing the Datasets Used by EJAM and Updating the Package as a New Release.