Run calculations for the staged EJSCREEN/EJAM dataset update pipeline

Usage

calc_ejscreen_dataset(
  yr,
  bg_envirodata = NULL,
  bg_extra_indicators = NULL,
  bg_geodata = NULL,
  bg_acs_raw = NULL,
  bg_acsdata = NULL,
  blockgroupstats = NULL,
  pipeline_dir = NULL,
  pipeline_storage = c("auto", "local", "s3"),
  save_stages = FALSE,
  use_saved_stages = TRUE,
  stage_format = c("csv", "rds", "rda", "arrow"),
  raw_acs_storage = c("folder", "object"),
  raw_table_format = stage_format,
  overwrite = TRUE,
  validation_strict = TRUE,
  download_acs_raw = TRUE,
  acs_download_fun = ACSdownload::get_acs_new,
  return_intermediate = TRUE,
  include_ejscreen_export = FALSE,
  include_ejscreen_export_statepct = NULL,
  include_ejscreen_pctile_lookup_exports = NULL,
  include_ejscreen_dataset_creator_input = FALSE,
  ejscreen_export_path = NULL,
  ejscreen_export_statepct_path = NULL,
  ejscreen_us_pctile_lookup_path = NULL,
  ejscreen_state_pctile_lookup_path = NULL,
  ejscreen_dataset_creator_input_path = NULL,
  ejscreen_export_vars = NULL,
  ejscreen_export_required_names = NULL,
  ejscreen_export_rename_newtype = "ejscreen_indicator",
  ejscreen_export_feature_server_fields = NULL,
  ejscreen_pctile_lookup_output_fields = ejscreen_pctile_lookup_fields(),
  blockgroup_tables = setdiff(as.vector(EJAM::tables_ejscreen_acs), tract_tables),
  tract_tables = c("B18101", "C16001", "B27010"),
  tract_weight_source = c("decennial2020", "acs"),
  include_tract_data = TRUE,
  include_islandareas_data = FALSE,
  islandareas_raw = NULL,
  islandareas_demographics = NULL,
  islandareas_reference = NULL,
  islandareas_tables = islandareas_tables_for_bg_acsdata(),
  use_islandareas_demographics = FALSE,
  fiveorone = "5",
  download_timeout = 3600,
  download_retries = 2,
  download_bg_geodata = FALSE,
  blockgroup_universe_source = c("acs", "union"),
  formulas = EJAM::formulas_ejscreen_acs$formula,
  tract_formulas = NULL,
  dropMOE = TRUE,
  extra_indicator_vars = ejscreen_default_extra_indicator_vars(),
  reuse_existing_if_missing = FALSE,
  existing_blockgroupstats = NULL,
  acs_vars = NULL,
  enviro_vars = NULL,
  ej_indicator_vars = names_e,
  ej_indicator_pctile_vars = names_e_pctile,
  ej_indicator_state_pctile_vars = names_e_state_pctile,
  ej_index_vars = names_ej,
  ej_index_supp_vars = names_ej_supp,
  ej_index_state_vars = names_ej_state,
  ej_index_supp_state_vars = names_ej_supp_state,
  demog_index_var = "Demog.Index",
  demog_index_supp_var = "Demog.Index.Supp",
  demog_index_state_var = "Demog.Index.State",
  demog_index_supp_state_var = "Demog.Index.Supp.State"
)

Arguments

yr: end year of the ACS 5-year survey to use.
bg_envirodata: environmental blockgroup table. If NULL, the wrapper tries to read the saved bg_envirodata stage when use_saved_stages is TRUE.
bg_extra_indicators: non-ACS, non-enviro blockgroup indicators such as lowlifex, or NULL to read/reuse/create that stage.
bg_geodata: Census/TIGER blockgroup geography stage, with square-meter arealand and areawater fields.
bg_acs_raw: optional raw ACS pipeline object from download_bg_acs_raw().
bg_acsdata: optional ACS-derived blockgroup table from calc_bg_acsdata().
blockgroupstats: optional already-combined blockgroupstats-like table.
pipeline_dir: folder or s3://... URI for reading/writing pipeline stage files.
pipeline_storage: stage storage backend: "auto", "local", or "s3". "auto" uses S3 when pipeline_dir starts with s3:// and local file storage otherwise.
save_stages: logical, whether to save each stage as it is created.
use_saved_stages: logical, whether missing inputs may be read from existing files in pipeline_dir.
stage_format: file format for saved/read tabular stages: "csv", "rds", "rda", or "arrow". The wrapper defaults to CSV so every pipeline checkpoint is easy to inspect outside R.
raw_acs_storage: raw ACS checkpoint storage pattern. "folder" saves one ACS table per file plus a manifest. "object" saves the historical single bg_acs_raw list object.
raw_table_format: file format for per-table raw ACS files when raw_acs_storage = "folder".
overwrite: logical, whether to overwrite saved stage files.
validation_strict: logical passed to stage validators.
download_acs_raw: logical, whether to download raw ACS tables when neither bg_acsdata nor saved ACS stages are available.
acs_download_fun: ACSdownload-compatible function used by download_bg_acs_raw() when raw ACS tables need to be downloaded. The default is ACSdownload::get_acs_new(). Supply a wrapper if you need a legacy ACS source implementation.
return_intermediate: logical. If TRUE, return key interim stage objects in addition to final datasets.
include_ejscreen_export: logical. If TRUE, also create the ordinary EJSCREEN-ready national-percentile export using calc_ejscreen_export().
include_ejscreen_export_statepct: logical or NULL. If TRUE, also create an EPA StatePct-style export where state raw scores and state percentiles are written into the generic EPA field names. When NULL, this follows include_ejscreen_export.
include_ejscreen_pctile_lookup_exports: logical or NULL. If TRUE, create EJScreen-style national and state percentile lookup CSV stages, ejscreen_us_pctile_lookup and ejscreen_state_pctile_lookup, from usastats and statestats. These use EJScreen field names and append a std row for each region. When NULL, these stages are created only when an explicit lookup-export save path is supplied.
include_ejscreen_dataset_creator_input: logical. If TRUE, also create the smaller pre-index input table expected by EPA's ejscreen-dataset-creator-2.3 Python tool.
ejscreen_export_path: optional file path for the EJSCREEN national percentile export.
ejscreen_export_statepct_path: optional file path for the EJSCREEN state-percentile export.
ejscreen_us_pctile_lookup_path, ejscreen_state_pctile_lookup_path: optional file paths for the EJScreen-style national and state percentile lookup exports.
ejscreen_dataset_creator_input_path: optional file path for the EJScreen dataset-creator input table.
ejscreen_export_vars: optional EJAM rname columns to keep in the EJSCREEN export before renaming.
ejscreen_export_required_names: optional final EJSCREEN field names that must be present.
ejscreen_export_rename_newtype: naming column in map_headernames to use when renaming the EJSCREEN export.
ejscreen_export_feature_server_fields: optional final EJSCREEN FeatureServer field list. Defaults to the current EJSCREEN v2.32 block group FeatureServer schema when an EJSCREEN export is requested.
ejscreen_pctile_lookup_output_fields: optional EJScreen lookup-table field names. Defaults to the fields used by EPA's archived national/state lookup CSVs.
blockgroup_tables: ACS tables to download at blockgroup resolution.
tract_tables: ACS tables to download at tract resolution for later blockgroup apportionment.
tract_weight_source: source for tract-to-blockgroup apportionment weights. "decennial2020" matches the legacy EJSCREEN method by using 2020 Decennial Census blockgroup population weights. "acs" uses same-vintage ACS blockgroup population weights.
include_tract_data: logical, whether to download tract_tables.
include_islandareas_data: logical, whether to append Island Areas rows. Puerto Rico is not included here because it is already part of ACS.
islandareas_raw: optional raw Island Areas Census DHC object from download_bg_islandareas_raw().
islandareas_demographics: optional transformed Island Areas Census DHC demographics table from calc_bg_islandareasdata().
islandareas_reference: optional Island Areas rows from the archived EPA EJScreen ACS2022 reference file. When supplied and use_islandareas_demographics = FALSE, these rows define the Island Areas blockgroup IDs and labels used for placeholder rows.
islandareas_tables: Island Areas Census DHC tables to download if include_islandareas_data is TRUE and islandareas_raw is not supplied.
use_islandareas_demographics: logical. Defaults to FALSE so that the EJSCREEN-compatible pipeline appends Island Areas blockgroup rows without using mixed-source Island Areas Census demographics in bg_acsdata. Set TRUE only for a supplemental mixed-source dataset.
fiveorone: ACS sample length, "5" by default.
download_timeout: timeout in seconds to use while downloading ACS table files. This is increased above R's usual 60 second default because some Census table-based summary files are hundreds of MB.
download_retries: number of times to retry a failed ACS table download after the initial attempt.
download_bg_geodata: logical, whether to download Census/TIGER blockgroup geography when bg_geodata is not supplied or saved.
blockgroup_universe_source: passed to calc_ejscreen_blockgroupstats(). The default "acs" uses the ACS table rows as the authoritative blockgroup universe for the requested ACS vintage.
formulas: formulas used for blockgroup-resolution ACS tables.
tract_formulas: formulas used for tract-resolution ACS indicators. Defaults to calc_blockgroupstats_from_tract_data() defaults.
dropMOE: logical, whether to drop ACS margin-of-error columns.
extra_indicator_vars: expected extra indicator columns.
reuse_existing_if_missing: logical, whether missing extra indicators should be copied from existing_blockgroupstats.
existing_blockgroupstats: optional blockgroupstats-like table to use when reuse_existing_if_missing is TRUE. Defaults to current package data.
acs_vars: variables to include in the ACS-only lookup stages. Defaults to current EJSCREEN/EJAM ACS indicators found in bgstats.
enviro_vars: variables to include in the environmental lookup stages. Defaults to current environmental, health, site, climate, and feature variables found in bgstats.
ej_indicator_vars: environmental indicators to use when calculating EJ indexes. Defaults to names_e, but can be replaced for custom indicators.
ej_indicator_pctile_vars, ej_indicator_state_pctile_vars: names for national/state environmental percentile columns used internally by calc_bgej().
ej_index_vars, ej_index_supp_vars, ej_index_state_vars, ej_index_supp_state_vars: names for the four EJ-index families created by calc_bgej().
demog_index_var, demog_index_supp_var, demog_index_state_var, demog_index_supp_state_var: demographic index column names used by calc_bgej().

Value

named list containing final datasets (blockgroupstats, bgej, usastats, and statestats) plus interim stages when return_intermediate is TRUE. Attributes record pipeline_dir, stage_format, and saved stage paths.

Details

For routine annual, validation-only, exports-only, and release runs, prefer the recipe scripts in data-raw/run_ejscreen_pipeline_*.R, or build a validated config with pipeline_config_annual() and pass it to run_ejscreen_pipeline(). The long-standing data-raw/run_ejscreen_dataset_pipeline.R file remains available as an environment-variable compatibility runner for older source() workflows.

calc_ejscreen_dataset() is a high-level wrapper around the staged annual update helpers. It is intentionally an orchestrator rather than a replacement for the individual stage functions. Each major input or output can be supplied as an R object, read from a saved stage in pipeline_dir, or created and saved by this function.

The default stage order is:

download raw ACS tables of demographic data into bg_acs_raw
when enabled, append Island Areas AS/GU/MP/VI placeholder rows using the archived EPA EJScreen ACS2022 reference for row IDs, area fields, and available environmental fields; optional Island Areas Census DHC demographics can still be saved/reviewed separately but are not used downstream unless explicitly requested
calculate ACS-based demographic indicators (and lead paint indicator) as bg_acsdata
validate/save bg_envirodata (key environmental indicators)
validate/save bg_extra_indicators (e.g., % low life expectancy)
create or validate bg_geodata, the Census/TIGER blockgroup geography attributes used for arealand, areawater, and internal-point fields
calculate demographic indexes (using % low life expectancy, etc.)
combine those blockgroup demog., envt., extra, and geography indicators as blockgroupstats
create intermediate percentile lookup tables usastats_acs, statestats_acs, usastats_envirodata, statestats_envirodata
calculate EJ indexes (from envt. percentiles and demog. indexes) and save as bgej table
create intermediate percentile lookup tables usastats_ej, statestats_ej
combine those as usastats and statestats
create an EJScreen-ready export file and/or EPA Python dataset-creator input file (optionally)

bg_envirodata must include pctpre1960. That column may be produced by an upstream environmental-data step that reads the saved bg_acsdata stage. For EJAM v3, Island Areas are supported at the blockgroup dataset, EJSCREEN export, and map-data visibility level when include_islandareas_data = TRUE. The default path keeps AS/GU/MP/VI demographic fields as NA, uses archived EPA EJScreen reference rows for row IDs and available environmental/area fields, and does not add AS/GU/MP/VI blocks to the block helper files. Radius/buffer analyses in those areas should therefore return no-data results rather than block-weighted estimates.

The annual pipeline creates the bgej stage, and the package-level dynamic Arrow loader obtains bgej.arrow from the ejamdata release tag recorded in DESCRIPTION as ejamdata_required_tag. For an EJAM v3.YYYY.0 release this is the matching v3.YYYY.0 tag, but the package version and required data tag can differ for patch releases. dataload_dynamic() and download_latest_arrow_data() do not use whichever data-repository release GitHub currently marks as latest.

Recipe config helpers are the preferred interface for maintainers. The compatibility runner can still use several settings stored as environment variables:

EJAM_PIPELINE_YR
EJAM_PIPELINE_DIR: override output folder.
EJAM_PIPELINE_STORAGE: auto, local, or s3. auto treats s3:// paths as S3.
EJAM_STAGE_FORMAT: primary stage format used for loading, usually csv.
EJAM_STAGE_FORMATS: comma-separated formats saved by the runner, usually csv,rda.
EJAM_BLOCKGROUP_UNIVERSE_SOURCE: acs or union. acs is recommended.
EJAM_TRACT_WEIGHT_SOURCE: decennial2020 or acs. decennial2020 matches legacy EJSCREEN tract-to-blockgroup apportionment.
AWS_PROFILE and AWS_REGION: used when pipeline_storage is s3
CENSUS_API_KEY: used by functions that download ACS data (or that download boundaries/shapefiles for FIPS from some sources)
EJAM_FORCE_ACS: TRUE to redownload/recalculate raw ACS and bg_acsdata.
EJAM_FORCE_BG_ACSDATA: TRUE to rebuild bg_acsdata from saved raw ACS.
EJAM_FORCE_BG_GEODATA: TRUE to redownload/recalculate Census/TIGER blockgroup geodata.
EJAM_ACS_DOWNLOAD_TIMEOUT
EJAM_ACS_DOWNLOAD_RETRIES
EJAM_INCLUDE_ISLANDAREAS_DATA: TRUE to save AS/GU/MP/VI rows. The annual/release runner enables this by default unless explicitly set otherwise.
EJAM_ISLANDAREAS_REFERENCE_PATH: archived EPA EJScreen reference CSV used for Island Areas row IDs, area fields, and available environmental fields.
EJAM_USE_ISLANDAREAS_DEMOGRAPHICS: TRUE only for an intentional mixed-source supplemental dataset using 2020 Island Areas Census DHC demographics in bg_acsdata.
EJAM_USE_PROVISIONAL_BG_ENVIRODATA: FALSE to require bg_envirodata.csv.
EJAM_BG_ENVIRODATA_REFERENCE_PATH and EJAM_BG_ENVIRODATA_REFERENCE_VARS: optional runner-only settings for deliberately creating or repairing a bg_envirodata source stage from an EJSCREEN-style reference CSV. Normal annual and replication runs should use corrected bg_envirodata as-is. Missing reference values are preserved as NA, not converted to zero. This matters for drinking water: EJAM versions after v2.32.8.001 should not convert missing drinking-water scores to zero unless the source explicitly reports zero.
EJAM_INCLUDE_EJSCREEN_EXPORT: TRUE to create ejscreen_export.csv.
EJAM_INCLUDE_EJSCREEN_EXPORT_STATEPCT: TRUE to create ejscreen_export_statepct.csv.
EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS: TRUE to create ejscreen_us_pctile_lookup.csv and ejscreen_state_pctile_lookup.csv. This is off by default because current EJScreen maps use the blockgroup exports that already contain percentile, bin, and popup fields, while reports are served through EJAM-API/EJAM.
EJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT: TRUE to create the smaller input table expected by EPA's Python dataset-creator workflow.
EJAM_VALIDATE_VS_PRIOR and related EJAM_PRIOR_* settings control prior-version comparisons.

The annual runner also writes pipeline_run_manifest.csv, which records the package version, Git branch/SHA, ACS vintage, run settings, and whether provisional environmental or extra-indicator inputs were reused.

Census/TIGER geography can occasionally include valid blockgroup features that are not present in the ACS summary-file tables for the same ACS vintage. For example, a draft ACS 2020-2024 build found 39 Suffolk County, New York blockgroups in TIGER geography but not in the relevant ACS blockgroup or tract tables. The default blockgroup_universe_source = "acs" therefore treats bg_acsdata as the authoritative final blockgroup universe and uses bg_geodata only to annotate those rows.

To check them:

print(
cbind(current_setting = Sys.getenv(c(
  "EJAM_PIPELINE_YR",
  "EJAM_PIPELINE_DIR", "EJAM_PIPELINE_STORAGE",
  "EJAM_STAGE_FORMAT", "EJAM_STAGE_FORMATS",
  "EJAM_BLOCKGROUP_UNIVERSE_SOURCE",
  "AWS_PROFILE", "AWS_REGION",
  "CENSUS_API_KEY",
  "EJAM_FORCE_ACS", "EJAM_FORCE_BG_ACSDATA", "EJAM_FORCE_BG_GEODATA",
  "EJAM_ACS_DOWNLOAD_TIMEOUT", "EJAM_ACS_DOWNLOAD_RETRIES",
  "EJAM_INCLUDE_ISLANDAREAS_DATA", "EJAM_ISLANDAREAS_REFERENCE_PATH",
  "EJAM_USE_ISLANDAREAS_DEMOGRAPHICS",
  "EJAM_BG_ENVIRODATA_REFERENCE_PATH", "EJAM_BG_ENVIRODATA_REFERENCE_VARS",
  "EJAM_USE_PROVISIONAL_BG_ENVIRODATA",
  "EJAM_INCLUDE_EJSCREEN_EXPORT", "EJAM_INCLUDE_EJSCREEN_EXPORT_STATEPCT",
  "EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS",
  "EJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT",
  "EJAM_VALIDATE_VS_PRIOR", "EJAM_PRIOR_PIPELINE_YR",
  "EJAM_PRIOR_PIPELINE_DIR", "EJAM_PRIOR_PACKAGE_REF"
)))
)