Skip to contents

Run calculations for the staged EJSCREEN/EJAM dataset update pipeline

Usage

calc_ejscreen_dataset(
  yr,
  bg_envirodata = NULL,
  bg_extra_indicators = NULL,
  bg_geodata = NULL,
  bg_acs_raw = NULL,
  bg_acsdata = NULL,
  blockgroupstats = NULL,
  pipeline_dir = NULL,
  pipeline_storage = c("auto", "local", "s3"),
  save_stages = FALSE,
  use_saved_stages = TRUE,
  stage_format = c("csv", "rds", "rda", "arrow"),
  raw_acs_storage = c("folder", "object"),
  raw_table_format = stage_format,
  overwrite = TRUE,
  validation_strict = TRUE,
  download_acs_raw = TRUE,
  acs_download_fun = ACSdownload::get_acs_new,
  return_intermediate = TRUE,
  include_ejscreen_export = FALSE,
  include_ejscreen_export_statepct = NULL,
  include_ejscreen_pctile_lookup_exports = NULL,
  include_ejscreen_dataset_creator_input = FALSE,
  ejscreen_export_path = NULL,
  ejscreen_export_statepct_path = NULL,
  ejscreen_us_pctile_lookup_path = NULL,
  ejscreen_state_pctile_lookup_path = NULL,
  ejscreen_dataset_creator_input_path = NULL,
  ejscreen_export_vars = NULL,
  ejscreen_export_required_names = NULL,
  ejscreen_export_rename_newtype = "ejscreen_indicator",
  ejscreen_export_feature_server_fields = NULL,
  ejscreen_pctile_lookup_output_fields = ejscreen_pctile_lookup_fields(),
  blockgroup_tables = setdiff(as.vector(EJAM::tables_ejscreen_acs), tract_tables),
  tract_tables = c("B18101", "C16001", "B27010"),
  tract_weight_source = c("decennial2020", "acs"),
  include_tract_data = TRUE,
  include_islandareas_data = FALSE,
  islandareas_raw = NULL,
  islandareas_demographics = NULL,
  islandareas_reference = NULL,
  islandareas_tables = islandareas_tables_for_bg_acsdata(),
  use_islandareas_demographics = FALSE,
  fiveorone = "5",
  download_timeout = 3600,
  download_retries = 2,
  download_bg_geodata = FALSE,
  blockgroup_universe_source = c("acs", "union"),
  formulas = EJAM::formulas_ejscreen_acs$formula,
  tract_formulas = NULL,
  dropMOE = TRUE,
  extra_indicator_vars = ejscreen_default_extra_indicator_vars(),
  reuse_existing_if_missing = FALSE,
  existing_blockgroupstats = NULL,
  acs_vars = NULL,
  enviro_vars = NULL,
  ej_indicator_vars = names_e,
  ej_indicator_pctile_vars = names_e_pctile,
  ej_indicator_state_pctile_vars = names_e_state_pctile,
  ej_index_vars = names_ej,
  ej_index_supp_vars = names_ej_supp,
  ej_index_state_vars = names_ej_state,
  ej_index_supp_state_vars = names_ej_supp_state,
  demog_index_var = "Demog.Index",
  demog_index_supp_var = "Demog.Index.Supp",
  demog_index_state_var = "Demog.Index.State",
  demog_index_supp_state_var = "Demog.Index.Supp.State"
)

Arguments

yr

end year of the ACS 5-year survey to use.

bg_envirodata

environmental blockgroup table. If NULL, the wrapper tries to read the saved bg_envirodata stage when use_saved_stages is TRUE.

bg_extra_indicators

non-ACS, non-enviro blockgroup indicators such as lowlifex, or NULL to read/reuse/create that stage.

bg_geodata

Census/TIGER blockgroup geography stage, with square-meter arealand and areawater fields.

bg_acs_raw

optional raw ACS pipeline object from download_bg_acs_raw().

bg_acsdata

optional ACS-derived blockgroup table from calc_bg_acsdata().

blockgroupstats

optional already-combined blockgroupstats-like table.

pipeline_dir

folder or s3://... URI for reading/writing pipeline stage files.

pipeline_storage

stage storage backend: "auto", "local", or "s3". "auto" uses S3 when pipeline_dir starts with s3:// and local file storage otherwise.

save_stages

logical, whether to save each stage as it is created.

use_saved_stages

logical, whether missing inputs may be read from existing files in pipeline_dir.

stage_format

file format for saved/read tabular stages: "csv", "rds", "rda", or "arrow". The wrapper defaults to CSV so every pipeline checkpoint is easy to inspect outside R.

raw_acs_storage

raw ACS checkpoint storage pattern. "folder" saves one ACS table per file plus a manifest. "object" saves the historical single bg_acs_raw list object.

raw_table_format

file format for per-table raw ACS files when raw_acs_storage = "folder".

overwrite

logical, whether to overwrite saved stage files.

validation_strict

logical passed to stage validators.

download_acs_raw

logical, whether to download raw ACS tables when neither bg_acsdata nor saved ACS stages are available.

acs_download_fun

ACSdownload-compatible function used by download_bg_acs_raw() when raw ACS tables need to be downloaded. The default is ACSdownload::get_acs_new(). Supply a wrapper if you need a legacy ACS source implementation.

return_intermediate

logical. If TRUE, return key interim stage objects in addition to final datasets.

include_ejscreen_export

logical. If TRUE, also create the ordinary EJSCREEN-ready national-percentile export using calc_ejscreen_export().

include_ejscreen_export_statepct

logical or NULL. If TRUE, also create an EPA StatePct-style export where state raw scores and state percentiles are written into the generic EPA field names. When NULL, this follows include_ejscreen_export.

include_ejscreen_pctile_lookup_exports

logical or NULL. If TRUE, create EJScreen-style national and state percentile lookup CSV stages, ejscreen_us_pctile_lookup and ejscreen_state_pctile_lookup, from usastats and statestats. These use EJScreen field names and append a std row for each region. When NULL, these stages are created only when an explicit lookup-export save path is supplied.

include_ejscreen_dataset_creator_input

logical. If TRUE, also create the smaller pre-index input table expected by EPA's ejscreen-dataset-creator-2.3 Python tool.

ejscreen_export_path

optional file path for the EJSCREEN national percentile export.

ejscreen_export_statepct_path

optional file path for the EJSCREEN state-percentile export.

ejscreen_us_pctile_lookup_path, ejscreen_state_pctile_lookup_path

optional file paths for the EJScreen-style national and state percentile lookup exports.

ejscreen_dataset_creator_input_path

optional file path for the EJScreen dataset-creator input table.

ejscreen_export_vars

optional EJAM rname columns to keep in the EJSCREEN export before renaming.

ejscreen_export_required_names

optional final EJSCREEN field names that must be present.

ejscreen_export_rename_newtype

naming column in map_headernames to use when renaming the EJSCREEN export.

ejscreen_export_feature_server_fields

optional final EJSCREEN FeatureServer field list. Defaults to the current EJSCREEN v2.32 block group FeatureServer schema when an EJSCREEN export is requested.

ejscreen_pctile_lookup_output_fields

optional EJScreen lookup-table field names. Defaults to the fields used by EPA's archived national/state lookup CSVs.

blockgroup_tables

ACS tables to download at blockgroup resolution.

tract_tables

ACS tables to download at tract resolution for later blockgroup apportionment.

tract_weight_source

source for tract-to-blockgroup apportionment weights. "decennial2020" matches the legacy EJSCREEN method by using 2020 Decennial Census blockgroup population weights. "acs" uses same-vintage ACS blockgroup population weights.

include_tract_data

logical, whether to download tract_tables.

include_islandareas_data

logical, whether to append Island Areas rows. Puerto Rico is not included here because it is already part of ACS.

islandareas_raw

optional raw Island Areas Census DHC object from download_bg_islandareas_raw().

islandareas_demographics

optional transformed Island Areas Census DHC demographics table from calc_bg_islandareasdata().

islandareas_reference

optional Island Areas rows from the archived EPA EJScreen ACS2022 reference file. When supplied and use_islandareas_demographics = FALSE, these rows define the Island Areas blockgroup IDs and labels used for placeholder rows.

islandareas_tables

Island Areas Census DHC tables to download if include_islandareas_data is TRUE and islandareas_raw is not supplied.

use_islandareas_demographics

logical. Defaults to FALSE so that the EJSCREEN-compatible pipeline appends Island Areas blockgroup rows without using mixed-source Island Areas Census demographics in bg_acsdata. Set TRUE only for a supplemental mixed-source dataset.

fiveorone

ACS sample length, "5" by default.

download_timeout

timeout in seconds to use while downloading ACS table files. This is increased above R's usual 60 second default because some Census table-based summary files are hundreds of MB.

download_retries

number of times to retry a failed ACS table download after the initial attempt.

download_bg_geodata

logical, whether to download Census/TIGER blockgroup geography when bg_geodata is not supplied or saved.

blockgroup_universe_source

passed to calc_ejscreen_blockgroupstats(). The default "acs" uses the ACS table rows as the authoritative blockgroup universe for the requested ACS vintage.

formulas

formulas used for blockgroup-resolution ACS tables.

tract_formulas

formulas used for tract-resolution ACS indicators. Defaults to calc_blockgroupstats_from_tract_data() defaults.

dropMOE

logical, whether to drop ACS margin-of-error columns.

extra_indicator_vars

expected extra indicator columns.

reuse_existing_if_missing

logical, whether missing extra indicators should be copied from existing_blockgroupstats.

existing_blockgroupstats

optional blockgroupstats-like table to use when reuse_existing_if_missing is TRUE. Defaults to current package data.

acs_vars

variables to include in the ACS-only lookup stages. Defaults to current EJSCREEN/EJAM ACS indicators found in bgstats.

enviro_vars

variables to include in the environmental lookup stages. Defaults to current environmental, health, site, climate, and feature variables found in bgstats.

ej_indicator_vars

environmental indicators to use when calculating EJ indexes. Defaults to names_e, but can be replaced for custom indicators.

ej_indicator_pctile_vars, ej_indicator_state_pctile_vars

names for national/state environmental percentile columns used internally by calc_bgej().

ej_index_vars, ej_index_supp_vars, ej_index_state_vars, ej_index_supp_state_vars

names for the four EJ-index families created by calc_bgej().

demog_index_var, demog_index_supp_var, demog_index_state_var, demog_index_supp_state_var

demographic index column names used by calc_bgej().

Value

named list containing final datasets (blockgroupstats, bgej, usastats, and statestats) plus interim stages when return_intermediate is TRUE. Attributes record pipeline_dir, stage_format, and saved stage paths.

Details

For routine annual, validation-only, exports-only, and release runs, prefer the recipe scripts in data-raw/run_ejscreen_pipeline_*.R, or build a validated config with pipeline_config_annual() and pass it to run_ejscreen_pipeline(). The long-standing data-raw/run_ejscreen_dataset_pipeline.R file remains available as an environment-variable compatibility runner for older source() workflows.

calc_ejscreen_dataset() is a high-level wrapper around the staged annual update helpers. It is intentionally an orchestrator rather than a replacement for the individual stage functions. Each major input or output can be supplied as an R object, read from a saved stage in pipeline_dir, or created and saved by this function.

The default stage order is:

  1. download raw ACS tables of demographic data into bg_acs_raw

  2. when enabled, append Island Areas AS/GU/MP/VI placeholder rows using the archived EPA EJScreen ACS2022 reference for row IDs, area fields, and available environmental fields; optional Island Areas Census DHC demographics can still be saved/reviewed separately but are not used downstream unless explicitly requested

  3. calculate ACS-based demographic indicators (and lead paint indicator) as bg_acsdata

  4. validate/save bg_envirodata (key environmental indicators)

  5. validate/save bg_extra_indicators (e.g., % low life expectancy)

  6. create or validate bg_geodata, the Census/TIGER blockgroup geography attributes used for arealand, areawater, and internal-point fields

  7. calculate demographic indexes (using % low life expectancy, etc.)

  8. combine those blockgroup demog., envt., extra, and geography indicators as blockgroupstats

  9. create intermediate percentile lookup tables usastats_acs, statestats_acs, usastats_envirodata, statestats_envirodata

  10. calculate EJ indexes (from envt. percentiles and demog. indexes) and save as bgej table

  11. create intermediate percentile lookup tables usastats_ej, statestats_ej

  12. combine those as usastats and statestats

  13. create an EJScreen-ready export file and/or EPA Python dataset-creator input file (optionally)

bg_envirodata must include pctpre1960. That column may be produced by an upstream environmental-data step that reads the saved bg_acsdata stage. For EJAM v3, Island Areas are supported at the blockgroup dataset, EJSCREEN export, and map-data visibility level when include_islandareas_data = TRUE. The default path keeps AS/GU/MP/VI demographic fields as NA, uses archived EPA EJScreen reference rows for row IDs and available environmental/area fields, and does not add AS/GU/MP/VI blocks to the block helper files. Radius/buffer analyses in those areas should therefore return no-data results rather than block-weighted estimates.

The annual pipeline creates the bgej stage, and the package-level dynamic Arrow loader obtains bgej.arrow from the ejamdata release tag recorded in DESCRIPTION as ejamdata_required_tag. For an EJAM v3.YYYY.0 release this is the matching v3.YYYY.0 tag, but the package version and required data tag can differ for patch releases. dataload_dynamic() and download_latest_arrow_data() do not use whichever data-repository release GitHub currently marks as latest.

Recipe config helpers are the preferred interface for maintainers. The compatibility runner can still use several settings stored as environment variables:

  • EJAM_PIPELINE_YR

  • EJAM_PIPELINE_DIR: override output folder.

  • EJAM_PIPELINE_STORAGE: auto, local, or s3. auto treats s3:// paths as S3.

  • EJAM_STAGE_FORMAT: primary stage format used for loading, usually csv.

  • EJAM_STAGE_FORMATS: comma-separated formats saved by the runner, usually csv,rda.

  • EJAM_BLOCKGROUP_UNIVERSE_SOURCE: acs or union. acs is recommended.

  • EJAM_TRACT_WEIGHT_SOURCE: decennial2020 or acs. decennial2020 matches legacy EJSCREEN tract-to-blockgroup apportionment.

  • AWS_PROFILE and AWS_REGION: used when pipeline_storage is s3

  • CENSUS_API_KEY: used by functions that download ACS data (or that download boundaries/shapefiles for FIPS from some sources)

  • EJAM_FORCE_ACS: TRUE to redownload/recalculate raw ACS and bg_acsdata.

  • EJAM_FORCE_BG_ACSDATA: TRUE to rebuild bg_acsdata from saved raw ACS.

  • EJAM_FORCE_BG_GEODATA: TRUE to redownload/recalculate Census/TIGER blockgroup geodata.

  • EJAM_ACS_DOWNLOAD_TIMEOUT

  • EJAM_ACS_DOWNLOAD_RETRIES

  • EJAM_INCLUDE_ISLANDAREAS_DATA: TRUE to save AS/GU/MP/VI rows. The annual/release runner enables this by default unless explicitly set otherwise.

  • EJAM_ISLANDAREAS_REFERENCE_PATH: archived EPA EJScreen reference CSV used for Island Areas row IDs, area fields, and available environmental fields.

  • EJAM_USE_ISLANDAREAS_DEMOGRAPHICS: TRUE only for an intentional mixed-source supplemental dataset using 2020 Island Areas Census DHC demographics in bg_acsdata.

  • EJAM_USE_PROVISIONAL_BG_ENVIRODATA: FALSE to require bg_envirodata.csv.

  • EJAM_BG_ENVIRODATA_REFERENCE_PATH and EJAM_BG_ENVIRODATA_REFERENCE_VARS: optional runner-only settings for deliberately creating or repairing a bg_envirodata source stage from an EJSCREEN-style reference CSV. Normal annual and replication runs should use corrected bg_envirodata as-is. Missing reference values are preserved as NA, not converted to zero. This matters for drinking water: EJAM versions after v2.32.8.001 should not convert missing drinking-water scores to zero unless the source explicitly reports zero.

  • EJAM_INCLUDE_EJSCREEN_EXPORT: TRUE to create ejscreen_export.csv.

  • EJAM_INCLUDE_EJSCREEN_EXPORT_STATEPCT: TRUE to create ejscreen_export_statepct.csv.

  • EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS: TRUE to create ejscreen_us_pctile_lookup.csv and ejscreen_state_pctile_lookup.csv. This is off by default because current EJScreen maps use the blockgroup exports that already contain percentile, bin, and popup fields, while reports are served through EJAM-API/EJAM.

  • EJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT: TRUE to create the smaller input table expected by EPA's Python dataset-creator workflow.

  • EJAM_VALIDATE_VS_PRIOR and related EJAM_PRIOR_* settings control prior-version comparisons.

The annual runner also writes pipeline_run_manifest.csv, which records the package version, Git branch/SHA, ACS vintage, run settings, and whether provisional environmental or extra-indicator inputs were reused.

Census/TIGER geography can occasionally include valid blockgroup features that are not present in the ACS summary-file tables for the same ACS vintage. For example, a draft ACS 2020-2024 build found 39 Suffolk County, New York blockgroups in TIGER geography but not in the relevant ACS blockgroup or tract tables. The default blockgroup_universe_source = "acs" therefore treats bg_acsdata as the authoritative final blockgroup universe and uses bg_geodata only to annotate those rows.

To check them:

print(
cbind(current_setting = Sys.getenv(c(
  "EJAM_PIPELINE_YR",
  "EJAM_PIPELINE_DIR", "EJAM_PIPELINE_STORAGE",
  "EJAM_STAGE_FORMAT", "EJAM_STAGE_FORMATS",
  "EJAM_BLOCKGROUP_UNIVERSE_SOURCE",
  "AWS_PROFILE", "AWS_REGION",
  "CENSUS_API_KEY",
  "EJAM_FORCE_ACS", "EJAM_FORCE_BG_ACSDATA", "EJAM_FORCE_BG_GEODATA",
  "EJAM_ACS_DOWNLOAD_TIMEOUT", "EJAM_ACS_DOWNLOAD_RETRIES",
  "EJAM_INCLUDE_ISLANDAREAS_DATA", "EJAM_ISLANDAREAS_REFERENCE_PATH",
  "EJAM_USE_ISLANDAREAS_DEMOGRAPHICS",
  "EJAM_BG_ENVIRODATA_REFERENCE_PATH", "EJAM_BG_ENVIRODATA_REFERENCE_VARS",
  "EJAM_USE_PROVISIONAL_BG_ENVIRODATA",
  "EJAM_INCLUDE_EJSCREEN_EXPORT", "EJAM_INCLUDE_EJSCREEN_EXPORT_STATEPCT",
  "EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS",
  "EJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT",
  "EJAM_VALIDATE_VS_PRIOR", "EJAM_PRIOR_PIPELINE_YR",
  "EJAM_PRIOR_PIPELINE_DIR", "EJAM_PRIOR_PACKAGE_REF"
)))
)