Run calculations for the staged EJSCREEN/EJAM dataset update pipeline
Source:R/calc_ejscreen_dataset.R
calc_ejscreen_dataset.RdRun calculations for the staged EJSCREEN/EJAM dataset update pipeline
Usage
calc_ejscreen_dataset(
yr,
bg_envirodata = NULL,
bg_extra_indicators = NULL,
bg_geodata = NULL,
bg_acs_raw = NULL,
bg_acsdata = NULL,
blockgroupstats = NULL,
pipeline_dir = NULL,
pipeline_storage = c("auto", "local", "s3"),
save_stages = FALSE,
use_saved_stages = TRUE,
stage_format = c("csv", "rds", "rda", "arrow"),
raw_acs_storage = c("folder", "object"),
raw_table_format = stage_format,
overwrite = TRUE,
validation_strict = TRUE,
download_acs_raw = TRUE,
acs_download_fun = ACSdownload::get_acs_new,
return_intermediate = TRUE,
include_ejscreen_export = FALSE,
include_ejscreen_export_statepct = NULL,
include_ejscreen_pctile_lookup_exports = NULL,
include_ejscreen_dataset_creator_input = FALSE,
ejscreen_export_path = NULL,
ejscreen_export_statepct_path = NULL,
ejscreen_us_pctile_lookup_path = NULL,
ejscreen_state_pctile_lookup_path = NULL,
ejscreen_dataset_creator_input_path = NULL,
ejscreen_export_vars = NULL,
ejscreen_export_required_names = NULL,
ejscreen_export_rename_newtype = "ejscreen_indicator",
ejscreen_export_feature_server_fields = NULL,
ejscreen_pctile_lookup_output_fields = ejscreen_pctile_lookup_fields(),
blockgroup_tables = setdiff(as.vector(EJAM::tables_ejscreen_acs), tract_tables),
tract_tables = c("B18101", "C16001", "B27010"),
tract_weight_source = c("decennial2020", "acs"),
include_tract_data = TRUE,
include_islandareas_data = FALSE,
islandareas_raw = NULL,
islandareas_demographics = NULL,
islandareas_reference = NULL,
islandareas_tables = islandareas_tables_for_bg_acsdata(),
use_islandareas_demographics = FALSE,
fiveorone = "5",
download_timeout = 3600,
download_retries = 2,
download_bg_geodata = FALSE,
blockgroup_universe_source = c("acs", "union"),
formulas = EJAM::formulas_ejscreen_acs$formula,
tract_formulas = NULL,
dropMOE = TRUE,
extra_indicator_vars = ejscreen_default_extra_indicator_vars(),
reuse_existing_if_missing = FALSE,
existing_blockgroupstats = NULL,
acs_vars = NULL,
enviro_vars = NULL,
ej_indicator_vars = names_e,
ej_indicator_pctile_vars = names_e_pctile,
ej_indicator_state_pctile_vars = names_e_state_pctile,
ej_index_vars = names_ej,
ej_index_supp_vars = names_ej_supp,
ej_index_state_vars = names_ej_state,
ej_index_supp_state_vars = names_ej_supp_state,
demog_index_var = "Demog.Index",
demog_index_supp_var = "Demog.Index.Supp",
demog_index_state_var = "Demog.Index.State",
demog_index_supp_state_var = "Demog.Index.Supp.State"
)Arguments
- yr
end year of the ACS 5-year survey to use.
- bg_envirodata
environmental blockgroup table. If NULL, the wrapper tries to read the saved
bg_envirodatastage whenuse_saved_stagesis TRUE.- bg_extra_indicators
non-ACS, non-enviro blockgroup indicators such as
lowlifex, or NULL to read/reuse/create that stage.- bg_geodata
Census/TIGER blockgroup geography stage, with square-meter
arealandandareawaterfields.- bg_acs_raw
optional raw ACS pipeline object from
download_bg_acs_raw().- bg_acsdata
optional ACS-derived blockgroup table from
calc_bg_acsdata().- blockgroupstats
optional already-combined blockgroupstats-like table.
- pipeline_dir
folder or
s3://...URI for reading/writing pipeline stage files.- pipeline_storage
stage storage backend:
"auto","local", or"s3"."auto"uses S3 whenpipeline_dirstarts withs3://and local file storage otherwise.- save_stages
logical, whether to save each stage as it is created.
- use_saved_stages
logical, whether missing inputs may be read from existing files in
pipeline_dir.- stage_format
file format for saved/read tabular stages:
"csv","rds","rda", or"arrow". The wrapper defaults to CSV so every pipeline checkpoint is easy to inspect outside R.- raw_acs_storage
raw ACS checkpoint storage pattern.
"folder"saves one ACS table per file plus a manifest."object"saves the historical singlebg_acs_rawlist object.- raw_table_format
file format for per-table raw ACS files when
raw_acs_storage = "folder".- overwrite
logical, whether to overwrite saved stage files.
- validation_strict
logical passed to stage validators.
- download_acs_raw
logical, whether to download raw ACS tables when neither
bg_acsdatanor saved ACS stages are available.- acs_download_fun
ACSdownload-compatible function used by
download_bg_acs_raw()when raw ACS tables need to be downloaded. The default isACSdownload::get_acs_new(). Supply a wrapper if you need a legacy ACS source implementation.- return_intermediate
logical. If TRUE, return key interim stage objects in addition to final datasets.
- include_ejscreen_export
logical. If TRUE, also create the ordinary EJSCREEN-ready national-percentile export using
calc_ejscreen_export().- include_ejscreen_export_statepct
logical or NULL. If TRUE, also create an EPA
StatePct-style export where state raw scores and state percentiles are written into the generic EPA field names. When NULL, this followsinclude_ejscreen_export.- include_ejscreen_pctile_lookup_exports
logical or NULL. If TRUE, create EJScreen-style national and state percentile lookup CSV stages,
ejscreen_us_pctile_lookupandejscreen_state_pctile_lookup, fromusastatsandstatestats. These use EJScreen field names and append astdrow for each region. When NULL, these stages are created only when an explicit lookup-export save path is supplied.- include_ejscreen_dataset_creator_input
logical. If TRUE, also create the smaller pre-index input table expected by EPA's
ejscreen-dataset-creator-2.3Python tool.- ejscreen_export_path
optional file path for the EJSCREEN national percentile export.
- ejscreen_export_statepct_path
optional file path for the EJSCREEN state-percentile export.
- ejscreen_us_pctile_lookup_path, ejscreen_state_pctile_lookup_path
optional file paths for the EJScreen-style national and state percentile lookup exports.
- ejscreen_dataset_creator_input_path
optional file path for the EJScreen dataset-creator input table.
- ejscreen_export_vars
optional EJAM
rnamecolumns to keep in the EJSCREEN export before renaming.- ejscreen_export_required_names
optional final EJSCREEN field names that must be present.
- ejscreen_export_rename_newtype
naming column in map_headernames to use when renaming the EJSCREEN export.
- ejscreen_export_feature_server_fields
optional final EJSCREEN FeatureServer field list. Defaults to the current EJSCREEN v2.32 block group FeatureServer schema when an EJSCREEN export is requested.
- ejscreen_pctile_lookup_output_fields
optional EJScreen lookup-table field names. Defaults to the fields used by EPA's archived national/state lookup CSVs.
- blockgroup_tables
ACS tables to download at blockgroup resolution.
- tract_tables
ACS tables to download at tract resolution for later blockgroup apportionment.
- tract_weight_source
source for tract-to-blockgroup apportionment weights.
"decennial2020"matches the legacy EJSCREEN method by using 2020 Decennial Census blockgroup population weights."acs"uses same-vintage ACS blockgroup population weights.- include_tract_data
logical, whether to download
tract_tables.- include_islandareas_data
logical, whether to append Island Areas rows. Puerto Rico is not included here because it is already part of ACS.
- islandareas_raw
optional raw Island Areas Census DHC object from
download_bg_islandareas_raw().- islandareas_demographics
optional transformed Island Areas Census DHC demographics table from
calc_bg_islandareasdata().- islandareas_reference
optional Island Areas rows from the archived EPA EJScreen ACS2022 reference file. When supplied and
use_islandareas_demographics = FALSE, these rows define the Island Areas blockgroup IDs and labels used for placeholder rows.- islandareas_tables
Island Areas Census DHC tables to download if
include_islandareas_datais TRUE andislandareas_rawis not supplied.- use_islandareas_demographics
logical. Defaults to FALSE so that the EJSCREEN-compatible pipeline appends Island Areas blockgroup rows without using mixed-source Island Areas Census demographics in
bg_acsdata. Set TRUE only for a supplemental mixed-source dataset.- fiveorone
ACS sample length,
"5"by default.- download_timeout
timeout in seconds to use while downloading ACS table files. This is increased above R's usual 60 second default because some Census table-based summary files are hundreds of MB.
- download_retries
number of times to retry a failed ACS table download after the initial attempt.
- download_bg_geodata
logical, whether to download Census/TIGER blockgroup geography when
bg_geodatais not supplied or saved.- blockgroup_universe_source
passed to
calc_ejscreen_blockgroupstats(). The default"acs"uses the ACS table rows as the authoritative blockgroup universe for the requested ACS vintage.- formulas
formulas used for blockgroup-resolution ACS tables.
- tract_formulas
formulas used for tract-resolution ACS indicators. Defaults to
calc_blockgroupstats_from_tract_data()defaults.- dropMOE
logical, whether to drop ACS margin-of-error columns.
- extra_indicator_vars
expected extra indicator columns.
- reuse_existing_if_missing
logical, whether missing extra indicators should be copied from
existing_blockgroupstats.- existing_blockgroupstats
optional blockgroupstats-like table to use when
reuse_existing_if_missingis TRUE. Defaults to current package data.- acs_vars
variables to include in the ACS-only lookup stages. Defaults to current EJSCREEN/EJAM ACS indicators found in
bgstats.- enviro_vars
variables to include in the environmental lookup stages. Defaults to current environmental, health, site, climate, and feature variables found in
bgstats.- ej_indicator_vars
environmental indicators to use when calculating EJ indexes. Defaults to names_e, but can be replaced for custom indicators.
- ej_indicator_pctile_vars, ej_indicator_state_pctile_vars
names for national/state environmental percentile columns used internally by
calc_bgej().- ej_index_vars, ej_index_supp_vars, ej_index_state_vars, ej_index_supp_state_vars
names for the four EJ-index families created by
calc_bgej().- demog_index_var, demog_index_supp_var, demog_index_state_var, demog_index_supp_state_var
demographic index column names used by
calc_bgej().
Value
named list containing final datasets (blockgroupstats, bgej,
usastats, and statestats) plus interim stages when
return_intermediate is TRUE. Attributes record pipeline_dir,
stage_format, and saved stage paths.
Details
For routine annual, validation-only, exports-only, and release runs, prefer
the recipe scripts in data-raw/run_ejscreen_pipeline_*.R, or build a
validated config with pipeline_config_annual() and pass it to
run_ejscreen_pipeline(). The long-standing
data-raw/run_ejscreen_dataset_pipeline.R file remains available as an
environment-variable compatibility runner for older source() workflows.
calc_ejscreen_dataset() is a high-level wrapper around the staged
annual update helpers. It is intentionally an orchestrator rather than a
replacement for the individual stage functions. Each major input or output can
be supplied as an R object, read from a saved stage in pipeline_dir, or
created and saved by this function.
The default stage order is:
download raw ACS tables of demographic data into
bg_acs_rawwhen enabled, append Island Areas AS/GU/MP/VI placeholder rows using the archived EPA EJScreen ACS2022 reference for row IDs, area fields, and available environmental fields; optional Island Areas Census DHC demographics can still be saved/reviewed separately but are not used downstream unless explicitly requested
calculate ACS-based demographic indicators (and lead paint indicator) as
bg_acsdatavalidate/save
bg_envirodata(key environmental indicators)validate/save
bg_extra_indicators(e.g., % low life expectancy)create or validate
bg_geodata, the Census/TIGER blockgroup geography attributes used forarealand,areawater, and internal-point fieldscalculate demographic indexes (using % low life expectancy, etc.)
combine those blockgroup demog., envt., extra, and geography indicators as blockgroupstats
create intermediate percentile lookup tables
usastats_acs,statestats_acs,usastats_envirodata,statestats_envirodatacalculate EJ indexes (from envt. percentiles and demog. indexes) and save as bgej table
create intermediate percentile lookup tables
usastats_ej,statestats_ejcombine those as usastats and statestats
create an EJScreen-ready export file and/or EPA Python dataset-creator input file (optionally)
bg_envirodata must include pctpre1960. That column may be produced by an
upstream environmental-data step that reads the saved bg_acsdata stage.
For EJAM v3, Island Areas are supported at the blockgroup dataset,
EJSCREEN export, and map-data visibility level when
include_islandareas_data = TRUE. The default path keeps AS/GU/MP/VI
demographic fields as NA, uses archived EPA EJScreen reference rows for
row IDs and available environmental/area fields, and does not add AS/GU/MP/VI
blocks to the block helper files. Radius/buffer analyses in those areas
should therefore return no-data results rather than block-weighted estimates.
The annual pipeline creates the bgej stage, and the package-level dynamic
Arrow loader obtains bgej.arrow from the ejamdata release tag recorded in
DESCRIPTION as ejamdata_required_tag. For an EJAM v3.YYYY.0 release this is
the matching v3.YYYY.0 tag, but the package version and required data tag can differ for patch
releases. dataload_dynamic() and download_latest_arrow_data() do not use
whichever data-repository release GitHub currently marks as latest.
Recipe config helpers are the preferred interface for maintainers. The compatibility runner can still use several settings stored as environment variables:
EJAM_PIPELINE_YR
EJAM_PIPELINE_DIR: override output folder.
EJAM_PIPELINE_STORAGE: auto, local, or s3. auto treats s3:// paths as S3.
EJAM_STAGE_FORMAT: primary stage format used for loading, usually csv.
EJAM_STAGE_FORMATS: comma-separated formats saved by the runner, usually csv,rda.
EJAM_BLOCKGROUP_UNIVERSE_SOURCE: acs or union. acs is recommended.
EJAM_TRACT_WEIGHT_SOURCE: decennial2020 or acs. decennial2020 matches legacy EJSCREEN tract-to-blockgroup apportionment.
AWS_PROFILE and AWS_REGION: used when pipeline_storage is s3
CENSUS_API_KEY: used by functions that download ACS data (or that download boundaries/shapefiles for FIPS from some sources)
EJAM_FORCE_ACS: TRUE to redownload/recalculate raw ACS and bg_acsdata.
EJAM_FORCE_BG_ACSDATA: TRUE to rebuild bg_acsdata from saved raw ACS.
EJAM_FORCE_BG_GEODATA: TRUE to redownload/recalculate Census/TIGER blockgroup geodata.
EJAM_ACS_DOWNLOAD_TIMEOUT
EJAM_ACS_DOWNLOAD_RETRIES
EJAM_INCLUDE_ISLANDAREAS_DATA: TRUE to save AS/GU/MP/VI rows. The annual/release runner enables this by default unless explicitly set otherwise.
EJAM_ISLANDAREAS_REFERENCE_PATH: archived EPA EJScreen reference CSV used for Island Areas row IDs, area fields, and available environmental fields.
EJAM_USE_ISLANDAREAS_DEMOGRAPHICS: TRUE only for an intentional mixed-source supplemental dataset using 2020 Island Areas Census DHC demographics in
bg_acsdata.EJAM_USE_PROVISIONAL_BG_ENVIRODATA: FALSE to require bg_envirodata.csv.
EJAM_BG_ENVIRODATA_REFERENCE_PATH and EJAM_BG_ENVIRODATA_REFERENCE_VARS: optional runner-only settings for deliberately creating or repairing a
bg_envirodatasource stage from an EJSCREEN-style reference CSV. Normal annual and replication runs should use correctedbg_envirodataas-is. Missing reference values are preserved asNA, not converted to zero. This matters for drinking water: EJAM versions after v2.32.8.001 should not convert missing drinking-water scores to zero unless the source explicitly reports zero.EJAM_INCLUDE_EJSCREEN_EXPORT: TRUE to create ejscreen_export.csv.
EJAM_INCLUDE_EJSCREEN_EXPORT_STATEPCT: TRUE to create ejscreen_export_statepct.csv.
EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS: TRUE to create ejscreen_us_pctile_lookup.csv and ejscreen_state_pctile_lookup.csv. This is off by default because current EJScreen maps use the blockgroup exports that already contain percentile, bin, and popup fields, while reports are served through EJAM-API/EJAM.
EJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT: TRUE to create the smaller input table expected by EPA's Python dataset-creator workflow.
EJAM_VALIDATE_VS_PRIOR and related EJAM_PRIOR_* settings control prior-version comparisons.
The annual runner also writes pipeline_run_manifest.csv, which records the
package version, Git branch/SHA, ACS vintage, run settings, and whether
provisional environmental or extra-indicator inputs were reused.
Census/TIGER geography can occasionally include valid blockgroup features
that are not present in the ACS summary-file tables for the same ACS vintage.
For example, a draft ACS 2020-2024 build found 39 Suffolk County, New York
blockgroups in TIGER geography but not in the relevant ACS blockgroup or
tract tables. The default blockgroup_universe_source = "acs" therefore
treats bg_acsdata as the authoritative final blockgroup universe and uses
bg_geodata only to annotate those rows.
To check them:
print(
cbind(current_setting = Sys.getenv(c(
"EJAM_PIPELINE_YR",
"EJAM_PIPELINE_DIR", "EJAM_PIPELINE_STORAGE",
"EJAM_STAGE_FORMAT", "EJAM_STAGE_FORMATS",
"EJAM_BLOCKGROUP_UNIVERSE_SOURCE",
"AWS_PROFILE", "AWS_REGION",
"CENSUS_API_KEY",
"EJAM_FORCE_ACS", "EJAM_FORCE_BG_ACSDATA", "EJAM_FORCE_BG_GEODATA",
"EJAM_ACS_DOWNLOAD_TIMEOUT", "EJAM_ACS_DOWNLOAD_RETRIES",
"EJAM_INCLUDE_ISLANDAREAS_DATA", "EJAM_ISLANDAREAS_REFERENCE_PATH",
"EJAM_USE_ISLANDAREAS_DEMOGRAPHICS",
"EJAM_BG_ENVIRODATA_REFERENCE_PATH", "EJAM_BG_ENVIRODATA_REFERENCE_VARS",
"EJAM_USE_PROVISIONAL_BG_ENVIRODATA",
"EJAM_INCLUDE_EJSCREEN_EXPORT", "EJAM_INCLUDE_EJSCREEN_EXPORT_STATEPCT",
"EJAM_INCLUDE_EJSCREEN_PCTILE_LOOKUP_EXPORTS",
"EJAM_INCLUDE_EJSCREEN_DATASET_CREATOR_INPUT",
"EJAM_VALIDATE_VS_PRIOR", "EJAM_PRIOR_PIPELINE_YR",
"EJAM_PRIOR_PIPELINE_DIR", "EJAM_PRIOR_PACKAGE_REF"
)))
)