Updating and Managing the Datasets Used by EJAM

The EJAM package and Shiny app use many data objects, including numerous datasets stored in the package’s /data/ folder and several large tables stored in a separate data repository. Those large tables contain information on Census block groups, Census block internal points, Census block population weights, and EPA FRS facilities.

How to Update Datasets in EJAM

The process begins from within the EJAM code repo. Historically, most data updates were coordinated from the overarching notes and script file data-raw/datacreate_0_UPDATE_ALL_DATASETS.R. That file is still useful as an index of older maintainer workflows, but it is no longer the primary path for the annual EJScreen-style blockgroup update. For blockgroupstats, usastats, statestats, bgej, and related annual pipeline checkpoints, use the staged pipeline documented in Updating EJScreen Datasets Annually (via the Pipeline). For other datasets, the focused datacreate_* scripts in data-raw/ remain the usual starting point. Documentation of datasets via /R/data_*.R files is generally handled by those same scripts while creating/updating the datasets.

That file covers not only the large Arrow datasets that are stored in a separate repository, but also many smaller data objects that are installed along with the package in the /data/ folder. Updating all the package’s data objects can be complicated because there are many different data objects of various types and formats and locations.

The various data objects need to be updated at various frequencies – some only yearly, some as part of the broader EJSCREEN Annual Data Update of demographic, environmental, and other indicators, and others when facility IDs and locations change (as often as possible, as when EPA’s FRS is updated). Some need to be updated only when the package features/code changes, such as the important data object called map_headernames (which in turn is used to update objects such as names_e).

See the draft utility EJAM:::pkg_data() for a dataset inventory:

x <- EJAM:::pkg_data()

## Get more info with pkg_data(simple = FALSE)
## 
## ignoring sortbysize because simple=TRUE

x$Item[!grepl("names_|^test", x$Item)]

##  [1] "NAICS"                            "SIC"                             
##  [3] "avg.in.us"                        "bg_cenpop2020"                   
##  [5] "bgpts"                            "blockgroupstats"                 
##  [7] "censusplaces"                     "custom"                          
##  [9] "ejamdata_version"                 "ejampackages"                    
## [11] "ejscreen_arcgis_service_field"    "ejscreen_schema_extra"           
## [13] "epa_programs"                     "epa_programs_defined"            
## [15] "formulas_ejscreen_acs"            "formulas_ejscreen_acs_disability"
## [17] "formulas_ejscreen_demog_index"    "frsprogramcodes"                 
## [19] "high_pctiles_tied_with_min"       "islandareas"                     
## [21] "lat_alias"                        "lon_alias"                       
## [23] "mact_table"                       "map_headernames"                 
## [25] "meters_per_mile"                  "modelDoaggregate"                
## [27] "modelEjamit"                      "modelEjamitByAnalysisType"       
## [29] "naics_counts"                     "naicstable"                      
## [31] "namez"                            "sictable"                        
## [33] "stateinfo"                        "stateinfo2"                      
## [35] "states_shapefile"                 "statestats"                      
## [37] "tables_ejscreen_acs"              "usastats"                        
## [39] "x_anyother"

Where the datasets are stored

EJAM relies on datasets mostly stored in the package itself or in a separate, data-related repository:

Datasets stored within the EJAM package (.rda files): Documentation and access to package data files
Datasets used by EJAM but stored separately (large .arrow files): Documentation and access to the large data files as GitHub release assets

Why the large datasets are put into the data repository using piggyback instead of committed using Git

As explained in the documentation for the piggyback R package:

“Because larger (> 50 MB) data files cannot easily be committed to git, a different approach is required to manage data associated with an analysis in a GitHub repository. This package provides a simple work-around by allowing larger (up to 2 GB) data files to piggyback on a repository as assets attached to individual GitHub releases. These files are not handled by git in any way, but instead are uploaded, downloaded, or edited directly by calls through the GitHub API. These data files can be versioned manually by creating different releases. This approach works equally well with public or private repositories. Data can be uploaded and downloaded programmatically from scripts. No authentication is required to download data from public repositories.”

Key datasets

Some notable data files, code details, and other objects that may need to be changed ANNUALLY or more often:

Blockgroup Datasets (Demographic and Environmental Data): These include datasets included with the package ?blockgroupstats, usastats, ?statestats, and ?bgej. The annual staged workflow for updating these ACS/EJScreen-style blockgroup datasets is now documented separately in Updating EJScreen Datasets Annually (via the Pipeline). That pipeline covers bg_acs_raw, bg_acsdata, optional Island Areas checkpoints, bg_envirodata, bg_extra_indicators, bg_geodata, blockgroupstats, bgej, usastats, statestats, ejscreen_export, ejscreen_export_statepct, optional EJScreen-style lookup exports (ejscreen_us_pctile_lookup and ejscreen_state_pctile_lookup), and the optional ejscreen_dataset_creator_input stage. This more general vignette focuses on the other datasets and storage/release mechanics used by EJAM. For EJAM v3, AS/GU/MP/VI are included at the blockgroup dataset, EJSCREEN export, and map-data visibility level with demographic fields kept as NA and partial EPA environmental fields where available. See Island Areas in EJAM v3 for the user-facing coverage notes and live EJSCREEN layer inventory observed in May 2026.
Block Datasets: The block (not blockgroup) tables might be updated less often, but Census FIPS codes do change yearly so the ?blockwts, ?blockpoints, ?quaddata, ?blockid2fips, and related additional data tables should be updated as needed. This is also done from within /data-raw/datacreate_0_UPDATE_ALL_DATASETS.R. See the census2020download package on GitHub for the function census2020_get_data() that may be useful.
Facilities Datasets for creating updated proximity scores each year: Facility (and roadway) locations for key types of sites were used once a year to update several environmental indicators that are proximity scores in EJSCREEN. The resulting environmental indicators are stored with EJAM, but these facility location datasets are not stored in EJAM. EJSCREEN obtains their locations for mapping purposes, via an API accessing hosted datasets with facility locations. In general, scripts for updating environmental indicators (including documentation of sources of facility location data, etc.) were stored by EPA. After 2025, new code for updating indicators may be found in this package’s data-raw/ folder or in related non-EPA source repositories. Proximity scores in EJSCREEN as of 2024-2026 were calculated based on the locations of these types of sites:
Major roadways (traffic)
Superfund NPL sites
Facilities with hazardous waste (TSDF)
Water bodies downstream of wastewater discharges
Risk management plan (RMP) facilities
Underground storage tanks (UST) (for a facility density indicator, similar to a proximity indicator)
Facilities Datasets for a user to specify places to analyze/report on:

Facility locations and categories are used in EJAM to help a user specify sets of EPA-regulated facilities or other types of sites to analyze and report on in EJSCREEN reports, using their NAICS/SIC/MACT/program information and coordinates. All of that information may need frequent updates because facilities open, close, relocate, or have their information corrected or otherwise updated. EPA’s FRS is the source for much of this information and the FRS is updated by EPA frequently and is available via an API. Through at least v2.32.8, EJAM (and therefore the community reports in EJSCREEN) used a snapshot of the EPA FRS data rather than using an API to obtain the latest info on demand – that is something that could be changed in a future version. Facility-related info is stored in tables EJAM uses, such as these: ?frs, ?frs_by_programid, ?frs_by_naics, ?frs_by_sic, ?frs_by_mact, ?NAICS, ?SIC, ?naics_counts, ?naicstable, ?SIC, ?sictable, ?mact_table, and ?epa_programs, ?frsprogramcodes, etc. These FRS, MACT, and Program info tables of EPA-relevant data have been updated in the EJAM package from scripts within /data-raw/datacreate_0_UPDATE_ALL_DATASETS.R. The ?NAICS, ?naicstable, and ?sictable objects (viewable using naics_categories() and sic_categories() utilities) have no EPA-specific data so they do not need frequent updates. The NAICS data object stores just the name of each NAICS code number, and new codes/names are published every five years, such as in 2017 and 2022, so a new version would typically be expected in 2027. The tables called ?SIC (unlike the NAICS table) and ?naics_counts (which has no analogous sic version), however, contain counts of EPA FRS facilities, so they need updates when FRS data are updated. The inconsistency in how NAICS vs SIC tables and the naics_counts table were named and defined was by historical accident, not intentional, so it would be OK if refactoring later made them consistent or even switched entirely to more frequent automated updates or even reliance on the FRS API.

?map_headernames stores critical metadata. This needs to be updated especially if indicator names change or are added. ?map_headernames holds most of the useful metadata about each variable (each indicator, like % low income) – e.g., how many digits to use in rounding, units, long and short indicator names, EJAM and EJScreen field names, the type or category of indicator, sort order to use in reports, and the method of calculating aggregations of the indicator over blockgroups. The editable source is now data-raw/map_headernames.csv. If metadata rows or values need to change, edit that CSV directly, then source data-raw/datacreate_map_headernames.R to validate the CSV and save data/map_headernames.rda. Older .xlsx workflows are obsolete and should not be used to regenerate this object.
Test data (inputs) and examples of outputs may have to be updated (every time parameters change & when outputs returned change). Those are generated by scripts/functions referred to from /data-raw/datacreate_0_UPDATE_ALL_DATASETS.R
A default year is used in various functions, such as for the last year of the 5-year ACS dataset. These defaults like yr or year should be updated via global searches where relevant.
Metadata about vintage/version is stored in attributes of many datasets. That metadata is updated via scripts/functions that call helpers such as metadata_add(), metadata_add_and_use_this(), metadata_check(), and metadata_mapping.R. For staged EJScreen annual outputs, the pipeline save helpers add the relevant metadata based on the requested pipeline year. After package data are replaced, run EJAM:::metadata_check() and EJAM:::metadata_check_print() to find stale attributes. Atomic name-vector objects such as many names_* datasets do not need metadata attributes.
Version numbering is recorded primarily in the DESCRIPTION file, release tags, and the NEWS file. The ejamdata_required_tag field in DESCRIPTION records which ejamdata release EJAM should use. The ejamdata_version.txt marker records which ejamdata release tag is actually saved in the local data folder.
Updating documentation - updates may be needed for the README, vignettes, and possibly examples in some functions in case updates to datasets alter how the examples would work.

Again, for non-pipeline datasets it is useful to understand data-raw/datacreate_0_UPDATE_ALL_DATASETS.R, because that script still points to many older focused data-creation scripts. For annual EJScreen-style blockgroup outputs, use the pipeline vignette and runner script as the current maintainer workflow.

The information below focuses on the other type of data objects – the set of large arrow files that are stored outside the package code repository.

Repository that stores the large arrow file release assets

Several large data.table files are not installed as part of the R package in the typical /data/ folder that contains .rda files lazy-loaded by the package. Instead, they are kept as release assets in a separate GitHub repository that we refer to here as the data repository. The release assets are the authoritative copies used by installed EJAM packages; committed files in a repository data/ folder should not be treated as the source used by EJAM installs.

IMPORTANT: The name of the data repository (as distinct from the package code repository) must be recorded/updated in the EJAM package DESCRIPTION file, so that the package knows where to look for the data files if the datasets are moved to a new repository. The current data repository for the installed or loaded source version is https://github.com/Public-Environmental-Data-Partners/ejamdata, which can be checked with url_package(type = "data", get_full_url = TRUE).

Arrow Package and Arrow File Format

To store the large files needed by the EJAM package, we use the Apache arrow file format through the arrow R package, with file extension .arrow. This allows us to work with larger-than-memory data and store it outside of the EJAM package itself.

Earlier versions of EJAM used the .arrow filename more loosely. Current dynamic datasets should be real Arrow IPC files. For example, the object called frs_arrow is the Arrow-backed version of what had been called the ?frs dataset.

The names of these tables should be listed in R/arrow_ds_names.R and in the global variable called .arrow_ds_names, which is used by functions like dataload_dynamic() and dataload_from_local().

These are the Arrow files used by EJAM:

Arrow file update groups

Arrow files do not all change on the same schedule. Use these groups when planning updates:

Facility Data Updates include frs, frs_by_programid, frs_by_naics, frs_by_sic, and frs_by_mact. These may be refreshed when EPA FRS/facility data are updated.
EJSCREEN Annual Data Update currently includes bgej.arrow. It is calculated from the annual EJScreen/EJAM demographic and environmental pipeline and must match the installed package’s blockgroupstats, usastats, and statestats.
Blockgroup Geography Updates include bgid2fips and blockwts, and related .rda objects such as bgpts and bg_cenpop2020. These need review during each annual update and regeneration when blockgroup FIPS, EJAM bgid, internal points, or blockgroup-to-block relationships change.
Block Geography Updates include blockpoints, quaddata, and blockid2fips. These need regeneration only when block-level FIPS or block internal-point geography changes.

For EJAM v3, the block and blockgroup helper files are intentionally carried forward without Island Area blocks. They should not be used to promise point-buffer/radius or block-weighted polygon analysis for AS/GU/MP/VI; those analyses should return no-data results rather than block-weighted estimates.

Use EJAM:::dynamic_geography_arrow_report() to check whether the current blockgroup and block geography Arrow files are compatible with the installed blockgroupstats blockgroup universe.

Blockgroup and block-level arrow files

?bgid2fips.arrow: crosswalk of EJAM blockgroup IDs (1-n) with 12-digit blockgroup FIPS codes
?blockid2fips.arrow: crosswalk of EJAM block IDs (1-n) with 15-digit block FIPS codes
?blockpoints.arrow: Census block internal points lat-lon coordinates, EJAM block ID
?blockwts.arrow: Census block population weight as share of blockgroup population, EJAM block and blockgroup ID
?bgej.arrow: blockgroup-level statistics of EJ variables. This is part of the EJSCREEN Annual Data Update group and must match the package’s blockgroupstats
?quaddata.arrow: 3D spherical coordinates of Census block internal points, with EJAM block ID

?frs.arrow: data.table of EPA Facility Registry Service (FRS) regulated sites
?frs_by_naics.arrow: data.table of NAICS industry code(s) for each EPA-regulated site in Facility Registry Service
?frs_by_sic.arrow: data.table of SIC industry code(s) for each EPA-regulated site in Facility Registry Service
?frs_by_programid.arrow: data.table of Program System ID code(s) for each EPA-regulated site in the Facility Registry Service
?frs_by_mact.arrow: data.table of MACT NESHAP codes for sites, indicating the subpart(s) that categorize relevant EPA-regulated sites

Development/Setup

The Arrow files are stored as release assets in a separate public GitHub repository (referred to here as ejamdata). The owner/repository name must be recorded/updated in the DESCRIPTION field called ejam_data_repo, which can be checked with url_package(type = "data", get_full_url = TRUE). EJAM uses that information to find the dynamic data files.
Any time the Arrow datasets are updated, create or update an ejamdata release and upload the .arrow files as release assets. Use the maintainer helper described below rather than relying on an automatic GitHub Actions workflow.
EJAM’s download_latest_arrow_data() function does the following:

Resolves the package-compatible ejamdata release tag from the DESCRIPTION field ejamdata_required_tag, unless a maintainer explicitly passes a different piggybacktag. This lets a patch release of EJAM keep using an earlier compatible ejamdata release if the Arrow files have not changed.
Checks the user’s locally installed Arrow data release tag, which is stored in data/ejamdata_version.txt.
If the data/ejamdata_version.txt file doesn’t exist, for example on the first EJAM install, it will be created at the end of the script.
If the versions are different, downloads Arrow files from the matching ejamdata release with piggyback::pb_download().
When dataload_dynamic("bgej") loads bgej, the local bgej.arrow must also match the installed package’s blockgroupstats; if it does not, EJAM tries to replace it from the package-compatible ejamdata release tag. See how this function works for details:

download_latest_arrow_data()

EJAM calls this logic from the attach/startup path through dataload_dynamic() so the needed Arrow files are available when a user loads EJAM or starts the app.

How it Works for the User

User installs EJAM

pak::pkg_install("Public-Environmental-Data-Partners/EJAM") (or as adjusted depending on the actual repository owner and name)

User loads EJAM as usual

library(EJAM). This triggers the dynamic-data checks needed for startup.

User runs EJAM as usual

The dataload_dynamic() function will work as usual because the needed .arrow files are cached locally after they are downloaded.

How New Versions of Arrow Datasets Are Republished / Released

First, create the key Arrow files locally or from the relevant pipeline output, as explained above.

For future updates, the package may be modified to publish these files via the update pipeline or related script such as run_arrow_publish_v2.5.0.R but the information below is to describe the functions if an update is done manually.

As mentioned above, we use the piggyback package to place large datasets in the assets of a new release on the https://github.com/Public-Environmental-Data-Partners/ejamdata repository, rather than committing them with Git. The current maintainer path is to call datasets_arrow_publish() with explicit local .arrow file paths.

The helper is intentionally conservative. It defaults to dry_run = TRUE, overwrite = FALSE, and mark_latest = FALSE. The default release note is "Updated datasets for EJScreen/EJAM updated as of " plus the release_date parameter.

Make sure the intended new data objects are available as .arrow files. For an annual EJSCREEN data release, bgej.arrow is the critical package-coupled asset and must match the package version/release tag. Facility and geography Arrow files may be carried forward unchanged if they are still compatible. If block or blockgroup helper Arrow files such as blockwts.arrow, blockpoints.arrow, blockid2fips.arrow, bgid2fips.arrow, or quaddata.arrow are intentionally regenerated in a future geography update, publish those files with the same helper after a dry-run review. For v3, Island Areas are handled only at the blockgroup dataset/export/map-data level; Island Area blocks are not added to the block-helper universe for this release path.

Example dry-run for a manual publish: (also see run_arrow_publish_v2.5.0.R to publish all .arrow files in 1 step)

release_number <- EJAM:::ejamdata_required_tag()
new_datasets_folder <- "path/to/folder/of/new/arrow/files"
filepaths_arrow <- file.path(new_datasets_folder, "bgej.arrow")

EJAM:::datasets_arrow_publish(
  files = filepaths_arrow,
  tag = release_number,
  release_date = Sys.Date(),
  dry_run = TRUE,
  overwrite = FALSE,
  mark_latest = FALSE
)

After reviewing the dry-run output and the intended release tag, rerun with dry_run = FALSE only when ready to create/update the release assets. Use overwrite = TRUE only after confirming existing assets with the same names should be replaced. Use mark_latest = TRUE only when this release should be shown by GitHub as the latest release.

Open a browser to confirm they are there.

browseURL(paste0(EJAM:::url_package("data", get_full_url = T), "/releases"))

Reload EJAM so it can get the updates. It should detect that new versions are available and cache them for the installed package.

rm(list=ls())
require(EJAM)

# Confirm they all can be opened as Arrow-backed objects
# as arrow files:
dataload_dynamic("all", return_data_table = FALSE)
# or read into memory as data.table/data.frame objects:
dataload_dynamic("all", return_data_table = TRUE)

Every release must contain all 11 Arrow files

A release’s assets are self-contained: dataload_dynamic() (via piggyback::pb_download()) pulls each Arrow file from the single release tagged in ejamdata_required_tag, so a missing asset breaks loading. Upload all 11 files to every v3.YYYY.0 release, even when most are unchanged:

vintage-specific: bgej.arrow;
geography (Census 2020, unchanged between ACS vintages): blockwts.arrow, blockpoints.arrow, quaddata.arrow, bgid2fips.arrow, blockid2fips.arrow;
facilities (FRS): frs.arrow, frs_by_programid.arrow, frs_by_naics.arrow, frs_by_sic.arrow, frs_by_mact.arrow.

FRS is intentionally not refreshed for an ACS-vintage release; its files carry over unchanged (only their ejam_package_version metadata is bumped) and are re-published under the new tag. Confirm the file names match paste0(EJAM:::.arrow_ds_names, ".arrow") before publishing.

Three of these carry the bgid join key — bgej, blockwts, and bgid2fips — which must be stored as double (see Annual EJScreen/ACS dataset updates, “the bgid type must be double”); the other eight have no bgid.

Keep the four identifiers in sync

After publishing, set the local version marker and confirm all four identifiers match (all v3.YYYY.0):

writeLines("v3.YYYY.0", "data/ejamdata_version.txt")   # tracked marker
EJAM:::ejamdata_required_tag()                          # from DESCRIPTION; must equal the marker

git release tag = ejamdata release tag = DESCRIPTION ejamdata_required_tag = data/ejamdata_version.txt. A code-only patch release (for example v3.2022.1) is the exception: it keeps ejamdata_required_tag and the marker at the existing v3.2022.0, because the data did not change and the patch reuses the already-published ejamdata release.

This previously had been handled with a GitHub Actions workflow that tried to use Git LFS. That automatic workflow is no longer used and should not be restored.

Bump the package version number

The package version (Version: 3.YYYY.x in DESCRIPTION) is recorded verbatim in several other files. When cutting a release, bump all of them together so the version shown in the app, the docs site, and the citation agree:

DESCRIPTION — Version:, plus the human-readable release fields VersionDate:, ReleaseDateEJAM:, VersionEJSCREEN:, ReleaseDateEJSCREEN:. (ejamdata_required_tag: is not a code version — see above; a code-only patch keeps it at the existing v3.YYYY.0.)
_pkgdown.yml — the footer components: (datefooter: and versionmsg:), shown on every docs page.
CITATION.cff — version: and date-released:. (inst/CITATION reads Version/VersionDate from DESCRIPTION at build time, so it needs no edit.)
inst/golem-config.yml — golem_version:.
NEWS.md — when publishing, retitle the top # EJAM 3.YYYY.x (unreleased) heading to the dated release heading (for example # EJAM 3.2022.1 (July 2026)).

README.md carries no hard-coded version (only a lifecycle badge), so it needs no edit. The deployed-API repo (EJAM-API) selects which EJAM version to build via the EJAM_VERSION build arg in its Dockerfile (with a matching mention in its README.md); bump that to the new tag (for example v3.2022.1) when redeploying the API. EJScreen has no EJAM-version string of its own — it reaches EJAM through the API URL — so it needs no version edit for an EJAM release.

Potential Improvements

Making More of the Code More Arrow-Friendly

Problem: loading the data as tibbles/data frames takes a long time.

Solution: We may be able to modify more of our code to be more Arrow-friendly. This essentially keeps the analysis code as a sort of query, and only actually loads the results into memory when requested (e.g., via dplyr::collect()). This dramatically reduces memory usage, which would speed up processing times and avoid potential crashes resulting from insufficient memory. However, this would require a decent lift to update the code in all places.

Pros: processing efficiency and significantly reduced memory usage.

Implementation: This has been enabled by the dataload_dynamic() function, which contains a return_data_table parameter. If FALSE, the Arrow file is opened as an Arrow-backed object rather than read fully into a data.table/data.frame.