Census
Overview
We store annual data from the national Scouting Census for sections and provide derived metrics for reporting.
Data is imported from JSON files (one per section/year) and persisted as SectionCensusReturn
records with database-generated fields for efficient querying.
Data model
SectionCensusReturn
- section: FK to
hierarchy.Section
- year: Positive integer (e.g. 2020)
- data_format_version: Small integer, defaults to V1
- data: JSON payload from the census export
- Computed fields (database-generated):
- annual_subs_cost: Decimal(6, 2). Extracted from
data["annual_cost"]
and cast to decimal - total_volunteers: Integer. Sum of volunteer counts by regex
- total_young_people: Integer. Sum of young people counts by regex
- ratio_young_people_to_volunteers: Decimal(6, 2). Rounded to 2dp
- annual_subs_cost: Decimal(6, 2). Extracted from
Ordering defaults to newest first (-year
) and there is an index on year
for filtering.
Expected JSON structure for import (V1)
Each file represents one section/year. Minimal required keys:
reg_no
: Section shortcode (matchesSection.shortcode
)year
: Year as string or integer (e.g. "2020")data
: Object containing numeric string values for counts and other fields- Optional
file
: Original source filename (not used for import logic)
Example:
{
"file": "447607-2020.html",
"reg_no": "S10042555",
"year": "2020",
"data": {
"annual_cost": "100",
"y_4_m": "7",
"y_4_f": "6",
"y_4_p": "1",
"y_4_s": "0",
"l_asl_m": "5",
"l_sl_f": "4",
"l_asl_p": "2",
"l_sl_s": "1"
}
}
Notes:
- The importer requires data
, reg_no
, and year
.
- data
must be a JSON object. Non-object types will be rejected.
Importing data
Use the management command to import a directory of JSON files:
poetry run ./manage.py import_census_returns /path/to/json_dir \
--dry-run # optional: report actions only
--fail-fast # optional: stop on first error
--format-version 1 # optional: defaults to 1
Behaviour:
- Files are processed in sorted order (glob *.json
).
- Upserts by (section, year)
using update_or_create
.
- --dry-run
prints CREATE/UPDATE without writing.
- --fail-fast
raises on first error; otherwise logs and continues.
Computed fields and regexes
Two PostgreSQL functions are installed via migration:
- j_sum_by_regex_key(j jsonb, regex text) -> int
- Sums values of keys matching regex
at the top level of j
.
- Only numeric string values are included (^[0-9]+$
); others are ignored.
- ratio(n int, d int) -> numeric
- Returns ROUND(n / NULLIF(d, 0), 2)
; yields 0
when d
is 0
.
Derived field definitions:
- total_young_people: ^y_[0-9]+_(m|f|p|s)$
- total_volunteers: ^l_[a-z]+_(m|f|p|s|xm|xf|xp|xs)$
- annual_subs_cost: data["annual_cost"]
cast to decimal
- ratio_young_people_to_volunteers: ratio(total_young_people, total_volunteers)
Error handling and validation
- Missing required keys (
data
,reg_no
,year
) cause a failure for that file. data
must be a dictionary; otherwise the file is rejected.- If
reg_no
does not match a uniqueSection.shortcode
, the file is rejected. - Non-numeric values for matching
y_*
orl_*
keys are ignored in sums (they do not break the import).
Database requirements
- PostgreSQL is required for the custom SQL functions and generated columns used by this feature.
- Running migrations installs the functions and generated fields.
Notes on JSON source
The JSON files are produced by a separate process that extracts census information. They must adhere to the structure described above. If additional keys or formats are introduced in the future, bump data_format_version
and extend the model/regexes accordingly.
If you have a repository or script that generates these files, document its output schema to ensure compatibility (e.g. how volunteer and young people counts are keyed, and where subscription cost is stored).
JSON source and generation (reference)
Locally, JSON files can be generated using the separate census-processing
workspace:
- Scripts:
- scrape_returns.py
: fetches raw census HTML/inputs
- parse_data.py
: parses and normalises into JSON files under data/processed-returns/
- Output naming: data/processed-returns/{reg_no}-{year}.json
- Each output contains the top-level keys: data
, file
, reg_no
, year
(confirmed via samples)
To import those locally generated files into Salute:
poetry run ./manage.py import_census_returns /Users/dan/Code/scouts/census-investigation/data/processed-returns --dry-run
This will validate structure and report which records would be created/updated. Remove --dry-run
to persist changes.
How HTML is processed into JSON
The JSON files are generated from the official census website’s HTML using a two-step process in the census-investigation
workspace.
1) Scraping approved returns
- Script: scrape_returns.py
- Logs in using CENSUS_USERNAME
and CENSUS_PASSWORD
.
- Discovers section "census IDs" from group index pages.
- Fetches the detailed section page for each year and only saves those with status Approved
to data/returns/{sectionId}-{year}.html
.
2) Parsing HTML into normalised JSON
- Script: parse_data.py
- Extracts year
and reg_no
(section shortcode) from the HTML header using a regex like (20\d{2}) - Reg no: (S\d+)
.
- Scans the page for elements with both data-key
and data-value
attributes.
- Filters out unwanted keys:
- By prefix: school_*
, approve_for_*
, approve*
, contact*
- By name: meetingplace
, records
, records_other
, section_type
, dow
- Validates there are no duplicate keys.
- Writes data/processed-returns/{reg_no}-{year}.json
with shape:
- file
: original HTML filename
- year
: e.g. "2023"
- reg_no
: e.g. "S10012345"
- data
: dict of key/value pairs (all values are strings in the source)
This is the directory you pass to the Salute import command.
Key naming and semantics
Counts for young people and volunteers follow a compact naming convention in the data
object. Values are numeric strings:
- Young people keys:
y_{age}_{suffix}
age
: numeric age, e.g.4
,5
,6
, …suffix
: one ofm
,f
,p
,s
m
: malef
: femalep
: prefer not to say / prefer to self-describes
: self-described / not specified
-
Examples:
y_4_m
,y_5_f
,y_7_p
,y_6_s
-
Volunteer keys:
l_{role}_{suffix}
role
: a short role identifier made of lowercase letters, e.g.asl
,sl
,dg
,yl
,sa
suffix
for volunteers includes extended forms for “prefer not to say/self-described”:- Basic:
m
,f
,p
,s
- Recorded elsewhere too:
xm
,xf
,xp
,xs
- Basic:
- Examples:
l_sl_m
,l_asl_f
,l_dg_p
,l_yl_xf
Salute computes derived fields from these keys using database functions:
- total_young_people
sums keys matching regex: ^y_[0-9]+_(m|f|p|s)$
- total_volunteers
sums keys matching regex: ^l_[a-z]+_(m|f|p|s|xm|xf|xp|xs)$
- Non-numeric values among matching keys are ignored for sums.
Other notable keys:
- annual_cost
: subscription amount for the year (string). Stored as decimal in Salute’s annual_subs_cost
generated column.
- Additional keys may exist for disabilities (dis_*
), relationships (rel_*
), etc., which are currently stored raw under data
and not aggregated.
If the naming scheme changes in future exports, adjust the regexes or bump data_format_version
to encode the new shape.