World Building Sizes
I am always excited when I see a post from Mark Litwintschik on Hacker News; I know that means I’m going to learn about a cool new GIS dataset. When I saw his post about the The World’s 2.75B Buildings, I knew I was in for a treat.
The Dataset
Mark’s blog entry talks about the Global Building Atlas, a project from researchers at the Technical University of Munich to document every building on the surface of the earth. They developed an open-source pipeline to detect buildings from Planet Labs’ satellite imagery. I like this methdology because it has no regional bias; if this was a patchwork of national datasets, it might be simpler to create, but the data quality would vary depending upon the country of origin.
Building sizes
Mark’s blog entry presented some summary statistics, which was nice, but I had my own curiosities I wanted to answer. The first question I thought of was “how do building sizes vary across the world?”. Since we know the footprint of each building (incredible!), we can estimate each building’s square footage.
Processing steps
I started by re-creating the density plot from the original article:
CREATE OR REPLACE TABLE step1 AS
SELECT H3_LATLNG_TO_CELL(
bbox.ymin,
bbox.xmin, 4) AS h3_4,
COUNT(*) num_buildings
FROM READ_PARQUET('*.parquet')
WHERE bbox.xmin BETWEEN -178.5 AND 178.5
GROUP BY 1;
COPY (
SELECT ST_ASWKB(H3_CELL_TO_BOUNDARY_WKT(h3_4)::geometry) geometry,
num_buildings
FROM step1
) TO 'stats/step1.parquet' (
FORMAT 'PARQUET',
CODEC 'ZSTD',
COMPRESSION_LEVEL 22,
ROW_GROUP_SIZE 15000);
Couple cool things about this query that jumped out to me:
- It was disk bound, and almost CPU-bound. This took about 20 minutes to run, with the data stored on a new SSD I recently purchased. It pulled data off the drive at roughly 350MB/s (~3Gbps), which sounded pretty good for a cheap SATA SSD. The CPU (Ryzen 5900X) was also running at ~90% load. Glad to see I got my money’s worth out of those two items!
- This is the first time I’ve used H3 binning for anything. H3 is pretty cool! Great way to segment the world into roughly-even chunks at a variety of scales.
- Truly incredible how many buildings there are in India.
So then I started computing building footprint areas, using roughly the same method:
CREATE OR REPLACE TABLE step2 AS
SELECT H3_LATLNG_TO_CELL(
bbox.ymin,
bbox.xmin, 7) AS h3_7,
MEDIAN(ST_Area_Spheroid(ST_FlipCoordinates(geometry)))*10.7639 AS median_sqft
FROM READ_PARQUET('*.parquet')
WHERE bbox.xmin BETWEEN -178.5 AND 178.5
GROUP BY 1;
COPY (
SELECT ST_ASWKB(H3_CELL_TO_BOUNDARY_WKT(h3_7)::geometry) geometry,
num_buildings,
median_sqft
FROM step2
) TO 'stats/step2.parquet' (
FORMAT 'PARQUET',
CODEC 'ZSTD',
COMPRESSION_LEVEL 22,
ROW_GROUP_SIZE 15000);
This requires ST_FlipCoordinates for some reason. I wonder why the
default ordering is not what ST_Area_Spheroid is looking for. I knew
there was an issue when I only got results up to 90 degrees west or east,
and the results seemed to vary strongly with longitude. Once I added the
flip I got my results. Through my debugging cycle, I spent a lot of time
waiting for the query to run. This one was CPU-bound; I was spending a lot
of time computing Haversines or however DuckDB computes true Spheroid area.
So, the next thing was to compute the areas once and then aggregate later:
CREATE OR REPLACE TABLE h3_prework AS
SELECT H3_LATLNG_TO_CELL(
bbox.ymin,
bbox.xmin, 15) AS h3,
ST_Area_Spheroid(ST_FlipCoordinates(geometry))*10.7639 AS area_sqft
FROM READ_PARQUET('*.parquet')
WHERE bbox.xmin BETWEEN -178.5 AND 178.5;
COPY (
SELECT H3_LATLNG_TO_CELL(
bbox.ymin,
bbox.xmin, 15) AS h3,
ST_Area_Spheroid(ST_FlipCoordinates(geometry))*10.7639 AS area_sqft
FROM READ_PARQUET('*.parquet')
WHERE bbox.xmin BETWEEN -178.5 AND 178.5
) TO 'stats/h3_prework.parquet' (
FORMAT 'PARQUET',
CODEC 'ZSTD',
COMPRESSION_LEVEL 22,
ROW_GROUP_SIZE 15000);
Now I had a 22GB parquet I could scan though and aggregate different H3 levels with. So aggregate I did:
COPY (
SELECT H3_CELL_TO_PARENT(h3, 5) AS h3,
COUNT(*) num_buildings,
MEDIAN(area_sqft) AS median_sqft,
ST_ASWKB(H3_CELL_TO_BOUNDARY_WKT(any_value(H3_CELL_TO_PARENT(h3, 5)))::geometry) AS geometry
FROM READ_PARQUET('stats/h3_prework.parquet')
GROUP BY 1
) TO 'stats/h3_5_stats.parquet' (
FORMAT 'PARQUET',
CODEC 'ZSTD',
COMPRESSION_LEVEL 22,
ROW_GROUP_SIZE 15000);
I also combined the processing and saving steps, hoping to limit intermediate products, since I almost ran out of memory every time this code ran.
I looked at doing this processing with polars. polars would have let
me easily process multiple H3 depths with one scan of the data from the disk,
which seemed more efficient. polars had most of what I needed but
polars-h3 wouldn’t make geometry data out of the h3 cells, so I couldn’t
use it for this task.
Now I had a very cool dataset to show off. But how was I going to send it to people? My friends generally don’t have QGIS on their phones to view the results.
Packaging for the web
My next goal was to package this dataset for the web. Here’s what I wanted:
- Easy access for anyone of any skill level.
- Self-hosted.
- Slippy map.
After some searching/LLMing, I decided that some kind of vector tile map made the most sense. The map is a bunch of simple polygons, so saving raster images seemed wasteful. I also wanted to represent coarser H3 polygons when viewing the map from lower zoom levels (“far away”) and show more spatial detail at higher zoom levels (“closer in”). QGIS has “Scale Dependent Visibility”, which kinda did what I wanted, but seemed to work very slowly. Finally, there was so much data that QGIS was choking on the whole file, and I assumed a web browser would be worse, so some sort of tile implementation where things are loaded ad-hoc seemed to make the most sense.
Then I had to figure out how vector tiles work! I eventually settled on the following pipeline for generating the tiles:
- Generate the H3 polygons for each zoom level in Flat Geobuf format:
COPY (
SELECT COUNT(*) num_buildings,
MEDIAN(area_sqft) AS median_sqft,
ST_ASWKB(H3_CELL_TO_BOUNDARY_WKT(any_value(H3_CELL_TO_PARENT(h3, 8)))::geometry) AS geometry
FROM READ_PARQUET('stats/h3_prework.parquet')
GROUP BY H3_CELL_TO_PARENT(h3, 8)
) TO 'stats/h3_8_stats.fgb' (
FORMAT gdal,
DRIVER 'FlatGeobuf',
SRS 'EPSG:4326');
- Use tippecanoe to generate a tileset for each layer individually, stored in mbtiles format. Note: I can’t use tippecanoe’s native ability to combine tiles, in general. This is because I’m plotting median building area, not average, and this can’t be combined trivially from smaller H3 cells to larger ones. Could I have used average? Sure. I wanted to use median so I was sticking with it :) .
$ tippecanoe -z3 -o 3.mbtiles --coalesce-densest-as-needed --detect-longitude-wraparound h3_3_stats.fgb
$ tippecanoe -z4 -Z4 -o 4.mbtiles --coalesce-densest-as-needed h3_4_stats.fgb
$ tippecanoe -z5 -Z5 -o 5.mbtiles --coalesce-densest-as-needed h3_5_stats.fgb
$ tippecanoe -z6 -Z6 -o 6.mbtiles --coalesce-densest-as-needed h3_6_stats.fgb
$ tippecanoe -z7 -Z7 -o 7.mbtiles --coalesce-densest-as-needed h3_7_stats.fgb
$ tippecanoe -z8 -Z8 -o 8.mbtiles --coalesce-densest-as-needed h3_8_stats.fgb
- Then, combine the tiles together with
tile-join:
tile-join -o sqft.mbtiles 3.mbtiles 4.mbtiles 5.mbtiles 6.mbtiles 7.mbtiles 8.mbtiles
This left me with a ~430MB mbtiles archive representing all the data. This gets hosted with mbtileserver and rendered on the page with deck.gl.
I used the viridis color map for the median building area. The darkest areas represent very small buildings, and the brightest colors represent 2,242 square feet of median footprint. This does not account for the height (i.e. number of floors) of each building, nor does it account for basements, usable square footage, etc. This is just a satellite’s opinion of each building’s square area.
This is an interactive map, so scroll around:
Interesting tidbits
I found the difference in median building size to be stark and incredible between Mexico and the United States:
The median Mexican building is ~300-500 sqft, and the median US building is closer to ~1300-1500 sqft.
The difference between Shenzen/Dongguan/Guangzhou and Jiangmen/Macao/Hong Kong is also interesting. I know nothing of the region so I can’t comment on why this might be:
Conclusion
This was a fun exercise in exploring a new dataset and building maps for the web. I am once again very satisfied with DuckDB as a processing engine for geospatial work like this. I will keep a lookout for more datasets in the future to investigate! Please feel free to scroll around the map and look at other interesting areas of building size.