Colab Pipeline: Spatial GNNs with city2graph and PyTorch Geometric for Location Intelligence

Spatial Graph Neural Networks with city2graph + PyTorch Geometric — a Colab-ready pipeline for location intelligence

TL;DR: A reproducible Colab pipeline converts OpenStreetMap POIs into PyTorch Geometric graphs using city2graph, trains a compact GraphSAGE classifier for POI function (food, retail, education, health), and visualizes embeddings on the map. Includes multiple proximity-graph constructions, a heterogeneous graph experiment, and practical notes for turning a prototype into production-ready location intelligence.

What you’ll get

A step-by-step pipeline from OSM → GeoDataFrame → PyG Data/HeteroData using city2graph and OSMnx.
Hands-on feature engineering: projected coordinates, local density, distance-to-street, and degree.
A two-layer GraphSAGE baseline and a heterogenous GNN conversion with to_hetero.
Guidance on proximity graph choices (KNN, Delaunay, Gabriel, RNG, EMST, Waxman) and when to use them.
Practical deployment checklist and experiment ideas for business teams.

Why spatial graph neural networks matter for location intelligence

Cities are networks. Proximity, road access and clustering patterns tell stories about urban function that flat tabular features miss. Spatial graph neural networks let you encode those relationships directly: nodes are POIs, edges encode neighborhood wiring, and message passing reveals context-aware features useful for classification and embeddings. That makes GNNs especially valuable for retail site selection, service-desert detection, competitor analysis, and map enrichment where OSM tags are sparse or noisy.

High-level pipeline: what we build

Ingest POIs and the walkable street network via OSMnx (fallback to a synthetic clustered dataset when OSM is unavailable).
Engineer compact spatial features: projected coordinates (cx, cy), local_density (neighbors within 150 m), dist_street (distance to nearest street segment), and node degree.
Construct multiple proximity graphs (KNN, Delaunay, Gabriel, RNG, EMST, Waxman) and pick one for training (KNN k=8 in the baseline).
Convert GeoDataFrames to PyTorch Geometric Data (homogeneous) or HeteroData (split node types) using city2graph utilities.
Train a two-layer GraphSAGE classifier and inspect embeddings with PCA and map overlays.

Key configuration / experiment defaults

Study area: Shibuya, Tokyo — CENTER = (35.6595, 139.7005), radius = 1,100 m
POI cap: 700 points (keeps Colab-friendly and reproducible)
Synthetic fallback seed: SEED = 42
Density radius: 150 m (for local_density)
Proximity graphs: KNN (k=8), Delaunay, Gabriel, RNG, EMST, Waxman (r0=150, beta=0.6)
Model: GraphSAGE — two SAGEConv layers, hidden size = 64, dropout = 0.3
Training: Adam (lr=0.01, weight_decay=5e-4), 200 epochs, train/val/test = 60/20/20 (seeded shuffle)

Definitions (plain language)

GraphSAGE — a GNN that aggregates neighbor features to produce node embeddings (good for inductive tasks).
Delaunay — connects points to avoid skinny triangles; useful to mesh local neighborhoods.
Gabriel — connects two points only if the circle having them as diameter is empty; captures closer-than-average links.
RNG (Relative Neighborhood Graph) — links points if no third point is closer to both; favors sparse, local connections.
EMST (Euclidean Minimum Spanning Tree) — the minimal set of edges that keeps the graph connected (low density, backbone structure).
Waxman — a probabilistic random graph weighted by distance, controlled by r0 and beta (adds longer-range edges stochastically).
to_hetero — PyG utility that converts a homogeneous PyG model into a heterogeneous one by creating per-node-type and per-edge-type parameters.

Feature engineering that punches above its weight

Keeping features compact helps a prototype iterate faster and reduces overfitting on small city patches. The pipeline uses:

cx, cy: projected coordinates — more suitable than raw lat/lon because distances become Euclidean in meters.
local_density: count/metric of neighbors within 150 m using scikit-learn’s NearestNeighbors.
dist_street: distance from the POI to the nearest walkable street segment (OSMnx gives street geometry).
degree: graph degree derived from the chosen proximity graph.

These continuous features are standardized with StandardScaler before training.

Choosing a proximity graph — qualitative comparison

Different topologies encode different inductive biases. Rather than picking one blindly, experiment and reason about what relationships drive your labels.

Graph	Behavior	When to try
KNN (k=8)	Uniform local neighborhood; degree controlled by k. Robust baseline for classification.	Start here for balanced local context and connectivity.
Delaunay	Mesh-like connections, captures triangular neighborhood structure.	Good when local tessellation and face adjacency matter (e.g., urban blocks).
Gabriel / RNG	Sparser, conservative edges that protect local uniqueness.	Try when you want to avoid spurious long edges and emphasize closest neighbors.
EMST	Minimal backbone — keeps graph connected with few edges.	Useful for structural analyses; not typically enough alone for rich message passing.
Waxman	Stochastic long-range edges based on distance; can add weak global signals.	Use to add multi-scale connections if labels depend on broader context.

Training the GraphSAGE baseline

We build a two-layer GraphSAGE encoder with SAGEConv blocks and a small linear classifier. Key decisions that worked in the pipeline:

Hidden size 64 and dropout 0.3 hits a balance between expressivity and generalization for ~700 nodes.
Adam with lr=0.01 and weight_decay=5e-4; train for 200 epochs and checkpoint the state with best validation accuracy.
Evaluate with accuracy and macro-F1. For production you’ll also want per-class precision/recall and a confusion matrix to spot biases.

Heterogeneous graphs: when and how

When POI category semantics matter, split nodes by type (food, retail, education, health) and bridge types with cross-type proximity edges (bridge_nodes using k=3 is a reasonable start). city2graph can output PyG HeteroData directly, and PyG’s to_hetero converts the homogeneous model into a hetero-aware one so each node type gets specialized message-passing parameters.

Quick Colab experiment (4 steps)

Open the companion Colab notebook on the project’s GitHub. Set SEED=42 and POI cap=700.
Run data ingestion: fetch POIs and street network for CENTER=(35.6595,139.7005), radius=1,100 m; fallback generates synthetic clusters if OSM is unreachable.
Build the KNN graph (k=8), extract features, scale them, and train the two-layer GraphSAGE for 200 epochs.
Extract final node embeddings, reduce with PCA to 2D, and overlay colors on the Shibuya map — expect clustering by POI function (cafes close to main streets, schools clustered in neighborhoods).

Suggested ablation experiments

Remove dist_street to see how much street proximity contributes.
Vary K in KNN (k=4, 8, 12) to evaluate sensitivity to neighborhood size.
Train on one city patch and test on another to estimate transferability (domain shift).
Try hetero vs homogeneous architectures and compare per-type F1 scores.

Practical implications for product and policy teams

Prototypes like this are directly useful for business teams:

Retail: detect competitor clusters and estimate catchment areas where OSM tags are missing.
Urban planning: locate service deserts by combining POI classes with street-access features.
Location intelligence: enrich maps with inferred functions and feed downstream recommendation systems.

But prototypes must be vetted before production. Common operational concerns include noisy OSM labels, temporal churn in POIs (openings/closures), and scaling beyond small patches.

Scaling & production checklist

Data pipeline: scheduled pulls from OSM, ETL into GeoDataFrames, and versioned snapshots for reproducibility.
Graph construction: shard by region or use sampling-based GNN trainers (GraphSAINT, ClusterGCN) for large graphs.
Training automation: CI for retraining, validation pipelines for label drift, and model-monitoring for performance decay.
Privacy & ethics: aggregate or anonymize outputs, avoid exposing venue-level sensitive inferences, and document allowed uses.
Cost considerations: small Colab experiments (<=700 nodes) are cheap; full-city graphs (tens of thousands of POIs) require more RAM and GPU time and often benefit from mini-batching.

Limitations, failure modes, and mitigation

Key risks to plan for:

OSM completeness & bias: tags vary by city and by POI class; validate labels with ground truth where possible.
Spatial autocorrelation: naive random splits can leak spatial signal; consider spatial cross-validation (leave-block-out) to get realistic generalization estimates.
Topology sensitivity: different proximity graphs yield different signals—ensemble or cross-validate topologies rather than assuming one is universally best.
Rare classes: downsampling to 700 POIs may drop rare but important categories—adjust sampling strategy to preserve business-critical classes.

Key questions & concise answers

Can you run this without live OSM access?

Yes. The pipeline includes a synthetic clustered fallback (SEED = 42) so the notebook remains runnable and reproducible even when OSM is unreachable.
Which proximity graph should I try first?

Start with KNN (k=8) as a robust baseline. Then experiment with sparser options (Gabriel, RNG) and mesh-like Delaunay or stochastic Waxman only if you have reason to believe multi-scale links matter.
When should you use heterogeneous graphs?

Use hetero modeling when node-type semantics matter (e.g., mixing restaurants and clinics) and you want the model to learn type-specific propagation rules; to_hetero makes conversion straightforward.
How do embeddings get validated?

Validate embeddings qualitatively by mapping PCA/UMAP outputs back to geography and quantitatively by cluster purity, downstream classification accuracy, and per-class metrics.

Next steps and recommended reading

To move from prototype to production, run the Colab experiment, iterate on feature ablations, and test spatial cross-validation. If you want help designing a production roadmap—data ingestion, model retraining cadence, monitoring, and cost estimates—I can outline a pragmatic plan tailored to your dataset and scale.

Further reading: city2graph docs, PyTorch Geometric heterogeneous modeling guide, classic spatial graph literature on proximity graphs, and practical tutorials for GraphSAGE and spatial cross-validation.