Reproducible Scanpy Pipeline for PBMC 3k: AnnData, UMAP and Leiden for Cell-Type Annotation

Reproducible Scanpy Pipeline for PBMC 3k: From Counts to Cell Types (AnnData + UMAP + Leiden)

How do you turn noisy 10x PBMC counts into auditable cell‑type labels quickly and reproducibly? Use a compact Scanpy pipeline that walks raw counts through quality control, normalization, dimensionality reduction, clustering, marker discovery, and a first‑pass cell‑type annotation—all saved as an AnnData object and CSVs for downstream use.

Why this pipeline matters

Single‑cell RNA sequencing (scRNA‑seq) produces large, sparse matrices. The goal is simple but nontrivial: separate biological signal from technical noise and summarize thousands of cells into interpretable groups. Scanpy is a Python toolkit well suited for scripted, repeatable work. This pipeline on the PBMC 3k dataset typically retains about two thousand six hundred cells and roughly eleven thousand genes after conservative filtering, and it recovers around eight to ten major immune cell types—good for prototyping, figure generation, and handing results to wet‑lab collaborators for validation.

Data, tools and quick install

Dataset: 10x Genomics PBMC 3k (peripheral blood mononuclear cells).

Key libraries: Scanpy (core), AnnData (annotated data object that stores matrices plus metadata), leidenalg and igraph (community detection and graphs), harmonypy (optional batch correction), seaborn, matplotlib, numpy and pandas for plotting and data handling.

Quick install (one line): pip install scanpy anndata leidenalg igraph harmonypy seaborn numpy pandas matplotlib

Pipeline overview — what happens and why

Compute per‑cell QC metrics. Measure total counts, number of detected genes, and mitochondrial percentage to flag low‑quality cells.
Filter low‑quality cells and lowly expressed genes. Removes empty droplets, broken cells and very rare genes that increase noise.
Normalize and log transform counts. Scale total counts to a common library size and apply log1p to stabilize variance for downstream methods.
Select highly variable genes (HVGs). Focuses analysis on genes with informative biological variation rather than technical noise.
Regress out technical covariates and scale. Optionally remove effects of library size and mitochondrial content and clip extreme values for numeric stability.
Run PCA. Linear dimensionality reduction to summarize major axes of variation.
Build a nearest‑neighbor graph. Connect cells in reduced space as input for clustering and UMAP.
Compute UMAP. Nonlinear visualization that preserves local neighborhood structure for human interpretation.
Leiden clustering and marker discovery. Community detection to define clusters, followed by differential expression to find marker genes per cluster.

Default parameter choices and the rationale

Keep cells with at least two hundred detected genes; exclude cells with more than five thousand detected genes and cells with mitochondrial percent of ten percent or higher. These thresholds are conservative for PBMCs and remove likely low‑quality or multiplet cells.
Filter genes that appear in fewer than three cells to reduce noise and file size.
Normalize by total counts to a target sum of ten thousand and apply log1p. This is a simple, robust approach compatible with many downstream methods.
Select HVGs using Seurat flavor with min_mean = 0.0125, max_mean = 3, and min_disp = 0.5. This keeps genes with moderate expression and above‑expected dispersion.
Regress out total counts and mitochondrial percent, then scale with a maximum value of ten to clip outliers.
Run PCA with svd_solver set to “arpack”; use thirty principal components as a default summary.
Build neighbors with twelve neighbors and Euclidean metric in PCA space.
UMAP with minimum distance of zero point three five and spread of one point zero balances local cluster tightness and global separation.
Leiden clustering at resolution point six produces a moderate number of clusters for PBMC 3k—raise the resolution for finer subclustering or lower it for broader groups.

Compact how‑to snippets (conceptual)

Use these as one‑line reminders during scripting:

Compute QC metrics: calculate mitochondrial percent, n_counts and n_genes for every cell and visualize distributions.
Filter: remove cells with fewer than two hundred genes, more than five thousand genes, or mitochondrial percent ten percent or higher; drop genes in fewer than three cells.
Normalize and log: scale to ten thousand counts per cell and apply log1p.
HVGs: flavor = “seurat”, min_mean = 0.0125, max_mean = 3, min_disp = 0.5.
PCA and neighbors: run PCA (n_pcs = 30) and compute neighbors with n_neighbors = 12.
UMAP and Leiden: compute UMAP (min_dist = 0.35, spread = 1.0) and cluster using Leiden at resolution = 0.6.

Diagnostics, visualizations and expected outputs

Useful plots and outputs to include with the pipeline:

QC plots: violin or scatter plots of n_counts vs n_genes, and histogram of mitochondrial percent.
Dimensionality reduction: PCA scatter and UMAP colored by cluster or metadata.
Marker visualizations: dotplot and matrixplot for canonical markers, plus violin plots for single genes.
Differential expression tables: cluster_markers.csv and celltype_markers.csv with adjusted p‑values and log fold changes.

Save a processed AnnData object as scanpy_pbmc3k_outputs/processed.h5ad and CSV exports such as cluster_markers.csv, celltype_markers.csv, and cluster_score_matrix.csv for handoff or publication supplementaries.

Marker genes and a rule‑based annotation

Canonical marker genes help interpret clusters quickly. Typical markers used for PBMCs include:

T cells: CD3D, TRAC, TRBC1, IL7R, CCR7
NK cells: NKG7, GNLY, PRF1
B cells: MS4A1, CD79A, CD79B
Monocytes: LYZ, S100A8, FCGR3A, CST3, LGALS3
Dendritic cells: FCER1A, CTSS
Platelets: PPBP

Rule‑based scoring computes a score per cluster from averaged expression of these gene sets and assigns the most likely cell type. It is fast and interpretable, but has limits when clusters are mixed, markers are shared, or the tissue exhibits atypical expression.

Annotation strategies — pros and cons

Rule‑based scoring: transparent and easy to audit; useful for initial labeling and figure generation. Weakness: sensitive to marker selection and ambiguous in transitional states.
Automated classifiers (CellTypist, scANVI, SingleR): can improve consistency across datasets and scale to many samples. Weakness: require good reference data and can be a black box if not validated.
Hybrid approach: use automated tools for bulk labeling then manually curate or confirm with marker plots and DE tables.

Practical parameter sensitivity — what to tweak

Leiden resolution: lower values merge clusters; higher values split them. Try 0.4, 0.6, 1.0 to see stability.
Number of neighbors and principal components: fewer neighbors emphasize very local structure; more neighbors increase smoothing. Vary n_neighbors between 5 and 30 and n_pcs between 20 and 50 when data are complex.
Mitochondrial threshold: ten percent is a common PBMC default. For tissues with naturally high mitochondrial reads, increase the threshold, but consider other QC metrics or doublet detection instead of blind thresholds.
Regressing out covariates: removes technical variance but can also remove biological signal correlated with those covariates. Regress only when you understand the tradeoffs.

“Compute per‑cell QC metrics (mitochondrial percent, total counts, detected genes) and use them to filter low‑quality cells before downstream analysis.”

“After normalization and selection of highly variable genes, perform PCA, build a nearest‑neighbor graph, compute UMAP for visualization, and apply Leiden clustering to define groups.”

“Identify marker genes per cluster via differential expression, then score known marker sets to assign probable cell‑type labels to clusters.”

Reproducibility and productionization checklist

Record package versions and Git commit hash. Save them in AnnData.uns or a provenance.json file.
Provide an environment.yml or requirements.txt and a Dockerfile for CI and deployment.
Store outputs in a structured folder (example: scanpy_pbmc3k_outputs/ with subfolders for figures, tables, and h5ad files).
Automate a sanity check in CI: run the notebook on a small subset and assert that expected outputs exist and cluster counts are within a reasonable range.
Link to the original raw data source and include metadata that maps sample IDs to experimental conditions.

Troubleshooting and common failure modes

Too few cells after filtering: Relax thresholds or check for heavy ambient RNA; consider doublet detection tools such as Scrublet.
Clusters look mixed or noisy: Increase HVG stringency, try more PCs, or run batch correction (Harmony) before PCA.
Marker lists are inconsistent: Check multiple DE tests, require a minimum log fold change, and confirm markers visually on UMAP/violin plots.
Unexpected runtime or memory use: PBMC 3k is small; if scaling up, profile memory and consider sparse representations or chunked workflows.

Practical next steps and integrations

Add batch correction with Harmony (harmonypy) when combining samples to avoid batch‑driven clusters.
Swap rule‑based annotation for CellTypist or scANVI when labeling many datasets or when you need higher automation.
Explore downstream analyses: pseudotime/trajectory inference, ligand‑receptor interactions, or integration with spatial transcriptomics.

Key takeaways and quick Q&A

What QC filters are sensible for PBMC 3k?

Keep cells with at least two hundred detected genes, exclude cells with more than five thousand detected genes or mitochondrial percent ten percent or higher, and filter genes seen in fewer than three cells.
How are counts normalized?

Total counts per cell are scaled to a target sum of ten thousand and then a log1p transform is applied; HVGs are selected using Seurat‑flavored thresholds to focus on informative genes.
Which dimensionality reduction and clustering defaults work well?

PCA with thirty components, neighbors built with twelve neighbors, UMAP using min_dist of 0.35, and Leiden clustering at resolution 0.6 are robust defaults for PBMC 3k.
Is rule‑based annotation reliable?

It is a fast, transparent first pass. For ambiguous clusters or large multi‑sample projects, complement it with automated classifiers and manual curation.

References and further reading

Ready to convert this pipeline into a one‑page production checklist, or a runnable Colab/Binder notebook with environment files and provenance? That practical deliverable makes it easier to hand this workflow to analysts, CI systems, or downstream biologists for validation and scaling.