ML-Driven Vulnerability Prioritization: Using Semantic Embeddings to Rank CVEs Beyond CVSS

ML-driven Vulnerability Prioritization: Using Semantic Embeddings to Improve CVSS-Based Triage

TL;DR: Use sentence embeddings plus simple metadata and supervised models to reorder CVE backlogs by likely exploit risk instead of relying only on CVSS. Run the demo notebook, or backtest the approach against three months of historical exploit telemetry to see lift in precision@k.

Why change CVSS-based triage?

CVSS gives a consistent baseline, but it’s a blunt instrument. It ignores nuance in CVE text and evolving exploit trends. A SOC analyst with hundreds of new CVEs each week needs a ranking that reflects real-world exploitability and business impact—not just a static score.

Semantic embeddings capture meaning from natural-language descriptions. They turn phrases like “remote code execution” or “deserialization” into numerical signals the model can reason about. Combine those signals with simple metadata and you get an ML-driven prioritization that better aligns with exploit risk and operational urgency.

“The pipeline builds an AI-assisted scanner that moves beyond static CVSS scoring to prioritize vulnerabilities using semantic understanding and ML.”

High-level pipeline

Ingest CVE records from the NVD API (defaults: last 30 days, up to 50 items; synthetic fallback if API access fails).
Convert each CVE description to a sentence embedding using the all-MiniLM-L6-v2 sentence-transformer.
Extract keyword flags and categorical metadata (attack vector, complexity, privileges, user interaction).
Combine the embedding and structured features into a single feature vector per CVE.
Train a Random Forest to predict severity class and a Gradient Boosting regressor to predict a CVSS-like numeric score.
Cluster CVEs with KMeans to surface recurring exploit themes and systemic risks.
Expose explainability (feature importance / SHAP) and visualize priority distribution in a dashboard.

Tools used in the demo: sentence-transformers (all-MiniLM-L6-v2), scikit-learn (RandomForestClassifier, GradientBoostingRegressor, KMeans), pandas, numpy, matplotlib/seaborn, and requests. Example code and a runnable notebook are available on the project GitHub: Marktechpost GitHub.

Feature engineering (plain language)

For each CVE we build a single feature vector by combining:

Text embedding: a dense 384-d vector from all-MiniLM-L6-v2.
Keyword flags: execution, injection, authentication, overflow, exposure.
Numeric fields: reference_count, description length, word count.
Categorical metadata one-hot: attack vector (network/local), complexity, privileges required, user interaction.

Models and priority scoring

The demo trains two supervised models: a Random Forest for severity class (Low/Medium/High/Critical) and a Gradient Boosting regressor to estimate a CVSS-like score. The final priority score blends classification confidence with the normalized regressed score:

priority = 0.4 × severity_probability + 0.6 × normalized_regressed_score

Normalize the regressed score to [0,1] before blending. Adjust the 0.4/0.6 weights to match your risk tolerance—higher weight to the regressor tightens numeric ranking; higher weight to the classifier favors categorical certainty.

“Vulnerability descriptions are treated as meaningful linguistic content that can be embedded and used for prediction.”

Concrete example (one CVE run-through)

Input (CVE description): “An issue in XYZ library allows remote code execution via crafted deserialization payloads when parsing untrusted input over the network.”

Extracted features:

Embedding highlights: language vectors strongly aligned with “remote code execution”, “deserialization”.
Keyword flags: execution=1, injection=0, authentication=0, overflow=0, exposure=1.
Metadata: attack_vector=network, privileges_required=None, user_interaction=None, reference_count=3, description_length=22 words.

Model outputs:

Severity class probability (Random Forest): Critical 0.72, High 0.20, Medium 0.08.
Regressed CVSS-like score (Gradient Boosting): 8.6 → normalized 0.86.
Final priority: 0.4×0.72 + 0.6×0.86 = 0.809 (high priority).

Explainability snapshot: SHAP values indicate the largest contributors to priority were tokens associated with “remote code execution” (positive impact) and attack_vector=network (positive). Low reference_count modestly reduced priority.

Evaluation: technical and business metrics

Standard model metrics are useful but not sufficient for SOC decisions. Use both technical and business-facing measures:

Classification: precision, recall, F1; per-class confusion matrix.
Regression: RMSE and calibration (Brier score or reliability diagram).
Business lift: precision@k (top-k prioritized CVEs), mean time to remediate (MTTR) for top-priority items, and cumulative true exploited CVEs captured in top-k.
Backtest: run the model on historical CVEs and compare where actual exploited CVEs fell in ML rank vs CVSS rank (compute lift and AUC).
Per-record explainability: SHAP to validate why a CVE scored high; flag records with high model uncertainty for manual review.

Visualization and explainability

Effective dashboards help analysts trust ML-driven triage. Key panes to include:

Priority distribution histogram and top-k list.
Scatter: CVSS vs ML priority to highlight disagreements.
Severity distribution bar chart and attack vector pie chart.
Cluster summary: average CVSS per cluster and common keywords.
Feature importance and SHAP detail panel for per-CVE explanations.

Operationalizing ML-driven triage

Adopt a staged rollout:

Backtest: Evaluate model against historical exploited CVEs over several months.
Shadow mode: Run the model in parallel with existing triage for 4–8 weeks and gather analyst feedback.
Integration: Push ML priorities into ticketing or SOAR tools, but gate any automatic patch actions behind analyst approval.
Retraining: Establish a cadence (weekly or monthly) for incremental retraining; trigger retrain on drift alerts or major vulnerability events.
Monitoring: Track precision@k, model confidence distribution, sudden shifts in feature importance, and unusual clustering patterns.
Governance: Define who owns labels, approve thresholds for automation, and keep audit logs for every decision the model influences.

Suggested CI/CD pieces: automated data pipeline from the NVD API, unit tests for feature extraction, model validation jobs, and deployment scripts with rollout/rollback capabilities.

Risks, limitations, and mitigations

Label quality: NVD entries can be noisy or incomplete. Mitigation: supplement with vendor advisories, exploit feeds, or analyst-labeled examples.
Data sparsity and synthetic fallbacks: Synthetic examples help demos but can bias models. Mitigation: prefer real historical data for training and test on realistic holdouts.
Adversarial disclosure: Attackers could craft language to evade or inflate priority. Mitigation: normalization, ensemble checks, and anomaly detectors on embedding space.
Model drift: Vulnerability types evolve. Mitigation: drift detection alerts and frequent retraining.
Overreliance and automation risk: Don’t automate high-impact actions without human approval; use ML to surface candidates and explain its reasoning.

How to try it in 30 minutes

Clone the GitHub repo: https://github.com/marktechpost.
pip install sentence-transformers scikit-learn pandas numpy matplotlib seaborn requests shap
Run the notebook; default demo fetches recent CVEs or uses a synthetic fallback.
Swap the data source to a 1–3 month historical CVE dump for a quick backtest.
Inspect SHAP outputs for 5 examples and compare ML top-k vs CVSS top-k.

Key takeaways & Questions

Can ML and NLP improve vulnerability prioritization beyond CVSS?

Yes. Converting descriptions into semantic embeddings and combining them with metadata lets supervised models surface exploit-relevant signals that static CVSS values miss, producing a ranking more aligned with operational risk.

Which features matter most?

Text embeddings capture linguistic cues; keyword flags (execution, injection, etc.), reference counts, description length, and attack-vector metadata provide complementary context. SHAP helps quantify each feature’s contribution.

What models work well for a practical pipeline?

Random Forest for severity classification, Gradient Boosting for numeric score regression, and KMeans for clustering offer a robust, explainable baseline. Consider fine-tuning transformer models if you have many labeled examples.

How should teams validate ML-driven priorities?

Backtest against historical exploitation telemetry, measure precision@k and lift relative to CVSS, run a shadow deployment to collect analyst feedback, and then phase into gated automation.

Glossary (quick)

CVE — Common Vulnerabilities and Exposures record.
NVD — National Vulnerability Database (NIST) and its API: NVD API docs.
CVSS — Common Vulnerability Scoring System (baseline numeric severity).
Semantic embeddings — Dense numerical vectors representing sentence meaning; model used: all-MiniLM-L6-v2.
SOAR — Security Orchestration, Automation and Response (integration target for prioritized tickets).
SHAP — Tool for model explainability: SHAP on GitHub.

Ready to see if ML-driven triage moves the needle for your team? Start by cloning the demo repo, run a backtest against your historical CVEs, and pilot the model in shadow mode. Measure precision@k and analyst feedback before automating any remediation steps.

Further reading and resources: NVD API, all-MiniLM-L6-v2 model, scikit-learn docs, SHAP, and the demo code on GitHub.

“Combining embeddings, metadata, clustering, and explainability produces a prioritization that better mirrors real exploit risk and operational urgency.”