Model Overview

SecureModernBERT-NER represents a new generation of cybersecurity-focused language models — combining the state-of-the-art architecture of ModernBERT with one of the largest and most diverse CTI-labelled NER corpora ever built.

Unlike conventional NER systems, SecureModernBERT-NER recognises 22 finely-grained, security-specific entity types, covering the full spectrum of cyber-threat intelligence — from THREAT-ACTOR and MALWARE to CVE, IPV4, DOMAIN, and REGISTRY-KEYS.

Trained on more than half a million manually curated spans sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of accuracy, generalisation, and contextual depth.

This model is designed to parse complex security narratives with human-level precision, extracting both contextual metadata (e.g., ORG, PRODUCT, PLATFORM) and highly technical indicators (e.g., HASHES, URLS, NETWORK ADDRESSES) — all within a single unified framework.

SecureModernBERT-NER sets a new standard for automated CTI entity recognition, enabling the next wave of threat-intelligence automation, enrichment, and analytics.

Quick Start

from transformers import pipeline

model_id = "attack-vector/SecureModernBERT-NER"

pipe = pipeline(
    task="token-classification",
    model=model_id,
    tokenizer=model_id,
    aggregation_strategy="first",
)

text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts."
predictions = pipe(text)
for pred in predictions:
    print(pred)

Sample output:

{'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8}
{'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42}
{'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74}

Intended Use & Limitations

  • Use cases: automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems.
  • Languages: English (model was trained and evaluated on English sources only).
  • Input format: free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training.
  • Limitations: noisy or ambiguous extractions may occur, especially with rare entity types (IPV6, EMAIL) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating hxxp) nor validate indicator authenticity. Always pair with downstream validation and human review.

Training Data

  • Size: 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
  • Label distribution (spans): ORG (approx. 198k), PRODUCT (approx. 79k), MALWARE (approx. 67k), PLATFORM (approx. 57k), THREAT-ACTOR (approx. 49k), SERVICE (approx. 46k), CVE (approx. 41k), LOC (approx. 38k), SECTOR (approx. 34k), TOOL (approx. 29k), plus indicator types such as URL, IPV4, SHA256, MD5, and REGISTRY-KEYS.
  • Pre-processing: JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.

Label Mapping

Label Description Example mention
URL Web address or obfuscated link used in campaigns. hxxp://185.222.202.55
ORG Organisations such as companies, CERTs, or research groups. Microsoft Threat Intelligence
SERVICE Online or cloud services referenced in attacks. Google Ads
SECTOR Industry sectors or verticals targeted. critical infrastructure
FILEPATH File system paths observed in malware samples. C:\Windows\System32\svchost.exe
DOMAIN Fully qualified domains or subdomains. malicious-domain[.]com
PLATFORM Operating systems or computing platforms. Windows Server
THREAT-ACTOR Named adversary groups or aliases. LockBit
PRODUCT Commercial or open-source software products. VMware ESXi
MALWARE Malware families, strains, or toolkits. TrickBot
LOC Countries, cities, or regions. United States
CVE CVE identifiers for vulnerabilities. CVE-2023-23397
TOOL Legitimate or dual-use tools leveraged in incidents. Cobalt Strike
IPV4 IPv4 addresses. 185.222.202.55
MITRE-TACTIC MITRE ATT&CK tactic categories. Credential Access
MD5 MD5 cryptographic hashes. d41d8cd98f00b204e9800998ecf8427e
CAMPAIGN Named operations or campaigns. Operation Cronos
SHA1 SHA-1 hashes. da39a3ee5e6b4b0d3255bfef95601890afd80709
SHA256 SHA-256 hashes. 9e107d9d372bb6826bd81d3542a419d6...
EMAIL Email addresses. [email protected]
IPV6 IPv6 addresses. 2001:0db8:85a3:0000:0000:8a2e:0370:7334
REGISTRY-KEYS Windows registry keys or paths. HKLM\Software\Microsoft\Windows\CurrentVersion\Run

Training Procedure

  • Base model: answerdotai/ModernBERT-large.
  • Hardware: single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM).
  • Optimisation setup: mixed precision fp16, optimiser adamw_torch, cosine learning-rate scheduler, gradient accumulation 1.
  • Key hyperparameters: learning rate 5e-5, batch size 128, epochs 5, maximum sequence length 128.
Parameter Value
Mixed precision fp16
Batch size 128
Learning rate 5e-5
Optimiser adamw_torch
Scheduler cosine
Epochs 5
Gradient accumulation 1
Max sequence length 128

Evaluation

AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring):

Metric Score
Precision 0.8468
Recall 0.8484
F1 0.8476
Accuracy 0.9589

An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality.

Label Used Accuracy
CAMPAIGN 1,817 0.7980
CVE 28,293 0.9995
DOMAIN 12,182 0.8878
EMAIL 731 0.8495
FILEPATH 13,889 0.7957
IPV4 1,164 0.9631
IPV6 563 0.7425
LOC 7,915 0.9557
MALWARE 10,405 0.9087
MD5 389 0.9100
MITRE-TACTIC 2,181 0.7093
ORG 36,324 0.9301
PLATFORM 8,036 0.8977
PRODUCT 18,720 0.8432
REGISTRY-KEYS 1,589 0.8490
SECTOR 6,453 0.8309
SERVICE 8,533 0.8179
SHA1 222 0.9189
SHA256 2,146 0.9874
THREAT-ACTOR 9,532 0.9418
TOOL 4,874 0.7895
URL 7,470 0.9801
  • Macro accuracy: 0.8776

Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint.

These metrics were computed with the seqeval micro-average at the entity level.

External Benchmarks

The following tables report detailed results on a shared CTI validation set. Do not compare the per-label values across models directly: each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution.

CyberPeace-Institute/SecureBERT-NER

Label Used Accuracy
ACT 3,945 0.1706
APT 9,518 0.5331
DOM 10,694 0.0196
EMAIL 731 0.0000
FILE 31,864 0.0747
IP 1,251 0.0088
LOC 7,895 0.8711
MAL 10,341 0.6076
MD5 354 0.8672
O 16,275 0.4700
OS 7,974 0.6598
SECTEAM 36,083 0.3509
SHA1 191 0.0209
SHA2 1,647 0.9709
TOOL 4,816 0.4043
URL 6,997 0.0795
VULID 27,586 0.3849
  • Macro accuracy: 0.3820

PranavaKailash/CyNER-2.0-DeBERTa-v3-base

Label Used Accuracy
Indicator 35,936 0.7878
Location 7,895 0.0113
Malware 12,125 0.7800
O 2,896 0.7652
Organization 42,537 0.6556
System 35,063 0.7259
TOOL 4,820 0.0000
Threat Group 9,522 0.0000
Vulnerability 27,673 0.1876
  • Macro accuracy: 0.4348

cisco-ai/SecureBERT2.0-NER

Label Used Accuracy
Indicator 35,789 0.8854
Malware 16,926 0.6204
O 10,786 0.6813
Organization 51,993 0.5579
System 34,955 0.6600
Vulnerability 27,525 0.2552
  • Macro accuracy: 0.6100

Responsible Use

  • Confirm entity detections before acting on indicators (e.g., automated blocking).
  • Combine with enrichment and scoring systems to filter false positives.
  • Monitor for drift if applying to new domains (e.g., non-English sources, informal channels).
  • Respect licensing and confidentiality of any proprietary CTI sources used for inference.

Citation

If you find this model useful, please cite the repository and the base model:

@software{securemodernbert_ner_2025,
  author = {Juan Manuel Cristóbal Moreno},
  title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/attack-vector/SecureModernBERT-NER}
}

Contact

Questions or feedback? Open an issue on the Hugging Face model repository or reach out at @juanmcristobal.

Downloads last month
40
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for attack-vector/SecureModernBERT-NER

Finetuned
(217)
this model