Model Overview
SecureModernBERT-NER represents a new generation of cybersecurity-focused language models — combining the state-of-the-art architecture of ModernBERT with one of the largest and most diverse CTI-labelled NER corpora ever built.
Unlike conventional NER systems, SecureModernBERT-NER recognises 22 finely-grained, security-specific entity types, covering the full spectrum of cyber-threat intelligence — from THREAT-ACTOR and MALWARE to CVE, IPV4, DOMAIN, and REGISTRY-KEYS.
Trained on more than half a million manually curated spans sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of accuracy, generalisation, and contextual depth.
This model is designed to parse complex security narratives with human-level precision, extracting both contextual metadata (e.g., ORG, PRODUCT, PLATFORM) and highly technical indicators (e.g., HASHES, URLS, NETWORK ADDRESSES) — all within a single unified framework.
SecureModernBERT-NER sets a new standard for automated CTI entity recognition, enabling the next wave of threat-intelligence automation, enrichment, and analytics.
Quick Start
from transformers import pipeline
model_id = "attack-vector/SecureModernBERT-NER"
pipe = pipeline(
task="token-classification",
model=model_id,
tokenizer=model_id,
aggregation_strategy="first",
)
text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts."
predictions = pipe(text)
for pred in predictions:
print(pred)
Sample output:
{'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8}
{'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42}
{'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74}
Intended Use & Limitations
- Use cases: automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems.
- Languages: English (model was trained and evaluated on English sources only).
- Input format: free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training.
- Limitations: noisy or ambiguous extractions may occur, especially with rare entity types (
IPV6,EMAIL) and obfuscated strings. The model does not normalise entities (e.g., deobfuscatinghxxp) nor validate indicator authenticity. Always pair with downstream validation and human review.
Training Data
- Size: 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
- Label distribution (spans):
ORG(approx. 198k),PRODUCT(approx. 79k),MALWARE(approx. 67k),PLATFORM(approx. 57k),THREAT-ACTOR(approx. 49k),SERVICE(approx. 46k),CVE(approx. 41k),LOC(approx. 38k),SECTOR(approx. 34k),TOOL(approx. 29k), plus indicator types such asURL,IPV4,SHA256,MD5, andREGISTRY-KEYS. - Pre-processing: JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.
Label Mapping
| Label | Description | Example mention |
|---|---|---|
| URL | Web address or obfuscated link used in campaigns. | hxxp://185.222.202.55 |
| ORG | Organisations such as companies, CERTs, or research groups. | Microsoft Threat Intelligence |
| SERVICE | Online or cloud services referenced in attacks. | Google Ads |
| SECTOR | Industry sectors or verticals targeted. | critical infrastructure |
| FILEPATH | File system paths observed in malware samples. | C:\Windows\System32\svchost.exe |
| DOMAIN | Fully qualified domains or subdomains. | malicious-domain[.]com |
| PLATFORM | Operating systems or computing platforms. | Windows Server |
| THREAT-ACTOR | Named adversary groups or aliases. | LockBit |
| PRODUCT | Commercial or open-source software products. | VMware ESXi |
| MALWARE | Malware families, strains, or toolkits. | TrickBot |
| LOC | Countries, cities, or regions. | United States |
| CVE | CVE identifiers for vulnerabilities. | CVE-2023-23397 |
| TOOL | Legitimate or dual-use tools leveraged in incidents. | Cobalt Strike |
| IPV4 | IPv4 addresses. | 185.222.202.55 |
| MITRE-TACTIC | MITRE ATT&CK tactic categories. | Credential Access |
| MD5 | MD5 cryptographic hashes. | d41d8cd98f00b204e9800998ecf8427e |
| CAMPAIGN | Named operations or campaigns. | Operation Cronos |
| SHA1 | SHA-1 hashes. | da39a3ee5e6b4b0d3255bfef95601890afd80709 |
| SHA256 | SHA-256 hashes. | 9e107d9d372bb6826bd81d3542a419d6... |
| Email addresses. | [email protected] |
|
| IPV6 | IPv6 addresses. | 2001:0db8:85a3:0000:0000:8a2e:0370:7334 |
| REGISTRY-KEYS | Windows registry keys or paths. | HKLM\Software\Microsoft\Windows\CurrentVersion\Run |
Training Procedure
- Base model:
answerdotai/ModernBERT-large. - Hardware: single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM).
- Optimisation setup: mixed precision
fp16, optimiseradamw_torch, cosine learning-rate scheduler, gradient accumulation1. - Key hyperparameters: learning rate
5e-5, batch size128, epochs5, maximum sequence length128.
| Parameter | Value |
|---|---|
| Mixed precision | fp16 |
| Batch size | 128 |
| Learning rate | 5e-5 |
| Optimiser | adamw_torch |
| Scheduler | cosine |
| Epochs | 5 |
| Gradient accumulation | 1 |
| Max sequence length | 128 |
Evaluation
AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring):
| Metric | Score |
|---|---|
| Precision | 0.8468 |
| Recall | 0.8484 |
| F1 | 0.8476 |
| Accuracy | 0.9589 |
An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality.
| Label | Used | Accuracy |
|---|---|---|
| CAMPAIGN | 1,817 | 0.7980 |
| CVE | 28,293 | 0.9995 |
| DOMAIN | 12,182 | 0.8878 |
| 731 | 0.8495 | |
| FILEPATH | 13,889 | 0.7957 |
| IPV4 | 1,164 | 0.9631 |
| IPV6 | 563 | 0.7425 |
| LOC | 7,915 | 0.9557 |
| MALWARE | 10,405 | 0.9087 |
| MD5 | 389 | 0.9100 |
| MITRE-TACTIC | 2,181 | 0.7093 |
| ORG | 36,324 | 0.9301 |
| PLATFORM | 8,036 | 0.8977 |
| PRODUCT | 18,720 | 0.8432 |
| REGISTRY-KEYS | 1,589 | 0.8490 |
| SECTOR | 6,453 | 0.8309 |
| SERVICE | 8,533 | 0.8179 |
| SHA1 | 222 | 0.9189 |
| SHA256 | 2,146 | 0.9874 |
| THREAT-ACTOR | 9,532 | 0.9418 |
| TOOL | 4,874 | 0.7895 |
| URL | 7,470 | 0.9801 |
- Macro accuracy: 0.8776
Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint.
These metrics were computed with the seqeval micro-average at the entity level.
External Benchmarks
The following tables report detailed results on a shared CTI validation set. Do not compare the per-label values across models directly: each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution.
CyberPeace-Institute/SecureBERT-NER
| Label | Used | Accuracy |
|---|---|---|
| ACT | 3,945 | 0.1706 |
| APT | 9,518 | 0.5331 |
| DOM | 10,694 | 0.0196 |
| 731 | 0.0000 | |
| FILE | 31,864 | 0.0747 |
| IP | 1,251 | 0.0088 |
| LOC | 7,895 | 0.8711 |
| MAL | 10,341 | 0.6076 |
| MD5 | 354 | 0.8672 |
| O | 16,275 | 0.4700 |
| OS | 7,974 | 0.6598 |
| SECTEAM | 36,083 | 0.3509 |
| SHA1 | 191 | 0.0209 |
| SHA2 | 1,647 | 0.9709 |
| TOOL | 4,816 | 0.4043 |
| URL | 6,997 | 0.0795 |
| VULID | 27,586 | 0.3849 |
- Macro accuracy: 0.3820
PranavaKailash/CyNER-2.0-DeBERTa-v3-base
| Label | Used | Accuracy |
|---|---|---|
| Indicator | 35,936 | 0.7878 |
| Location | 7,895 | 0.0113 |
| Malware | 12,125 | 0.7800 |
| O | 2,896 | 0.7652 |
| Organization | 42,537 | 0.6556 |
| System | 35,063 | 0.7259 |
| TOOL | 4,820 | 0.0000 |
| Threat Group | 9,522 | 0.0000 |
| Vulnerability | 27,673 | 0.1876 |
- Macro accuracy: 0.4348
cisco-ai/SecureBERT2.0-NER
| Label | Used | Accuracy |
|---|---|---|
| Indicator | 35,789 | 0.8854 |
| Malware | 16,926 | 0.6204 |
| O | 10,786 | 0.6813 |
| Organization | 51,993 | 0.5579 |
| System | 34,955 | 0.6600 |
| Vulnerability | 27,525 | 0.2552 |
- Macro accuracy: 0.6100
Responsible Use
- Confirm entity detections before acting on indicators (e.g., automated blocking).
- Combine with enrichment and scoring systems to filter false positives.
- Monitor for drift if applying to new domains (e.g., non-English sources, informal channels).
- Respect licensing and confidentiality of any proprietary CTI sources used for inference.
Citation
If you find this model useful, please cite the repository and the base model:
@software{securemodernbert_ner_2025,
author = {Juan Manuel Cristóbal Moreno},
title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/attack-vector/SecureModernBERT-NER}
}
Contact
Questions or feedback? Open an issue on the Hugging Face model repository or reach out at @juanmcristobal.
- Downloads last month
- 40
Model tree for attack-vector/SecureModernBERT-NER
Base model
answerdotai/ModernBERT-large