Model Overview

SecureModernBERT-NER represents a new generation of cybersecurity-focused language models — combining the state-of-the-art architecture of ModernBERT with one of the largest and most diverse CTI-labelled NER corpora ever built.

Unlike conventional NER systems, SecureModernBERT-NER recognises 22 finely-grained, security-specific entity types, covering the full spectrum of cyber-threat intelligence — from THREAT-ACTOR and MALWARE to CVE, IPV4, DOMAIN, and REGISTRY-KEYS.

Trained on more than half a million manually curated spans sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of accuracy, generalisation, and contextual depth.

This model is designed to parse complex security narratives with human-level precision, extracting both contextual metadata (e.g., ORG, PRODUCT, PLATFORM) and highly technical indicators (e.g., HASHES, URLS, NETWORK ADDRESSES) — all within a single unified framework.

SecureModernBERT-NER sets a new standard for automated CTI entity recognition, enabling the next wave of threat-intelligence automation, enrichment, and analytics.

Quick Start

from transformers import pipeline

model_id = "attack-vector/SecureModernBERT-NER"

pipe = pipeline(
    task="token-classification",
    model=model_id,
    tokenizer=model_id,
    aggregation_strategy="first",
)

text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts."
predictions = pipe(text)
for pred in predictions:
    print(pred)

Sample output:

{'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8}
{'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42}
{'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74}

Intended Use & Limitations

Use cases: automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems.
Languages: English (model was trained and evaluated on English sources only).
Input format: free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training.
Limitations: noisy or ambiguous extractions may occur, especially with rare entity types (IPV6, EMAIL) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating hxxp) nor validate indicator authenticity. Always pair with downstream validation and human review.

Training Data

Size: 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
Label distribution (spans): ORG (approx. 198k), PRODUCT (approx. 79k), MALWARE (approx. 67k), PLATFORM (approx. 57k), THREAT-ACTOR (approx. 49k), SERVICE (approx. 46k), CVE (approx. 41k), LOC (approx. 38k), SECTOR (approx. 34k), TOOL (approx. 29k), plus indicator types such as URL, IPV4, SHA256, MD5, and REGISTRY-KEYS.
Pre-processing: JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.

Label Mapping

Label	Description	Example mention
URL	Web address or obfuscated link used in campaigns.	`hxxp://185.222.202.55`
ORG	Organisations such as companies, CERTs, or research groups.	`Microsoft Threat Intelligence`
SERVICE	Online or cloud services referenced in attacks.	`Google Ads`
SECTOR	Industry sectors or verticals targeted.	`critical infrastructure`
FILEPATH	File system paths observed in malware samples.	`C:\Windows\System32\svchost.exe`
DOMAIN	Fully qualified domains or subdomains.	`malicious-domain[.]com`
PLATFORM	Operating systems or computing platforms.	`Windows Server`
THREAT-ACTOR	Named adversary groups or aliases.	`LockBit`
PRODUCT	Commercial or open-source software products.	`VMware ESXi`
MALWARE	Malware families, strains, or toolkits.	`TrickBot`
LOC	Countries, cities, or regions.	`United States`
CVE	CVE identifiers for vulnerabilities.	`CVE-2023-23397`
TOOL	Legitimate or dual-use tools leveraged in incidents.	`Cobalt Strike`
IPV4	IPv4 addresses.	`185.222.202.55`
MITRE-TACTIC	MITRE ATT&CK tactic categories.	`Credential Access`
MD5	MD5 cryptographic hashes.	`d41d8cd98f00b204e9800998ecf8427e`
CAMPAIGN	Named operations or campaigns.	`Operation Cronos`
SHA1	SHA-1 hashes.	`da39a3ee5e6b4b0d3255bfef95601890afd80709`
SHA256	SHA-256 hashes.	`9e107d9d372bb6826bd81d3542a419d6...`
EMAIL	Email addresses.	`[email protected]`
IPV6	IPv6 addresses.	`2001:0db8:85a3:0000:0000:8a2e:0370:7334`
REGISTRY-KEYS	Windows registry keys or paths.	`HKLM\Software\Microsoft\Windows\CurrentVersion\Run`

Training Procedure

Base model: answerdotai/ModernBERT-large.
Hardware: single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM).
Optimisation setup: mixed precision fp16, optimiser adamw_torch, cosine learning-rate scheduler, gradient accumulation 1.
Key hyperparameters: learning rate 5e-5, batch size 128, epochs 5, maximum sequence length 128.

Parameter	Value
Mixed precision	`fp16`
Batch size	`128`
Learning rate	`5e-5`
Optimiser	`adamw_torch`
Scheduler	`cosine`
Epochs	`5`
Gradient accumulation	`1`
Max sequence length	`128`

Evaluation

AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring):

Metric	Score
Precision	0.8468
Recall	0.8484
F1	0.8476
Accuracy	0.9589

An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality.

Label	Used	Accuracy
CAMPAIGN	1,817	0.7980
CVE	28,293	0.9995
DOMAIN	12,182	0.8878
EMAIL	731	0.8495
FILEPATH	13,889	0.7957
IPV4	1,164	0.9631
IPV6	563	0.7425
LOC	7,915	0.9557
MALWARE	10,405	0.9087
MD5	389	0.9100
MITRE-TACTIC	2,181	0.7093
ORG	36,324	0.9301
PLATFORM	8,036	0.8977
PRODUCT	18,720	0.8432
REGISTRY-KEYS	1,589	0.8490
SECTOR	6,453	0.8309
SERVICE	8,533	0.8179
SHA1	222	0.9189
SHA256	2,146	0.9874
THREAT-ACTOR	9,532	0.9418
TOOL	4,874	0.7895
URL	7,470	0.9801

Macro accuracy: 0.8776

Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint.

These metrics were computed with the seqeval micro-average at the entity level.

External Benchmarks

The following tables report detailed results on a shared CTI validation set. Do not compare the per-label values across models directly: each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution.

CyberPeace-Institute/SecureBERT-NER

Label	Used	Accuracy
ACT	3,945	0.1706
APT	9,518	0.5331
DOM	10,694	0.0196
EMAIL	731	0.0000
FILE	31,864	0.0747
IP	1,251	0.0088
LOC	7,895	0.8711
MAL	10,341	0.6076
MD5	354	0.8672
O	16,275	0.4700
OS	7,974	0.6598
SECTEAM	36,083	0.3509
SHA1	191	0.0209
SHA2	1,647	0.9709
TOOL	4,816	0.4043
URL	6,997	0.0795
VULID	27,586	0.3849

Macro accuracy: 0.3820

PranavaKailash/CyNER-2.0-DeBERTa-v3-base

Label	Used	Accuracy
Indicator	35,936	0.7878
Location	7,895	0.0113
Malware	12,125	0.7800
O	2,896	0.7652
Organization	42,537	0.6556
System	35,063	0.7259
TOOL	4,820	0.0000
Threat Group	9,522	0.0000
Vulnerability	27,673	0.1876

Macro accuracy: 0.4348

cisco-ai/SecureBERT2.0-NER

Label	Used	Accuracy
Indicator	35,789	0.8854
Malware	16,926	0.6204
O	10,786	0.6813
Organization	51,993	0.5579
System	34,955	0.6600
Vulnerability	27,525	0.2552

Macro accuracy: 0.6100

Responsible Use

Confirm entity detections before acting on indicators (e.g., automated blocking).
Combine with enrichment and scoring systems to filter false positives.
Monitor for drift if applying to new domains (e.g., non-English sources, informal channels).
Respect licensing and confidentiality of any proprietary CTI sources used for inference.

Citation

If you find this model useful, please cite the repository and the base model:

@software{securemodernbert_ner_2025,
  author = {Juan Manuel Cristóbal Moreno},
  title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/attack-vector/SecureModernBERT-NER}
}

Contact

Questions or feedback? Open an issue on the Hugging Face model repository or reach out at @juanmcristobal.

Downloads last month: 40

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for attack-vector/SecureModernBERT-NER

Base model

answerdotai/ModernBERT-large

Finetuned

(217)

this model