--- base_model: - deepseek-ai/DeepSeek-R1-Distill-Llama-8B tags: - text-classification - brazilian-portuguese - nfe pipeline_tag: text-classification library_name: transformers license: apache-2.0 --- # deepseek-r1-distill-llama-8B-finetuned-nfe-detection-r1-distill-llama-8B-finetuned-nfe-detection > **Finetuned DeepSeek R1 Distill Llama 8B for detecting suppliers under federal sanctions (CGU/CEIS) in Brazilian NF‑e documents.** [![Hugging Face](https://img.shields.io/badge/HF-Model-informational?logo=huggingface)](https://huggingface.co/CleitonOERocha/deepseek-r1-distill-llama-8B-finetuned-nfe-detection) [![License: Apache‑2.0](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE) --- ## TL;DR * **Task**  Binary text classification (`0` = ordinary purchase, `1` = purchase from sanctioned supplier) * **Base model**  `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` (8 B params) * **Training data**  40 000 NF‑e records (70 % train, 30 % test) * **Best epoch**  5 / 5 (early‑stopping on validation loss) * **Performance**  Accuracy 0.952 | F1 0.953 | ROC‑AUC 0.981 * **License**  Apache 2.0 (weights, code & dataset) --- ## Motivation Brazil’s federal administration issued **1.76 M invoices in 2023**. Detecting suppliers already punished by regulators is tedious and error‑prone. This model automates the first triage step, highlighting suspicious transactions for auditors. The work is part of master’s dissertation *Detection of Potentially Untrustworthy Companies through Government Procurement Extracts* (UFBA, 2025). --- ## Quick start ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline model_name = "CleitonOERocha/deepseek-r1-distill-llama-8B-finetuned-nfe-detection" model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) tokenizer = AutoTokenizer.from_pretrained(model_name) clf = pipeline( "text-classification", model=model, tokenizer=tokenizer, top_k=None, device="cuda" # "cpu" also works ) print(clf("[CLS] Destinatario: XXX BATALHAO LOG [SEP] Municipio emitente: SÃO PAULO [SEP] Descricao do produto: MICROFONE LAPELA BY-M1 PRETO P2 [SEP] Qtd: 2 [SEP] Total: 649.78")) ``` Example output: ```json [{"label": "LABEL_1", "score": 0.9752885699272156}, {"label": "LABEL_0", "score": 0.024711500853300095}] ``` --- ### Dataset creation - Crawled NF‑e ZIPs from [Portal da Transparência](https://www.portaltransparencia.gov.br/) - Merged with sanction list from CGU/CEIS - Filtering → deduplication → text normalisation → label propagation --- ## Model details | Item | Value | | ------------------- | ----------------------------------- | | Base | DeepSeek R1 Distill Llama 8B | | Parameters | 8 B | | Architecture | Decoder‑only Transformer | | Max sequence length | 4 096 | | Fine‑tuned epochs | 5 | | Learning rate | 2 × 10⁻⁵ | | Optimizer | AdamW | | Loss | Cross‑entropy | > Dataset: **40 000** NF‑e lines (28 000 train | 12 000 test) ### Evaluation metrics | Metric | Value | | --------- | ----- | | Accuracy | **0.9519** | | Precision | 0.9457 | | Recall | 0.9599 | | F1‑score | **0.9527** | | ROC‑AUC | 0.9812 | ### Confusion matrix (test set) | | Pred 0 | Pred 1 | |----------|--------|--------| | **True 0** | 5 603 | 334 | | **True 1** | 243 | 5 820 | --- ## Limitations & Biases - Relies only on free‑text invoice fields; numeric anomalies (e.g., price outliers) are out of scope. - Trained on 2023 federal data; state/municipal or older invoices may need adaptation. - False positives are expected; always corroborate with additional data. --- ## Ethical considerations Use this model as an **assistant**, not a final verdict. Always corroborate predictions with official sanction registries. --- ## License Released under the **Apache License 2.0**. This applies to weights, code and dataset scripts. --- ## Resources * **GitHub** — source code, data‑processing notebooks and training logs: * **Hugging Face** — model hub page: --- ## Citation ```bibtex @mastersthesis{rocha2025nfe, author = {Cleiton Otavio da Exaltação Rocha and Gecynalda Soares da Silva Gomes}, title = {Detection of Potentially Untrustworthy Companies through Government Procurement Extracts}, school = {Universidade Federal da Bahia}, year = {2025}, address = {Salvador, Brasil} } ``` --- ### Contact Open an issue on the [GitHub repo](https://github.com/CleitonOERocha/Mestrado) or tag **@CleitonOERocha** on the 🤗 Hub. ---