---
library_name: transformers
datasets:
- stanfordnlp/imdb
metrics:
- accuracy
- f1
base_model:
- google-bert/bert-base-uncased
pipeline_tag: text-classification
---

# Model Card for bert-imdb-sentiment

<!-- Provide a quick summary of what the model is/does. -->
This is a fine-tuned `bert-base-uncased` model for **binary sentiment classification** on the IMDb movie reviews dataset.  
The model predicts whether a given movie review is **positive** or **negative**.


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This model is a `BertForSequenceClassification` model fine-tuned using Hugging Face Transformers and the IMDb dataset (25,000 movie reviews).  
The training was done using the `Trainer` API with the following configuration:
- Tokenization with `BertTokenizer` (`bert-base-uncased`), max sequence length of 256.
- Fine-tuned for 3 epochs with learning rate `2e-5` and mixed-precision (fp16).
- Achieved **~91.54% accuracy** and **F1 score of ~91.54%** on the test split.

- **Developed by:** *koushik reddy*
- **Model type:** Transformer-based sequence classifier (`BertForSequenceClassification`)
- **Language(s) (NLP):** English
- **Finetuned from model :** `bert-base-uncased` ([Hugging Face link](https://huggingface.co/bert-base-uncased))

### Model Sources 

<!-- Provide the basic links for the model. -->

- **Repository:** [https://huggingface.co/koushik-25/bert-imdb-sentiment](https://huggingface.co/koushik-25/bert-imdb-sentiment)
- **Paper :** Original BERT paper: *Devlin et al., 2018* ([https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805))
- **Demo :** You can test it directly using the Inference Widget on the model page.

## Intended Uses & Limitations

- ✅ Intended for sentiment classification of English movie reviews.
- ⚠️ May not generalize well to other domains (e.g., tweets, product reviews) without additional fine-tuning.
- ⚠️ May reflect biases present in the IMDb dataset and the original BERT pre-training corpus.


### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
```python
from transformers import BertForSequenceClassification, BertTokenizer
import torch

# Load model from the Hub
model = BertForSequenceClassification.from_pretrained("your-username/bert-imdb-sentiment")
tokenizer = BertTokenizer.from_pretrained("your-username/bert-imdb-sentiment")

# Inference
inputs = tokenizer("The movie was fantastic!", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=1).item()
print(["NEGATIVE", "POSITIVE"][pred])
```


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- **Dataset:** IMDb movie reviews (`datasets.load_dataset('imdb')`).
- **Size:** 25,000 training, 25,000 test samples.
- **Preprocessing:** Tokenization with `max_length=256` chosen based on review length histogram.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing 

- Text was lowercased automatically because `bert-base-uncased` is a lowercase model.
- Each example was tokenized with padding to `max_length=256` and truncated if longer.
- The dataset was split into train, validation, and test using:
  - `train`: 0–20,000 samples from the training set
  - `val`: 20,000–25,000 samples from the training set
  - `test`: the official IMDb test split


#### Training Hyperparameters

- **Base Model:** `bert-base-uncased`
- **Num Labels:** 2 (binary classification)
- **Batch size:** 4 per device (with gradient accumulation of 16 steps, so effective batch size = 64)
- **Learning Rate:** 2e-5
- **Epochs:** 3
- **Optimizer:** AdamW (default in Transformers)
- **Mixed Precision:** fp16 mixed precision training enabled for faster training and reduced memory usage (`fp16=True` in `TrainingArguments`)
- **Scheduler:** Linear learning rate scheduler with warmup (default)
- **Seed:** 224


#### Speeds, Sizes, Times 

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
- **Training Time:** Approx. varies by GPU; typically around 15-20 minutes on T4 GPU
- **Checkpoint Size:** ~420MB for `pytorch_model.bin` (BERT base size plus classification head).
- **Total Parameters:** ~110 million.


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

- **Dataset:** IMDb test split (25,000 reviews) held out from training.
- **Preprocessing:** Same as training — lowercased, tokenized with `max_length=256`.

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
- This model was evaluated on the overall IMDb test set only. No specific subgroup or domain disaggregation was done.
- The model is expected to generalize well to similar English movie review sentiment but may not be robust to domain shifts.


#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

- **Accuracy:** Measures the fraction of correctly classified reviews.
- **F1 Score:** Weighted average F1 across classes to balance precision and recall.

## Evaluation Results

| Metric    | Score   |
|-----------|---------|
| Accuracy  | 91.54%  |
| F1 Score  | 91.54%  |

Evaluated on the IMDb test set.

## Summary

This is a fine-tuned BERT model (`bert-base-uncased`) for binary sentiment analysis on the IMDb movie reviews dataset.  
It classifies a given movie review as **positive** or **negative** with an accuracy of **91.54%** and a weighted F1 score of **91.54%** on the test set.  
The model was trained using the Hugging Face `transformers` library, with tokenization based on a maximum sequence length of 256 tokens to balance coverage and efficiency.

The model is intended for English movie reviews but may generalize reasonably to similar sentiment analysis tasks on longer-form English text.