File size: 2,443 Bytes
da50597
1f4004d
 
 
 
da50597
 
 
 
 
1f4004d
da50597
 
1f4004d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: Doc2Page - Document to Webpage Converter
emoji: πŸ„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.47.2
app_file: app.py
pinned: false
license: apache-2.0
short_description: Convert docs to webpages using PaddleOCR and ERNIE
---

# πŸ“„βž‘οΈπŸŒ Doc2Page - Document to Webpage Converter

Convert your PDF documents or images into beautiful, responsive HTML webpages!

## ✨ Features

- πŸ“– **Smart OCR**: Extract text from PDFs and images using PaddleOCR 
- πŸ€– **AI Enhancement**: Transform content into well-structured HTML using ERNIE 
- 🎨 **Beautiful Output**: Generate responsive, styled webpages with modern CSS
- πŸš€ **Easy Deployment**: Optional one-click deployment to GitHub Pages
- πŸ“± **Mobile Friendly**: Responsive design that works on all devices

## πŸ”§ How It Works

1. **Upload**: Drop your PDF or image file
2. **Extract**: PaddleOCR extracts text and structure
3. **Transform**: ERNIE converts to beautiful HTML
4. **Deploy**: Optionally publish to GitHub Pages

## πŸ“ Supported Formats

- **PDFs**: `.pdf`
- **Images**: `.png`, `.jpg`, `.jpeg`, `.bmp`, `.tiff`

## πŸš€ Quick Start

1. Upload a document using the file picker
2. Click "Convert to Webpage" 
3. Preview your generated webpage
4. Download the HTML file
5. Optionally deploy to GitHub Pages

## βš™οΈ Configuration

**Setup using .env file:**

1. Copy the example environment file:
```bash
cp .env.example .env
```

2. Edit the `.env` file with your credentials:
```bash
# Required API Configuration for PP-StructureV3
API_URL=your_pp_structurev3_api_url
API_TOKEN=your_api_token

# Optional ERNIE API Configuration for enhanced HTML generation
ERNIE_CLIENT_ID=your_client_id_here
ERNIE_CLIENT_SECRET=your_client_secret_here
```

**Note:** The `.env` file is automatically loaded when the application starts. Without ERNIE credentials, the app will use a high-quality fallback HTML generator.

## πŸ—οΈ Technical Stack

- **Frontend**: Gradio for the web interface
- **OCR Engine**: PP-StructureV3 API (PaddlePaddle)
- **AI Processing**: ERNIE 4.5-X1.1-Preview (optional)
- **Image Processing**: Pillow

## πŸ“ Example Use Cases

- Convert research papers to web format
- Digitize scanned documents
- Create web-friendly versions of presentations
- Transform printed materials to responsive websites
- Archive documents in searchable HTML format

## πŸ“„ License

This project is licensed under the Apache 2.0 License.