Spaces:

marcosremar2
/

docker_mineru

Sleeping

App Files Files Community

marcosremar2 commited on May 3

Commit

ab599b4

1 Parent(s): 6dd265b

Update with magic-pdf API implementation

Browse files

Files changed (8) hide show

README.md +41 -30
api.py +186 -0
convert_pdf.py +92 -0
deploy_to_hf.sh +72 -0
download_models_hf.py +74 -0
magic-pdf.json +62 -0
start.sh +12 -0
test_api.py +75 -0

README.md CHANGED Viewed

@@ -1,17 +1,24 @@
 ---
-title: MinerU PDF Extractor
 emoji: 📄
 colorFrom: blue
 colorTo: indigo
 sdk: docker
-sdk_version: "latest"
-app_file: app.py
 pinned: false
 ---
-# MinerU PDF Extractor API
-This Hugging Face Space provides a FastAPI-based service that uses `magic-pdf` to extract structured content from PDFs. The service exposes REST endpoints to process PDF files and return extracted text and tables in a structured JSON format.
 ## API Endpoints
@@ -21,7 +28,7 @@ This Hugging Face Space provides a FastAPI-based service that uses `magic-pdf` t
 GET /health
 ```
-Returns the service status and timestamp.
 ### Extract PDF Content
@@ -29,47 +36,51 @@ Returns the service status and timestamp.
 POST /extract
 ```
-Upload a PDF file to extract its text content and tables.
 #### Request
-- Content-Type: multipart/form-data
-- Body: PDF file in the 'file' field
 #### Response
 JSON object containing:
-- Filename
-- Pages with extracted text
-- Tables in Markdown format
-## Usage Examples
-### Using cURL
-```bash
-curl -X POST "https://marcosremar2-docker-mineru.hf.space/extract" \
-  -H "Content-Type: multipart/form-data" \
-  -F "file=@your_document.pdf" \
-  --output result.json
-```
-### Using Python
-```python
-import requests
-url = "https://marcosremar2-docker-mineru.hf.space/extract"
-files = {"file": open("your_document.pdf", "rb")}
-response = requests.post(url, files=files)
-data = response.json()
-# Process the extracted data
-print(f"Filename: {data['result']['filename']}")
-print(f"Number of pages: {len(data['result']['pages'])}")
 ```
 ## API Documentation
 Once deployed, you can access the auto-generated Swagger documentation at:

 ---
+title: MinerU PDF Processor
 emoji: 📄
 colorFrom: blue
 colorTo: indigo
 sdk: docker
 pinned: false
+license: apache-2.0
+app_port: 7860
 ---
+# MinerU PDF API
+A simple API for extracting text and tables from PDF documents using MinerU's magic-pdf library.
+## Features
+- Extract text from PDF documents
+- Identify and extract tables from PDFs
+- Works with both regular and scanned PDFs
+- Simple JSON response format
 ## API Endpoints
 GET /health
 ```
+Returns the current status of the service.
 ### Extract PDF Content
 POST /extract
 ```
+Upload a PDF file to extract its text and tables.
 #### Request
+- `file`: The PDF file to process (multipart/form-data)
 #### Response
 JSON object containing:
+- `filename`: Original filename
+- `pages`: Array of pages with text and tables
+## Deployment
+This application is deployed as a Hugging Face Space using Docker.
+## Local Development
+To run this application locally:
+1. Install the requirements:
+   ```
+   pip install -r requirements.txt
+   ```
+2. Run the application:
+   ```
+   python app.py
+   ```
+3. Access the API at `http://localhost:7860`
+## Docker
+You can also build and run with Docker:
+```bash
+docker build -t mineru-pdf-api .
+docker run -p 7860:7860 mineru-pdf-api
 ```
+## About
+This API is built on top of MinerU and magic-pdf, a powerful PDF extraction tool.
 ## API Documentation
 Once deployed, you can access the auto-generated Swagger documentation at:

api.py ADDED Viewed

	@@ -0,0 +1,186 @@

+from fastapi import FastAPI, UploadFile, File, HTTPException
+from fastapi.responses import JSONResponse, FileResponse
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.staticfiles import StaticFiles
+import tempfile
+import os
+import json
+import traceback
+from datetime import datetime
+from typing import Dict, List, Any, Optional
+import shutil
+from convert_pdf import convert_pdf
+# Create output directory if it doesn't exist
+os.makedirs("output", exist_ok=True)
+os.makedirs("output/images", exist_ok=True)
+# Application metadata
+app_description = """
+# MinerU PDF Processor API
+This API provides PDF processing capabilities using MinerU's magic-pdf library.
+It extracts text content, tables, and generates markdown from PDF documents.
+## Features:
+- PDF text extraction
+- Markdown conversion
+- Layout analysis
+"""
+app = FastAPI(
+    title="MinerU PDF API",
+    description=app_description,
+    version="1.0.0",
+    contact={
+        "name": "PDF Converter Service",
+    },
+)
+# Add CORS middleware to allow cross-origin requests
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # Allow all origins
+    allow_credentials=True,
+    allow_methods=["*"],  # Allow all methods
+    allow_headers=["*"],  # Allow all headers
+)
+# Mount the output directory as static files
+app.mount("/output", StaticFiles(directory="output"), name="output")
+# Health check endpoint
+@app.get("/health", tags=["Health"])
+async def health_check() -> Dict[str, Any]:
+    """
+    Health check endpoint to verify the service is running.
+    Returns the service status and current time.
+    """
+    return {
+        "status": "healthy",
+        "timestamp": datetime.now().isoformat(),
+        "service": "mineru-pdf-processor"
+    }
+@app.post("/convert", tags=["PDF Processing"])
+async def convert(file: UploadFile = File(...)) -> Dict[str, Any]:
+    """
+    Convert a PDF file to markdown using the magic-pdf library.
+    Parameters:
+        file: The PDF file to process
+    Returns:
+        A JSON object containing the conversion result and links to output files
+    """
+    if not file.filename or not file.filename.lower().endswith('.pdf'):
+        raise HTTPException(status_code=400, detail="Invalid file. Please upload a PDF file.")
+    content = await file.read()
+    temp_pdf_path = None
+    try:
+        # Save the uploaded PDF to a temporary file
+        with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp_pdf:
+            temp_pdf.write(content)
+            temp_pdf_path = temp_pdf.name
+        # Clear previous output files
+        for item in os.listdir("output/images"):
+            os.remove(os.path.join("output/images", item))
+        for item in os.listdir("output"):
+            if os.path.isfile(os.path.join("output", item)):
+                os.remove(os.path.join("output", item))
+        # Process the PDF using convert_pdf function
+        md_content = convert_pdf(temp_pdf_path)
+        # Get the base name of the processed file
+        filename_without_ext = os.path.splitext(os.path.basename(temp_pdf_path))[0]
+        # Gather the output files
+        output_files = {}
+        # Markdown file
+        md_path = os.path.join("output", f"{filename_without_ext}.md")
+        if os.path.exists(md_path):
+            output_files["markdown"] = f"/output/{filename_without_ext}.md"
+        # Layout PDF
+        layout_path = os.path.join("output", f"{filename_without_ext}_layout.pdf")
+        if os.path.exists(layout_path):
+            output_files["layout"] = f"/output/{filename_without_ext}_layout.pdf"
+        # Spans PDF
+        spans_path = os.path.join("output", f"{filename_without_ext}_spans.pdf")
+        if os.path.exists(spans_path):
+            output_files["spans"] = f"/output/{filename_without_ext}_spans.pdf"
+        # Model PDF
+        model_path = os.path.join("output", f"{filename_without_ext}_model.pdf")
+        if os.path.exists(model_path):
+            output_files["model"] = f"/output/{filename_without_ext}_model.pdf"
+        # Content list JSON
+        content_list_path = os.path.join("output", f"{filename_without_ext}_content_list.json")
+        if os.path.exists(content_list_path):
+            output_files["content_list"] = f"/output/{filename_without_ext}_content_list.json"
+        # Middle JSON
+        middle_json_path = os.path.join("output", f"{filename_without_ext}_middle.json")
+        if os.path.exists(middle_json_path):
+            output_files["middle_json"] = f"/output/{filename_without_ext}_middle.json"
+        return {
+            "filename": file.filename,
+            "status": "success",
+            "markdown_content": md_content,
+            "output_files": output_files
+        }
+    except Exception as e:
+        error_detail = str(e)
+        error_trace = traceback.format_exc()
+        # Log the error
+        print(f"Error processing PDF: {error_detail}")
+        print(error_trace)
+        return JSONResponse(
+            status_code=500,
+            content={
+                "error": "Error processing PDF",
+                "detail": error_detail,
+                "filename": file.filename if file and hasattr(file, 'filename') else None
+            }
+        )
+    finally:
+        # Clean up the temporary file
+        if temp_pdf_path and os.path.exists(temp_pdf_path):
+            try:
+                os.unlink(temp_pdf_path)
+            except Exception:
+                pass
+@app.get("/files/{filename}", tags=["Files"])
+async def get_file(filename: str):
+    """
+    Get a file from the output directory.
+    Parameters:
+        filename: The name of the file to retrieve
+    Returns:
+        The requested file
+    """
+    file_path = os.path.join("output", filename)
+    if not os.path.exists(file_path):
+        raise HTTPException(status_code=404, detail=f"File {filename} not found")
+    return FileResponse(path=file_path)
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("api:app", host="0.0.0.0", port=7860, reload=False)

convert_pdf.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import os
+import sys
+from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
+from magic_pdf.data.dataset import PymuDocDataset
+from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+from magic_pdf.config.enums import SupportedPdfParseMethod
+def convert_pdf(pdf_file_path):
+    # Get filename and prepare output paths
+    pdf_file_name = os.path.basename(pdf_file_path)
+    name_without_suff = os.path.splitext(pdf_file_name)[0]
+    # prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+    os.makedirs(local_image_dir, exist_ok=True)
+    os.makedirs(local_md_dir, exist_ok=True)
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    )
+    # read bytes
+    reader1 = FileBasedDataReader(os.path.dirname(pdf_file_path))
+    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
+    print(f"Processing PDF: {pdf_file_path}")
+    # proc
+    ## Create Dataset Instance
+    ds = PymuDocDataset(pdf_bytes)
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        infer_result = ds.apply(doc_analyze, ocr=True)
+        ## pipeline
+        pipe_result = infer_result.pipe_ocr_mode(image_writer)
+    else:
+        infer_result = ds.apply(doc_analyze, ocr=False)
+        ## pipeline
+        pipe_result = infer_result.pipe_txt_mode(image_writer)
+    ### draw model result on each page
+    infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
+    ### get model inference result
+    model_inference_result = infer_result.get_infer_res()
+    ### draw layout result on each page
+    pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
+    ### draw spans result on each page
+    pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
+    ### get markdown content
+    md_content = pipe_result.get_markdown(image_dir)
+    ### dump markdown
+    md_file_path = f"{name_without_suff}.md"
+    pipe_result.dump_md(md_writer, md_file_path, image_dir)
+    print(f"Markdown saved to: {os.path.join(local_md_dir, md_file_path)}")
+    ### get content list content
+    content_list_content = pipe_result.get_content_list(image_dir)
+    ### dump content list
+    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
+    ### get middle json
+    middle_json_content = pipe_result.get_middle_json()
+    ### dump middle json
+    pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json')
+    return md_content
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("Usage: python convert_pdf.py <pdf_path>")
+        sys.exit(1)
+    pdf_path = sys.argv[1]
+    if not os.path.exists(pdf_path):
+        print(f"Error: PDF file not found at {pdf_path}")
+        sys.exit(1)
+    convert_pdf(pdf_path)

deploy_to_hf.sh ADDED Viewed

	@@ -0,0 +1,72 @@

+#!/bin/bash
+# Script to deploy the PDF processor to Hugging Face Spaces
+# Check if huggingface_hub is installed
+pip show huggingface_hub > /dev/null 2>&1
+if [ $? -ne 0 ]; then
+    echo "Installing huggingface_hub..."
+    pip install huggingface_hub
+fi
+# Set up variables
+if [ -z "$1" ]; then
+    read -p "Enter your Hugging Face username: " HF_USERNAME
+else
+    HF_USERNAME=$1
+fi
+if [ -z "$2" ]; then
+    read -p "Enter the name for your Space: " SPACE_NAME
+else
+    SPACE_NAME=$2
+fi
+SPACE_REPO="$HF_USERNAME/$SPACE_NAME"
+SPACE_URL="https://huggingface.co/spaces/$SPACE_REPO"
+# Check if the repo exists
+echo "Checking if the Space already exists..."
+huggingface-cli repo info spaces/$SPACE_REPO > /dev/null 2>&1
+if [ $? -eq 0 ]; then
+    echo "Space $SPACE_REPO already exists."
+    read -p "Do you want to continue and update it? (y/n): " CONTINUE
+    if [ "$CONTINUE" != "y" ] && [ "$CONTINUE" != "Y" ]; then
+        echo "Deployment cancelled."
+        exit 1
+    fi
+else
+    echo "Creating new Space: $SPACE_REPO"
+    huggingface-cli repo create spaces/$SPACE_NAME --type space --organization $HF_USERNAME
+fi
+# Create a temporary directory
+TEMP_DIR=$(mktemp -d)
+echo "Created temporary directory: $TEMP_DIR"
+# Clone the repository
+echo "Cloning repository..."
+git clone https://huggingface.co/spaces/$SPACE_REPO $TEMP_DIR
+# Copy files to the repository
+echo "Copying files to the repository..."
+cp -r api.py app.py convert_pdf.py download_models_hf.py requirements.txt Dockerfile README.md .gitattributes $TEMP_DIR/
+mkdir -p $TEMP_DIR/output/images
+# If magic-pdf.json exists, copy it
+if [ -f "magic-pdf.json" ]; then
+    cp magic-pdf.json $TEMP_DIR/
+fi
+# Configure Git LFS
+cd $TEMP_DIR
+git lfs install
+# Add, commit, and push changes
+echo "Committing changes..."
+git add .
+git commit -m "Update PDF processor application"
+git push
+echo "Deployment completed successfully!"
+echo "Your Space is available at: $SPACE_URL"

download_models_hf.py ADDED Viewed

	@@ -0,0 +1,74 @@

+import json
+import os
+import shutil
+import requests
+from huggingface_hub import snapshot_download
+def download_json(url):
+    # 下载JSON文件
+    response = requests.get(url)
+    response.raise_for_status()  # 检查请求是否成功
+    return response.json()
+def download_and_modify_json(url, local_filename, modifications):
+    if os.path.exists(local_filename):
+        data = json.load(open(local_filename))
+        config_version = data.get('config_version', '0.0.0')
+        if config_version < '1.2.0':
+            data = download_json(url)
+    else:
+        data = download_json(url)
+    # 修改内容
+    for key, value in modifications.items():
+        data[key] = value
+    # 保存修改后的内容
+    with open(local_filename, 'w', encoding='utf-8') as f:
+        json.dump(data, f, ensure_ascii=False, indent=4)
+if __name__ == '__main__':
+    mineru_patterns = [
+        # "models/Layout/LayoutLMv3/*",
+        "models/Layout/YOLO/*",
+        "models/MFD/YOLO/*",
+        "models/MFR/unimernet_hf_small_2503/*",
+        "models/OCR/paddleocr_torch/*",
+        # "models/TabRec/TableMaster/*",
+        # "models/TabRec/StructEqTable/*",
+    ]
+    model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
+    layoutreader_pattern = [
+        "*.json",
+        "*.safetensors",
+    ]
+    layoutreader_model_dir = snapshot_download('hantian/layoutreader', allow_patterns=layoutreader_pattern)
+    model_dir = model_dir + '/models'
+    print(f'model_dir is: {model_dir}')
+    print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
+    # paddleocr_model_dir = model_dir + '/OCR/paddleocr'
+    # user_paddleocr_dir = os.path.expanduser('~/.paddleocr')
+    # if os.path.exists(user_paddleocr_dir):
+    #     shutil.rmtree(user_paddleocr_dir)
+    # shutil.copytree(paddleocr_model_dir, user_paddleocr_dir)
+    json_url = 'https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json'
+    config_file_name = 'magic-pdf.json'
+    home_dir = os.path.expanduser('~')
+    config_file = os.path.join(home_dir, config_file_name)
+    json_mods = {
+        'models-dir': model_dir,
+        'layoutreader-model-dir': layoutreader_model_dir,
+    }
+    download_and_modify_json(json_url, config_file, json_mods)
+    print(f'The configuration file has been configured successfully, the path is: {config_file}')

magic-pdf.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+    "bucket_info": {
+        "bucket-name-1": [
+            "ak",
+            "sk",
+            "endpoint"
+        ],
+        "bucket-name-2": [
+            "ak",
+            "sk",
+            "endpoint"
+        ]
+    },
+    "models-dir": "/Users/marcos/.cache/huggingface/hub/models--opendatalab--PDF-Extract-Kit-1.0/snapshots/14efd64068741c8e1d79d635dd236a80a9db66ba/models",
+    "layoutreader-model-dir": "/Users/marcos/.cache/huggingface/hub/models--hantian--layoutreader/snapshots/641226775a0878b1014a96ad01b9642915136853",
+    "device-mode": "cuda",
+    "layout-config": {
+        "model": "doclayout_yolo"
+    },
+    "formula-config": {
+        "mfd_model": "yolo_v8_mfd",
+        "mfr_model": "unimernet_small",
+        "enable": true
+    },
+    "table-config": {
+        "model": "rapid_table",
+        "sub_model": "slanet_plus",
+        "enable": true,
+        "max_time": 400
+    },
+    "latex-delimiter-config": {
+        "display": {
+            "left": "$$",
+            "right": "$$"
+        },
+        "inline": {
+            "left": "$",
+            "right": "$"
+        }
+    },
+    "llm-aided-config": {
+        "formula_aided": {
+            "api_key": "your_api_key",
+            "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
+            "model": "qwen2.5-7b-instruct",
+            "enable": false
+        },
+        "text_aided": {
+            "api_key": "your_api_key",
+            "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
+            "model": "qwen2.5-7b-instruct",
+            "enable": false
+        },
+        "title_aided": {
+            "api_key": "your_api_key",
+            "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
+            "model": "qwen2.5-32b-instruct",
+            "enable": false
+        }
+    },
+    "config_version": "1.2.1"
+}

start.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+# Start script for Hugging Face Spaces deployment
+# Activate the virtual environment
+. /opt/mineru_venv/bin/activate
+# Set environment variables
+export HF_SPACE_ID="${SPACE_ID:-default}"
+# Start the FastAPI server
+python -m uvicorn api:app --host 0.0.0.0 --port 7860

test_api.py ADDED Viewed

	@@ -0,0 +1,75 @@

+#!/usr/bin/env python3
+"""
+Test script for the PDF processor API
+"""
+import requests
+import argparse
+import os
+import json
+from pathlib import Path
+def test_api(api_url, pdf_path):
+    """
+    Test the PDF processor API by sending a PDF file and checking the response
+    """
+    print(f"Testing API at {api_url} with PDF file: {pdf_path}")
+    if not os.path.exists(pdf_path):
+        print(f"Error: PDF file not found at {pdf_path}")
+        return
+    # Send the PDF file to the API
+    with open(pdf_path, 'rb') as pdf_file:
+        files = {'file': (os.path.basename(pdf_path), pdf_file, 'application/pdf')}
+        try:
+            print("Sending request to API...")
+            response = requests.post(f"{api_url}/convert", files=files)
+            if response.status_code == 200:
+                print("Request successful!")
+                result = response.json()
+                # Print response summary
+                print("\nResponse summary:")
+                print(f"Filename: {result.get('filename', 'N/A')}")
+                print(f"Status: {result.get('status', 'N/A')}")
+                # Check output files
+                output_files = result.get('output_files', {})
+                print("\nOutput files:")
+                for file_type, file_path in output_files.items():
+                    print(f"- {file_type}: {file_path}")
+                # Save the markdown content to a file
+                md_content = result.get('markdown_content', '')
+                output_dir = Path('test_output')
+                output_dir.mkdir(exist_ok=True)
+                output_file = output_dir / f"{Path(pdf_path).stem}_output.md"
+                with open(output_file, 'w') as f:
+                    f.write(md_content)
+                print(f"\nMarkdown content saved to: {output_file}")
+                # Save the full response as JSON
+                response_file = output_dir / f"{Path(pdf_path).stem}_response.json"
+                with open(response_file, 'w') as f:
+                    json.dump(result, f, indent=2)
+                print(f"Full response saved to: {response_file}")
+            else:
+                print(f"Request failed with status code: {response.status_code}")
+                print(f"Response content: {response.text}")
+        except Exception as e:
+            print(f"Error during API test: {str(e)}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test the PDF processor API")
+    parser.add_argument("--api", default="http://localhost:7860", help="API URL (default: http://localhost:7860)")
+    parser.add_argument("--pdf", required=True, help="Path to the PDF file to test")
+    args = parser.parse_args()
+    test_api(args.api, args.pdf)