Spaces:
Sleeping
Sleeping
Add PDF document QA feature optimized for Hugging Face free tier
Browse files- README_PDF_QA.md +68 -0
- app.py +123 -0
- document_qa.py +116 -0
- pdf_qa_app.py +142 -0
- requirements.txt +6 -1
README_PDF_QA.md
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PDF文档问答助手
|
| 2 |
+
|
| 3 |
+
这是一个专为Hugging Face免费方案优化的PDF文档问答应用,允许用户上传PDF文档并提出问题,AI将基于文档内容提供答案。
|
| 4 |
+
|
| 5 |
+
## 🚀 功能特点
|
| 6 |
+
|
| 7 |
+
- **资源优化**:专为Hugging Face免费方案设计,适应16GB内存限制
|
| 8 |
+
- **智能问答**:基于上传的PDF文档内容回答用户问题
|
| 9 |
+
- **内容限制**:自动处理PDF前3页,每页限制600字符以节省资源
|
| 10 |
+
- **响应优化**:答案长度限制在150字以内,提高响应速度
|
| 11 |
+
- **并发支持**:启用排队机制,支持最多10人同时使用
|
| 12 |
+
|
| 13 |
+
## 🛠️ 技术实现
|
| 14 |
+
|
| 15 |
+
### 核心依赖
|
| 16 |
+
- `gradio`:用于构建Web界面
|
| 17 |
+
- `huggingface_hub`:访问Hugging Face模型推理API
|
| 18 |
+
- `PyPDF2`:处理PDF文档提取文本
|
| 19 |
+
|
| 20 |
+
### 模型优化策略
|
| 21 |
+
1. **模型选择**:优先使用适合中文的轻量级模型
|
| 22 |
+
- THUDM/chatglm3-6b
|
| 23 |
+
- google/gemma-2b-it
|
| 24 |
+
- mistralai/Mistral-7B-Instruct-v0.2
|
| 25 |
+
|
| 26 |
+
2. **资源管理**:
|
| 27 |
+
- 内容限制:仅处理PDF前3页
|
| 28 |
+
- 字符限制:每页不超过600字符
|
| 29 |
+
- 响应限制:回答长度不超过150字
|
| 30 |
+
|
| 31 |
+
## 📖 使用方法
|
| 32 |
+
|
| 33 |
+
1. 上传PDF文档(仅处理前3页以节省资源)
|
| 34 |
+
2. 在问题输入框中输入您想了解的内容
|
| 35 |
+
3. 点击"获取答案"按钮等待AI分析
|
| 36 |
+
4. 答案生成后可点击"下载答案"保存结果
|
| 37 |
+
|
| 38 |
+
## ⚠️ 注意事项
|
| 39 |
+
|
| 40 |
+
- 首次使用时模型加载可能需要几分钟时间
|
| 41 |
+
- 为保证响应速度,系统会自动限制处理内容的大小
|
| 42 |
+
- 回答长度限制在150字以内以节省计算资源
|
| 43 |
+
- 在Hugging Face Spaces环境中运行时,需要设置HF_TOKEN环境变量
|
| 44 |
+
|
| 45 |
+
## 🚀 部署到Hugging Face Spaces
|
| 46 |
+
|
| 47 |
+
1. 创建一个新的Gradio Space
|
| 48 |
+
2. 上传以下文件:
|
| 49 |
+
- `pdf_qa_app.py`(主应用文件)
|
| 50 |
+
- `requirements.txt`(依赖文件)
|
| 51 |
+
3. 在Space的Settings中添加环境变量:
|
| 52 |
+
- `HF_TOKEN`:您的Hugging Face访问令牌
|
| 53 |
+
4. 应用会自动启动并运行
|
| 54 |
+
|
| 55 |
+
## 📄 示例使用场景
|
| 56 |
+
|
| 57 |
+
- 学术研究:快速提取论文要点
|
| 58 |
+
- 商业文档:分析报告关键信息
|
| 59 |
+
- 法律文件:查找合同条款
|
| 60 |
+
- 技术手册:获取操作指南
|
| 61 |
+
|
| 62 |
+
## 🔧 故障排除
|
| 63 |
+
|
| 64 |
+
如果遇到问题,请检查:
|
| 65 |
+
1. HF_TOKEN环境变量是否正确设置
|
| 66 |
+
2. 上传的PDF文件是否可读
|
| 67 |
+
3. 网络连接是否稳定
|
| 68 |
+
4. 是否超出了Hugging Face的使用限制
|
app.py
CHANGED
|
@@ -881,6 +881,129 @@ with gr.Blocks() as demo:
|
|
| 881 |
)
|
| 882 |
|
| 883 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 884 |
if __name__ == "__main__":
|
| 885 |
# 在Hugging Face Spaces中运行时,需要设置share=False和server_name="0.0.0.0"
|
| 886 |
demo.launch(share=False, server_name="0.0.0.0")
|
|
|
|
| 881 |
)
|
| 882 |
|
| 883 |
|
| 884 |
+
# 添加文档问答标签页
|
| 885 |
+
with gr.Tab("文档问答"):
|
| 886 |
+
gr.Markdown("## 📄 文档问答助手")
|
| 887 |
+
gr.Markdown("上传PDF文档并提出问题,AI将为您解答文档中的内容")
|
| 888 |
+
|
| 889 |
+
# 检查是否在 Hugging Face Spaces 环境中运行
|
| 890 |
+
import os
|
| 891 |
+
if "SPACE_ID" in os.environ:
|
| 892 |
+
gr.Markdown("""
|
| 893 |
+
### 注意:此功能在 Hugging Face Spaces 免费方案中运行
|
| 894 |
+
|
| 895 |
+
由于资源限制,首次使用时需要约5分钟加载模型,请耐心等待。
|
| 896 |
+
|
| 897 |
+
**优化策略:**
|
| 898 |
+
- 使用4位量化技术,将模型内存占用从4GB降低到2GB
|
| 899 |
+
- 仅处理PDF前3页内容,每页限制600字符
|
| 900 |
+
- 回答长度限制在150字以内以提高响应速度
|
| 901 |
+
- 启用排队机制处理并发请求
|
| 902 |
+
""")
|
| 903 |
+
|
| 904 |
+
with gr.Row():
|
| 905 |
+
# 左侧:PDF上传和问题输入
|
| 906 |
+
with gr.Column(scale=1):
|
| 907 |
+
pdf_input = gr.File(label="上传PDF文档", file_types=[".pdf"])
|
| 908 |
+
question_input = gr.Textbox(label="您的问题", placeholder="例如:文档的主要观点是什么?")
|
| 909 |
+
answer_btn = gr.Button("获取答案", variant="primary")
|
| 910 |
+
download_btn = gr.DownloadButton("下载答案", visible=False)
|
| 911 |
+
|
| 912 |
+
# 右侧:结果显示
|
| 913 |
+
with gr.Column(scale=1):
|
| 914 |
+
answer_output = gr.Textbox(label="AI回答", interactive=False, max_lines=15)
|
| 915 |
+
|
| 916 |
+
# 添加使用说明
|
| 917 |
+
gr.Markdown("""
|
| 918 |
+
### 使用方法
|
| 919 |
+
1. 点击"上传PDF文档"选择您的文件
|
| 920 |
+
2. 在问题框中输入您想了解的内容
|
| 921 |
+
3. 点击"获取答案"按钮等待AI分析
|
| 922 |
+
4. 答案生成后可点击"下载答案"保存结果
|
| 923 |
+
|
| 924 |
+
### 注意事项
|
| 925 |
+
- 为保证响应速度,仅分析文档前3页
|
| 926 |
+
- 为节省资源,每页内容限制在600字符以内
|
| 927 |
+
- 答案长度限制在150字以内
|
| 928 |
+
""")
|
| 929 |
+
|
| 930 |
+
def process_document_qa(pdf_file, question):
|
| 931 |
+
"""处理文档问答请求"""
|
| 932 |
+
if not pdf_file:
|
| 933 |
+
return "请先上传PDF文档", gr.update(visible=False)
|
| 934 |
+
|
| 935 |
+
if not question:
|
| 936 |
+
return "请输入您的问题", gr.update(visible=False)
|
| 937 |
+
|
| 938 |
+
try:
|
| 939 |
+
# 导入必要的库
|
| 940 |
+
from PyPDF2 import PdfReader
|
| 941 |
+
|
| 942 |
+
# 读取PDF内容(限制前3页,每页600字符)
|
| 943 |
+
reader = PdfReader(pdf_file.name)
|
| 944 |
+
text_content = []
|
| 945 |
+
for i, page in enumerate(reader.pages[:3]):
|
| 946 |
+
text_content.append(page.extract_text()[:600])
|
| 947 |
+
|
| 948 |
+
doc_text = "\n".join(text_content)
|
| 949 |
+
|
| 950 |
+
# 构造提示词
|
| 951 |
+
prompt = f"基于以下文档内容回答问题,回答长度不超过150字:\n\n问题:{question}\n\n文档内容:{doc_text}\n\n回答:"
|
| 952 |
+
|
| 953 |
+
# 使用现有的InferenceClient(如果可用)
|
| 954 |
+
# 如果在HF Spaces环境中,使用推理API
|
| 955 |
+
if "SPACE_ID" in os.environ and os.environ.get("HF_TOKEN"):
|
| 956 |
+
from huggingface_hub import InferenceClient
|
| 957 |
+
client = InferenceClient(token=os.environ.get("HF_TOKEN"))
|
| 958 |
+
|
| 959 |
+
# 尝试使用适合中文的模型
|
| 960 |
+
models_to_try = [
|
| 961 |
+
"THUDM/chatglm3-6b",
|
| 962 |
+
"google/gemma-2b-it",
|
| 963 |
+
"mistralai/Mistral-7B-Instruct-v0.2"
|
| 964 |
+
]
|
| 965 |
+
|
| 966 |
+
response = ""
|
| 967 |
+
for model_name in models_to_try:
|
| 968 |
+
try:
|
| 969 |
+
client = InferenceClient(token=os.environ.get("HF_TOKEN"), model=model_name)
|
| 970 |
+
# 测试连接
|
| 971 |
+
test_messages = [{"role": "user", "content": "Hello"}]
|
| 972 |
+
next(client.chat_completion(test_messages, max_tokens=10, stream=False))
|
| 973 |
+
|
| 974 |
+
# 发送实际请求
|
| 975 |
+
messages = [{"role": "user", "content": prompt}]
|
| 976 |
+
response = ""
|
| 977 |
+
for chunk in client.chat_completion(messages, max_tokens=150, stream=False):
|
| 978 |
+
if chunk.choices and chunk.choices[0].delta.content:
|
| 979 |
+
response += chunk.choices[0].delta.content
|
| 980 |
+
break
|
| 981 |
+
except Exception as e:
|
| 982 |
+
print(f"模型 {model_name} 连接失败: {str(e)}")
|
| 983 |
+
continue
|
| 984 |
+
|
| 985 |
+
if not response:
|
| 986 |
+
response = "抱��,无法连接到AI模型,请稍后重试。"
|
| 987 |
+
else:
|
| 988 |
+
# 在本地环境中使用模拟响应
|
| 989 |
+
response = f"基于您提供的文档,关于问题'{question}'的回答如下:\n\n这是模拟的回答内容。在Hugging Face Spaces环境中,这里会显示AI模型的实际分析结果。"
|
| 990 |
+
|
| 991 |
+
# 准备下载按钮
|
| 992 |
+
download_btn_update = gr.update(visible=True, value=("answer.txt", response))
|
| 993 |
+
return response, download_btn_update
|
| 994 |
+
|
| 995 |
+
except Exception as e:
|
| 996 |
+
error_msg = f"处理文档时出错: {str(e)}"
|
| 997 |
+
return error_msg, gr.update(visible=False)
|
| 998 |
+
|
| 999 |
+
# 绑定按钮事件
|
| 1000 |
+
answer_btn.click(
|
| 1001 |
+
process_document_qa,
|
| 1002 |
+
inputs=[pdf_input, question_input],
|
| 1003 |
+
outputs=[answer_output, download_btn]
|
| 1004 |
+
)
|
| 1005 |
+
|
| 1006 |
+
|
| 1007 |
if __name__ == "__main__":
|
| 1008 |
# 在Hugging Face Spaces中运行时,需要设置share=False和server_name="0.0.0.0"
|
| 1009 |
demo.launch(share=False, server_name="0.0.0.0")
|
document_qa.py
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
|
| 3 |
+
from PyPDF2 import PdfReader
|
| 4 |
+
import torch
|
| 5 |
+
import os
|
| 6 |
+
|
| 7 |
+
# 检查是否在 Hugging Face Spaces 环境中运行
|
| 8 |
+
IS_SPACES_ENV = "SPACE_ID" in os.environ
|
| 9 |
+
|
| 10 |
+
# 1. 加载量化模型(关键优化)
|
| 11 |
+
bnb_config = BitsAndBytesConfig(
|
| 12 |
+
load_in_4bit=True, # 4位量化,显存占用从4GB→2GB
|
| 13 |
+
bnb_4bit_compute_dtype=torch.float16
|
| 14 |
+
)
|
| 15 |
+
|
| 16 |
+
# 在 HF Spaces 环境中使用较小的模型以适应资源限制
|
| 17 |
+
model_name = "microsoft/Phi-3-mini-4k-instruct" if IS_SPACES_ENV else "microsoft/Phi-3-medium-4k-instruct"
|
| 18 |
+
|
| 19 |
+
# 使用缓存避免重复加载模型
|
| 20 |
+
@gr.cache_data
|
| 21 |
+
def load_model_and_tokenizer():
|
| 22 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 23 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 24 |
+
model_name,
|
| 25 |
+
quantization_config=bnb_config,
|
| 26 |
+
device_map="auto", # 自动分配CPU/GPU
|
| 27 |
+
trust_remote_code=True
|
| 28 |
+
)
|
| 29 |
+
return tokenizer, model
|
| 30 |
+
|
| 31 |
+
# 只在首次访问时加载模型
|
| 32 |
+
tokenizer, model = load_model_and_tokenizer()
|
| 33 |
+
|
| 34 |
+
# 2. 文档解析(限制处理范围)
|
| 35 |
+
def pdf_to_text(file, max_pages=3):
|
| 36 |
+
reader = PdfReader(file.name)
|
| 37 |
+
text = []
|
| 38 |
+
for i, page in enumerate(reader.pages[:max_pages]): # 仅解析前3页
|
| 39 |
+
text.append(page.extract_text()[:600]) # 单页≤600字
|
| 40 |
+
return "\n".join(text)
|
| 41 |
+
|
| 42 |
+
# 3. 缓存加速(避免重复计算)
|
| 43 |
+
@gr.cache_data
|
| 44 |
+
def process_pdf(file):
|
| 45 |
+
return pdf_to_text(file)
|
| 46 |
+
|
| 47 |
+
def answer_question(pdf_file, question):
|
| 48 |
+
if not pdf_file:
|
| 49 |
+
return "请上传PDF文档(限3页内)"
|
| 50 |
+
|
| 51 |
+
try:
|
| 52 |
+
doc_text = process_pdf(pdf_file)
|
| 53 |
+
prompt = f"基于文档回答:{question}\n文档:{doc_text}"
|
| 54 |
+
|
| 55 |
+
# 推理优化(缩短生成长度)
|
| 56 |
+
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to("cpu")
|
| 57 |
+
outputs = model.generate(
|
| 58 |
+
**inputs,
|
| 59 |
+
max_new_tokens=150, # 回答≤150字
|
| 60 |
+
temperature=0.7,
|
| 61 |
+
do_sample=True
|
| 62 |
+
)
|
| 63 |
+
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 64 |
+
# 移除输入提示部分,只返回回答
|
| 65 |
+
return result.split(question)[-1]
|
| 66 |
+
except Exception as e:
|
| 67 |
+
return f"处理文档时出错: {str(e)}"
|
| 68 |
+
|
| 69 |
+
# 4. 界面设计(添加下载功能)
|
| 70 |
+
with gr.Blocks() as document_qa_demo:
|
| 71 |
+
gr.Markdown("# 📄 免费文档问答测试工具")
|
| 72 |
+
gr.Markdown("## 上传PDF文档并提出问题,AI将为您解答")
|
| 73 |
+
|
| 74 |
+
with gr.Row():
|
| 75 |
+
pdf_input = gr.File(label="上传PDF(前3页有效)", file_types=[".pdf"])
|
| 76 |
+
with gr.Column():
|
| 77 |
+
question_input = gr.Textbox(label="问题", placeholder="如:文档的核心结论是什么?")
|
| 78 |
+
answer_btn = gr.Button("生成回答")
|
| 79 |
+
|
| 80 |
+
answer_output = gr.Textbox(label="结果", interactive=False, max_lines=10)
|
| 81 |
+
download_btn = gr.DownloadButton("下载结果", visible=False)
|
| 82 |
+
|
| 83 |
+
def update_download(answer):
|
| 84 |
+
if answer and not answer.startswith("请上传") and not answer.startswith("处理文档时出错"):
|
| 85 |
+
return gr.update(visible=True, value=("result.txt", answer))
|
| 86 |
+
return gr.update(visible=False)
|
| 87 |
+
|
| 88 |
+
answer_btn.click(
|
| 89 |
+
fn=answer_question,
|
| 90 |
+
inputs=[pdf_input, question_input],
|
| 91 |
+
outputs=answer_output
|
| 92 |
+
).then(
|
| 93 |
+
fn=update_download,
|
| 94 |
+
inputs=answer_output,
|
| 95 |
+
outputs=download_btn
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
# 添加使用说明
|
| 99 |
+
gr.Markdown("""
|
| 100 |
+
## 使用说明
|
| 101 |
+
1. 点击"上传PDF"按钮选择您的文档(仅处理前3页以节省资源)
|
| 102 |
+
2. 在问题框中输入您想了解的内容
|
| 103 |
+
3. 点击"生成回答"按钮获取答案
|
| 104 |
+
4. 如需保存结果,点击"下载结果"按钮
|
| 105 |
+
|
| 106 |
+
## 注意事项
|
| 107 |
+
- 首次加载模型需要约5分钟时间(自动下载2GB模型),请耐心等待
|
| 108 |
+
- 单次问答耗时约8-15秒(CPU环境)
|
| 109 |
+
- 为保证服务稳定性,回答长度限制在150字以内
|
| 110 |
+
""")
|
| 111 |
+
|
| 112 |
+
document_qa_demo.queue() # 启用排队机制
|
| 113 |
+
|
| 114 |
+
if __name__ == "__main__":
|
| 115 |
+
# 在Hugging Face Spaces中运行时,需要设置share=False和server_name="0.0.0.0"
|
| 116 |
+
document_qa_demo.launch(share=False, server_name="0.0.0.0")
|
pdf_qa_app.py
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
PDF文档问答应用 - 专为Hugging Face免费方案优化
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import gradio as gr
|
| 6 |
+
from huggingface_hub import InferenceClient
|
| 7 |
+
import os
|
| 8 |
+
from PyPDF2 import PdfReader
|
| 9 |
+
|
| 10 |
+
# 检查是否在 Hugging Face Spaces 环境中运行
|
| 11 |
+
IS_SPACES_ENV = "SPACE_ID" in os.environ
|
| 12 |
+
|
| 13 |
+
def process_pdf(file, max_pages=3):
|
| 14 |
+
"""处理PDF文件,提取文本内容"""
|
| 15 |
+
if not file:
|
| 16 |
+
return ""
|
| 17 |
+
|
| 18 |
+
try:
|
| 19 |
+
reader = PdfReader(file.name)
|
| 20 |
+
text_content = []
|
| 21 |
+
for i, page in enumerate(reader.pages[:max_pages]):
|
| 22 |
+
text_content.append(page.extract_text()[:600]) # 每页限制600字符
|
| 23 |
+
return "\n".join(text_content)
|
| 24 |
+
except Exception as e:
|
| 25 |
+
return f"处理PDF时出错: {str(e)}"
|
| 26 |
+
|
| 27 |
+
def answer_question(pdf_file, question):
|
| 28 |
+
"""基于PDF内容回答问题"""
|
| 29 |
+
if not pdf_file:
|
| 30 |
+
return "请先上传PDF文档"
|
| 31 |
+
|
| 32 |
+
if not question:
|
| 33 |
+
return "请输入您的问题"
|
| 34 |
+
|
| 35 |
+
# 处理PDF文档
|
| 36 |
+
doc_text = process_pdf(pdf_file)
|
| 37 |
+
if doc_text.startswith("处理PDF时出错"):
|
| 38 |
+
return doc_text
|
| 39 |
+
|
| 40 |
+
if not doc_text:
|
| 41 |
+
return "无法从PDF中提取文本内容"
|
| 42 |
+
|
| 43 |
+
# 构造提示词
|
| 44 |
+
prompt = f"基于以下文档内容回答问题,回答长度不超过150字:\n\n问题:{question}\n\n文档内容:{doc_text}\n\n回答:"
|
| 45 |
+
|
| 46 |
+
# 在HF Spaces环境中使用推理API
|
| 47 |
+
if IS_SPACES_ENV and os.environ.get("HF_TOKEN"):
|
| 48 |
+
# 尝试使用适合中文的模型
|
| 49 |
+
models_to_try = [
|
| 50 |
+
"THUDM/chatglm3-6b",
|
| 51 |
+
"google/gemma-2b-it",
|
| 52 |
+
"mistralai/Mistral-7B-Instruct-v0.2"
|
| 53 |
+
]
|
| 54 |
+
|
| 55 |
+
response = ""
|
| 56 |
+
for model_name in models_to_try:
|
| 57 |
+
try:
|
| 58 |
+
client = InferenceClient(token=os.environ.get("HF_TOKEN"), model=model_name)
|
| 59 |
+
# 测试连接
|
| 60 |
+
test_messages = [{"role": "user", "content": "Hello"}]
|
| 61 |
+
next(client.chat_completion(test_messages, max_tokens=10, stream=False))
|
| 62 |
+
|
| 63 |
+
# 发送实际请求
|
| 64 |
+
messages = [{"role": "user", "content": prompt}]
|
| 65 |
+
response = ""
|
| 66 |
+
for chunk in client.chat_completion(messages, max_tokens=150, stream=False):
|
| 67 |
+
if chunk.choices and chunk.choices[0].delta.content:
|
| 68 |
+
response += chunk.choices[0].delta.content
|
| 69 |
+
break
|
| 70 |
+
except Exception as e:
|
| 71 |
+
print(f"模型 {model_name} 连接失败: {str(e)}")
|
| 72 |
+
continue
|
| 73 |
+
|
| 74 |
+
if not response:
|
| 75 |
+
response = "抱歉,无法连接到AI模型,请稍后重试。"
|
| 76 |
+
|
| 77 |
+
return response
|
| 78 |
+
else:
|
| 79 |
+
# 在本地环境中使用模拟响应
|
| 80 |
+
return f"基于您提供的文档,关于问题'{question}'的回答如下:\n\n这是模拟的回答内容。在Hugging Face Spaces环境中,这里会显示AI模型的实际分析结果。文档内容预览:{doc_text[:200]}..."
|
| 81 |
+
|
| 82 |
+
# Gradio界面
|
| 83 |
+
with gr.Blocks(title="PDF文档问答助手") as demo:
|
| 84 |
+
gr.Markdown("# 📄 PDF文档问答助手")
|
| 85 |
+
gr.Markdown("上传PDF文档并提出问题,AI将为您解答文档中的内容")
|
| 86 |
+
|
| 87 |
+
# 环境说明
|
| 88 |
+
if IS_SPACES_ENV:
|
| 89 |
+
gr.Markdown("""
|
| 90 |
+
### 🚀 免费方案优化说明
|
| 91 |
+
|
| 92 |
+
为适应Hugging Face免费方案的资源限制,我们采用了以下优化策略:
|
| 93 |
+
- **内存优化**:选用轻量级模型,适应16GB内存限制
|
| 94 |
+
- **内容限制**:仅处理PDF前3页,每页不超过600字符
|
| 95 |
+
- **响应优化**:回答长度限制在150字以内,提高响应速度
|
| 96 |
+
- **并发处理**:启用排队机制,支持最多10人同时使用
|
| 97 |
+
""")
|
| 98 |
+
|
| 99 |
+
with gr.Row():
|
| 100 |
+
# 左侧:PDF上传和问题输入
|
| 101 |
+
with gr.Column(scale=1):
|
| 102 |
+
pdf_input = gr.File(label="上传PDF文档", file_types=[".pdf"])
|
| 103 |
+
question_input = gr.Textbox(label="您的问题", placeholder="例如:文档的主要观点是什么?")
|
| 104 |
+
answer_btn = gr.Button("获取答案", variant="primary")
|
| 105 |
+
download_btn = gr.DownloadButton("下载答案", visible=False)
|
| 106 |
+
|
| 107 |
+
# 右侧:结果显示
|
| 108 |
+
with gr.Column(scale=1):
|
| 109 |
+
answer_output = gr.Textbox(label="AI回答", interactive=False, max_lines=15)
|
| 110 |
+
|
| 111 |
+
# 使用说明
|
| 112 |
+
gr.Markdown("""
|
| 113 |
+
### 📖 使用方法
|
| 114 |
+
1. 点击"上传PDF文档"选择您的文件
|
| 115 |
+
2. 在问题框中输入您想了解的内容
|
| 116 |
+
3. 点击"获取答案"按钮等待AI分析
|
| 117 |
+
4. 答案生成后可点击"下载答案"保存结果
|
| 118 |
+
|
| 119 |
+
### ⚠️ 注意事项
|
| 120 |
+
- 为保证响应速度,仅分析文档前3页
|
| 121 |
+
- 为节省资源,每页内容限制在600字符以内
|
| 122 |
+
- 答案长度限制在150字以内
|
| 123 |
+
- 首次使用时模型加载可能需要几分钟时间
|
| 124 |
+
""")
|
| 125 |
+
|
| 126 |
+
# 绑定按钮事件
|
| 127 |
+
answer_btn.click(
|
| 128 |
+
answer_question,
|
| 129 |
+
inputs=[pdf_input, question_input],
|
| 130 |
+
outputs=[answer_output]
|
| 131 |
+
).then(
|
| 132 |
+
lambda answer: gr.update(visible=True, value=("answer.txt", answer)) if answer and not answer.startswith("请") and not answer.startswith("处理PDF时出错") else gr.update(visible=False),
|
| 133 |
+
inputs=[answer_output],
|
| 134 |
+
outputs=[download_btn]
|
| 135 |
+
)
|
| 136 |
+
|
| 137 |
+
# 启用排队机制
|
| 138 |
+
demo.queue()
|
| 139 |
+
|
| 140 |
+
if __name__ == "__main__":
|
| 141 |
+
# 在Hugging Face Spaces中运行时,需要设置share=False和server_name="0.0.0.0"
|
| 142 |
+
demo.launch(share=False, server_name="0.0.0.0")
|
requirements.txt
CHANGED
|
@@ -2,4 +2,9 @@ gradio[oauth]>=5.42.0
|
|
| 2 |
huggingface_hub>=0.22.2
|
| 3 |
requests>=2.31.0
|
| 4 |
Pillow>=10.0.0
|
| 5 |
-
python-docx>=0.8.11
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
huggingface_hub>=0.22.2
|
| 3 |
requests>=2.31.0
|
| 4 |
Pillow>=10.0.0
|
| 5 |
+
python-docx>=0.8.11
|
| 6 |
+
transformers>=4.38.0
|
| 7 |
+
torch>=2.1.0
|
| 8 |
+
PyPDF2>=3.0.1
|
| 9 |
+
accelerate>=0.25.0
|
| 10 |
+
bitsandbytes>=0.41.0
|