利用PaddleOCR进行图片的跨页表格提取与合并（PDF扫描版）

🕗 发布于 2024-07-13 14:44 pdf python

利用PaddleOCR进行扫描版PDF的跨页表格提取与合并

前言

前言

在处理PDF文件中的表格时，常常会遇到表格跨页的情况。并且一些PDF文件为扫描版。这种情况下，如果要将跨页的表格合并为一个完整的表格，手动操作不仅繁琐且容易出错。因此，本文将介绍如何利用PaddleOCR和Python代码，自动化地检测并合并这些跨页表格。

1.环境准备

首先，我们需要安装以下库：

pandas：数据处理
paddleocr：OCR识别表格结构
pdf2image：将PDF页面转换为图像
beautifulsoup4：解析HTML
numpy：数组处理

安装命令如下：

pip install pandas paddleocr pdf2image beautifulsoup4 numpy

引入所需的库并设置一些警告和日志配置，以确保代码执行过程中不会被不必要的信息干扰：

import pandas as pd
from paddleocr import PPStructure, save_structure_res
from pdf2image import convert_from_path
from bs4 import BeautifulSoup
import warnings
import numpy as np
import logging
import os

warnings.filterwarnings("ignore")
logging.disable(logging.DEBUG)
logging.disable(logging.WARNING)

2.文件路径与阈值设置

定义PDF文件的路径和一些参数阈值，用于判断表格是否跨页：

path = 'E:/Jobcontent/data/测试/'
output_folder = "E:/table/ex/"
topthreshold = 0.2
dthreshold = 0.8

table_engine = PPStructure(show_log=True)

3.定义辅助函数

这些辅助函数用于提取PDF页面中的表格信息，并判断表格是否跨页。

top_bottom_table_info：获取页面中最上方表格的列数和坐标。
find_bottom_table_info：获取页面中最下方表格的列数和坐标。
is_continuation：判断表格是否跨页。

def top_bottom_table_info(table_result):
    top_table = None
    min_y = 0
    for table in table_result:
        if table['type'] == 'table':
            bbox = table['bbox']
            current_bottom_y = bbox[1]
            if top_table is None or current_bottom_y < min_y:
                top_table = table
                min_y = current_bottom_y

    if top_table is not None:
        soup = BeautifulSoup(top_table['res']['html'], 'html.parser')
        last_row = soup.find_all('tr')[-1]
        columns = last_row.find_all('td')
        top_row_columns = len(columns)
        top_xy = top_table['bbox']
        return top_row_columns, top_xy
    else:
        return None

def find_bottom_table_info(table_result):
    bottom_table = None
    bottom_y = 0
    for table in table_result:
        if table['type'] == 'table':
            bbox = table['bbox']
            current_bottom_y = bbox[3]
            if bottom_table is None or current_bottom_y > bottom_y:
                bottom_table = table
                bottom_y = current_bottom_y

    if bottom_table is not None:
        soup = BeautifulSoup(bottom_table['res']['html'], 'html.parser')
        last_row = soup.find_all('tr')[0]
        columns = last_row.find_all('td')
        last_row_columns = len(columns)
        bottom_xy = bottom_table['bbox']
        return last_row_columns, bottom_xy
    else:
        return None

def is_continuation(top_row_columns, last_row_columns, bottom_xy, top_xy, page_height, dthreshold=0.8, topthreshold=0.2):
    if top_row_columns != last_row_columns:
        return False
    is_last_table_at_bottom = bottom_xy[3] > dthreshold * page_height
    is_first_table_at_top = top_xy[1] < topthreshold * page_height
    return is_last_table_at_bottom and is_first_table_at_top

4.处理PDF文件

读取PDF文件并提取每页的表格信息。对于跨页的表格，提取其列数和坐标，并将结果合并。

获取PDF文件列表：获取指定路径下所有以“.pdf”结尾的文件。
逐个处理PDF文件：对于每个PDF文件，初始化一个列表来存储跨页表格信息。
将PDF页面转换为图像：将PDF文件的每一页转换为图像，并逐页处理。
提取表格信息：使用table_engine函数从每个页面图像中提取表格信息，并保存结构化结果。
检测跨页表格：检查当前页的最后一行和下一页的第一行的列数是否相同，如果相同，则记录跨页表格的信息。
合并跨页表格：对于检测到的跨页表格，读取跨页的两部分表格，合并后保存为CSV文件。
完成提取：在处理完所有PDF文件后，打印“表格提取完成”的消息。

pdf_files = [f for f in os.listdir(path) if f.endswith('.pdf')]
print(pdf_files)

for pdf_file in pdf_files:
    cross_page_tables = []
    pdf_path = os.path.join(path, pdf_file)
    images = convert_from_path(pdf_path, dpi=200)
    print('正在提取表格,请耐心等待...')
    for page_number, image in enumerate(images):
        table_result = table_engine(np.array(image))
        save_structure_res(table_result, output_folder, f'{page_number+1}')

        _, page_height = image.size
        last_row_columns, bottom_xy = find_bottom_table_info(table_result)
        if page_number+1 < len(images):
            table_result_end = table_engine(np.array(images[page_number+1]))
            top_row_columns, top_xy = top_bottom_table_info(table_result_end)

            if is_continuation(top_row_columns, last_row_columns, bottom_xy, top_xy, page_height):
                cross_page_tables.append((bottom_xy, top_xy, page_number+1, page_number+2))
                print(f"{pdf_file} 的表格在第 {page_number+1} 页和第 {page_number+2} 页之间跨页，并且最后一行和下一页的第一行列数相同")

    if cross_page_tables:
        for (bottom_xy, top_xy, start_page, end_page) in cross_page_tables:
            output_folder_s = output_folder + f'{start_page}/'
            output_folder_t = output_folder + f'{end_page}/'
            file_s = f'{bottom_xy}'+'_0'+'.xlsx'
            file_t = f'{top_xy}'+'_0'+'.xlsx'
            s_path = os.path.join(output_folder_s, file_s)
            e_path = os.path.join(output_folder_t, file_t)
            table_result_start =pd.read_excel(s_path, header=None)
            table_result_end = pd.read_excel(e_path, header=None)
            merged_table = pd.concat([table_result_start, table_result_end], ignore_index=True)
            output_path = os.path.join(output_folder, f'{pdf_file}_merged_{start_page}_{end_page}.csv')
            merged_table.to_csv(output_path, index=False)

print('表格提取完成')

5.总结

通过上述代码，可以实现对扫描版PDF文件中跨页表格的检测与合并，并将结果保存为CSV文件。该方法对提升PDF表格处理的自动化程度和效率具有重要意义。

原文地址：https://blog.csdn.net/weixin_44733966/article/details/140373838

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：索引原理；为什么采用B+树？
下一篇：2024年新一代WebOffice内嵌网页组件——猿大师办公助手

第9章综合案例————众成远程教育
制作“众成远程教育”网页，本章项目页面布局要求如下:页面要求有最外层的 div-al，第二层嵌套上中下3行区域，分别为div-top.div-main和 div-footer。而 div-main 又
阅读更多2024-11-17
Python3语法基础（全，带示例）
信息技术类，对口高考，Python，教师：施恒锋
阅读更多2024-11-17
【鸿蒙开发】第十四章 Web组件的使用、基本属性与事件
Web组件用于在应用程序中显示Web页面内容，为开发者提供页面加载、页面交互、页面调试等能力。页面加载：Web组件提供基础的前端页面加载的能力，包括：加载网络页面、本地页面、html格式文本数据。页面
阅读更多2024-11-17
python基础知识（五）——文件上传
python基础知识（五）——文件上传
阅读更多2024-11-17
如何在uniapp中获取和修改Web项目的Cookie
在uniapp开发Web项目时，操作Cookie是常见的需求。本文将介绍如何在uniapp中获取和修改Web项目的Cookie，且不设置过期时间。
阅读更多2024-11-17
时钟之Canvas+JS版
上一篇介绍使用CSS+JS方式实现，但元素泰国单一。此篇将以HTML5的canvas标签结合JS来实现。HTML代码JS代码//计时器//钟表半径//时针刻度宽度//分针刻度宽度//时针宽度//分针宽
阅读更多2024-11-17
AI测试的主要研究方向介绍
这个框架将支持对不同主题的基础测试数据集进行文本分词、图像标注、特征筛选等加工处理，为不同AI医疗产品提供定制化的测试数据，解决医学数据模块的通用性与特定测试数据集需求之间的冲突，确保测试数据集既具有
阅读更多2024-11-17
CSS盒子的定位＞（下篇）#固定定位#笔记
固定定位其实是绝对定位的子类别，一个设置了的元素是相对于视窗固定的，就算页面文档发生了滚动，它也会一直待在相同的地方。
阅读更多2024-11-17
doris udf -- 避免使用递归CTE
在部门表里有部门id (dept_id) 和父部门id (parent_id) ，父部门id同时也是部门id。现在要查部门id下所有的子部门id，但是不知道部门层级，部门关系可能也会调整。
阅读更多2024-11-17
前端知识点---this的用法 , this动态绑定(Javascript)
在JavaScript中，this 是一个非常重要但是呢也让人难搞明白的关键字。**它的值不是在编写代码时静态确定的，而是在代码运行时动态绑定的。**这非常重要下面讲一下它 .在全局作用域中（即不在
阅读更多2024-11-17