常见资料文件转换为 TXT 文件

🕗 发布于 2024-06-21 17:03 python 开发语言

代码概述

该代码旨在遍历指定目录中的所有文件，将支持的文件格式（如PDF、DOC、DOCX、PPT、PPTX、XLS、XLSX、TXT）转换为文本文件并保存。使用不同的库处理不同的文件格式，如fitz用于PDF，docx用于DOCX，comtypes用于PPT，openpyxl用于XLSX，xlrd用于XLS，aspose-words用于DOC等。

安装依赖库

在运行代码之前，需要安装以下Python库：

注意：Python版本为3.9

pip install pymupdf python-docx chardet python-pptx xlrd openpyxl comtypes aspose-words

代码解释

1 导入必要的库

import os
import fitz  # PyMuPDF
import docx
import chardet
from pptx import Presentation
import xlrd
import openpyxl
import comtypes.client
import aspose.words as aw

2 定义提取文本的函数

为每种文件格式定义一个函数来提取文本：

从PPT文件提取文本

def extract_text_from_ppt(file_path):
    powerpoint = comtypes.client.CreateObject("Powerpoint.Application")
    powerpoint.Visible = 1
    abs_file_path = os.path.abspath(file_path)
    slides = powerpoint.Presentations.Open(abs_file_path).Slides
    text_runs = []
    for slide in slides:
        for shape in slide.Shapes:
            if shape.HasTextFrame and shape.TextFrame.TextRange.Paragraphs().Count > 0:
                try:
                    for paragraph in shape.TextFrame.TextRange.Paragraphs():
                        for run in paragraph.Runs():
                            text_runs.append(run.Text)
                except Exception as e:
                    print(f"Error processing shape in slide: {e}")
    powerpoint.Quit()
    return '\n'.join(text_runs)

从PDF文件提取文本

def extract_text_from_pdf(file_path):
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

从DOCX文件提取文本

def extract_text_from_docx(file_path):
    doc = docx.Document(file_path)
    text = [paragraph.text for paragraph in doc.paragraphs]
    return "\n".join(text)

从PPTX文件提取文本

def extract_text_from_pptx(file_path):
    prs = Presentation(file_path)
    text_runs = []
    for slide in prs.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:
                for paragraph in shape.text_frame.paragraphs:
                    for run in paragraph.runs:
                        text_runs.append(run.text)
    return '\n'.join(text_runs)

从XLSX文件提取文本

def extract_text_from_xlsx(file_path):
    workbook = openpyxl.load_workbook(file_path)
    sheets = workbook.sheetnames
    text = []
    for sheet_name in sheets:
        sheet = workbook[sheet_name]
        for row in sheet.iter_rows(values_only=True):
            text.append("\t".join([str(cell) if cell is not None else "" for cell in row]))
    return "\n".join(text)

从XLS文件提取文本

def extract_text_from_xls(file_path):
    workbook = xlrd.open_workbook(file_path)
    text = []
    for sheet in workbook.sheets():
        for row_idx in range(sheet.nrows):
            row = sheet.row(row_idx)
            text.append("\t".join([str(cell.value) for cell in row]))
    return "\n".join(text)

从DOC文件提取文本

def extract_text_from_doc(file_path):
    try:
        doc = aw.Document(file_path)
        text = doc.get_text()
        return text
    except Exception as e:
        print(f"Error processing file {file_path}: {e}")
        return ""

从TXT文件提取文本

def extract_text_from_txt(file_path):
    rawdata = open(file_path, 'rb').read()
    result = chardet.detect(rawdata)
    encoding = result['encoding']
    with open(file_path, 'r', encoding=encoding, errors='ignore') as f:
        return f.read()

3 文件扩展名与提取函数的映射

EXTRACTORS = {
    '.pdf': extract_text_from_pdf,
    '.docx': extract_text_from_docx,
    '.pptx': extract_text_from_pptx,
    '.ppt': extract_text_from_ppt,
    '.xlsx': extract_text_from_xlsx,
    '.xls': extract_text_from_xls,
    '.doc': extract_text_from_doc,
    '.txt': extract_text_from_txt,
}

4 生成输出文件路径的函数

def get_next_output_file_path(directory):
    files = os.listdir(directory)
    txt_files = [f for f in files if f.endswith('.txt')]
    num_txt_files = len(txt_files)
    num_str = str(num_txt_files).zfill(3)
    output_file_path = os.path.join(directory, f'output_{num_str}.txt')
    return output_file_path

5 文件转换函数

def convert_to_text(file_path: str) -> str:
    _, ext = os.path.splitext(file_path)
    extractor = EXTRACTORS.get(ext)
    if extractor is None:
        raise ValueError(f"Unsupported file format: {ext}")
    return extractor(file_path)

6 遍历目录并转换文件的函数

def convert_files_in_directory(directory):
    file_paths = [os.path.join(root, filename)
                  for root, dirs, files in os.walk(directory)
                  for filename in files]
    target_directory = r'E:\MarkDeng\data2train'
    output_file_path = get_next_output_file_path(target_directory)
    total_files = len(file_paths)
    with open(output_file_path, 'x', encoding='utf-8') as f:
        for i, file_path in enumerate(file_paths, start=1):
            try:
                text = convert_to_text(file_path)
                f.write(text + '\n')
                print(f"Processed {i} out of {total_files} files.")
            except Exception as e:
                print(f"Error processing file {file_path}: {e}")

7 主函数

if __name__ == '__main__':
    # 示例使用
    your_directory = r'E:\MarkDeng\data2train\126套技巧教程'
    convert_files_in_directory(your_directory)

总结

该代码通过定义不同的函数来提取各种文件格式的文本，并通过遍历目录中的文件来转换所有支持的文件格式，最终将转换结果保存到指定目录的文本文件中。这一过程简化了从多个文件中提取文本的复杂性，并可以很容易地扩展以支持更多的文件格式。

原文地址：https://blog.csdn.net/weixin_47420447/article/details/139856418

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：oracle 常见sql 解析
下一篇：太速科技-FMC213V3-基于FMC兼容1.8V IO的Full Camera Link 输入子卡

CSS——唯美窗口
通过上述步骤，我们成功创建了一个唯美的弹窗窗口。这个弹窗不仅在视觉上给人以美的享受，而且在用户体验上也做到了简洁而高效。希望这个教程能帮助你在自己的项目中实现更加优雅和美观的弹窗设计。记住，好的设计不
阅读更多2024-11-06
前端入门一之CSS知识详解
CSS中的继承: 子标签会继承父标签的某些样式，如文本颜色和字号。恰当地使用继承可以简化代码，降低 CSS 样式的复杂性。相同选择器给设置相同的样式，此时一个样式就会覆盖（层叠）另一个冲突的样式。盒子
阅读更多2024-11-06
Midjourney国内直登
Midjourney确实是一个强大的AI绘画工具，能够根据用户输入的文本生成高质量的图像。然而，由于国内的网络限制，直接访问Midjourney可能会遇到障碍。目前，已经有一些国内代理或中转平台可以帮
阅读更多2024-11-06
Go 语言的错误处理
Go 语言允许你创建自定义错误类型，以便在错误发生时提供更多的信息。自定义错误类型通常是实现了Error()方法的结构体。= nil {在这个例子中，我们创建了一个FileError结构体来封装文件错
阅读更多2024-11-06
使用Spring Validation实现数据校验详解
在现代Web应用开发中，数据校验是不可忽视的重要环节。Spring提供了强大的数据校验框架——Spring Validation，可以有效提升数据输入的安全性与应用的稳定性。本文将介绍如何使用Spri
阅读更多2024-11-06
phpstudy 使用php8.2.9版本报错问题
在切换php版本到更高版本时在终端查看php版本时报如下界面错误。
阅读更多2024-11-06
微信小程序高校教材征订系统
系统分为三个角色，分别是教材科、系教学秘书、教研室主任。系统主要完成功能是教材科要发布教材征订信息，系部教学秘书根据教学计划通知各个教研室主任指定教材征订计划，系部教学秘书汇总教材征订计划表并上报教材
阅读更多2024-11-06
自制inscode项目推荐:色块小游戏
颜色匹配小游戏是一款基于HTML、CSS和JavaScript开发的简单而有趣的网页游戏。游戏的目标是通过点击颜色块，将整个游戏板上的所有方块变成同一种颜色。玩家需要在有限的步数内完成任务，否则游戏将
阅读更多2024-11-06
微服务day03
由于已有相关项目则要关闭DockerComponent中的已开启的项目创建网络用来连接项目容器创建并运行Mysql容器。
阅读更多2024-11-06
Linux中sysctl、systemctl、systemd、init的区别
用于管理系统和服务。两者在Linux系统中扮演着不同的角色，服务于不同的管理需求。是现代Linux系统中的主流系统和服务管理器，它提供了比传统的。的命令行工具，使得管理系统和服务变得更加方便和高效。更
阅读更多2024-11-06