如何从幻灯片中提取的文本生成 xml 或 json？

问题描述

我正在使用此代码从使用 pptx 的幻灯片中提取文本，如何生成包含每张幻灯片文本的 xml 或 json 文件？

local_pptxFileList = ["/content/drive/MyDrive/Slides/Backlog Management.pptx"]

for i in local_pptxFileList:
            ppt = Presentation(i)
            for slide in ppt.slides:
                for shape in slide.shapes:
                    if shape.has_text_frame:
                      print(shape.text)

解决方法

将提取的文本存储到数据结构中，例如列表（或列表列表，每个演示文稿的文本有一个列表）。

使用 json 模块从您的数据结构创建一个 json，并保存到文件中。我没有处理过编码（例如 utf-8）以确保正确存储文本，但您可以轻松找到大量相关信息。

import json 

local_pptxFileList = ["/content/drive/MyDrive/Slides/Backlog Management.pptx"]

all_texts = [] 
for i in local_pptxFileList:
    ppt = Presentation(i)
    this_pres_texts = [] 
    for slide in ppt.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:
                this_pres_texts.append(shape.text)
    all_texts.append(this_pres_texts)

with open('data.txt','w') as outfile:
    json.dump(all_texts,outfile)

json python python-pptx xml xml xml xml xml xml xml