CoCo Caption数据集转SFT格式（parquet格式转jpg和json）

🕗 发布于 2024-12-11 18:43 json

# transformer to mllm instruction-tuning data
import pyarrow.parquet as pq
import pandas as pd
import cv2
import numpy as np
import os
import json
import random
from tqdm import tqdm

instructions = [
    "Give the caption of the image.",
    "Give the caption of the image.",
    "Give the caption of the image.",
    "Give the caption of the image in English.",
    "Describe the content of the image in detail, including all the objects and their interactions.",
    "Provide a comprehensive caption that summarizes the main elements and actions in the image.",
    "Explain what is happening in the image and the context surrounding the scene.",
    "Capture the essence of the image by describing the subject, setting, and any relevant actions.",
    "Create a caption that conveys the mood and atmosphere of the image.",
    "Describe the image with attention to detail, including colors, textures, and any unique features.",
    "Write a caption that highlights the most significant aspects of the image and makes them stand out.",
    "Provide a narrative caption that tells a story based on the image's content.",
    "Craft a caption that is both informative and engaging, drawing the viewer into the image.",
    "Describe the image in a way that captures the viewer's attention and encourages further exploration of the scene.",
    "Write a caption that is concise yet informative, giving just enough detail to paint a clear picture of the image.",
    "Provide a descriptive caption that helps the viewer understand the relationship between the objects in the image.",
    "Craft a caption that is not only accurate but also evokes emotion, reflecting the sentiment of the image.",
    "Write a caption that is suitable for an audience unfamiliar with the image's context, providing enough detail for understanding."
]
# 设置数据文件夹和图像保存目录
data_dir = '/mnt/workspace/data/coco_captions/data'
image_dir = 'images'
os.makedirs(image_dir, exist_ok=True)

# 初始化JSON数据结构
chat_test = []
chat_train = []

# 遍历数据文件夹中的所有Parquet文件
for file_name in os.listdir(data_dir):
    if file_name.startswith("test"):
        output_json = chat_test
        file_path = os.path.join(data_dir, file_name)
        print(file_path)
        parquet_file = pq.ParquetFile(file_path)
        data = parquet_file.read().to_pandas()
        for index, row in tqdm(data.iterrows()):
            filename = row['filename']
            save_path = os.path.join(image_dir, filename)
            
            image_feature = row['image']['bytes']
            image_array = np.frombuffer(image_feature, dtype=np.uint8)
            image = cv2.imdecode(image_array, cv2.IMREAD_COLOR)
            if image is not None:
                cv2.imwrite(save_path, image)
            else:
                print(f"Failed to decode image for file {file_name}, row {index}.")
                continue
            
            caption = row['caption']
            cocoid = row['cocoid']
            instruction = random.choice(instructions)
            
            if random.choice([True, False]):
                instruction = f"<image>\n{instruction}"
            else:
                instruction = f"{instruction}\n<image>"
            
            conversation = {
                "id": cocoid,
                "image": filename,
                "conversations": [
                    {"from": "human", "value": instruction},
                    {"from": "gpt", "value": caption},
                ]
            }
            output_json.append(conversation)
    elif file_name.startswith(("train", "validation")):
        output_json = chat_train
        file_path = os.path.join(data_dir, file_name)
        print(file_path)
        parquet_file = pq.ParquetFile(file_path)
        data = parquet_file.read().to_pandas()
        for index, row in tqdm(data.iterrows()):
            filename = row['filename']
            save_path = os.path.join(image_dir, filename)
            
            image_feature = row['image']['bytes']
            image_array = np.frombuffer(image_feature, dtype=np.uint8)
            image = cv2.imdecode(image_array, cv2.IMREAD_COLOR)
            if image is not None:
                cv2.imwrite(save_path, image)
            else:
                print(f"Failed to decode image for file {file_name}, row {index}.")
                continue
            
            caption = row['caption']
            cocoid = row['cocoid']
            instruction = random.choice(instructions)
            
            if random.choice([True, False]):
                instruction = f"<image>\n{instruction}"
            else:
                instruction = f"{instruction}\n<image>"
            
            conversation = {
                "id": cocoid,
                "image": filename,
                "conversations": [
                    {"from": "human", "value": instruction},
                    {"from": "gpt", "value": caption},
                ]
            }
            output_json.append(conversation)

# 将JSON数据保存为文件
with open('chat_test.json', 'w') as json_file:
    json.dump(chat_test, json_file, indent=4)

with open('chat_train.json', 'w') as json_file:
    json.dump(chat_train, json_file, indent=4)

print("Finished processing and saved chat data to chat_test.json and chat_train.json.")

原文地址：https://blog.csdn.net/weixin_54338498/article/details/144372838

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Ubuntu上使用system()函数运行不需要输入密码
下一篇：关于网站的权重和百度蜘蛛爬虫的关系

.NET(C#) 如何配置用户首选项及保存用户设置
.NET(C#) 如何配置用户首选项及保存用户设置
阅读更多2024-12-14
【最新】北大数字普惠金融指数数据集-省市县（2011-2023年）
郭峰,王靖一,王芳,孔涛,张勋,程志云.测度中国数字普惠金融发展:指数编制与空间特征[J].经济学(季刊),2020,19(04):1401-1418.时间跨度：省级和城市级指数时间跨度为2011-2
阅读更多2024-12-14
GESP202412 四级【Recamán】题解（AC）
a11ak−1−kkakak−1−kak−1k小杨想知道 Recamán 数列的前n项从小到大排序后的结果。手动计算非常困难，小杨希望你能帮他解决这个问题。
阅读更多2024-12-14
IDEA遇到EasyConnect中的网络资源无法访问的问题
版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/wanshanyu_/article/de
阅读更多2024-12-14
双目摄像头标定方法
此时已经完成标定，左下角为反投影误差，右边为外参可视化。将双目左右目拍的图像上传（左右目最好不少于20张）此时回到主页面，即可看到成功导出。把这些误差大的删除即可。
阅读更多2024-12-14
Servlet、omcat服务器架构与工作原理
Servlet是运行在服务器端的Java程序，它的主要职责之一是接收并处理来自客户端（如浏览器）的HTTP请求。当客户端发送一个请求到服务器时，Servlet可以解析请求中的信息，例如请求的URL路径
阅读更多2024-12-14
Vue生命周期钩子函数：深入解析与实践
作为高级Vue前端开发人员，对Vue组件的生命周期钩子函数有着深刻的理解是至关重要的。生命周期钩子函数是指在Vue组件的创建、更新、销毁等过程中，Vue自动调用的一系列方法。通过这些钩子函数，我们可以
阅读更多2024-12-14
安卓开发--使用android studio发布APP
app发布
阅读更多2024-12-14
数据结构与算法学习笔记----拓扑排序
@ author: 明月清了个风。
阅读更多2024-12-14
python 将数据保存到现有的Excel文件的新工作表
out_file = ‘query.xlsx’df1 = pd.DataFrame(out_data)若直接写入：df1.to_excel(out_file, index=False, sheet_n
阅读更多2024-12-14

CoCo Caption数据集转SFT格式（parquet格式转jpg和json）

相关文章