CUDA 计时功能，记录GPU程序/函数耗时，cudaEventCreate，cudaEventRecord，cudaEventElapsedTime

🕗 发布于 2024-12-10 14:25 CUDA

为了测试GPU函数的耗时，可以使用 CUDA 提供的计时功能：cudaEventCreate, cudaEventRecord, 和 cudaEventElapsedTime。这些函数可以帮助你测量某个 CUDA 操作（如设置设备）所花费的时间。

一、记录耗时案例

以下是一个示例程序，它测量调用 cudaSetDevice 所花费的时间：

#include <iostream>
#include <vector>
#include <cuda_runtime.h>

 
__global__ void dummyKernel() {

    // Dummy kernel to ensure CUDA context is initialized
}

 

int main() {

    // CUDA device IDs
    int device1 = 0;
    int numIterations = 10; // Number of times to call cudaSetDevice

 
    // Create CUDA events
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Vector to store elapsed times
    std::vector<float> elapsedTimes(numIterations);

 
    // Set initial device (optional, but ensures a known starting state)
    cudaSetDevice(device1);

 
    // Measure time for multiple cudaSetDevice calls
    for (int i = 0; i < numIterations; ++i) {
        // Record the start event
        cudaEventRecord(start, 0);
 
        // Set the device (this is the operation we are timing)
        cudaSetDevice(device1);

        // Record the stop event
        cudaEventRecord(stop, 0);

        // Measure the elapsed time between the start and stop events
        cudaEventElapsedTime(&elapsedTimes[i], start, stop);

        // Output results
        std::cout << "Number of iterations: i " << i << std::endl;

        std::cout << " time to set device " << device1 << ": " << elapsedTimes[i] << " ms" << std::endl;

    }

 

    // Calculate statistics (e.g., average time)
    float totalTime = 0.0f;
    for (float time : elapsedTimes) {
        totalTime += time;
    }
    float averageTime = totalTime / numIterations;

 

    // Output results
    std::cout << "Number of iterations: " << numIterations << std::endl;
    std::cout << "Average time to set device " << device1 << ": " << averageTime << " ms" << std::endl;

 
    // Optionally, run a dummy kernel to ensure CUDA is initialized and ready
    dummyKernel<<<1, 1>>>();
    cudaDeviceSynchronize();
 

    // Clean up
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return 0;
}

二、编译和运行

2.1 编译: 使用 nvcc 编译这个 CUDA 程序。（上面程序文件铭为test_cudaSetDevice_multiple.cu）

nvcc -o test_cudaSetDevice_multiple test_cudaSetDevice_multiple.cu

2.2 运行: ，然后运行生成的可执行文件。

./test_cudaSetDevice_multiple

哈哈哈，就得到运行结果啦！

原文地址：https://blog.csdn.net/lianghuaju/article/details/144340427

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：区块链智能合约( solidity) 安全编程
下一篇：如何在 Odoo18 视图中添加关联数据看板按钮 | 免费开源ERP实施诀窍

.NET(C#) 如何配置用户首选项及保存用户设置
.NET(C#) 如何配置用户首选项及保存用户设置
阅读更多2024-12-14
【最新】北大数字普惠金融指数数据集-省市县（2011-2023年）
郭峰,王靖一,王芳,孔涛,张勋,程志云.测度中国数字普惠金融发展:指数编制与空间特征[J].经济学(季刊),2020,19(04):1401-1418.时间跨度：省级和城市级指数时间跨度为2011-2
阅读更多2024-12-14
GESP202412 四级【Recamán】题解（AC）
a11ak−1−kkakak−1−kak−1k小杨想知道 Recamán 数列的前n项从小到大排序后的结果。手动计算非常困难，小杨希望你能帮他解决这个问题。
阅读更多2024-12-14
IDEA遇到EasyConnect中的网络资源无法访问的问题
版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/wanshanyu_/article/de
阅读更多2024-12-14
双目摄像头标定方法
此时已经完成标定，左下角为反投影误差，右边为外参可视化。将双目左右目拍的图像上传（左右目最好不少于20张）此时回到主页面，即可看到成功导出。把这些误差大的删除即可。
阅读更多2024-12-14
Servlet、omcat服务器架构与工作原理
Servlet是运行在服务器端的Java程序，它的主要职责之一是接收并处理来自客户端（如浏览器）的HTTP请求。当客户端发送一个请求到服务器时，Servlet可以解析请求中的信息，例如请求的URL路径
阅读更多2024-12-14
Vue生命周期钩子函数：深入解析与实践
作为高级Vue前端开发人员，对Vue组件的生命周期钩子函数有着深刻的理解是至关重要的。生命周期钩子函数是指在Vue组件的创建、更新、销毁等过程中，Vue自动调用的一系列方法。通过这些钩子函数，我们可以
阅读更多2024-12-14
安卓开发--使用android studio发布APP
app发布
阅读更多2024-12-14
数据结构与算法学习笔记----拓扑排序
@ author: 明月清了个风。
阅读更多2024-12-14
python 将数据保存到现有的Excel文件的新工作表
out_file = ‘query.xlsx’df1 = pd.DataFrame(out_data)若直接写入：df1.to_excel(out_file, index=False, sheet_n
阅读更多2024-12-14

CUDA 计时功能，记录GPU程序/函数耗时，cudaEventCreate，cudaEventRecord，cudaEventElapsedTime

一、记录耗时案例

二、编译和运行

相关文章