【深大计算机系统(2)】实验五 Cache实验 实验报告 附实验代码、常用指令、实验数据
目录
2. 测量分析出Cache 的层次结构、容量以及L1 Cache行有多少?
写在前面:
上交的实验报告需要包含账户的姓名全拼以及学号,本报告的相应部分已经打码处理,请在自己的环境中完成实验!
一、实验目的:
1. 加强对Cache工作原理的理解;
2. 体验程序中访存模式变化是如何影响cahce效率进而影响程序性能的过程;
3. 学习在X86真实机器上通过调整程序访存模式来探测多级cache结构以及TLB的大小。
二、实验环境:
X86真实机器
三、实验内容与步骤:
1. 分析Cache访存模式对系统性能的影响
(1)给出一个矩阵乘法的普通代码A,设法优化该代码,从而提高性能。
(2)改变矩阵大小,记录相关数据,并分析原因。
2. 编写代码来测量x86机器上(非虚拟机)的Cache 层次结构和容量
(1)设计一个方案,用于测量x86机器上的Cache层次结构,并设计出相应的代码;
(2)运行你的代码获得相应的测试数据;
(3)根据测试数据来详细分析你所用的x86机器有几级Cache,各自容量是多大?
(4)根据测试数据来详细分析L1 Cache行有多少?
3. 尝试测量你的x86机器TLB有多大?(选作)
四、实验过程及内容:
准备工作:
首先在虚拟机桌面上新建一个实验五的文件夹experiment_5用于保存本实验所用的所有文件。下载main_a.c文件后使用指令gcc -o main_a main_a.c 进行编译,发现没有错误。
图:编译main_a.c
图中,第一次编译错误是由于“-o”的命令出错,具体原因是ppt中给出的指令使用的符号并不是标准的“-”,而是“–”,导致机器无法识别。修改后即可正确编译并生成可运行文件。
接下来即可开始实验:
1. 分析Cache访存模式对系统性能的影响:
使用指令./main_a x,以x*x为矩阵的大小运行程序./main_a,并记录运行时间,填入表中(单位为秒)。
接下来阅读源代码,并尝试优化:
阅读代码可以得知,程序的主要功能是实现两个矩阵的相乘运算,将大小为x*x的矩阵a,b相乘的结果保存在矩阵c中。但此代码的空间局部性较差,常常访问距离当前访问地址较远的地址,具体而言,在对c某个位置的计算时,对矩阵b的访问中,程序每次访问的地址将会增加size,此时如果size较大,访问的空间将会距离较大,因此空间局部性差。
优化方法:将原始的整块矩阵相乘拆分为一个个小矩阵块相乘,从而提高了空间局部性,提高运行效率。计算部分的代码如图:
图:核心代码
其中,BLOCK_SIZE即为矩阵块的大小。
接下来,由于我们并不能直接确定块的合适大小,因此我尝试将此参数进行多次调整,找到一个较优的值:
图:参数调整过程
图中第二个参数即为BLOCK_SIZE的值,如图可以发现,512为一个较优的参数值。
接下来固定参数,进行实验,并统计实验结果:
图:实验过程
最终实验结果如表:
如表,代码的优化带来了显著的效果。
2. 测量分析出Cache 的层次结构、容量以及L1 Cache行有多少?
首先需要对mountain文件进行解压:tar-xvf mountain.tar
图:解压成功
接着使用make生成可执行文件,并将其运行:
图:运行结果
将运行结果保存到本地,并使用matlab画图:
图:数据山
图中能明显地观察到读取速度的分层现象。
根据上述的实验结果可以分析出:随着读取内容的增多,读取速度三次下降分别出现在:512KB~1024KB、4MB~8MB、8MB~16MB,说明本电脑共存在三级缓存,且这三段速度下降分别对应了这三段缓存的切换过程。其中L1缓存的容量可能约为768KB、L2缓存为6MB、L3缓存为12MB。
而随着步长的变化,读取速度也会出现对应的突变:这样的突变是由于数据是按块加载到缓存中的,较大的步长可能会导致两次连续的访问必然不可能出现在同一块中,从而降低了读取速度。本实验中,测得步长达到40步左右时会出现突变,L1缓存的行约为512KB/40*8B=1638行
打开任务管理器验证实验结果:
图:三个缓存真实大小
如图所示,三个缓存的真实容量与实验结果相吻合。
3. 尝试测量你的x86机器TLB有多大?(选作)
TLB的定义:TLB(Translation Lookaside Buffer)是位于CPU内部的一种专用缓存,用于加速虚拟地址到物理地址转换的硬件缓存。
当CPU执行程序时,会生成虚拟地址来访问内存,CPU首先会调用TLB,如果TLB中存在该虚拟地址对应的物理地址,则直接使用TLB中的物理地址进行内存访问。反之需要通过页表来进行地址转换。
因此我编写了一个程序:首先创建一个相当大的数组(此处为256MB),此数组的大小会远超TLB的大小。接着使用不同步长进行数组的遍历:首先为1,然后每次乘以2。并统计每种步长对应的访问时间。如果步长*单位长度(此处取的是4KB)小于TLB,则地址可能可以直接在TLB中找到,从而访问较快;反之访问较慢。这样一来,只需要找到访问时间突变对应的步长,即可计算出TLB的大小。
接下来运行程序,结果如图。将总访问时间乘以对应的步长可以得到访问相同空间所需的时间,可以看到,在步长为8~16的实验中,相对时间出现了较大的突变。因此,计算出来的TLB大小为32KB~64KB。
图:运行结果
五、实验总结与体会
实验总结:
本次实验我利用程序的局部性的知识,优化了矩阵乘法的代码,将矩阵乘法的运行效率显著提高,最大加速比约为173%。接下来运行了mountain程序,得到了不同访问大小、不同步长时的读取速度,并将其绘图展示。根据此实验的结果,我测得了CPU具有三级缓存,以及它们对应的大小分别为:768KB、6MB、12MB。我在任务管理器中直接查看了三级缓存对应的大小,验证了上述实验结果的正确性。最后,我学习了TLB的相关知识,编写代码测量了程序以不同步长访问同一数组所需要的时间,以此估算出TLB的大小约为48KB。
实验体会:
本次实验中数据山的画图部分最为困难,需要学习matlab绘图函数,将实验结果可视化。除此之外,TLB的测量部分也较为困难,要完成此任务,需要首先查阅资料,学习TLB的相关知识,再利用其特性,使用不同步长访问数组的方式,完成TLB大小的估算。
尾注
如有疑问欢迎讨论,如有好的建议与意见欢迎提出,如有发现错误则恳请指正!
附录:
实验代码:
1. 矩阵运算代码:
#include <sys/time.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
float* a, * b, * c, temp;
long int i, j, k, size, m;
struct timeval time1, time2;
if (argc < 2) {
printf("\n\tUsage:%s <Row of square matrix>\n", argv[0]);
exit(-1);
} //if
size = atoi(argv[1]);
int BLOCK_SIZE = atoi(argv[2]);
m = size * size;
a = (float*)malloc(sizeof(float) * m);
b = (float*)malloc(sizeof(float) * m);
c = (float*)malloc(sizeof(float) * m);
for (i = 0; i < size; i++) {
for (j = 0; j < size; j++) {
a[i * size + j] = (float)(rand() % 1000 / 100.0);
b[i * size + j] = (float)(rand() % 1000 / 100.0);
c[i * size + j] = 0;
}
}
gettimeofday(&time1, NULL);
long int ii, jj, kk;
for (ii = 0; ii < size; ii += BLOCK_SIZE)
for (jj = 0; jj < size; jj += BLOCK_SIZE)
for (kk = 0; kk < size; kk += BLOCK_SIZE)
for (i = ii; i < ii + BLOCK_SIZE && i < size; i++)
for (j = jj; j < jj + BLOCK_SIZE && j < size; j++)
for (k = kk; k < kk + BLOCK_SIZE && k < size; k++)
c[i * size + j] += a[i * size + k] * b[k * size + j];
gettimeofday(&time2, NULL);
time2.tv_sec -= time1.tv_sec;
time2.tv_usec -= time1.tv_usec;
if (time2.tv_usec < 0L) {
time2.tv_usec += 1000000L;
time2.tv_sec -= 1;
}
printf("Executiontime=%ld.%06ld seconds\n", time2.tv_sec, time2.tv_usec);
return(0);
}//main
2. mountain绘图代码(matlab):
% 数据准备
sizes = [2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768]
strides = 1:64; % 列对应的stride
% 吞吐量数据
data = [
8519.78519.79727.310223.710208.78511.47288.48519.711331.920367.518570.316972.915675.014576.713578.312779.611980.911282.010682.910183.79684.59285.28885.88486.48087.17787.67488.07288.46988.86789.26589.56389.86190.15990.45790.85591.15491.25291.65191.75091.94892.24792.34692.54592.74492.84393.04293.14193.34093.53993.63993.63893.83793.93694.13694.13594.33494.43494.43394.63394.63294.73294.73194.93194.9;
8345.98519.79734.510223.711681.311348.511681.312779.611356.910208.79285.211348.510483.39717.813628.212779.611980.911331.921465.720367.59734.518570.317771.68486.416274.07837.57538.014576.714077.56789.213179.012779.612380.211980.911681.311282.05491.210682.910483.35091.99884.29684.59484.99285.29085.58885.88686.18486.48286.88087.17987.27787.67687.77488.07388.27288.47088.76988.86889.06789.26689.36589.56489.66389.8;
8431.98701.08792.49294.29085.59734.59734.510223.710095.011681.312380.211348.512579.99734.510902.612779.612005.811356.910757.810208.79734.59285.211847.711348.510882.610483.315125.814576.714077.513628.213179.012779.612380.211980.923362.711331.911032.421465.710483.310183.79934.19734.59484.918570.318171.017771.68686.116972.98336.716274.07987.215675.015375.47538.014776.414576.77138.614077.513778.013578.313378.66589.512979.36389.8;
8519.78609.48654.48701.08839.99085.58985.79294.29563.79085.59291.49734.59676.99734.59911.410223.79614.610095.010757.810221.212979.312380.211847.713618.313079.112579.912100.711681.314077.510902.610543.212779.612380.216007.89345.17571.28825.98606.38386.68167.09959.19734.512679.812380.29085.511847.711581.511348.516673.410882.610682.915724.910283.610083.99884.29717.814327.114077.513827.913628.213378.613179.012979.312779.6;
8453.78519.78453.28609.48608.68654.48654.78701.08863.98839.99009.98792.48985.78985.79085.58890.29161.69085.59059.29085.59734.59291.49478.29734.510060.99676.99315.99734.59393.39911.49593.89294.29904.212018.311681.310095.011044.910757.811980.910221.211396.111125.112679.810611.610383.410155.211598.19727.311115.610899.310682.910483.312340.312100.711881.011681.314327.111262.011082.313628.213403.613179.012979.310223.7;
8442.88453.78486.48519.78553.08586.38575.38609.48552.28608.68619.68654.48677.58818.08724.18701.08747.48863.98829.58839.98652.89009.98888.99085.59023.58985.78974.68985.79021.69085.59176.79294.29008.39161.69345.19563.79306.29562.59318.59085.59385.09734.59509.89291.49691.29478.29278.59734.59534.89342.29163.99676.99492.510092.29909.29734.510428.810247.310074.89911.49748.19593.810383.46815.8;
8426.58442.88486.48475.68497.58519.68536.18519.78552.88553.08558.38519.28530.48575.38637.78609.48649.18654.08608.88608.68654.28619.68619.68654.48722.78677.58654.08654.78676.58724.18794.38701.08810.48747.48900.18863.98840.98829.58828.08839.98863.68900.18950.48744.98810.28888.98698.69085.58899.19023.59163.98985.79144.78974.68811.98985.79181.39021.69239.49085.58935.79176.79029.16196.2;
8442.88464.68470.18475.68470.08470.08459.18475.68478.38497.58497.58486.48458.88497.38511.38519.78505.88502.88503.18497.58536.48497.18555.58519.28510.88530.48578.38575.38513.88553.08527.88519.78619.98553.08594.38552.28625.38502.58602.98608.68625.48654.28572.28619.68680.28619.68566.38519.28613.78579.78553.08677.58663.48654.08651.68654.78663.58676.58697.98724.18580.38614.88654.96103.7;
8434.68423.88425.18394.08395.38380.68392.78345.98332.58321.88204.58261.48085.38006.37754.78158.58277.18261.58175.18282.38281.18261.58275.98261.48308.68284.78285.18271.78279.48308.68277.08138.28304.58321.88308.48261.28321.48348.58336.48335.18290.08308.78337.48318.98308.28366.88373.88324.18345.38308.28344.48388.38369.78356.58134.08199.28199.48204.28214.78230.38250.38196.68145.85242.9;
8401.48412.98417.08412.98381.98368.58378.68324.68326.68328.58348.58269.48159.17646.17913.17190.37746.17796.47815.67991.58172.58247.18413.68372.48459.88371.08447.58440.08434.48388.58485.28220.08482.48344.48473.48478.38496.68425.18494.78361.78455.08478.38483.08497.58490.58492.08501.78486.48511.98476.78482.38494.58514.08503.18497.68458.98463.98473.88489.88511.38495.88527.88478.34972.0;
8426.88235.68431.28418.38395.48370.58376.28348.58329.68285.78330.28199.48188.28421.08393.77226.08384.28386.67041.57002.77246.07662.98374.98412.98408.98371.18438.38449.68454.38439.48442.88345.98392.78389.98437.68441.48371.18425.18388.68348.411959.812041.611993.212016.712054.112009.812007.811918.112070.812033.412074.811877.512044.212041.612047.212061.112004.412073.412070.512074.812086.012061.112085.47050.8;
8423.18273.08208.68158.58031.57905.07783.47945.58012.68056.27997.77791.27312.77014.96871.16369.66723.06725.86612.76676.76845.57326.68242.28429.18408.98384.38415.48369.28424.88353.58069.67573.18392.78384.28408.08410.98408.58405.88401.88361.88392.58399.78337.88400.18398.78398.08397.57234.08403.38408.98361.58397.38405.48438.38357.78440.08320.18444.48399.58388.68382.18432.28371.54433.0;
11865.511700.28081.08042.68051.67968.77914.77873.27690.87553.47405.06735.56465.86240.76071.66070.46352.06411.66216.36193.86500.79538.89767.79849.59744.39805.96837.06802.76793.76727.16724.06252.46607.56528.76462.16354.76305.26333.36340.46379.76291.26459.76413.36432.86546.56467.96646.66839.86614.76618.16547.86582.06578.86608.36515.96568.26484.36599.76481.85850.36014.06022.55871.94141.2;
8329.110210.86729.85982.75734.95447.65189.14880.24736.15329.84473.74295.44151.94001.43915.24320.74986.95222.45770.86041.06555.86941.97086.57039.96985.26872.76995.06952.96752.46687.66680.73381.36545.36474.76477.06417.86469.26541.06549.26628.16600.16545.56320.76335.46320.86323.06240.56461.26290.96226.96289.66342.36332.36235.06065.66261.66129.96210.86212.46162.86133.56100.05984.32975.5;
10210.07710.47007.76004.95391.64882.84340.93966.43602.93653.03095.22993.32985.32871.82834.52106.33583.13598.93562.23902.74139.84420.64835.15753.35478.37030.25832.46073.76225.86158.76317.21974.86242.36264.86378.26286.06374.06475.76691.16860.46722.96692.26630.69157.96628.76481.16615.85219.86508.96634.56662.86827.66406.86453.06144.56521.66537.76594.46507.16610.06515.56568.76328.21812.0;
];
% 创建网格
[X, Y] = meshgrid(strides, sizes);
% 绘制三维图
figure;
surf(X, Y, data);
% 设置颜色映射
colormap jet;
colorbar;
% 添加标签
xlabel('Stride');
ylabel('Size (KB)');
set(gca, 'YDir', 'reverse')
zlabel('Throughput (KB/sec)');
title('Memory Mountain');
% 调整视角
view(45, 30);
3. 测量TLB代码:
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#define ARRAY_SIZE (256 * 1024 * 1024) // 256MB
#define PAGE_SIZE 4096 // 4KB
#define NUM_TRIALS 1000 // 访问次数
void clear_cache() {
int* dummy = (int*)malloc(ARRAY_SIZE);
for (int i = 0; i < ARRAY_SIZE / sizeof(int); i += PAGE_SIZE / sizeof(int)) {
dummy[i] = i;
}
free(dummy);
}
void measure_tlb_miss_rate() {
int* array = (int*)malloc(ARRAY_SIZE);
int stride, i, trial;
LARGE_INTEGER start, end, frequency;
double elapsed_time;
// 获取高精度计时器的频率
QueryPerformanceFrequency(&frequency);
printf("Stride (pages)\tTime (ms)\n");
for (stride = 1; stride <= 1024; stride *= 2) {
elapsed_time = 0.0;
for (trial = 0; trial < NUM_TRIALS; trial++) {
// 清空缓存
clear_cache();
// 开始计时
QueryPerformanceCounter(&start);
// 访问数组
for (i = 0; i < ARRAY_SIZE / sizeof(int); i += stride * PAGE_SIZE / sizeof(int)) {
array[i] = i;
}
// 结束计时
QueryPerformanceCounter(&end);
// 累加时间
elapsed_time += (double)(end.QuadPart - start.QuadPart) * 1000.0 / frequency.QuadPart;
}
// 计算平均时间
elapsed_time /= NUM_TRIALS;
printf("%d\t\t%f\n", stride, elapsed_time);
}
free(array);
}
int main() {
measure_tlb_miss_rate();
return 0;
}
常用虚拟机指令:
实验五:/*
进入实验五文件夹(需要先创建):cd ~/Desktop/experiment_5
编译main_a.c:gcc -o main_a main_a.c
以不同大小运行main_a(此处为100):./main_a 100
编译main_a_better.c:gcc -o main_a_better main_a_better.c
以不同矩阵大小、块大小运行优化的main_a(此处为1500、16):
./main_a_better 1500 512
解压mountain:tar-xvf mountain.tar
*/
原文地址:https://blog.csdn.net/m0_74268508/article/details/140588181
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!