1. 问题
要对2048*1080个字节进行内存拷贝工作,当然最简单的就是memcpy来做了,但是由于时间限制,用memcpy耗时太长,需要进行优化。我们是DSP平台,当然提供了快速指令来进行,但是发现时间
比memcpy还差,这就说明出现问题了,现在把代码贴上(我的代码质量啊...)
点击(此处)折叠或打开
-
#if 0
-
for(i=0; i<2211840; i+=64)
-
{
-
SUPER_LD32R((signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i) ,
-
(signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+4),
-
(unsigned char *)videoPacket->buffers[0].data+i, 0);
-
-
SUPER_LD32R((signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+8),
-
(signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+12),
-
(unsigned char *)videoPacket->buffers[0].data+i+8, 0);
-
-
SUPER_LD32R((signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+16),
-
(signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+20),
-
(unsigned char *)videoPacket->buffers[0].data+i+16, 0);
-
-
SUPER_LD32R((signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+24),
-
(signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+28),
-
(unsigned char *)videoPacket->buffers[0].data+i+24, 0);
-
-
SUPER_LD32R((signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+32) ,
-
(signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+36),
-
(unsigned char *)videoPacket->buffers[0].data+i+32, 0);
-
-
SUPER_LD32R((signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+40),
-
(signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+44),
-
(unsigned char *)videoPacket->buffers[0].data+i+40, 0);
-
-
SUPER_LD32R((signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+48),
-
(signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+52),
-
(unsigned char *)videoPacket->buffers[0].data+i+48, 0);
-
-
SUPER_LD32R((signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+56),
-
(signed long *)(compareSignal.tempMemP[compareSignal.auxFlag]+i+60),
-
(unsigned char *)videoPacket->buffers[0].data+i+56, 0);
-
}
-
#else
-
unsigned char * tempP = compareSignalMem.tempMemP[compareSignal.auxFlag++];
-
unsigned char * tempPacketP = (unsigned char *)videoPacket->buffers[0].data;
-
for(i=0; i<2211840; i+=64)
-
{
-
unsigned char * tempP1 = tempP+i;
-
SUPER_LD32R((signed long *)(tempP1) , (signed long *)(tempP1+4), tempPacketP+i, 0);
-
SUPER_LD32R((signed long *)(tempP1+8), (signed long *)(tempP1+12), tempPacketP+i+8, 0);
-
SUPER_LD32R((signed long *)(tempP1+16), (signed long *)(tempP1+20), tempPacketP+i+16, 0);
-
SUPER_LD32R((signed long *)(tempP1+24), (signed long *)(tempP1+28), tempPacketP+i+24, 0);
-
SUPER_LD32R((signed long *)(tempP1+32) , (signed long *)(tempP1+36), tempPacketP+i+32, 0);
-
SUPER_LD32R((signed long *)(tempP1+40), (signed long *)(tempP1+44), tempPacketP+i+40, 0);
-
SUPER_LD32R((signed long *)(tempP1+48), (signed long *)(tempP1+52), tempPacketP+i+48, 0);
-
SUPER_LD32R((signed long *)(tempP1+56), (signed long *)(tempP1+60), tempPacketP+i+56, 0);
-
}
- #endif
a) 经测试从内存中取值是非常耗时的,要比运算耗时。所以我在优化后把耗时的compareSignalMem.tempMemP[compareSignal.auxFlag++]; 从for循环中提取了出来,这样耗时减少了。
b)另外还注意了一点就是在一个for循环中多做,减少for循环的次数。当for循环中的运算量比较小,但for循环的次数比较多时,可以采用这个办法来减少for循环的次数。
c)在解决这个问题中,看到从内存中取值还是比较耗时间的
就这样吧