On Tue, Sep 13, 2011 at 1:17 PM, Niels Möller nisse@lysator.liu.se wrote:
Then you shouldn't need to bother about the lsh directory again. You have a symlink to the shared aclocal.m4 (and to some other shared files).
Ok I figured out. I attach a working SSE2 detection.
Thanks. I'll have to think some more on how to organize this. Some properties I'd like to have:
- Don't require users to call any init function.
One could define memxor to jump via a function pointer, and have an initial value for that pointer which jumps to the routine to set the pointer to the right function, and then use it. Overwriting the pointer should be atomic, so no locking needed even for multithreaded programs.
I don't think locking is an issue if you only call a function on initialization. You can expect (and require) that a library isn't going to be initialized by multiple threads. I don't know however of a portable way to do initialization transparently without an explicit function call.
- Avoid using gcc-specific things, including inline asm, in the C
source files.
The cpuid test would have then to be moved to an assembly file.
Other obvious uses for cpu detection in nettle: * The AES code could check for the special aes instructions.
Indeed. Once a framework for overwriting functionality is set, those would be not very hard to add. However setting such framework in nettle seems to require substantial work as all exported functions need to be replaced by function pointers thus breaking ABI. If this is done gradually (it has to, as you never know what you would be able to optimize in a new processor) it would be worse, since every optimization added would break ABI.
Maybe it makes sense to have a libgcrypt-like high level interface and optimizations would be used only there. The existing C api remains an API to access the C implementation. This could also address the problem with optimized hash algorithms[0], since most cpu-assisted sha1 or sha256 implementations work on an output=hash(data,length) basis and do not map to the existing API.
[0]. http://www.mail-archive.com/openssl-dev@openssl.org/msg21787.html
* The serpent code can use %xmm and %ymm registers, when present. On x86_64, as far as I'm aware all current implementations have sse2, but one could check for, and make use of, the 256-bit %ymm registers.
I wouldn't care of serpent optimizations much :)
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
I don't know however of a portable way to do initialization transparently without an explicit function call.
It works for gmp. Which doesn't imply that we should do it in exactly the same way, of course.
The cpuid test would have then to be moved to an assembly file.
Right.
Indeed. Once a framework for overwriting functionality is set, those would be not very hard to add. However setting such framework in nettle seems to require substantial work as all exported functions need to be replaced by function pointers thus breaking ABI.
I don't think the function pointers should be exported. If "fat" library is enabled (default for x86), then the exported function should be
void memxor (...) { (*memxor_p)(...); }
I think one should have the possibility to choose between fat and non-fat builds, with the same ABI. There's going to be a small extra call overhead in the fat case.
As long as all implementations can use the same ctx structs, there should be no problem with the ABI. If we also want to support hardware acelerators which are like black boxes, then some API and or ABI changes may be necessary.
I wouldn't care of serpent optimizations much :)
I'm not surprised ;-) But on processors which lack aes-instructions, but which have 256-bit %ymm-registers, serpent can most likely be twice as fast as aes if used in ctr mode (for the current code with 128-bit %xmm-registers, serpent speed was somewhere between aes-128 and aes-192 last time I measured).
/nisse
On 09/13/2011 03:56 PM, Niels Möller wrote:
Indeed. Once a framework for overwriting functionality is set, those would be not very hard to add. However setting such framework in nettle seems to require substantial work as all exported functions need to be replaced by function pointers thus breaking ABI.
I don't think the function pointers should be exported. If "fat" library is enabled (default for x86), then the exported function should be
void memxor (...) { (*memxor_p)(...); } I think one should have the possibility to choose between fat and non-fat builds, with the same ABI. There's going to be a small extra call overhead in the fat case.
Disabling the optional architectures would be possible but reducing the call overhead in the "thin" case would require a lot of ifdefs. It would keep the external api intact, but the internals would look ugly.
As long as all implementations can use the same ctx structs, there should be no problem with the ABI. If we also want to support hardware acelerators which are like black boxes, then some API and or ABI changes may be necessary.
This is not guaranteed. For example AES-NI and padlock require the AES key to be aligned to 16-byte boundaries, something that the current structures do not offer.
I wouldn't care of serpent optimizations much :)
I'm not surprised ;-) But on processors which lack aes-instructions, but which have 256-bit %ymm-registers, serpent can most likely be twice as fast as aes if used in ctr mode (for the current code with 128-bit %xmm-registers, serpent speed was somewhere between aes-128 and aes-192 last time I measured).
Why not use camellia as an alternative? It is newer design than serpent and is pretty much standardized as the aes alternative.
In any case, I just noticed that for the x86-64 you don't really need to detect SSE2, it is just there by default. So maybe the SSE2 xor can just replace the x86-64 xor. For the plain x86 though this is not the case.
regards, Nikos
PS. The ECC patch just got very low in my priority stack. If anyone else is interested into porting it to nettle, he would make me a favor.
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
I think one should have the possibility to choose between fat and non-fat builds, with the same ABI. There's going to be a small extra call overhead in the fat case.
Disabling the optional architectures would be possible but reducing the call overhead in the "thin" case would require a lot of ifdefs. It would keep the external api intact, but the internals would look ugly.
Even if fat is default on x86, the non-fat case is important for other architectures. I think the complexity will be manageable, and most of it will be in the configure script and assembly code, not in the C files.
This is not guaranteed. For example AES-NI and padlock require the AES key to be aligned to 16-byte boundaries, something that the current structures do not offer.
At least that's an ABI change which is harmless for other implementations. A different question is how to portably tell the C compiler that a certain structure must be 16-byte aligned.
In any case, I just noticed that for the x86-64 you don't really need to detect SSE2, it is just there by default.
That's my understanding as well. You can test for it, but it's present in all existing x86_64 cpus.
/nisse
nettle-bugs@lists.lysator.liu.se