On Fri, Jan 18, 2013 at 9:35 AM, Niels Möller nisse@lysator.liu.se wrote:
How does this algorithm compare to others?
If I read the gnutls code right, ecc_mulmod_cache makes the same number of ecc_projective_dbl_point calls as the non-cached version? If that is right, any comb-like algorithm should be able to beat it, since those algorithms reduce the number of point doublings and the number of iterations in the outer loop. Depending on selected table size, of course.
In the main loop there is no difference in the wmnaf code, but outside the loop the cached version does one doubling less and several additions less.
Do you have any ecc benchmarking code I could borrow?
I use gnutls-cli's benchmark-tls-kx option. It measures TLS key exchange time (one of them is ECDHE-ECDSA using an 192-bit curve).
Also for that part you don't need constant time. [...] (in ECDH you also don't care unless it is fixed key ECDH).
In ECDH, any data dependent timing does leak some information about your secret exponent. What you're saying, if I understand you correctly, is that since you usually use each secret exponent only once, an attacker will in most cases not be able to collect enough information to do any damage.
Yes. That's the way ECDHE is used in gnutls. No key is reused and the timing information from a single session isn't sufficient to recover it.
I would put that of lower priority that signing. Servers (where performance matters) rarely verify ECDSA keys.
Noted (although I'm not sure I agree it's less important than signing performance).
Sorry, indeed, that's for TLS only. In TLS usually only the server is authenticated. If there is client authentication too, then this is also important.
Why is that? Did you see any issues on the timing resistant version of the function?
Let me first explain what I mean when I talk about a "side-channel silent" function. That means that if we call the function with operands which are of the same size, but otherwise different, the function should execute exactly the same sequence of instructions in both cases, and access memory in exactly the same pattern.
If one then assumes that the underlying machine instructions have data independent timing (true for most current cpus), we leak no side information from timing or cache behaviour. We may still leak information through power analysis, if, e.g., executing a multiplication instructions consumes different amount of energy depending on the input bit patterns.
You are correct, but power analysis is useful for dedicated devices such as smart cards. A generic-purpose library will not be used there anyway. Protection from timing analysis is important, and while it is not precisely measured, the gnutls wmnaf code contains counter-measures for that. If you check the main loop in ecc_mulmod_cached_timing() there is an "else" case where a dummy addition is being performed. That is to make the number of operations used independent of the input data.
I'm not intimately familiar with wmnaf, but it looks like the function ecc_wMNAF has a lot of branches depending on the exponent bits.
Could be, but ecc_wMNAF is a fixed conversion that doesn't depend on any input from an attacker. That is the best you can do is associate the key with a time needed for its wmNAF conversion. I don't think this is better than the association with its public key.
Next, in the main ecc_mulmod functions, the timing resistant versions calls the madd primitive the same number of times independent of the values of the digits. But there are some small remaining leaks (by the above pretty high standard):
- If wmnaf_len depends on the exponent bit pattern, then so does the number of iterations through the loop.
This is a fixed association again. An attack cannot control wmnaf_len or the number of iterations. Timing attacks usually work by the attacker trying different inputs and measuring the different timings. In the cases you describe, as far as I understand, different input will not provide additional timing information.
- Branches depending on whether digit is > 0, < 0 or == 0. Those will have slightly different timing. Since digit == 0 is unlikely, that branch will be badly predicted. And it also affects instruction cache.
- For the data cache, accesses to the wmnaf table are obviously data dependent. Also, data is written to different locations in memory (R or T) depending on whether or not the current digit is non-zero.
Could be, these are very cpu dependent. If your argumentation is that a really safe timing resistant multiplication version has to be used, then I wouldn't disagree, if it's up to the level that is not so slow that no-one uses it. However, there is no reason for the code that does not need to be timing resistant, not to be as fast as possible.
In the point addition, it seems hard to avoid special cases when adding a point to itself or to its negation, or adding the zero point.
Do you mean the infinity points or the point (0,0,0)?
The infinity point, which is the unit element of the group. I denote it 0 below (P + 0 = P for all P).
I've seen some implementations do not handle that at all, but I agree that it has to be handled properly.
regards, Nikos