As explained on the upstream bug, the pure SSE2 implementation of BLAKE2 is always slower than the reference implementation. On Athlon64, it's even 2.5 times slower. It might be reasonable to disable the intrinsic variant if the CPU doesn't support at least SSSE3 (which is the lowest supported optimization that may make the code faster). I suppose we should use -march= for that check since that's what the upstream code uses.
mgorny, from what I see, your PR was merged upstream (congratulations!) a while ago already and is present in python 3.7.0. Can we consider this bug fixed or do we want to backport this to 3.6?
I suppose nobody reported a bug so far so we should be good.