Actually, the barrel shift was first introduced in the Intel 386 in the mid 80s, so it's nothing particularly new... They could do a variable length shift in just one cycle, and so could the later 486 and Pentiums. Then in the P4 Intel simulated the shift instructions in microcode rather than with circuitry---making the P4 much slower for shifts (4 to 6 cycles rather than just one). With the Intel Cores they returned back to the old model... weird.
Even with a P4 using the shifts must be faster than the ByteBuffer due to the instantiation and method calls involved in the latter, and the relatively large processing (well over 6 cycles) and memory overheads of each. Even function calls in C can be noticeable in realtime apps, hence the "inline" keyword. Indeed, if I could in Java, I'd mark the getter methods on the class I created above "inline".
Of course, it's all on a small scale and a regular OS will impose so many other services and
thread yields that it's probably negligible. If the ByteBuffers are easier for what you're doing then use them. If you were transforming a lot of network data for example, it might be more convenient to use a ByteBuffer since the network bandwidth and not the CPU will most likely be the bottleneck. They can also be useful in direct mode for linking up efficiently with native JNI components.