The SIMD extensions are primarily aimed at graphics vector processing and as the article points out, do not support square roots and the like directly, so these have to be approximated. They do the barebones minimum operations in parallel very efficiently, which is ideal for pixels, but can also be useful in other areas such as audio or simulation. Unfortunately, because the FPU shares much of the register and pipeline hardware as the MMX and SIMD extensions, you cant intermix FPU instructions with SSE instructions.
Current versions of PB go as far as supporting MMX but no further, so you will have to use opcodes to invoke the SIMD instructions.