In this paper we revisited the parallel implementations of LEA. By taking the advantages of both the light-weight features of LEA and the parallel computation abilities of ARM-NEON platforms, performance is significantly improved. We firstly optimized the implementations on ARM and NEON architectures. For ARM processor, barrel shifter instruction is used to hide the latencies for rotation operations. For NEON engine, the minimum number of NEON registers are assigned to the round key variables by performing the on-time round key loading from ARM registers. This approach reduces the required NEON registers for round key variables by three registers and the registers and temporal registers are used to retain four more plaintext for encryption operation. Furthermore, we finely transform the data into SIMD format by using transpose and swap instructions. The compact ARM and NEON implementations are combined together and computed in mixed processing way. This approach hides the latency of ARM computations into NEON overheads. Finally, multiple cores are fully exploited to perform the maximum throughputs on the target devices. The proposed implementations achieved the fastest LEA encryption within 3.2 cycle/byte for Cortex-A9 processors.