Natural language usage for robot control is essential for developing successful human-friendly robotic systems. In spite of the fact that the realization of robots with high cognitive capabilities that understand natural instructions as humans is quite difficult, there is a high potential for introducing voice interfaces for most of the existing robotic systems. Although there have been some interesting work in this domain, usually the scope and the efficiency of natural language controlled robots are limited due to constraints in the number of built in commands, the amount of information contained in a command, the reuse of excessive commands, etc. We present a multimodal interface for a robotic manipulator, which can learn both from human user voice instructions and vision input to overcome some of these drawbacks. Results of three experiments, i.e., learning situations, learning actions, and learning objects are presented.