The human brain is one of the most complex dynamic systems that enables us to communicate in natural language. We have a good understanding of some principles underlying natural languages and language processing, some knowledge about socio-cultural conditions framing acquisition, and some insights about where activity is occurring in the brain. However, we were not yet able to understand the behavioural and mechanistic characteristics for natural language and how mechanisms in the brain allow to acquire and process language. In an effort to bridge the gap between insights from behavioural psychology and neuroscience, the goal of this paper is to contribute a computational understanding of the appropriate characteristics that favour language acquisition, in a brain-inspired neural architecture. Accordingly, we provide concepts and refinements in cognitive modelling regarding principles and mechanisms in the brain - such as the hierarchical abstraction of context - in a plausible recurrent architecture. On this basis, we propose neurocognitively plausible model for embodied language acquisition from real world interaction of a humanoid robot with its environment. The model is capable of learning language production grounded in both, temporal dynamic somatosensation and vision. In particular, the architecture consists of a continuous time recurrent neural network, where parts have different leakage characteristics and thus operate on multiple timescales for every modality and the association of the higher level nodes of all modalities into cell assemblies. Thus, this model features hierarchical concept abstraction in sensation as well as concept decomposition in production, multi-modal integration, and self-organisation of latent representations.