Embodied symbiotic assistants that see, act, infer and chat
We present Symbiote, an embodied home assistant that maps images from its camera into objects and rooms, builds geometric semantic maps, parses human instructions and conversations into user intents and their arguments, explores in a goal-directed way to find relevant objects (if not present in the map) and executes the inferred actions plans using its navigation and manipulation policies, and/or ask questions to clarify intents and arguments. Our main contribution is a hybrid approach to the semantic parsing of user instructions and their mapping to suitable action routines. We propose a text-to-text neural encoder-decoder language parsing model that maps user instructions to sequences of simplified utterances. The generated utterances are then mapped to parameterized action primitives to execute by a rule-based parser. Our neural parser benefits from largescale text-to-text unsupervised language pre-training, and our rule-based parser effectively covers the domain of simplified single-step instructions that our neural model generates. Training our neural parser to map language utterances directly to parameterized action programs would not work as the output space would be much outside the text domain that the neural model has been pre-trained on. We present ablations and evaluations of different modules of our agent. We discuss our failure models which are mostly related to a lack of accurate referential object instance grounding, instruction parsing, and perception failures. We outline current and future experiments and research directions in the realms of open-vocabulary spatiotemporal 2D and 3D perception, memory-augmented vision-language parsing networks to handle continual learning without forgetting, and fast and few-shot learning during deployment and interaction with human users. We also discuss our present conversational strategies and how we plan to make them more creative and engaging for the user.