Abstract: I describe a modeling language for constructing 3D visualizations of concepts denoted by natural language expressions. This language, VoxML (Visual Object Concept Modeling Language), is being used as the platform for creating multimodal semantic simulations in the context of human-computer communication. I demonstrate the implementation of VoxML in the context of human-machine interaction. In our prototype system, people and avatars cooperate to build blocks world structures through the interaction of language, gesture, vision, and action. This provides a platform to study computational issues involved in multimodal communication. In order to establish elements of the common ground in discourse between speakers, VoxML is used to create an embodied 3D simulation, enabling both the generation and interpretation of multiple modalities, including: language, gesture, and the visualization of objects moving and agents acting in their environment. The simulation is built on the modeling language VoxML, that encodes objects with rich semantic typing and action affordances, and actions themselves as multimodal programs, enabling contextually salient inferences and decisions in the environment. We illustrate this with a walk-through of multimodal communication in a shared task.