The audio as it is currently implemented in the demo appears to be a fixed Binaural recording, i.e. it 'moves with your head' rather than being fixed to the environment or an object in the environment (e.g. the Domina's head). I assume this was done because recoding binaural audio is a relatively well-known and well documented process. 
This may already be know, as the room immediately prior to the Trance/Binaural Session selection room has an environmentally positioned audio track, but Oculus have a 3D audio SDK available that can effectively do 'in software' what a binaural recording rig does physically: model the reflections and interference of sound with a person's head/shoulders/outer ears. The advantage over a fixed binaural recording is that this is dynamically updated as the user moves their head so the perceived audio remains correct for all head positions rather than just one. 
Some audio sources are more suitable to be fixed to the head (e.g. binaural beats, the 'backing voices') rather than the environment, but the current demo at least would be enhanced by the Domina's voice emanating from the Domina herself. 
Lipsync would be appreciated, but is understandably a lot of extra animation load if baseline data is not captured at time of recording. There are free to use available facial capture programs available (http://blog.mashape.com/list-of-10-face-detection-recognition-apis/), as well as better-supported commercial software, that can capture lip movement and facial expression with a webcam at time of recording for future animation. If you are not already doing so, at least capturing a raw webcam video (1280x702 @ 60fps would be a reasonable baseline) during recording would make future animation easier.