currently building a chatbot with a multimodal transformer and i'm convinced this is the future of human-computer interaction, not just some gimmicky "conversational AI