Conversing with ChatGPT through a 3D character

Conversing with ChatGPT through a 3D character


Although it's not a novel attempt, we have finally conversed with ChatGPT through live voices and made the responses spoken by a 3D character. Please take a look at the video below.

We were able to create this sample in about two hours.
This is still the first stage of verification, so there are many points to be improved, such as the character and voice not matching and the unnatural lines, and so on.
We plan to continue to improve it and publish it as an article in the future.

Main technologies used


Now, I've listed the technologies used in this development below.


First of all, ChatGPT does not understand our live voices as they are. 

Also, the responses from ChatGPT are not spoken but returned as text. 

Therefore, by interposing the following technologies between the user's voice-based questions and the ChatGPT's text-based answers, we can create a communication that seems like we are talking to each other.


Speech-To-Text (process of converting voice into text)


In order for ChatGPT to understand the question, it is necessary to convert the voice of the questioner into text information. 

This is an automatic technology called "transcription". 

There are countless services that provide this technology online, but I've listed some of the representative ones below.



When you send voice information to the API provided by these services, the transcribed text will be returned instantly. 

We used Google Speech-To-Text for this verification.


Text-To-Speech (process of converting text into speech)


It is also necessary to convert the answer, which is returned as text information from ChatGPT, into a natural speaking voice. 

This is also called "voice synthesis." Here are some of the representative ones below.



When you send text information to the API provided by these services, the speech data that reads the text aloud is returned instantly. We used Google Text-to-Speech for this verification.


In addition, for Japanese language, there are many Japanese-specific services that synthesize so-called "anime voice."


Lip Sync (synchronization of voice and lips)


To make it look like the character is speaking the voice that Text-To-Speech reads aloud, you need to synchronize the character's lip movements with the voice. 

This technology is called lip sync.

The real-time voice is analyzed during playback to extract vowels from the waveform and move the character's lips. 

To use this technology, you need to prepare animations for moving the lips of the character's 3D model when generating each of the vowels "a", "e", "i", "o", and "u".


Flow of conversation with ChatGPT through voice


In summary, by utilizing the following technologies in order, the character will be able to respond to user's spoken questions with spoken answers:


Tips for achieving natural conversation


If you have used ChatGPT before, you probably know that it is generally very verbose in its responses. Additionally, since it returns responses in small chunks, it takes quite a bit of time for the full response to be written out. This is true even when using ChatGPT through its API. Waiting for this time delay makes for a very unnatural conversation.


Therefore, it is necessary to limit the maximum character count for ChatGPT's responses. The shorter the response, the faster the API response time will be. This also means that the Text-To-Speech synthesis process will be faster.


The easiest way to do this is to add a standardized phrase like "within 30 words" after transcribing the user's question with Speech-To-Text. 


For example, if a user asks "What is the difference between Onigiri and Omusubi?" you would send "What is the difference between Onigiri and Omusubi? Within 30 characters" to ChatGPT.


Without any limitations, verbose ChatGPT may begin with something like "Onigiri and omusubi are traditional and easy-to-eat Japanese foods that are centered around rice. Generally speaking,..." and so on.


With a limit of 30 words, ChatGPT's response will be something like "Onigiri has filling, is shaped like a triangle or sphere, and is wrapped in nori. Omusubi is seasoned with salt, formed into a ball, and eaten plain without any filling." Regardless of the truth, a pinpoint accurate answer will be returned.


If you further limit the response to 10 words, you will receive a succinct response that eliminates all unnecessary information such as "shape and filling".


To make the response seem more human-like, you can add specific details like "In 10 characters or less, in a female tone,". 

This may result in a response like "Difference in form and filling, dear!" which shows a nice attention to detail, such as replacing "shape" with "form".


Leave your Unity development using ChatGPT to Vitalify Asia!!


At Vitalify Asia, we develop cutting-edge apps using advanced technology in Unity every day. We specialize in game development, AR/VR app development, 2D illustration creation, and 3D modeling. In addition to the above, we can develop applications in a wide range of fields, so please feel free to contact us.