[ad_1]
It’s been precisely a decade since I began attending GeekCon (sure, a geeks’ convention 🙂) — a weekend-long hackathon-makeathon by which all initiatives have to be ineffective and just-for-fun, and this yr there was an thrilling twist: all initiatives had been required to include some type of AI.
My group’s mission was a speech-to-text-to-speech sport, and right here’s the way it works: the person selects a personality to speak to, after which verbally expresses something they’d prefer to the character. This spoken enter is transcribed and despatched to ChatGPT, which responds as if it had been the character. The response is then learn aloud utilizing text-to-speech expertise.
Now that the sport is up and working, bringing laughs and enjoyable, I’ve crafted this how-to information that can assist you create an analogous sport by yourself. All through the article, we’ll additionally discover the varied issues and selections we made in the course of the hackathon.
Need to see the complete code? Here is the link!
As soon as the server is working, the person will hear the app “speaking”, prompting them to decide on the determine they wish to speak to and begin conversing with their chosen character. Every time they wish to speak out loud — they need to press and maintain a key on the keyboard whereas speaking. After they end speaking (and launch the important thing), their recording can be transcribed by Whisper
(a text-to-speech mannequin by OpenAI
), and the transcription can be despatched to ChatGPT
for a response. The response can be learn out loud utilizing a text-to-speech library, and the person will hear it.
Disclaimer
Word: The mission was developed on a Home windows working system and incorporates the pyttsx3
library, which lacks compatibility with M1/M2 chips. As pyttsx3
isn’t supported on Mac, customers are suggested to discover various text-to-speech libraries which are appropriate with macOS environments.
Openai Integration
I utilized two OpenAI
fashions: Whisper
, for speech-to-text transcription, and the ChatGPT
API for producing responses based mostly on the person’s enter to their chosen determine. Whereas doing so prices cash, the pricing mannequin could be very low-cost, and personally, my invoice remains to be below $1 for all my utilization. To get began, I made an preliminary deposit of $5, and so far, I’ve not exhausted this accretion, and this preliminary deposit received’t expire till a yr from now.
I’m not receiving any fee or advantages from OpenAI
for penning this.
When you get your OpenAI
API key — set it as an setting variable to make use of upon making the API calls. Make sure that to not push your key to the codebase or any public location, and to not share it unsafely.
Speech to Textual content — Create Transcription
The implementation of the speech-to-text characteristic was achieved utilizing Whisper
, an OpenAI
mannequin.
Beneath is the code snippet for the operate chargeable for transcription:
async def get_transcript(audio_file_path: str,
text_to_draw_while_waiting: str) -> Non-obligatory[str]:
openai.api_key = os.environ.get("OPENAI_API_KEY")
audio_file = open(audio_file_path, "rb")
transcript = Noneasync def transcribe_audio() -> None:
nonlocal transcript
attempt:
response = openai.Audio.transcribe(
mannequin="whisper-1", file=audio_file, language="en")
transcript = response.get("textual content")
besides Exception as e:
print(e)
draw_thread = Thread(goal=print_text_while_waiting_for_transcription(
text_to_draw_while_waiting))
draw_thread.begin()
transcription_task = asyncio.create_task(transcribe_audio())
await transcription_task
if transcript is None:
print("Transcription not out there inside the specified timeout.")
return transcript
This operate is marked as asynchronous (async) for the reason that API name might take a while to return a response, and we await it to make sure that this system doesn’t progress till the response is obtained.
As you may see, the get_transcript
operate additionally invokes the print_text_while_waiting_for_transcription
operate. Why? Since acquiring the transcription is a time-consuming process, we needed to maintain the person knowledgeable that this system is actively processing their request and never caught or unresponsive. In consequence, this textual content is step by step printed because the person awaits the subsequent step.
String Matching Utilizing FuzzyWuzzy for Textual content Comparability
After transcribing the speech into textual content, we both utilized it as is, or tried to match it with an current string.
The comparability use circumstances had been: deciding on a determine from a predefined record of choices, deciding whether or not to proceed enjoying or not, and when opting to proceed – deciding whether or not to decide on a brand new determine or keep on with the present one.
In such circumstances, we needed to match the person’s spoken enter transcription with the choices in our lists, and due to this fact we determined to make use of the FuzzyWuzzy
library for string matching.
This enabled selecting the closest choice from the record, so long as the matching rating exceeded a predefined threshold.
Right here’s a snippet of our operate:
def detect_chosen_option_from_transcript(
transcript: str, choices: Checklist[str]) -> str:
best_match_score = 0
best_match = ""for choice in choices:
rating = fuzz.token_set_ratio(transcript.decrease(), choice.decrease())
if rating > best_match_score:
best_match_score = rating
best_match = choice
if best_match_score >= 70:
return best_match
else:
return ""
If you wish to study extra concerning the FuzzyWuzzy
library and its features — you may try an article I wrote about it here.
Get ChatGPT Response
As soon as we now have the transcription, we are able to ship it over to ChatGPT
to get a response.
For every ChatGPT
request, we added a immediate asking for a brief and humorous response. We additionally instructed ChatGPT
which determine to faux to be.
So our operate seemed as follows:
def get_gpt_response(transcript: str, chosen_figure: str) -> str:
system_instructions = get_system_instructions(chosen_figure)
attempt:
return make_openai_request(
system_instructions=system_instructions,
user_question=transcript).decisions[0].message["content"]
besides Exception as e:
logging.error(f"couldn't get ChatGPT response. error: {str(e)}")
increase e
and the system directions seemed as follows:
def get_system_instructions(determine: str) -> str:
return f"You present humorous and quick solutions. You might be: {determine}"
Textual content to Speech
For the text-to-speech half, we opted for a Python library known as pyttsx3
. This alternative was not solely easy to implement but additionally supplied a number of extra benefits. It’s freed from cost, gives two voice choices — female and male — and means that you can choose the talking fee in phrases per minute (speech pace).
When a person begins the sport, they decide a personality from a predefined record of choices. If we couldn’t discover a match for what they stated inside our record, we’d randomly choose a personality from our “fallback figures” record. In each lists, every character was related to a gender, so our text-to-speech operate additionally obtained the voice ID equivalent to the chosen gender.
That is what our text-to-speech operate seemed like:
def text_to_speech(textual content: str, gender: str = Gender.FEMALE.worth) -> None:
engine = pyttsx3.init()engine.setProperty("fee", WORDS_PER_MINUTE_RATE)
voices = engine.getProperty("voices")
voice_id = voices[0].id if gender == "male" else voices[1].id
engine.setProperty("voice", voice_id)
engine.say(textual content)
engine.runAndWait()
The Essential Stream
Now that we’ve roughly received all of the items of our app in place, it’s time to dive into the gameplay! The principle movement is printed beneath. You may discover some features we haven’t delved into (e.g. choose_figure
, play_round
), however you may discover the complete code by checking out the repo. Ultimately, most of those higher-level features tie into the inner features we’ve coated above.
Right here’s a snippet of the primary sport movement:
import asynciofrom src.handle_transcript import text_to_speech
from src.main_flow_helpers import choose_figure, begin, play_round,
is_another_round
def farewell() -> None:
farewell_message = "It was nice having you right here, "
"hope to see you once more quickly!"
print(f"n{farewell_message}")
text_to_speech(farewell_message)
async def get_round_settings(determine: str) -> dict:
new_round_choice = await is_another_round()
if new_round_choice == "new determine":
return {"determine": "", "another_round": True}
elif new_round_choice == "no":
return {"determine": "", "another_round": False}
elif new_round_choice == "sure":
return {"determine": determine, "another_round": True}
async def fundamental():
begin()
another_round = True
determine = ""
whereas True:
if not determine:
determine = await choose_figure()
whereas another_round:
await play_round(chosen_figure=determine)
user_choices = await get_round_settings(determine)
determine, another_round =
user_choices.get("determine"), user_choices.get("another_round")
if not determine:
break
if another_round is False:
farewell()
break
if __name__ == "__main__":
asyncio.run(fundamental())
We had a number of concepts in thoughts that we didn’t get to implement in the course of the hackathon. This was both as a result of we didn’t discover an API we had been happy with throughout that weekend, or because of the time constraints stopping us from growing sure options. These are the paths we didn’t take for this mission:
Matching the Response Voice with the Chosen Determine’s “Precise” Voice
Think about if the person selected to speak to Shrek, Trump, or Oprah Winfrey. We needed our text-to-speech library or API to articulate responses utilizing voices that matched the chosen determine. Nevertheless, we couldn’t discover a library or API in the course of the hackathon that supplied this characteristic at an inexpensive value. We’re nonetheless open to strategies you probably have any =)
Let the Customers Discuss to “Themselves”
One other intriguing thought was to immediate customers to supply a vocal pattern of themselves talking. We might then practice a mannequin utilizing this pattern and have all of the responses generated by ChatGPT learn aloud within the person’s personal voice. On this situation, the person might select the tone of the responses (affirmative and supportive, sarcastic, offended, and so on.), however the voice would carefully resemble that of the person. Nevertheless, we couldn’t discover an API that supported this inside the constraints of the hackathon.
Including a Frontend to Our Software
Our preliminary plan was to incorporate a frontend element in our software. Nevertheless, on account of a last-minute change within the variety of members in our group, we determined to prioritize the backend improvement. In consequence, the applying at the moment runs on the command line interface (CLI) and doesn’t have frontend aspect.
Latency is what bothers me most for the time being.
There are a number of parts within the movement with a comparatively excessive latency that for my part barely hurt the person expertise. For instance: the time it takes from ending offering the audio enter and receiving a transcription, and the time it takes for the reason that person presses a button till the system really begins recording the audio. So if the person begins speaking proper after urgent the important thing — there can be no less than one second of audio that received’t be recorded on account of this lag.
Need to see the entire mission? It’s right here!
Additionally, heat credit score goes to Lior Yardeni, my hackathon companion with whom I created this sport.
On this article, we realized the best way to create a speech-to-text-to-speech sport utilizing Python, and intertwined it with AI. We’ve used the Whisper
mannequin by OpenAI
for speech recognition, performed round with the FuzzyWuzzy
library for textual content matching, tapped into ChatGPT
’s conversational magic through their developer API, and introduced all of it to life with pyttsx3
for text-to-speech. Whereas OpenAI
’s companies (Whisper
and ChatGPT
for builders) do include a modest value, it’s budget-friendly.
We hope you’ve discovered this information enlightening and that it’s motivating you to embark in your initiatives.
Cheers to coding and enjoyable! 🚀
[ad_2]
Source link