Speech¶
Speech is a service that enables easy integration of Google speech recognition technologies into developer applications. The Speech
class in gcp-pilot provides a high-level interface for interacting with Google Cloud Speech-to-Text API.
Installation¶
To use the Speech functionality, you need to install gcp-pilot with the speech extra:
pip install gcp-pilot[speech]
Usage¶
Initialization¶
from gcp_pilot.speech import Speech
# Initialize with default credentials
speech = Speech()
# Initialize with specific project
speech = Speech(project_id="my-project")
# Initialize with service account impersonation
speech = Speech(impersonate_account="service-account@project-id.iam.gserviceaccount.com")
Converting Speech to Text¶
From Audio Content¶
# Convert speech from audio content to text
with open("audio.flac", "rb") as audio_file:
audio_content = audio_file.read()
transcripts = speech.speech_file_to_text(
flac_content=audio_content,
language="en-US", # Optional: defaults to "en"
rate=16000, # Optional: sample rate in Hz, defaults to 44100
long_running=False, # Optional: if True, uses asynchronous recognition
)
for transcript in transcripts:
print(f"Transcript: {transcript}")
From Audio URI¶
# Convert speech from a GCS URI to text
transcripts = speech.speech_uri_to_text(
uri="gs://my-bucket/audio.flac",
language="en-US", # Optional: defaults to "en"
rate=16000, # Optional: sample rate in Hz, defaults to 44100
long_running=False, # Optional: if True, uses asynchronous recognition
)
for transcript in transcripts:
print(f"Transcript: {transcript}")
Long-Running Recognition¶
For longer audio files (more than 1 minute), you should use long-running recognition:
# Convert speech from a GCS URI to text using long-running recognition
transcripts = speech.speech_uri_to_text(
uri="gs://my-bucket/long-audio.flac",
language="en-US",
rate=16000,
long_running=True, # Use asynchronous recognition
)
for transcript in transcripts:
print(f"Transcript: {transcript}")
Supported Audio Formats¶
The Speech API currently supports FLAC format. The audio must be encoded as follows:
- FLAC (Free Lossless Audio Codec) format
- Sample rate hertz matching the actual audio
- Single channel (mono) or 2 channels (stereo)
Language Support¶
The Speech API supports a wide range of languages. Some common language codes include:
en-US
: English (United States)en-GB
: English (United Kingdom)es-ES
: Spanish (Spain)fr-FR
: French (France)de-DE
: German (Germany)ja-JP
: Japanese (Japan)ru-RU
: Russian (Russia)
For a complete list of supported languages, refer to the Google Cloud Speech-to-Text documentation.
Error Handling¶
The Speech class handles common errors and converts them to more specific exceptions:
from gcp_pilot import exceptions
try:
transcripts = speech.speech_uri_to_text(uri="gs://non-existent-bucket/audio.flac")
except exceptions.NotFound:
print("Audio file not found")
except exceptions.InvalidArgument as e:
print(f"Invalid argument: {e}")
Working with Service Account Impersonation¶
Service account impersonation allows you to act as a service account without having its key file. This is a more secure approach than downloading and storing service account keys.
# Initialize with service account impersonation
speech = Speech(impersonate_account="service-account@project-id.iam.gserviceaccount.com")
# Now all operations will be performed as the impersonated service account
transcripts = speech.speech_uri_to_text(uri="gs://my-bucket/audio.flac")
For more information on service account impersonation, see the Authentication documentation.
Best Practices¶
Here are some best practices for working with the Speech API:
- Use the right sample rate: Ensure the sample rate you specify matches the actual audio sample rate.
- Choose the appropriate recognition mode: Use synchronous recognition for short audio (< 1 minute) and asynchronous recognition for longer audio.
- Use GCS URIs for large files: For large audio files, upload them to Google Cloud Storage and use the URI instead of sending the content directly.
- Specify the correct language: Providing the correct language code improves recognition accuracy.
- Consider using enhanced models: For better accuracy, consider using enhanced models available in the Speech API.
- Optimize audio quality: Better audio quality leads to better recognition results. Reduce background noise and ensure clear speech.
- Handle errors gracefully: Implement proper error handling to manage issues like invalid audio formats or network problems.