【AIGC 学習】Bark テキストツーソーチュアル（2）長音声の生成

このツールは当初、13秒を超えない音声しか生成できませんでしたが、先月新しいバージョンがリリースされ、長時間の音声生成に特化しています。

https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb

始める前に、関連する環境をまずダウンロードする必要があります。

#@title 環境のインストール - どの音声を生成する場合でも実行が必要です
! pip install git+https://github.com/suno-ai/bark.git

from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

import os
import nltk
nltk.download('punkt')

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import nltk  # we'll use this to split into sentences
import numpy as np

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform

preload_models()

私はいくつかの長尺の例を試しました：

#@title 長い音声の生成

speaker = "v2/en_speaker_6"

script = """
Hey, have you heard about this new text-to-audio model called "Bark"? 
Apparently, it's the most realistic and natural-sounding text-to-audio model 
out there right now. People are saying it sounds just like a real person speaking. 
I think it uses advanced machine learning algorithms to analyze and understand the 
nuances of human speech, and then replicates those nuances in its own speech output. 
It's pretty impressive, and I bet it could be used for things like audiobooks or podcasts. 
In fact, I heard that some publishers are already starting to use Bark to create audiobooks. 
It would be like having your own personal voiceover artist. I really think Bark is going to 
be a game-changer in the world of text-to-audio technology! [end]
""".replace("\n", " ").strip()

sentences = nltk.sent_tokenize(script)

GEN_TEMP = 0.6

silence = np.zeros(int(0.1 * SAMPLE_RATE)) 

pieces = []
for sentence in sentences:
    semantic_tokens = generate_text_semantic(
        sentence,
        history_prompt=speaker,
        temp=GEN_TEMP,
        min_eos_p=0.05, 
    )

    audio_array = semantic_to_waveform(semantic_tokens, history_prompt=speaker,)
    pieces += [audio_array, silence.copy()]

Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

これが音声の効果です：

私たちはまた、長い会話も生成できます：

#@title 長い会話の生成

speaker_lookup = {"Samantha": "v2/en_speaker_9", "John": "v2/en_speaker_6"}

script = """
Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?
John: No, I haven't. What's so special about it?
Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.
John: Wow, that sounds amazing. How does it work?
Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.
John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?
Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.
John: I can imagine. It would be like having your own personal voiceover artist.
Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audio technology."""
script = script.strip().split("\n")
script = [s.strip() for s in script if s]
script

pieces = []
silence = np.zeros(int(0.1*SAMPLE_RATE))
for line in script:
    speaker, text = line.split(": ")
    audio_array = generate_audio(text, history_prompt=speaker_lookup[speaker], )
    pieces += [audio_array, silence.copy()]

Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

これは音声効果です：

サポートされている音響効果：

[laughter]，[laughs]，[sighs]，[music]，[gasps]，[clears throat]
— or ... for hesitations
♪ for song lyrics
強調のために単語を大文字で書くこと CAPITOLIZATION 強調の大文字化
[MAN] と [WOMAN] 男性と女性

言語や音声も変更できます：

どの効果も喜ばしく、さらにMITライセンスなので、商用利用にも非常に友好です～