09 Sep 2023 6896字 23分
CC BY 4.0 （除特别声明或转载文章外）
如果这篇博客帮助到你，可以请我喝一杯咖啡~

Audio course

Working with audio data

加载声音数据集

这里主要使用 Hugging Face 的 datasets 库，使用方法如下：

pip install datastes[audio]

使用 load_dataset() 函数一键下载

这里以 MINDS-14 （包含人们以几种语言和方言询问电子银行系统问题的录音。）举例

在 hub 上找到对应的 dataset，并且只对澳大利亚子集感兴趣，并且限制划分为训练集合

from datasets import load_dataset

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds

输出

Dataset(
    {
        features: [
            "path",
            "audio",
            "transcription",
            "english_transcription",
            "intent_class",
            "lang_id",
        ],
        num_rows: 654,
    }
)

这个数据集包含 654 个音频文件每一个有一个转义脚本，一个英文翻译，和一个标签对应查询意图，每一列包含原始声音数据

example = minds[0]
example

输出

{
    "path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav",
    "audio": {
        "path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav",
        "array": array(
            [0.0, 0.00024414, -0.00024414, ..., -0.00024414, 0.00024414, 0.0012207],
            dtype=float32,
        ),
        "sampling_rate": 8000,
    },
    "transcription": "I would like to pay my electricity bill using my card can you please assist",
    "english_transcription": "I would like to pay my electricity bill using my card can you please assist",
    "intent_class": 13,
    "lang_id": 2,
}

每一列包含几个特征：

path：音频文件的路径
array: 解码音频数据，用 1-维数的 numpy 的 array表示
sampling_rate: 采样率（8000Hz例子）

intent_class 是录音的分类类别，为了将数字转为有意义的字符使用 int2str():

id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

输出

"pay_bill"

如果你计划在子集训练一个声音分类器，你不需要所有特征，举例来说， lang_id 对于所有示例都具有相同的值，所以没用。English_transcription 可能会在此子集中复制转录，因此我们可以安全地删除它们。

简单使用 Hugging Face Datasets ‘remove_columns’ 删除不相干的特征

columns_to_remove = ["lang_id", "english_transcription"]
minds = minds.remove_columns(columns_to_remove)
minds

输出

Dataset({features: ["path", "audio", "transcription", "intent_class"], num_rows: 654})

现在让我们看几个例子，我们从 gradio 中使用 Blocks 和 Audio 在数据集中随机编码一些样例

import gradio as gr


def generate_audio():
    example = minds.shuffle()[0]
    audio = example["audio"]
    return (
        audio["sampling_rate"],
        audio["array"],
    ), id2label(example["intent_class"])


with gr.Blocks() as demo:
    with gr.Column():
        for _ in range(4):
            audio, label = generate_audio()
            output = gr.Audio(audio, label=label)

demo.launch(debug=True)

你乐意可以可视化一些列子：

import librosa
import matplotlib.pyplot as plt
import librosa.display

array = example["audio"]["array"]
sampling_rate = example["audio"]["sampling_rate"]

plt.figure().set_figwidth(12)
librosa.display.waveshow(array, sr=sampling_rate)

处理数据集

你要训练推理模型，首先得预处理数据。大概遵循以下步骤

重新采样数据
过滤
将声音转换模型期望输出

重采样

默认采样率一般不适用自己的，有差异的话自己重新调整

一般的预训练模型采样率 16kHZ。我们上次举例用的是 8KHZ,这意味我们得上采样。

使用 cast_colume 此操作不会更改就地的音频，而是向数据集发出信号，以便在加载时即时进行音频示例

from datasets import Audio

minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

重新加载发现已经按照我们想要的采样率了

minds[0]

输出

{
    "path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav",
    "audio": {
        "path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav",
        "array": array(
            [
                2.0634243e-05,
                1.9437837e-04,
                2.2419340e-04,
                ...,
                9.3852862e-04,
                1.1302452e-03,
                7.1531429e-04,
            ],
            dtype=float32,
        ),
        "sampling_rate": 16000,
    },
    "transcription": "I would like to pay my electricity bill using my card can you please assist",
    "intent_class": 13,
}

数组值也不同了，是原来的两倍

Nyquist sampling theorem 奈奎斯特采样定理限制

过滤

比如我们会过滤超过 20s 的来防止训练时候爆内存

下面是写的例子

MAX_DURATION_IN_SECONDS = 20.0


def is_audio_length_in_range(input_length):
    return input_length < MAX_DURATION_IN_SECONDS

可以将过滤功能应用于数据集的列，但我们没有该数据集中有音轨持续时间的列。但是，我们可以根据该列中的值创建一个，然后将其删除。

# use librosa to get example's duration from the audio file
new_column = [librosa.get_duration(path=x) for x in minds["path"]]
minds = minds.add_column("duration", new_column)

# use 🤗 Datasets' `filter` method to apply the filtering function
minds = minds.filter(is_audio_length_in_range, input_columns=["duration"])

# remove the temporary helper column
minds = minds.remove_columns(["duration"])
minds

输出

Dataset({features: ["path", "audio", "transcription", "intent_class"], num_rows: 624})

现在删到 624 个了

预处理

transformers 提供特征提取器将原始音频转化成可以丢给模型输入的形式

Whisper’s feature extractor 先填充，后转频域

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

写一个处理单个音频的提取器

def prepare_dataset(example):
    audio = example["audio"]
    features = feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"], padding=True
    )
    return features

使用 map 给所有训练例子

minds = minds.map(prepare_dataset)
minds

输出

Dataset(
    {
        features: ["path", "audio", "transcription", "intent_class", "input_features"],
        num_rows: 624,
    }
)

现在我们有了 log-mel 频谱作为输入

再次可视化一下：

import numpy as np

example = minds[0]
input_features = example["input_features"]

plt.figure().set_figwidth(12)
librosa.display.specshow(
    np.asarray(input_features[0]),
    x_axis="time",
    y_axis="mel",
    sr=feature_extractor.sampling_rate,
    hop_length=feature_extractor.hop_length,
)
plt.colorbar()

大部分任务是多模态的，transformers 同样提供model-specific tokenizers 加载文本输入。

你可以分别加载，或者通过 processor 一块加载。为了简化，使用 AutoProcessor 从 checkpoint 加载模型特征和 processor

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("openai/whisper-small")

流音频

之前的音频都太小了，多的几十G的基本得靠流模式处理。因为

节省空间
加快时间
容易实验

举例：

gigaspeech = load_dataset("speechcolab/gigaspeech", "xs", streaming=True)

以下是流式传输数据集时访问示例的方法：

next(iter(gigaspeech["train"]))

输出

{
    "segment_id": "YOU0000000315_S0000660",
    "speaker": "N/A",
    "text": "AS THEY'RE LEAVING <COMMA> CAN KASH PULL ZAHRA ASIDE REALLY QUICKLY <QUESTIONMARK>",
    "audio": {
        "path": "xs_chunks_0000/YOU0000000315_S0000660.wav",
        "array": array(
            [0.0005188, 0.00085449, 0.00012207, ..., 0.00125122, 0.00076294, 0.00036621]
        ),
        "sampling_rate": 16000,
    },
    "begin_time": 2941.89,
    "end_time": 2945.07,
    "audio_id": "YOU0000000315",
    "title": "Return to Vasselheim | Critical Role: VOX MACHINA | Episode 43",
    "url": "https://www.youtube.com/watch?v=zr2n1fLVasU",
    "source": 2,
    "category": 24,
    "original_full_path": "audio/youtube/P0004/YOU0000000315.opus",
}

使用 take() 获取前几个元素

gigaspeech_head = gigaspeech["train"].take(2)
list(gigaspeech_head)

输出

[
    {
        "segment_id": "YOU0000000315_S0000660",
        "speaker": "N/A",
        "text": "AS THEY'RE LEAVING <COMMA> CAN KASH PULL ZAHRA ASIDE REALLY QUICKLY <QUESTIONMARK>",
        "audio": {
            "path": "xs_chunks_0000/YOU0000000315_S0000660.wav",
            "array": array(
                [
                    0.0005188,
                    0.00085449,
                    0.00012207,
                    ...,
                    0.00125122,
                    0.00076294,
                    0.00036621,
                ]
            ),
            "sampling_rate": 16000,
        },
        "begin_time": 2941.89,
        "end_time": 2945.07,
        "audio_id": "YOU0000000315",
        "title": "Return to Vasselheim | Critical Role: VOX MACHINA | Episode 43",
        "url": "https://www.youtube.com/watch?v=zr2n1fLVasU",
        "source": 2,
        "category": 24,
        "original_full_path": "audio/youtube/P0004/YOU0000000315.opus",
    },
    {
        "segment_id": "AUD0000001043_S0000775",
        "speaker": "N/A",
        "text": "SIX TOMATOES <PERIOD>",
        "audio": {
            "path": "xs_chunks_0000/AUD0000001043_S0000775.wav",
            "array": array(
                [
                    1.43432617e-03,
                    1.37329102e-03,
                    1.31225586e-03,
                    ...,
                    -6.10351562e-05,
                    -1.22070312e-04,
                    -1.83105469e-04,
                ]
            ),
            "sampling_rate": 16000,
        },
        "begin_time": 3673.96,
        "end_time": 3675.26,
        "audio_id": "AUD0000001043",
        "title": "Asteroid of Fear",
        "url": "http//www.archive.org/download/asteroid_of_fear_1012_librivox/asteroid_of_fear_1012_librivox_64kb_mp3.zip",
        "source": 0,
        "category": 28,
        "original_full_path": "audio/audiobook/P0011/AUD0000001043.opus",
    },
]

Audio Course 1

Audio course

Working with audio data

加载声音数据集

处理数据集

重采样

过滤

预处理

流音频