纯CPU环境离线部署语音合成TTS服务(支持中文)技术选型:eSpeak,ChatTTS,CoquiTTS

分类: 365速度发国际大厅 时间: 2025-09-28 02:27:34 作者: admin 阅读: 7902

背景

还记得我刚毕业那会儿,接触的项目就是跟音频应用相关的工作,包括语音识别、语音合成以及音频剪辑等功能,具体见https://github.com/heartsuit/BaiduASRAndTTS,当时主要是调用了百度的 ASR 与 TTS 接口。后来有的项目用到语音合成功能时,直接通过前端的 NPM 包 speak-tts 即可实现(调用客户端操作系统的类库实现)。

关于语音合成 TTS (Text to Speech),如果想要私有化部署一套 TTS 服务,如今的选择是真多: ChatTTS , VITS , MeloTTS , CoquiTTS 等,此外, HuggingFace , ModelScope 上的开源模型更是数不胜数。

这里的需求是在纯CPU、无互联网的环境下完成中文文本语音合成,时间要求是5秒以内。

结合实际的应用场景,本次主要关注在纯 CPU 场景下,对于中文文本的合成效果(人声自然)与合成效率(时间短)两个方面。以下将通过对 eSpeak , ChatTTS , CoquiTTS 这三种语音合成 TTS 服务离线部署测试,分析三种方案的优劣。

先看结果

模型名称合成效果合成效率eSpeak鬼畜声、比较差毫秒级~几秒ChatTTS逼真流畅,音质高50秒左右CoquiTTS正常人声,音质一般5秒左右

Note:以上是在纯 CPU 环境下使用100个字符以内的中文文本进行测试得出的结果。

eSpeak

eSpeak 是一个开源的文本到语音(TTS)合成器,适用于多种语言,包括英语和其他语言。 eSpeak 使用了形式合成方法,能够生成高质量的声音,并且因为其小文件大小和多语言特性,被广泛应用于各种场景中。

关于 eSpeak 的 TTS 方案,我直接使用 Cursor 来编码,全程没有一行自己编写的代码,部署到服务器上之后成功运行,不过,中文的合成效果很是鬼畜,挺差的。。

目录结构

D:.

│ docker-compose.yml

│ Dockerfile

│ package.json

│ README.md

└─src

│ index.js

│ tts.js

└─routes

tts.routes.js

代码结构比较简单,标准的 Node.js 后端项目,使用了传统的 Express Web 框架,不多作解释。

tts.js

const {

exec

} = require('child_process');

const util = require('util');

const path = require('path');

const fs = require('fs').promises;

const execPromise = util.promisify(exec);

class TextToSpeech {

constructor() {

this.outputDir = process.env.OUTPUT_DIR || './output';

}

async convertToSpeech(text, options = {}) {

const {

language = 'zh',

speed = 175,

pitch = 50,

volume = 100

} = options;

const fileName = `${Date.now()}-${Math.random().toString(36).substring(7)}`;

const wavFile = path.join(this.outputDir, `${fileName}.wav`);

const mp3File = path.join(this.outputDir, `${fileName}.mp3`);

try {

// 1. 使用 espeak 生成 wav 文件

await execPromise(

`espeak -v ${language} -s ${speed} -p ${pitch} -a ${volume} -w "${wavFile}" "${text}"`

);

// 2. 转换为 mp3

await execPromise(

`sox "${wavFile}" "${mp3File}"`

);

// 3. 读取 MP3 文件

const audioBuffer = await fs.readFile(mp3File);

// 4. 清理临时文件

await Promise.all([

fs.unlink(wavFile),

fs.unlink(mp3File)

]);

return audioBuffer;

} catch (error) {

// 清理任何可能存在的临时文件

try {

await Promise.all([

fs.unlink(wavFile).catch(() => {}),

fs.unlink(mp3File).catch(() => {})

]);

} catch (e) {

// 忽略清理错误

}

throw new Error(`TTS conversion failed: ${error.message}`);

}

}

}

module.exports = TextToSpeech;

tts.routes.js

const express = require('express');

const router = express.Router();

const TextToSpeech = require('../tts');

const tts = new TextToSpeech();

router.post('/convert', async (req, res) => {

try {

const {

text,

options

} = req.body;

if (!text) {

return res.status(400).json({

error: 'Text is required'

});

}

const audioBuffer = await tts.convertToSpeech(text, options);

// 设置响应头

res.set({

'Content-Type': 'audio/mpeg',

'Content-Disposition': `attachment; filename="speech-${Date.now()}.mp3"`

});

// 发送音频数据

res.send(audioBuffer);

} catch (error) {

console.error('TTS Error:', error);

res.status(500).json({

error: error.message

});

}

});

// 添加健康检查端点

router.get('/health', (req, res) => {

res.json({

status: 'ok'

});

});

module.exports = router;

index.js

const express = require('express');

const cors = require('cors');

const morgan = require('morgan');

const ttsRoutes = require('./routes/tts.routes');

const app = express();

const port = process.env.PORT || 4000;

// 中间件

app.use(cors());

app.use(morgan('dev'));

app.use(express.json());

// 路由

app.use('/api/tts', ttsRoutes);

// 错误处理

app.use((err, req, res, next) => {

console.error(err.stack);

res.status(500).json({

error: 'Something went wrong!'

});

});

app.listen(port, () => {

console.log(`Server running on port ${port}`);

});

package.json

{

"name": "tts-service",

"version": "1.0.0",

"description": "Offline text-to-speech service",

"main": "src/index.js",

"dependencies": {

"express": "^4.18.2",

"cors": "^2.8.5",

"morgan": "^1.10.0",

"uuid": "^9.0.0"

}

}

Docker部署文件

Dockerfile

FROM node:18-slim

# 设置时区

ENV TZ=Asia/Shanghai

RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

# 安装必要的包

RUN apt-get update && apt-get install -y \

espeak \

espeak-ng \

sox \

libsox-fmt-mp3 \

&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

# 复制 package.json 和 package-lock.json

COPY package*.json ./

# 安装依赖

RUN npm install --registry=https://registry.npmmirror.com/

# 复制源代码

COPY . .

# 创建输出目录

RUN mkdir -p output && chmod 777 output

EXPOSE 4000

CMD ["node", "src/index.js"]

docker-compose.yml

version: '3'

services:

tts-service:

build: .

ports:

- "4000:4000"

volumes:

- ./output:/app/output

environment:

- PORT=4000

- OUTPUT_DIR=/app/output

README.md

使用方法:

1. 构建和运行服务:

docker-compose up --build

2. API 使用示例:

使用 curl:

curl -X POST http://192.168.44.171:4000/api/tts/convert \

-H "Content-Type: application/json" \

-d '{

"text": "这是一个测试文本",

"options": {

"language": "zh+f2",

"speed": 175,

"pitch": 50,

"volume": 100

}

}' \

--output speech.mp3

ChatTTS

ChatTTS 是一款专为对话场景设计的文本转语音(TTS)模型,支持中英文双语。经过大量训练, ChatTTS 能够生成自然、流畅且富有表现力的语音合成内容,并在韵律控制上超越了大多数开源模型。它不仅能细致地控制笑声、停顿和感叹词等特征,还可以用于多种语言和场景的语音内容生成。此外, ChatTTS 特别适用于大型语言模型(LLM)助手的对话任务以及诸如对话式音频和视频介绍等应用。它的性能非常出色,甚至在与微软 Azure-tts 这样的商业级项目相比时,也毫不逊色。

ChatTTS 提供了在线工具来生成语音,可以快速体验:https://chattts.com/zh?__theme=dark#Demo,不过经过测试,普通的一句话 TTS 一般需要20秒(这还是使用了 GPU 的)。下面是使用了基于 ChatTTS 实现语音合成的开源 UI 项目:ChatTTS-ui,主要是对官方的ChatTTS进行了容器化构建,并提供了一个用于测试的 Web 可视化页面和开放的 API 接口。

官方在线体验

本地部署

# 下载ChatTTS源码

cd ChatTTS-ui-main

# 运行cpu版本

docker-compose -f docker-compose.cpu.yaml up -d

纯CPU运行效果

浏览器打开:http://192.168.44.171:9966;

CoquiTTS

CoquiTTS 是一个开源的文字到语音(Text-to-Speech, TTS)系统,旨在使语音合成技术对研究人员、开发者和创造者更加可接近。它基于共同学习技术,能够从各语言的训练数据集中转换知识,从而有效降低所需的数据量。 CoquiTTS 支持多种语言,包括跨语言克隆,例如英文到中文、中文到英文等,共计 16 种语言。

此外, CoquiTTS 还提供了先进的多语言文本转语音库支持超过1100种语言的功能,并包含如 Tacotron2 、 VITS 和 YourTTS 等多种深度学习模型。它不仅用于生成高质量的语音,还提供训练新模型和微调现有模型的工具,支持多说话人 TTS,并提供了数据集分析功能。

CoquiTTS 因其高效性和多功能性而受到广泛关注,在 GitHub 上获得了 36.4k 的星标数,成为新一代开源语音技术的领军者。

安装

容器化部署,使用中文模型: tts_models/zh-CN/baker/tacotron2-DDC-GST 。

[root@tts ~]# docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu

Unable to find image 'ghcr.io/coqui-ai/tts-cpu:latest' locally

latest: Pulling from coqui-ai/tts-cpu

025c56f98b67: Pull complete

778656c04542: Pull complete

85485c9f43dd: Pull complete

23b3c91f0de2: Pull complete

fd19b936aab8: Pull complete

30b21c9aef2b: Pull complete

cc12d1e5322b: Pull complete

b91e9a336532: Pull complete

d679a5e35c77: Pull complete

0d84a5b8bca3: Pull complete

4f4fb700ef54: Pull complete

d170b2e70a00: Pull complete

c612db99f0b2: Pull complete

Digest: sha256:a2f6659245358c38efb1bb44b39f7b7b3459e03e9ed5687c447681cb82c35de3

Status: Downloaded newer image for ghcr.io/coqui-ai/tts-cpu:latest

root@b452b7513c7e:~# python3 TTS/server/server.py --list_models

Name format: type/language/dataset/model

1: tts_models/multilingual/multi-dataset/xtts_v2

2: tts_models/multilingual/multi-dataset/xtts_v1.1

3: tts_models/multilingual/multi-dataset/your_tts

4: tts_models/multilingual/multi-dataset/bark

5: tts_models/bg/cv/vits

6: tts_models/cs/cv/vits

7: tts_models/da/cv/vits

8: tts_models/et/cv/vits

9: tts_models/ga/cv/vits

10: tts_models/en/ek1/tacotron2

11: tts_models/en/ljspeech/tacotron2-DDC

12: tts_models/en/ljspeech/tacotron2-DDC_ph

13: tts_models/en/ljspeech/glow-tts

14: tts_models/en/ljspeech/speedy-speech

15: tts_models/en/ljspeech/tacotron2-DCA

16: tts_models/en/ljspeech/vits

17: tts_models/en/ljspeech/vits--neon

18: tts_models/en/ljspeech/fast_pitch

19: tts_models/en/ljspeech/overflow

20: tts_models/en/ljspeech/neural_hmm

21: tts_models/en/vctk/vits

22: tts_models/en/vctk/fast_pitch

23: tts_models/en/sam/tacotron-DDC

24: tts_models/en/blizzard2013/capacitron-t2-c50

25: tts_models/en/blizzard2013/capacitron-t2-c150_v2

26: tts_models/en/multi-dataset/tortoise-v2

27: tts_models/en/jenny/jenny

28: tts_models/es/mai/tacotron2-DDC

29: tts_models/es/css10/vits

30: tts_models/fr/mai/tacotron2-DDC

31: tts_models/fr/css10/vits

32: tts_models/uk/mai/glow-tts

33: tts_models/uk/mai/vits

34: tts_models/zh-CN/baker/tacotron2-DDC-GST

35: tts_models/nl/mai/tacotron2-DDC

36: tts_models/nl/css10/vits

37: tts_models/de/thorsten/tacotron2-DCA

38: tts_models/de/thorsten/vits

39: tts_models/de/thorsten/tacotron2-DDC

40: tts_models/de/css10/vits-neon

41: tts_models/ja/kokoro/tacotron2-DDC

42: tts_models/tr/common-voice/glow-tts

43: tts_models/it/mai_female/glow-tts

44: tts_models/it/mai_female/vits

45: tts_models/it/mai_male/glow-tts

46: tts_models/it/mai_male/vits

47: tts_models/ewe/openbible/vits

48: tts_models/hau/openbible/vits

49: tts_models/lin/openbible/vits

50: tts_models/tw_akuapem/openbible/vits

51: tts_models/tw_asante/openbible/vits

52: tts_models/yor/openbible/vits

53: tts_models/hu/css10/vits

54: tts_models/el/cv/vits

55: tts_models/fi/css10/vits

56: tts_models/hr/cv/vits

57: tts_models/lt/cv/vits

58: tts_models/lv/cv/vits

59: tts_models/mt/cv/vits

60: tts_models/pl/mai_female/vits

61: tts_models/pt/cv/vits

62: tts_models/ro/cv/vits

63: tts_models/sk/cv/vits

64: tts_models/sl/cv/vits

65: tts_models/sv/cv/vits

66: tts_models/ca/custom/vits

67: tts_models/fa/custom/glow-tts

68: tts_models/bn/custom/vits-male

69: tts_models/bn/custom/vits-female

70: tts_models/be/common-voice/glow-tts

Name format: type/language/dataset/model

1: vocoder_models/universal/libri-tts/wavegrad

2: vocoder_models/universal/libri-tts/fullband-melgan

3: vocoder_models/en/ek1/wavegrad

4: vocoder_models/en/ljspeech/multiband-melgan

5: vocoder_models/en/ljspeech/hifigan_v2

6: vocoder_models/en/ljspeech/univnet

7: vocoder_models/en/blizzard2013/hifigan_v2

8: vocoder_models/en/vctk/hifigan_v2

9: vocoder_models/en/sam/hifigan_v2

10: vocoder_models/nl/mai/parallel-wavegan

11: vocoder_models/de/thorsten/wavegrad

12: vocoder_models/de/thorsten/fullband-melgan

13: vocoder_models/de/thorsten/hifigan_v1

14: vocoder_models/ja/kokoro/hifigan_v1

15: vocoder_models/uk/mai/multiband-melgan

16: vocoder_models/tr/common-voice/hifigan

17: vocoder_models/be/common-voice/hifigan

Name format: type/language/dataset/model

1: voice_conversion_models/multilingual/vctk/freevc24

root@b452b7513c7e:~# python3 TTS/server/server.py --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST

> tts_models/zh-CN/baker/tacotron2-DDC-GST is already downloaded.

> Using model: tacotron2

> Setting up Audio Processor...

| > sample_rate:22050

| > resample:False

| > num_mels:80

| > log_func:np.log10

| > min_level_db:-100

| > frame_shift_ms:None

| > frame_length_ms:None

| > ref_level_db:0

| > fft_size:1024

| > power:1.5

| > preemphasis:0.0

| > griffin_lim_iters:60

| > signal_norm:True

| > symmetric_norm:True

| > mel_fmin:50.0

| > mel_fmax:7600.0

| > pitch_fmin:0.0

| > pitch_fmax:640.0

| > spec_gain:1.0

| > stft_pad_mode:reflect

| > max_norm:4.0

| > clip_norm:True

| > do_trim_silence:True

| > trim_db:60

| > do_sound_norm:False

| > do_amp_to_db_linear:True

| > do_amp_to_db_mel:True

| > do_rms_norm:False

| > db_level:None

| > stats_path:/root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST/scale_stats.npy

| > base:10

| > hop_length:256

| > win_length:1024

> Model's reduction rate `r` is set to: 2

* Serving Flask app 'server'

* Debug mode: off

INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.

* Running on all addresses (::)

* Running on http://[::1]:5002

* Running on http://[::1]:5002

INFO:werkzeug:Press CTRL+C to quit

INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:29:57] "GET / HTTP/1.1" 200 -

INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:39] "GET /favicon.ico HTTP/1.1" 404 -

INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:39] "GET /static/coqui-log-green-TTS.png HTTP/1.1" 200 -

> Model input: 你好。

> Speaker Idx:

> Language Idx:

> Text splitted to sentences.

['你好。']

Building prefix dict from the default dictionary ...

DEBUG:jieba:Building prefix dict from the default dictionary ...

Dumping model to file cache /tmp/jieba.cache

DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache

Loading model cost 0.612 seconds.

DEBUG:jieba:Loading model cost 0.612 seconds.

Prefix dict has been built successfully.

DEBUG:jieba:Prefix dict has been built successfully.

> Processing time: 1.3838930130004883

> Real-time factor: 1.3535681749760808

INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:51] "GET /api/tts?text=你好。&speaker_id=&style_wav={"0":%200.1}&language_id= HTTP/1.1" 200 -

> Model input: Coqui TTS 支持多种语言,包括跨语言克隆,例如英文到中文、中文到英文等,共计16种语言。

> Speaker Idx:

> Language Idx:

> Text splitted to sentences.

['Coqui TTS 支持多种语言,包括跨语言克隆,例如英文到中文、中文到英文等,共计16种语言。']

Coqui TTS dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 , baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 , li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2 dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 , goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。

[!] Character 'C' not found in the vocabulary. Discarding it.

Coqui TTS dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 , baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 , li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2 dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 , goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。

[!] Character 'T' not found in the vocabulary. Discarding it.

Coqui TTS dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 , baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 , li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2 dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 , goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。

[!] Character 'S' not found in the vocabulary. Discarding it.

Coqui TTS dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 , baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 , li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2 dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 , goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。

[!] Character 'g' not found in the vocabulary. Discarding it.

> Processing time: 3.930570125579834

> Real-time factor: 0.4210506765887842

INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:37:24] "GET /api/tts?text=Coqui%20TTS%20支持多种语言,包括跨语言克隆,例如英文到中文、中文到英文等,共计16种语言。&speaker_id=&style_wav={"0":%200.1}&language_id= HTTP/1.1" 200

纯CPU运行效果

浏览器打开:http://192.168.44.171:5002;大部分情况下,会在5秒左右返回合成结果。

Note:

记得要在中文最后加上中文句号,否则模型会在后面加一段啊的声音,补齐12s的时长,这应该是个Bug。对于较长的文本,比如50个文字以上,有时会出现最后一部分被截断的情况,这应该也是个Bug。

遇到模型下载失败的问题

报错信息:从GitHub上下载模型文件超时

root@b452b7513c7e:~# python3 TTS/server/server.py --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST

> Downloading model to /root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST

> Failed to download the model file to /root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST

Traceback (most recent call last):

File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn

sock = connection.create_connection(

File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection

raise err

File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection

sock.connect(sa)

TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send

resp = conn.urlopen(

File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopen

retries = retries.increment(

File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment

raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /coqui-ai/TTS/releases/download/v0.6.1_models/tts_models--zh-CN--baker--tacotron2-DDC-GST.zip (Caused by ConnectTimeoutError(, 'Connection to github.com timed out. (connect timeout=None)'))

解决方法:手动下载模型

从报错信息中可以看到模型的下载地址: https://github.com/coqui-ai/TTS/releases/download/v0.6.1_models/tts_models--zh-CN--baker--tacotron2-DDC-GST.zip ;手动下载后解压并传至容器的 /root/.local/share/tts 目录下。

[root@tts opt]# unzip tts_models--zh-CN--baker--tacotron2-DDC-GST.zip

Archive: tts_models--zh-CN--baker--tacotron2-DDC-GST.zip

creating: tts_models--zh-CN--baker--tacotron2-DDC-GST/

extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/model_file.pth

extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/scale_stats.npy

extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/config.json

[root@tts opt]# docker cp tts_models--zh-CN--baker--tacotron2-DDC-GST nostalgic_hawking:/root/.local/share/tts

Successfully copied 686MB to nostalgic_hawking:/root/.local/share/tts

小总结

架构是一种权衡。

根据当前的实际需求,下面来总结一下三种 TTS 方案的对比,在纯 CPU 、无互联网环境下进行中文文本语音合成(要求5秒内完成)时:

方案名称合成效果合成速度是否满足需求eSpeak鬼畜声毫秒级速度满足,效果差ChatTTS最佳50秒左右效果好,速度不达标CoquiTTS正常5秒左右基本满足要求

综上, CoquiTTS 是最符合需求的方案,它能在纯 CPU 环境下5秒内完成合成,且语音效果可以接受。

Reference

https://chattts.com/zh?__theme=dark#Demohttps://github.com/jianchang512/ChatTTS-uihttps://github.com/coqui-ai/TTShttps://docs.coqui.ai/en/latest/tutorial_for_nervous_beginners.html

If you have any questions or any bugs are found, please feel free to contact me.

Your comments and suggestions are welcome!

相关文章

365速度发国际大厅

勹的解释

365bet正网注册

目前使用ORACLE EBS的公司列表

365速度发国际大厅

0-2,终于输了!世界第17倒下,61%控球+射12脚没用,告别世界杯