Spring AI Alibaba 多模态实战：文本对话、文生图、语音合成、语音转文字完整教程

张

张建站

2026/6/14 23:07:40

10分钟阅读

Spring AI Alibaba 多模态实战：文本对话、文生图、语音合成、语音转文字完整教程

目录一、核心模型整体介绍二、ChatModel 文本对话模型基础核心1. 工作原理三、ImageModel 文生图模型1. 工作原理四、AudioModel 语音合成文本转语音1. 工作原理五、语音转文字模型音频转文本重点避坑1. 工作原理2. 核心踩坑总结必看3. 前置准备必须操作六、总结Spring AI Alibaba 是基于 Spring 生态封装的 AI 开发框架完美适配阿里百炼 DashScope 大模型屏蔽了底层复杂的 API 对接逻辑让开发者可以快速实现文本对话、图片生成、语音合成、语音转文字四大核心多模态能力。本文将从零讲解四大模型的核心原理、完整可运行代码同时补充实战踩坑解决方案适合新手快速上手 Spring AI 多模态开发。一、核心模型整体介绍Spring AI Alibaba 对不同 AI 能力做了高度抽象提供四类核心模型各司其职接口统一、极简易用ChatModel文本对话模型基础文本交互接收纯文本输入返回格式化文本回复是最基础的大模型能力。ImageModel文生图模型以文生图接收用户文本描述AI 生成对应图片并返回图片公网地址。Audio 语音合成模型文本转语音输入文字模型生成音频流可直接下载保存为 MP3 文件。Audio 语音转文字模型音频解析转写上传音频文件自动识别音频内容并输出文本适配会议记录、语音笔记等场景。二、ChatModel 文本对话模型基础核心1. 工作原理ChatModel 是 Spring AI 顶层统一对话接口底层适配阿里通义千问系列大模型。接收前端传入的文本 Prompt将请求转发至 DashScope 大模型模型基于训练数据理解语义、生成回复最终返回格式化文本结果。package org.example.ai_demo.controller; import com.alibaba.cloud.ai.memory.redis.RedisChatMemoryRepository; import org.springframework.ai.chat.model.ChatModel; import org.springframework.ai.chat.model.ChatResponse; import org.springframework.ai.chat.prompt.Prompt; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; RestController public class ChatModelController { // 自动注入底层 DashScope 对话模型实现 private final ChatModel chatModel; public ChatModelController(ChatModel chatModel){ this.chatModel chatModel; } RequestMapping(/chat2) public String chat2(String input){ // 封装Prompt并调用大模型 ChatResponse response chatModel.call(new Prompt(input)); // 提取模型返回的文本内容 return response.getResult().getOutput().getText(); } }三、ImageModel 文生图模型1. 工作原理ImageModel 专门用于图文生成底层适配阿里万相大模型。接收用户图片描述文本调用 DashScope 图像生成接口异步生成图片后返回公网可访问的图片 URL支持自定义图片尺寸、模型规格。package org.example.ai_demo.controller; import org.springframework.ai.image.*; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; RestController public class ImageModelControler { // 注入文生图模型 private final ImageModel imageModel; public ImageModelControler(ImageModel imageModel) { this.imageModel imageModel; } RequestMapping(/image) public String image(String input){ // 配置图像生成参数模型、宽高 ImageOptions imageOptions ImageOptionsBuilder .builder() .model(wanx2.1-t2i-turbo) .height(1024) .width(1024) .build(); // 封装绘图请求 ImagePrompt imagePrompt new ImagePrompt(input, imageOptions); ImageResponse response imageModel.call(imagePrompt); // 返回图片公网URL String url response.getResult().getOutput().getUrl(); return url; } }四、AudioModel 语音合成文本转语音1. 工作原理通过SpeechSynthesisModel实现文本转语音能力输入任意中文/英文文本模型自动合成标准人声语音返回音频字节流前端可直接下载为 MP3 格式文件。、package org.example.ai_demo.controller; import com.alibaba.cloud.ai.dashscope.audio.synthesis.SpeechSynthesisModel; import com.alibaba.cloud.ai.dashscope.audio.synthesis.SpeechSynthesisPrompt; import com.alibaba.cloud.ai.dashscope.audio.synthesis.SpeechSynthesisResponse; import org.springframework.http.MediaType; import org.springframework.http.ResponseEntity; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RequestParam; import org.springframework.web.bind.annotation.RestController; import java.nio.ByteBuffer; RestController public class AudioModelController { // 注入语音合成模型 private final SpeechSynthesisModel speechSynthesisModel; public AudioModelController(SpeechSynthesisModel speechSynthesisModel) { this.speechSynthesisModel speechSynthesisModel; } GetMapping(/synthesize) public ResponseEntitybyte[] synthesize(RequestParam(text) String text){ // 封装文本转语音请求 SpeechSynthesisPrompt prompt new SpeechSynthesisPrompt(text); SpeechSynthesisResponse response speechSynthesisModel.call(prompt); // 获取音频字节流 ByteBuffer audio response.getResult().getOutput().getAudio(); byte[] audioBytes new byte[audio.remaining()]; audio.get(audioBytes); // 响应MP3音频文件支持浏览器直接下载 return ResponseEntity.ok() .contentType(MediaType.APPLICATION_OCTET_STREAM) .header(Content-Disposition, attachment; filenameoutput.mp3) .body(audioBytes); } }五、语音转文字模型音频转文本重点避坑1. 工作原理基于sensevoice-v1模型读取音频文件并智能识别语音内容输出结构化文本。该模型为异步任务模型也是实战踩坑最多的模块。2. 核心踩坑总结必看很多新手直接使用本地磁盘文件FileSystemResource调用接口会直接报错spec is null空指针异常根源旧版本 Spring AI Alibaba SDK不支持本地文件直传DashScope 语音转写接口仅支持公网可访问的 HTTPS 音频链接临时签名 OSS 链接、私有权限 OSS 文件均无法识别会提示「不支持该文件类型」3. 前置准备必须操作1. 打开阿里云 OSS 控制台创建 Bucket2. 将 Bucket 读写权限设置为公共读3. 上传本地 MP3 文件复制无签名永久公网 HTTPS 链接4. 链接可在浏览器直接播放/下载即为有效。package org.example.ai_demo.controller; import com.alibaba.cloud.ai.dashscope.audio.DashScopeAudioTranscriptionModel; import com.alibaba.cloud.ai.dashscope.audio.DashScopeAudioTranscriptionOptions; import org.springframework.ai.audio.transcription.AudioTranscriptionPrompt; import org.springframework.core.io.UrlResource; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RestController; import java.net.MalformedURLException; RestController public class AudioModelController2 { // 替换为你的OSS公共读MP3公网链接 private static final String AUDIO_URL_PATH 你的公网地址 private final DashScopeAudioTranscriptionModel dashScopeAudioTranscriptionModel; public AudioModelController2(DashScopeAudioTranscriptionModel dashScopeAudioTranscriptionModel){ this.dashScopeAudioTranscriptionModel dashScopeAudioTranscriptionModel; } GetMapping(/audio) public String audio() throws MalformedURLException { // 加载公网音频资源 UrlResource resource new UrlResource(AUDIO_URL_PATH); // 指定语音转写模型 AudioTranscriptionPrompt prompt new AudioTranscriptionPrompt(resource, DashScopeAudioTranscriptionOptions.builder() .withModel(sensevoice-v1) .build()); // 返回转写后的文本 return dashScopeAudioTranscriptionModel.call(prompt).getResult().getOutput(); } }六、总结1.ChatModel基础文本对话开箱即用无特殊环境要求2.ImageModel文生图异步任务需配置重试策略规避排队报错3.Audio 语音合成文本转音频直接返回文件流无需第三方存储4.Audio 语音转文字核心难点必须使用OSS公共读公网链接禁止本地文件、私有链接、临时签名链接。Spring AI Alibaba 极大简化了多模态 AI 开发掌握以上四大模型即可快速实现聊天、绘图、语音合成、语音转写全场景 AI 应用。