OpenAI多模态开发

Posted on 2025-05-26 Edited on 2025-06-22 In 人工智能 , 大模型 Views: Word count in article: 11k Reading time ≈ 10 mins.

1、图生文GPT-4 Vison

从历史上看，语言模型系统仅接受文本作为输入。但是单一的输入形式，限制了大模型的应用落地范围。
随着技术发展，OpenAI 开发的 GPT-4 Turbo with Vision（简称 GPT-4V）允许模型接收图像作为输入，并回答关于它们的问题。
注意，目前在 Assistants API 中使用 GPT-4 时还不支持图像输入。

使用 GPT-4V 识别线上图像（URL）：

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-turbo",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "介绍下这幅图?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

response.choices[0].message.content

# 封装成一个函数 query_image_description
def query_image_description(url, prompt="介绍下这幅图?"):
    client = OpenAI()  # 初始化 OpenAI 客户端
    
    # 发送请求给 OpenAI 的聊天模型
    response = client.chat.completions.create(
        model="gpt-4-turbo",  # 指定使用的模型
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": url}},
                ],
            }
        ],
        max_tokens=300,
    )
    
    # 返回模型的响应
    return response.choices[0].message.content

# 调用函数测试
image_url = "https://p6.itc.cn/q_70/images03/20200602/0c267a0d3d814c9783659eb956969ba1.jpeg"
content = query_image_description(image_url)
print(content)

使用 GPT-4V 识别本地图像文件（Base64编码）：

from openai import OpenAI
import base64
import requests
import json

client = OpenAI()  # 初始化 OpenAI 客户端

def query_base64_image_description(image_path, prompt="解释下图里的内容？", max_tokens=1000):

    # 实现 Base64 编码
    def encode_image(path):
        with open(path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')

    # 获取图像的 Base64 编码字符串
    base64_image = encode_image(image_path)

    # 构造请求的 HTTP Header
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {client.api_key}"
    }

    # 构造请求的负载
    payload = {
        "model": "gpt-4-turbo",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                ]
            }
        ],
        "max_tokens": max_tokens
    }

    # 发送 HTTP 请求
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

    # 检查响应并提取所需的 content 字段
    if response.status_code == 200:
        response_data = response.json()
        content = response_data['choices'][0]['message']['content']
        return content
    else:
        return f"Error: {response.status_code}, {response.text}"

content = query_base64_image_description("./images/gdp_1980_2020.jpg")
print(content)

2、文生图DALL·E 3

OpenAI Images API 提供了三种与图像交互的方法：
- 基于文本提示生成图像（DALL·E 3 和 DALL·E 2）
- 通过模型编辑（替换）已存在图像的某些区域，根据新的文本提示创建编辑过的图像版本（仅限 DALL·E 2）
- 创建现有图像的变体（仅限 DALL·E 2）

本次主要介绍第一种文生图像的使用方法（关于 DALL·E 3 模型更新的更多内容，请参考**OpenAI Cookbook**），图像生成API：

新参数：
- model（’dall-e-2’ 或 ‘dall-e-3’）：您正在使用的模型。请注意将其设置为 ‘dall-e-3’，因为如果为空，默认为 ‘dall-e-2’。
- style（’natural’ 或 ‘vivid’）：生成图像的风格。必须是 ‘vivid’ 或 ‘natural’ 之一。’vivid’ 会使模型倾向于生成超现实和戏剧性的图像。’natural’ 会使模型产生更自然、不那么超现实的图像。默认为 ‘vivid’。
- quality（’standard’ 或 ‘hd’）：将生成的图像质量。’hd’ 创建细节更精细、整体一致性更高的图像。默认为 ‘standard’。
其他参数：
- prompt（str）：所需图像的文本描述。最大长度为1000个字符。必填字段。
- n（int）：要生成的图像数量。必须在1到10之间。默认为1。对于 dall-e-3，只支持 n=1。
- size（…）：生成图像的尺寸。对于 DALL·E-2 模型，必须是 256x256、512x512 或 1024x1024 之一。对于 DALL·E-3 模型，必须是 1024x1024、1792x1024 或 1024x1792 之一。
- response_format（’url’ 或 ‘b64_json’）：返回生成图像的格式。必须是 “url” 或 “b64_json” 之一。默认为 “url”。
- user（str）：代表您的终端用户的唯一标识符，将帮助 OpenAI 监控和检测滥用。了解更多。

from openai import OpenAI
client = OpenAI()

response = client.images.generate(
    model="dall-e-3",
    prompt="a white siamese cat",
    size="1024x1024",
    quality="standard",
    n=1,
)

image_url = response.data[0].url
print(image_url)

# 高清模式（quality="hd")
response = client.images.generate(
    model="dall-e-3",
    prompt="a white siamese cat",
    size="1024x1024",
    quality="hd",
    n=1,
)
print(response.data[0].url)

# 自然风格(style="natural")
response = client.images.generate(
    model="dall-e-3",
    prompt="a white siamese cat",
    size="1024x1024",
    quality="standard",
    n=1,
    style="natural"
)
print(response.data[0].url)

# 戏剧风格(style="vivid")
response = client.images.generate(
    model="dall-e-3",
    prompt="a white siamese cat",
    size="1024x1024",
    quality="standard",
    n=1,
    style="vivid"
)
print(response.data[0].url)

3、文生音TTS

TTS指文字配音模型Text-To-Speech(TTS)，文字配音 API 提供了一个基于 TTS（文本到语音）模型的服务。它内置了 6 种语音，并可用于：
- 朗读书面博客文章
- 用多种语言制作口语音频
- 使用流媒体实时提供音频输出
TTS 目前支持6个不同的配音：alloy, echo, fable, onyx, nova, and shimmer
发音示例参考：https://platform.openai.com/docs/guides/text-to-speech/voice-options
默认响应格式是 “mp3”，但也支持其他格式，如 “opus”、”aac”、”flac” 和 “pcm”。
- Opus：适用于互联网流媒体和通讯，低延迟。
- AAC：用于数字音频压缩，受 YouTube、Android、iOS 的偏好。
- FLAC：用于无损音频压缩，受音频爱好者喜爱，适用于存档。
- WAV：未压缩的 WAV 音频，适用于低延迟应用以避免解码开销。
- PCM：类似于 WAV，但包含未带头部的原始样本，24kHz（16位有符号，小端）。
TTS 模型在语言支持方面非常广泛（针对英语发音做了优化）：阿非利卡语、阿拉伯语、亚美尼亚语、阿塞拜疆语、白俄罗斯语、波斯尼亚语、保加利亚语、加泰罗尼亚语、中文、克罗地亚语、捷克语、丹麦语、荷兰语、英语、爱沙尼亚语、芬兰语、法语、加利西亚语、德语、希腊语、希伯来语、印地语、匈牙利语、冰岛语、印度尼西亚语、意大利语、日本语、卡纳达语、哈萨克语、韩语、拉脱维亚语、立陶宛语、马其顿语、马来语、马拉提语、毛利语、尼泊尔语、挪威语、波斯语、波兰语、葡萄牙语、罗马尼亚语、俄语、塞尔维亚语、斯洛伐克语、斯洛文尼亚语、西班牙语、斯瓦希里语、瑞典语、他加禄语、泰米尔语、泰语、土耳其语、乌克兰语、乌尔都语、越南语和威尔士语。

文字配音API（从输入文本生成音频）：

参数：
- model（’tts-1’ 或 ‘tts-1-hd’）：使用的 TTS 模型，默认为 ‘tts-1’。
- input（文本）：要为其生成音频的文本，最大长度为4096个字符。
- voice（’alloy’, ‘echo’, ‘fable’, ‘onyx’, ‘nova’, ‘shimmer’）：生成音频时使用的声音，支持声音预览在“文本到语音指南”中提供。
- response_format（’mp3’, ‘opus’, ‘aac’, ‘flac’, ‘wav’, ‘pcm’）：音频的输出格式，默认为 ‘mp3’。
- speed（0.25到4.0）：生成音频的速度，默认速度为 1.0。
返回：
- audio_file: 音频文件内容。

# 使用 TTS 给李云龙台词配音
from openai import OpenAI
client = OpenAI()

speech_file_path = "./audio/liyunlong.mp3"
# 官方示例的用法会触发 Deprecated 警告⚠️，已替换为最佳实践
with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="echo",
    input="二营长！你他娘的意大利炮呢？给我拉来！"
) as response:
    response.stream_to_file(speech_file_path)

# 使用 TTS 替换语音聊天的音色
speech_file_path = "./audio/quewang.mp3"
with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="onyx",
    input="周三早上11点，雀王争霸赛，老地方23号房，经典三缺一！"
) as response:
    response.stream_to_file(speech_file_path)

# 使用 TTS 播报新闻
speech_file_path = "./audio/boyin.mp3"
with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="onyx",
    input="""
    上海F1赛车时隔五年回归 首位中国车手周冠宇：我渴望站上领奖台
    2024年4月17日
    阔别五年的世界一级方程式（F1）赛车中国站即将于2024年4月19至21日在上海国际赛车场举行，并首次有中国籍赛车手参赛。
    
    作为中国第一位F1赛车手，24岁的上海小伙周冠宇称自己“渴望站上领奖台”。
    """
) as response:
    response.stream_to_file(speech_file_path)

4、音生文Whisper

OpenAI 提供了两个基于开源的 Whisper large-v2 模型的语音到文本API服务：
- 转录（transcriptions）：将音频转录为音频所使用的任何语言。
- 翻译（translations）：将音频翻译并转录为英语。
目前文件上传限制为 25 MB，支持以下输入文件类型：mp3、mp4、mpeg、mpga、m4a、wav 和 webm。

语音转录 Transcription API（输入音频文件，返回转录对象（JSON））：

参数
- file（文件）：需要转录的音频文件对象（不是文件名），支持以下格式：flac、mp3、mp4、mpeg、mpga、m4a、ogg、wav 或 webm。
- model（’whisper-1’）：使用的模型 ID。目前仅可使用由我们的开源 Whisper V2 模型驱动的 whisper-1。
- language（语言，可选）：输入音频的语言。提供 ISO-639-1 格式的输入语言可以提高准确性和响应速度。
- prompt（提示，可选）：可选文本，用于指导模型的风格或继续前一个音频片段。提示应与音频语言相匹配。
- response_format（响应格式，可选）：转录输出的格式，默认为 json。可选的格式有：json、text、srt、verbose_json 或 vtt。
- temperature（温度，可选）：采样温度，范围从 0 到 1。更高的值，如 0.8，将使输出更随机，而更低的值，如 0.2，将使输出更集中和确定。如果设置为 0，模型将使用对数概率自动提高温度，直到达到某些阈值。
- **timestamp_granularities[]**（时间戳粒度，可选）：为此转录填充的时间戳粒度，默认为 segment。响应格式必须设置为 verbose_json 才能使用时间戳粒度。支持以下一个或两个选项：word 或 segment。注意：segment 时间戳不增加额外延迟，但生成 word 时间戳会增加额外延迟。

返回值：转录对象（Transcription Object）或详细转录对象（Verbose Transcription Object）。

Transcription Object：

1
2
3

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}

Verbose Transcription Object:

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 3.319999933242798,
      "text": " The beach was a popular spot on a hot summer day.",
      "tokens": [
        50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2860786020755768,
      "compression_ratio": 1.2363636493682861,
      "no_speech_prob": 0.00985979475080967
    },
    ...
  ]
}

# 将 TTS 配音的李云龙台词音频文件(liyunlong.mp3)发送给 Whisper 模型进行中文转录
from openai import OpenAI
client = OpenAI()

audio_file= open("./audio/liyunlong.mp3", "rb")

transcription = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file
)

print(transcription.text) # 二营长,你他娘的意大利泡呢?给我拉来!

语音翻译API（输入音频文件，返回翻译文本）：

请求体
- file（文件）：需要翻译的音频文件对象（不是文件名），支持以下格式：flac、mp3、mp4、mpeg、mpga、m4a、ogg、wav 或 webm。
- model（’whisper-1’）：使用的模型 ID。目前只有由我们的开源 Whisper V2 模型驱动的 whisper-1 可用。
- prompt（提示，可选）：可选文本，用于指导模型的风格或继续前一个音频片段。提示应为英文。
- response_format（响应格式，可选）：转录输出的格式，默认为 json。可选的格式包括：json、text、srt、verbose_json 或 vtt。
- temperature（温度，可选）：采样温度，范围从 0 到 1。较高的值，如 0.8，将使输出更随机，而较低的值，如 0.2，将使输出更集中和确定。如果设置为 0，模型将使用对数概率自动增加温度，直到达到特定阈值。
返回值
- translated_text: 翻译后的文本。

# 使用 Whisper 实现中文识别+翻译
# 将 TTS 配音的李云龙台词音频文件(liyunlong.mp3)发送给 Whisper 模型进行翻译
audio_file= open("./audio/liyunlong.mp3", "rb")

translation = client.audio.translations.create(
    model="whisper-1", 
    file=audio_file,
    prompt="Translate into English",
)

print(translation.text) # Second Battalion Commander, where is your Italian gun? Bring it to me.

# 使用 TTS 给李云龙英文版台词配音
speech_file_path = "./audio/liyunlong_en.mp3"

with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="onyx",
    input=translation.text
) as response:
    response.stream_to_file(speech_file_path)
    
# 使用 Whipser + TTS 生成郭德纲相声英文版
gdg_audio_file = open("./audio/gdg.mp3", "rb")
gdg_speech_file = "./audio/gdg_en.mp3"

translation = client.audio.translations.create(
  model="whisper-1", 
  file=gdg_audio_file
)

print(translation.text) # There are a lot of people here. Thank you for coming. Today is the Xiangsheng Conference. Today, He Yunwei and I, the two of us, have a special feature. We don't have much ability, but we give a lot. There are not many opportunities to cooperate with Mr. Xing. Give a little. This is also an old senior in our Xiangsheng world. Mr. Xing Wenzhao has been in the Xiangsheng circle for more than 50 years. That's right. We are going to hold a commemorative exhibition this year. Mr. Xing Wenzhao's third anniversary. 53rd anniversary. That's the 40th anniversary. Oh my god.

with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="onyx",
    input=translation.text
) as response:
    response.stream_to_file(gdg_speech_file)