如何为Orca-2-13B创建一个与OpenAI兼容的API服务呢？

Rust语言中文社区 2023-11-27 685

电子说

1.3w人已加入

描述

Orca-2-13B[1] 是微软最新发布的 Orca 2 系列中的一款模型，Orca 2 另外还提供 7B 版本。Orca 2系列模型是由 LLAMA 2 基础模型中微调而来。Orca 2系列模型擅长推理、文本总结、数学问题解决和理解任务，是在原始 13B Orca 模型的基础上进一步发展而来，模仿更强大的 AI 系统推理过程从而提高小型模型在复杂任务中的能力。

本文将以 Orca-2-13B 为例，将介绍：

如何在你自己的设备上运行 Orca-2-13B

如何为 Orca-2-13B 创建一个与 OpenAI 兼容的 API 服务

你也可以使用同样的方式运行 Orca-2-7B 模型，只需要替换 Orca-2-7B模型的 GGUF 格式的下载链接。

我们将用 Rust + Wasm 技术栈来开发和部署这个模型的应用程序。无需安装复杂的 Python 包或 C++ 工具链！了解我们为什么选择 Rust+Wasm 技术栈[2]。

在自己的设备上运行 Orca-2-13B

步骤 1：通过以下命令行安装 WasmEdge[3]。

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

步骤 2：下载模型 GGUF 文件[4]。由于模型文件有几 GB，所以可能需要很长时间。

curl -LO https://huggingface.co/second-state/Orca-2-13B-GGUF/resolve/main/Orca-2-13b-ggml-model-q4_0.gguf

步骤 3：下载一个跨平台的可移植 Wasm 文件，用于聊天应用。该应用能让你用命令行与模型进行交流。戳这里[5]查看该应用的 Rust 源代码。

curl -LO https://github.com/second-state/llama-utils/raw/main/chat/llama-chat.wasm

就这样。接下来，可以通过输入以下命令在终端与模型进行聊天。

wasmedge --dir .:. --nn-preload defaultAUTO:Orca-2-13b-ggml-model-q4_0.gguf llama-chat.wasm -p chatml -s 'You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.' --stream-stdout

这个可移植的 Wasm 应用会自动利用你设备上的硬件加速器（如 GPU）。

在我的 Mac M1 32G 内存设备上，它的速度约为每秒 9.15 个token。

[USER]: What is an Orca?

[ASSISTANT]: 
An orca, or killer whale, is a large toothed predator belonging to the oceanic dolphin family. They are highly intelligent and social animals, known for their curiosity and playfulness.

[USER]:

为 Orca-2-13B 创建一个与 OpenAI 兼容的 API 服务

一个与 OpenAI 兼容的网络 API 能让 Orca-2-13B 与大量的 LLM 工具和代理框架（如 flows.network、LangChain 和 LlamaIndex）一起工作。

首先，先下载一个 API 服务器应用。它也是一个可以在许多 CPU 和 GPU 设备上运行的跨平台可移植 Wasm 应用。

curl -LO https://github.com/second-state/llama-utils/raw/main/api-server/llama-api-server.wasm

然后，使用以下命令行启动模型的 API 服务器。

wasmedge --dir .:. --nn-preload defaultAUTO:Orca-2-13B.Q5_K_M.gguf llama-api-server.wasm -p chatml

从另一个终端，你可以使用 curl 与 API 服务器进行交互。

curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"What is the capital of France?"}], "model":"Orca-2-13B"}'

就这样。WasmEdge 是运行 Orca-2-13B 大模型应用程序最简单、最快、最安全的方式[6]。试试看吧！

审核编辑：刘清

打开APP阅读更多精彩内容