Llama2模型的优化版本:Llama-2-Onnx

news/发布时间2024/5/15 12:59:03

Llama2模型的优化版本:Llama-2-Onnx。

Llama-2-Onnx是Llama2模型的优化版本。Llama2模型由一堆解码器层组成。每个解码器层(或变换器块)由一个自注意层和一个前馈多层感知器构成。与经典的变换器相比,Llama模型在前馈层中使用了不同的投影大小。例如,Llama1和Llama2的投影都使用了2.7倍的隐藏大小,而不是标准的4倍隐藏大小。Llama1和Llama2之间的一个关键区别在于注意层的架构变化,Llama2利用了分组查询注意(GQA)机制来提高效率。

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Llama 2 Powered By ONNX

This is an optimized version of the Llama 2 model, available from Meta under the Llama Community License Agreement found on this repository. Microsoft permits you to use, modify, redistribute and create derivatives of Microsoft’s contributions to the optimized version subject to the restrictions and disclaimers of warranty and liability in the Llama Community License agreement.

Before You Start

The sub-modules that contain the ONNX files in this repository are access controlled.
To get access permissions to the Llama 2 model, please fill out the Llama 2 ONNX sign up page. If allowable, you will receive GitHub access in the next 48 hours, but usually much sooner.

Cloning This Repository And The Submodules

Before you begin, ensure you have Git LFS installed. Git LFS (Large File Storage) is used to handle large files efficiently. You can find out how to install Git LFS for your operating system at https://git-lfs.com/.

Next, you can choose which version of the Llama 2 model you would like to use by selecting the appropriate submodule.

Chose from the following sub-modules:

  • 7B_FT_float16
  • 7B_FT_float32
  • 7B_float16
  • 7B_float32
  • 13B_FT_float16
  • 13B_FT_float32
  • 13B_float16
  • 13B_float32
git clone https://github.com/microsoft/Llama-2-Onnx.git
cd Llama-2-Onnx
git submodule init <chosen_submodule> 
git submodule update

You can repeate the init command with a different submodule name to initialize multiple submodules. Be careful, the contained files are very large! (7B Float16 models are about 10GB)

What is Llama 2?

Llama 2 is a collection of pretrained and fine-tuned generative text models. To learn more about Llama 2, review the Llama 2 model card.

What Is The Structure Of Llama 2?

Llama 2 model consists of a stack of decoder layers. Each decoder layer (or transformer block) is constructed from one self-attention layer and one feed-forward multi-layer perceptron. Llama models use different projection sizes compared with classic transformers in the feed-forward layer, for instance, both Llama 1 and Llama 2 projection use 2.7x hidden size rather than the standard 4x hidden size. A key difference between Llama 1 and Llama 2 is the architectural change of attention layer, in which Llama 2 takes advantage of Grouped Query Attention (GQA) mechanism to improve efficiency.

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

FAQ

Is There A Simple Code Example Running Llama 2 With ONNX?

There are two examples provided in this repository. There is a minimum working example shown in Llama-2-Onnx/MinimumExample. This is simply a command line program that will complete some text with the chosen version of Llama 2.

Given the following input:

python MinimumExample/Example_ONNX_LlamaV2.py --onnx_file 7B_FT_float16/ONNX/LlamaV2_7B_FT_float16.onnx --embedding_file 7B_FT_float16/embeddings.pth --tokenizer_path tokenizer.model --prompt "What is the lightest element?"

Output:

The lightest element is hydrogen. Hydrogen is the lightest element on the periodic table, with an atomic mass of 1.00794 u (unified atomic mass units).

Is There A More Complete Code Example Running Llama 2 With ONNX?

There is a more complete chat bot interface that is available in Llama-2-Onnx/ChatApp. This is a python program based on the popular Gradio web interface. It will allow you to interact with the chosen version of Llama 2 in a chat bot interface.

An example interaction can be seen here:

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

How Do I Use The Fine-tuned Models?

The fine-tuned models were trained for dialogue applications.

To get the expected features and performance for them, a specific formatting needs to be followed, including the INST tag, BOS and EOS tokens, and the whitespaces and breaklines in between (we recommend calling strip() on inputs to avoid double-spaces).

This enables models in chat mode as well as additional safeguards to reduce potentially undesirable output.

Why Is The First Inference Session Slow?

ONNX runtime execution provider might need to generate JIT binaries for the underlying hardware, typically the binary is cache and will be loaded directly in the subsequent runs to reduce the overhead.

Why Is FP16 ONNX Slower Than ONNX FP32 On My Device?

It is possible that your device does not support native FP16 math, therefore weights will be cast to FP32 at runtime. Using the FP32 version of the model will avoid the cast overhead.

How Do I Get Better Inference Speed?

It is recommended that inputs/outputs are put on target device to avoid expensive data copies, please refer to the following document for details.

I/O Binding | onnxruntime

What Parameters Should I Test With?

Users can perform temperature and top-p sampling using the model’s output logits. Please refer to Meta’s guidance for the best parameters combination; an example is located here.

How Can I Develop With Llama 2 Responsibly?

In order to help developers innovate responsibly, Meta encourages you to review the Responsible Use Guide for the Llama 2 models.

Microsoft encourages you to learn more about its Responsible AI approach, including many publicly available resources and tools for developers.

参考文献:
[1]http://github.com/microsoft/Llama-2-Onnx

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.bcls.cn/ZYUK/3902.shtml

如若内容造成侵权/违法违规/事实不符,请联系编程老四网进行投诉反馈email:xxxxxxxx@qq.com,一经查实,立即删除!

相关文章

vue3 toRefs之后的变量修改方法

上效果 修改值需要带上解构之前的对象名obj&#xff0c; changeName:()>{ // toRefs 解决后变量修改值方法&#xff1a; 解构前变量.字段新值 obj.name FEIFEI; } } 案例源码 <!DOCTYPE html> <html> <head><me…

普中51单片机学习(十一)

独立按键 独立按键原理 按键在闭合和断开时触电存在抖动现象 硬件消抖电路如下 实验代码 #include "reg52.h" typedef unsigned char u8; typedef unsigned int u16;void delay(u16 i) {while(i--); } sbit ledP2^0; sbit k1P3^1;void keypro() {if(k10){delay(1…

yolov9目标检测报错AttributeError: ‘list‘ object has no attribute ‘device‘

最近微智启软件工作室在运行yolov9目标检测的detect.py测试代码时&#xff0c;报错&#xff1a; File “G:\down\yolov9-main\yolov9-main\detect.py”, line 102, in run pred non_max_suppression(pred, conf_thres, iou_thres, classes, agnostic_nms, max_detmax_det) Fil…

css实现悬浮卡片

结果展示 html代码 <!doctype html> <html lang"zh"> <head><meta charset"UTF-8"><meta http-equiv"X-UA-Compatible" content"IEedge,chrome1"> <meta name"viewport" content"…

《图解设计模式》笔记(二)交给子类

三、Template Method模式&#xff1a;将具体处理交给子类 示例程序类图 public static void main(String[] args) {// 生成一个持有H的CharDisplay类的实例AbstractDisplay d1 new CharDisplay(H);// 生成一个持有"Hello, world."的StringDisplay类的实例AbstractD…

pytest结合Allure生成测试报告

文章目录 1.Allure配置安装2.使用基本命令报告美化1.**前置条件**2.**用例步骤****3.标题和描述****4.用例优先级**3.进阶用法allure+parametrize参数化parametrize+idsparametrize+@allure.title()4.动态化参数5.环境信息**方式一****方式二**6.用例失败截图1.Allure配置安装 …

PNPM 批量检查和更新项目依赖

&#x1f680; 作者主页&#xff1a; 有来技术 &#x1f525; 开源项目&#xff1a; youlai-mall &#x1f343; vue3-element-admin &#x1f343; youlai-boot &#x1f33a; 仓库主页&#xff1a; Gitee &#x1f4ab; Github &#x1f4ab; GitCode &#x1f496; 欢迎点赞…

【前端素材】推荐优质后台管理系统Frest平台模板(附源码)

一、需求分析 定义 后台管理系统是一种用于管理和控制网站、应用程序或系统后台操作的软件工具&#xff0c;通常由授权用户&#xff08;如管理员、编辑人员等&#xff09;使用。它提供了一种用户友好的方式来管理网站或应用程序的内容、用户、数据等方面的操作&#xff0c;并…

Apache Commons开源的工具库介绍

Apache Commons 是 Apache 软件基金会主持的一个项目&#xff0c;旨在提供一系列可重用的 Java 组件。这些组件覆盖了从数据封装、文本处理到网络通信等各个方面&#xff0c;是 Java 开发中常用的一系列工具库。Apache Commons 项目下的各个库通常以 "commons-" 开头…

神经网络系列---分类度量

文章目录 分类度量混淆矩阵&#xff08;Confusion Matrix&#xff09;&#xff1a;二分类问题二分类代码多分类问题多分类宏平均法:多分类代码多分类微平均法&#xff1a; 准确率&#xff08;Accuracy&#xff09;&#xff1a;精确率&#xff08;Precision&#xff09;&#xf…

浅谈SpringMVC

什么是MVC模式 MVC&#xff1a;MVC是一种设计模式 MVC的原理图&#xff1a; 分析&#xff1a; 1&#xff1a;M-Model 模型&#xff08;完成业务逻辑&#xff1a;有javaBean构成&#xff0c;servicedaoentity&#xff09; 2&#xff1a;V-View 视图&#xff08;做界面的展示…

助力精准可信时空智能:卫星授时安全隔离装置

随着信息化、数字化、智能化发展浪潮的不断推进&#xff0c;各行业对卫星导航授时信息的精准可信度需求也越来越高。面对有意/无意的导航信号欺骗干扰&#xff0c;一旦发生时间信息错误&#xff0c;将导致巨大的经济损失甚至严重的安全事故。在复杂的电磁环境下&#xff0c;亟需…

详细讲解缓冲区

目录 理解回车和换行&#xff08;\r&&\n&#xff09; 那如何实现单独的回车和换行呢&#xff1f; 缓冲区 证明有缓冲区的存在 ​编辑 怎么刷新缓冲区&#xff08;显示器缓冲区&#xff09;&#xff1f; fflush函数​编辑 缓冲区出现的意义 I/O流 模拟倒计时小程…

【Vue3】学习computed计算属性

&#x1f497;&#x1f497;&#x1f497;欢迎来到我的博客&#xff0c;你将找到有关如何使用技术解决问题的文章&#xff0c;也会找到某个技术的学习路线。无论你是何种职业&#xff0c;我都希望我的博客对你有所帮助。最后不要忘记订阅我的博客以获取最新文章&#xff0c;也欢…

Map集合特点、遍历方式、TreeMap排序及Collections和Arrays

目录 ​编辑 一、集合框架 二、 Map集合 特点 遍历方式 HashMap与Hashtable的区别 TreeMap Collections Arrays 一、集合框架 二、 Map集合 Map集合是一种键值对的集合&#xff0c;其中每个键对应一个值。在Java中&#xff0c;Map接口定义了一种将键映射到值的数据结…

测试开源C#人脸识别模块DlibDotNet

百度“C# 换脸”找到参考文献4&#xff0c;发现其中使用DlibDotNet检测并识别人脸&#xff08;之前主要用的是ViewFaceCore&#xff09;&#xff0c;DlibDotNet是Dlib的.net封装版本&#xff0c;后者为开源C工具包&#xff0c;支持机器学习算法、图像处理等算法以支撑各类高级应…

ELK入门(二)- springboot整合ES

springboot整合elasticsearch 引用依赖 <?xml version"1.0" encoding"UTF-8"?> <project xmlns"http://maven.apache.org/POM/4.0.0"xmlns:xsi"http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation"http…

WebGL中开发科学数据可视化应用

WebGL在科学数据可视化领域有广泛的应用&#xff0c;可以用于呈现和解释复杂的科学数据。以下是在WebGL中开发科学数据可视化应用时的一些建议&#xff0c;希望对大家有所帮助。北京木奇移动技术有限公司&#xff0c;专业的软件外包开发公司&#xff0c;欢迎交流合作。 1.选择合…

LabVIEW开发FPGA的高速并行视觉检测系统

LabVIEW开发FPGA的高速并行视觉检测系统 随着智能制造的发展&#xff0c;视觉检测在生产线中扮演着越来越重要的角色&#xff0c;尤其是在质量控制方面。传统的基于PLC的视觉检测系统受限于处理速度和准确性&#xff0c;难以满足当前生产需求的高速和高精度要求。为此&#xf…

HarmonyOS开发技术全面分析

系统定义 HarmonyOS 是一款 “ 面向未来 ” 、面向全场景&#xff08;移动办公、运动健康、社交通信、媒体娱乐等&#xff09;的分布式操作系统。在传统的单设备系统能力的基础上&#xff0c;HarmonyOS提出了基于同一套系统能力、适配多种终端形态的分布式理念&#xff0c;能够…
推荐文章