Inference code for CodeLlama models.
Code Llama is a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama was developed by fine-tuning Llama 2 using a higher sampling of code. As with Llama 2, we applied considerable safety mitigations to the fine-tuned versions of the model. For detailed information on model training, architecture and parameters, evaluations, responsible AI and safety refer to our research paper. Output generated by code generation features of the Llama Materials, including Code Llama, may be subject to third party licenses, including, without limitation, open source licenses.
We are unlocking the power of large language models and our latest version of Code Llama is now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly. This release includes model weights and starting code for pretrained and fine-tuned Llama language models — ranging from 7B to 34B parameters.
This repository is intended as a minimal example to load Code Llama models and run inference.
Download
In order to download the model weights and tokenizers, please visit the Meta website and accept our License.
Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download. Make sure that you copy the URL text itself, do not use the 'Copy link address' option when you right click the URL. If the copied URL text starts with: https://download.llamameta.net, you copied it correctly. If the copied URL text starts with: https://l.facebook.com, you copied it the wrong way.
Pre-requisites: make sure you have wget
and md5sum
installed. Then to run the script: bash download.sh
.
Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden
, you can always re-request a link.
Model sizes
Model | Size |
---|---|
7B | ~12.55GB |
13B | 24GB |
34B | 63GB |
Setup
In a conda env with PyTorch / CUDA available, clone the repo and run in the top-level directory:
pip install -e .
Inference
Different models require different model-parallel (MP) values:
Model | MP |
---|---|
7B | 1 |
13B | 2 |
34B | 4 |
All models support sequence lengths up to 100,000 tokens, but we pre-allocate the cache according to max_seq_len
and max_batch_size
values. So set those according to your hardware and use-case.
Pretrained Code Models
The Code Llama and Code Llama - Python models are not fine-tuned to follow instructions. They should be prompted so that the expected answer is the natural continuation of the prompt.
See example_completion.py
for some examples. To illustrate, see command below to run it with the CodeLlama-7b
model (nproc_per_node
needs to be set to the MP
value):
torchrun --nproc_per_node 1 example_completion.py \
--ckpt_dir CodeLlama-7b/ \
--tokenizer_path CodeLlama-7b/tokenizer.model \
--max_seq_len 128 --max_batch_size 4
Pretrained code models are: the Code Llama models CodeLlama-7b
, CodeLlama-13b
, CodeLlama-34b
and the Code Llama - Python models
CodeLlama-7b-Python
, CodeLlama-13b-Python
, CodeLlama-34b-Python
.
Code Infilling
Code Llama and Code Llama - Instruct 7B and 13B models are capable of filling in code given the surrounding context.
See example_infilling.py
for some examples. The CodeLlama-7b
model can be run for infilling with the command below (nproc_per_node
needs to be set to the MP
value):
torchrun --nproc_per_node 1 example_infilling.py \
--ckpt_dir CodeLlama-7b/ \
--tokenizer_path CodeLlama-7b/tokenizer.model \
--max_seq_len 192 --max_batch_size 4
Pretrained infilling models are: the Code Llama models CodeLlama-7b
and CodeLlama-13b
and the Code Llama - Instruct models CodeLlama-7b-Instruct
, CodeLlama-13b-Instruct
.
Fine-tuned Instruction Models
Code Llama - Instruct models are fine-tuned to follow
instructions. To get the expected features and performance for them, a
specific formatting defined in chat_completion
needs to be followed, including the INST
and <<SYS>>
tags, BOS
and EOS
tokens, and the whitespaces and linebreaks in between (we recommend calling strip()
on inputs to avoid double-spaces).
You can use chat_completion
directly to generate answers with the instruct model.
You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.
Examples using CodeLlama-7b-Instruct
:
torchrun --nproc_per_node 1 example_instructions.py \
--ckpt_dir CodeLlama-7b-Instruct/ \
--tokenizer_path CodeLlama-7b-Instruct/tokenizer.model \
--max_seq_len 512 --max_batch_size 4
Fine-tuned instruction-following models are: the Code Llama - Instruct models CodeLlama-7b-Instruct
, CodeLlama-13b-Instruct
, CodeLlama-34b-Instruct
.
Code Llama is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the Responsible Use Guide. More details can be found in our research papers as well.
Issues
Please report any software “bug”, or other problems with the models through one of the following means:
- Reporting issues with the model: github.com/facebookresearch/codellama
- Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
- Reporting bugs and security concerns: facebook.com/whitehat/info
Model Card
See MODEL_CARD.md for the model card of Code Llama.
References
from https://github.com/facebookresearch/codellama
---------------------------------
Code Llama 是从 Llama-2 基础模型微调而来,有基础版(Code Llama)、Python微调版(Code Llama-Python)、以及自然语言指令微调版(Code Llama-Instruct)共 3 个版本
3 个版本的模型尺寸分别有 7B、13B 和 34B,每个模型都被喂进了 5000 亿 token 的代码及代码相关数据中训练
Meta希望Code Llama能激发大众对于Llama 2的进一步开发,成为研究和商业产品创建新的创造性工具
Features
支持10万 token 上下文(可以直接塞进整个项目)
支持 Python、C++、Java、PHP、Typescript(Javascript)、SQL、C#和Bash等语言
Python 34B 版本在HumanEval上得分为 53.7%,在 MBPP上得分为56.2%,超过了 GPT-3.5 的 48.1% 和 52.2%(评分 )
开源可商用
令人惊喜的是,Code Llama还有一个没有公布的「unnatural」版本,性能已经超过ChatGPT,逼近GPT-4
参考1 13 | 参考2 3
------------------------
Meta 开源编程大模型「Code Llama」,性能直逼 GPT-4

Takeaways
- Code Llama is a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts.
- Code Llama is free for research and commercial use.
- Code Llama is built on top of Llama 2 and is available in three models:
- Code Llama, the foundational code model;
- Codel Llama - Python specialized for Python;
- and Code Llama - Instruct, which is fine-tuned for understanding natural language instructions.
- In our own benchmark testing, Code Llama outperformed state-of-the-art publicly available LLMs on code tasks.
Today, we are releasing Code Llama, a large language model (LLM) that can use text prompts to generate code. Code Llama is state-of-the-art for publicly available LLMs on code tasks, and has the potential to make workflows faster and more efficient for current developers and lower the barrier to entry for people who are learning to code. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more robust, well-documented software.
The generative AI space is evolving rapidly, and we believe an open approach to today’s AI is the best one for developing new AI tools that are innovative, safe, and responsible. We are releasing Code Llama under the same community license as Llama 2.
How Code Llama works
Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. Essentially, Code Llama features enhanced coding capabilities, built on top of Llama 2. It can generate code, and natural language about code, from both code and natural language prompts (e.g., “Write me a function that outputs the fibonacci sequence.”) It can also be used for code completion and debugging. It supports many of the most popular languages being used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash.

We are releasing three sizes of Code Llama with 7B, 13B, and 34B parameters respectively. Each of these models is trained with 500B tokens of code and code-related data. The 7B and 13B base and instruct models have also been trained with fill-in-the-middle (FIM) capability, allowing them to insert code into existing code, meaning they can support tasks like code completion right out of the box.
The three models address different serving and latency requirements. The 7B model, for example, can be served on a single GPU. The 34B model returns the best results and allows for better coding assistance, but the smaller 7B and 13B models are faster and more suitable for tasks that require low latency, like real-time code completion.

The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens.
Aside from being a prerequisite for generating longer programs, having longer input sequences unlocks exciting new use cases for a code LLM. For example, users can provide the model with more context from their codebase to make the generations more relevant. It also helps in debugging scenarios in larger codebases, where staying on top of all code related to a concrete issue can be challenging for developers. When developers are faced with debugging a large chunk of code they can pass the entire length of the code into the model.

Additionally, we have further fine-tuned two additional variations of Code Llama: Code Llama - Python and Code Llama - Instruct.
Code Llama - Python is a language-specialized variation of Code Llama, further fine-tuned on 100B tokens of Python code. Because Python is the most benchmarked language for code generation – and because Python and PyTorch play an important role in the AI community – we believe a specialized model provides additional utility.
Code Llama - Instruct is an instruction fine-tuned and aligned variation of Code Llama. Instruction tuning continues the training process, but with a different objective. The model is fed a “natural language instruction” input and the expected output. This makes it better at understanding what humans expect out of their prompts. We recommend using Code Llama - Instruct variants whenever using Code Llama for code generation since Code Llama - Instruct has been fine-tuned to generate helpful and safe answers in natural language.
We do not recommend using Code Llama or Code Llama - Python to perform general natural language tasks since neither of these models are designed to follow natural language instructions. Code Llama is specialized for code-specific tasks and isn’t appropriate as a foundation model for other tasks.
When using the Code Llama models, users must abide by our license and acceptable use policy.

Evaluating Code Llama’s performance
To test Code Llama’s performance against existing solutions, we used two popular coding benchmarks: HumanEval and Mostly Basic Python Programming (MBPP). HumanEval tests the model’s ability to complete code based on docstrings and MBPP tests the model’s ability to write code based on a description.
Our benchmark testing showed that Code Llama performed better than open-source, code-specific LLMs and outperformed Llama 2. Code Llama 34B, for example, scored 53.7% on HumanEval and 56.2% on MBPP, the highest compared with other state-of-the-art open solutions, and on par with ChatGPT.

As with all cutting edge technology, Code Llama comes with risks. Building AI models responsibly is crucial, and we undertook numerous safety measures before releasing Code Llama. As part of our red teaming efforts, we ran a quantitative evaluation of Code Llama’s risk of generating malicious code. We created prompts that attempted to solicit malicious code with clear intent and scored Code Llama’s responses to those prompts against ChatGPT’s (GPT3.5 Turbo). Our results found that Code Llama answered with safer responses.
Details about our red teaming efforts from domain experts in responsible AI, offensive security engineering, malware development, and software engineering are available in our research paper.
Releasing Code Llama
Programmers are already using LLMs to assist in a variety of tasks, ranging from writing new software to debugging existing code. The goal is to make developer workflows more efficient, so they can focus on the most human centric aspects of their job, rather than repetitive tasks.
At Meta, we believe that AI models, but LLMs for coding in particular, benefit most from an open approach, both in terms of innovation and safety. Publicly available, code-specific models can facilitate the development of new technologies that improve peoples' lives. By releasing code models like Code Llama, the entire community can evaluate their capabilities, identify issues, and fix vulnerabilities.
Code Llama’s training recipes are available on our Github repository.
Model weights are also available.
Responsible use
Our research paper discloses details of Code Llama’s development as well as how we conducted our benchmarking tests. It also provides more information into the model’s limitations, known challenges we encountered, mitigations we’ve taken, and future challenges we intend to investigate.
We’ve also updated our Responsible Use Guide and it includes guidance on developing downstream models responsibly, including:
- Defining content policies and mitigations.
- Preparing data.
- Fine-tuning the model.
- Evaluating and improving performance.
- Addressing input- and output-level risks.
- Building transparency and reporting mechanisms in user interactions.
Developers should evaluate their models using code-specific evaluation benchmarks and perform safety studies on code-specific use cases such as generating malware, computer viruses, or malicious code. We also recommend leveraging safety datasets for automatic and human evaluations, and red teaming on adversarial prompts.
The future of generative AI for coding
Code Llama is designed to support software engineers in all sectors – including research, industry, open source projects, NGOs, and businesses. But there are still many more use cases to support than what our base and instruct models can serve.
We hope that Code Llama will inspire others to leverage Llama 2 to create new innovative tools for research and commercial products.
Try Code Llama today
Download the Code Llama ModelRead the research paper
-------------------------------------------------------------------
Ollama本地部署llama3结合open-webui使用AI大模型
Llama 3是由Meta(Facebook) AI发布的一个开源语言模型系列,包括一个8B(十进制的80亿)模型和一个70B(十进制的700亿)模型。Llama 3支持多种商业和研究用途,并在多个行业标准测试中展示了其卓越的性能。
Llama 3采用了优化的自回归Transformer架构,这种架构专为处理复杂的文本生成任务设计,能够有效提升生成文本的连贯性和相关性。模型结合了监督式微调(SFT)和带人类反馈的强化学习(RLHF),这种混合方法不仅增强了模型的帮助性,也提高了安全性,使得模型在实际应用中更加可靠和符合用户预期。
一、安装ollama
ollama官网:https://ollama.com/
根据自己的系统类型下载Ollama
下载地址:https://ollama.com/download
下载以后直接双击运行安装即可,安装过程全部默认。
二、通过ollama下载AI模型
打开ollama的模型的网页:https://ollama.com/library
找到llama3,双击进入,复制安装命令(如果需要其他ai模型同理操作)
ollama run llama3
按下“win”+“R”,输入“cmd”回车,右键粘贴刚才的命令,再次回车开始下载安装llama3。
其他ollama
常用的命令有
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
pull Pull a model from a registry
push Push a model to a registry
list List models
cp Copy a model
rm Remove a model
help Help about any command
看到下面的提示就说明安装完成了,之后直接输入你需要的问题就可以开始对话了。
之后再次需要对话的时候再CMD命令行中输入ollama run llama3
,就可以开启对话了。
这样在CMD命令行中使用ollama模型非常不方便,我需要安装结合open-webui或lobe-chat等开源免费的WEB界面使用。
三、open-webui下载安装使用
1.启动微软Hyper-V,打开“控制面板->程序->启用或关闭Windows功能”
2.安装docker环境。
docker官网:https://docker.com/
下载docker desktop
下载地址:https://docs.docker.com/desktop/install/windows-install/
下载以后,直接双击运行exe文件,注意去掉“Use WSL2 instead of Hyper-V(recommended)”的勾选,否则会带来很多问题(踩坑的经验)。
等待安装
安装完成,点击“Close and restart”重启计算机。
系统重启后。双击运行桌面的“Docker Desktop”图标,弹窗点击“Accept”。
点击:“Continue without signing in”,不登录进入。
点击:“Skip survey”
进入到了Docker Desktop界面。
切换国内源(设置⚙—Docker Engine),粘贴下面的内容,点击“Apply&restart”保存并重启Docker
{
"registry-mirrors": [
"https://82m9ar63.mirror.aliyuncs.com",
"http://hub-mirror.c.163.com",
"https://docker.mirrors.ustc.edu.cn"
],
"builder": {
"gc": {
"defaultKeepStorage": "20GB",
"enabled": true
}
},
"experimental": false,
"features": {
"buildkit": true
}
}
3.Docker安装open-webui服务。
按下“win”+“R”,输入“cmd”回车,右键粘贴下面的命令,回车。
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
这个安装过程根据你的网络环境情况,时间可能会比较长,请耐心等待,直到出现下面的界面。
再次打开Docker Desktop界面,就可以看到已经在运行的open-webui服务。
点击“port(s)下面的端口号就可以大打开“open-webui服务”的网页了。
点击“Sign up”,打开注册页面,输入注册信息,点击“Create Account”,注册第一个注册的就是管理员。
注册后完成,点击右上角的设置⚙——Geeral——Language,选择chinese即可切换中文。
之后可以看到直接可以选择之前部署好的llama3:8b模型,开始使用吧。
from http://web.archive.org/web/20240613014428/https://www.sunweihu.com/8838.html
--------------------------------------
Llama 3 大模型开源了!
Llama 3 是 Meta 发布的最新大型语言模型,旨在让个人、创作者、研究人员和各种规模的企业能够负责任地试验、创新和扩展他们的想法。
相比于之前发布的开源模型, Llama 3 的特性是:
数据量:训练的数据是 Llama 2数据集的 7 倍多
能力增强:推理和代码能力增强
训练效率:比 Llama2 高 3 倍;
模型大小:提供从 8B 到 70B 参数的不同大小的预训练和指令调整的 Llama 3语言模型
下载和使用:提供了模型权重和分词器的下载指南,以及如何在本地运行模型的快速入门步骤
支持模型并行:不同大小的模型需要不同的模型并行(MP)值
许可证: 模型和权重对研究人员和商业实体开放,旨在促进发现和道德的AI进步
开源地址:https://github.com/meta-llama/llama3
( The official Meta Llama 3 GitHub site
Models on Hugging Face | Blog | Website | Get Started
We are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly.
This release includes model weights and starting code for pre-trained and instruction-tuned Llama 3 language models — including sizes of 8B to 70B parameters.
This repository is a minimal example of loading Llama 3 models and running inference. For more detailed examples, see llama-recipes.
To download the model weights and tokenizer, please visit the Meta Llama website and accept our License.
Once your request is approved, you will receive a signed URL over email. Then, run the download.sh script, passing the URL provided when prompted to start the download.
Pre-requisites: Ensure you have wget
and md5sum
installed. Then run the script: ./download.sh
.
Remember that the links expire after 24 hours and a
certain amount of downloads. You can always re-request a link if you
start seeing errors such as 403: Forbidden
.
We also provide downloads on Hugging Face, in both transformers and native llama3
formats. To download the weights from Hugging Face, please follow these steps:
- Visit one of the repos, for example meta-llama/Meta-Llama-3-8B-Instruct.
- Read and accept the license. Once your request is approved, you'll be granted access to all the Llama 3 models. Note that requests used to take up to one hour to get processed.
- To download the original native weights to use with this repo, click
on the "Files and versions" tab and download the contents of the
original
folder. You can also download them from the command line if youpip install huggingface-hub
:
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir meta-llama/Meta-Llama-3-8B-Instruct
You can follow the steps below to get up and running with Llama 3 models quickly. These steps will let you run quick inference locally. For more examples, see the Llama recipes repository.
-
Clone and download this repository in a conda env with PyTorch / CUDA.
-
In the top-level directory run:
pip install -e .
-
Visit the Meta Llama website and register to download the model/s.
-
Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.
-
Once you get the email, navigate to your downloaded llama repository and run the download.sh script.
- Make sure to grant execution permissions to the download.sh script
- During this process, you will be prompted to enter the URL from the email.
- Do not use the “Copy Link” option; copy the link from the email manually.
-
Once the model/s you want have been downloaded, you can run the model locally using the command below:
torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir Meta-Llama-3-8B-Instruct/ \
--tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
--max_seq_len 512 --max_batch_size 6
Note
- Replace
Meta-Llama-3-8B-Instruct/
with the path to your checkpoint directory andMeta-Llama-3-8B-Instruct/tokenizer.model
with the path to your tokenizer model. - The
–nproc_per_node
should be set to the MP value for the model you are using. - Adjust the
max_seq_len
andmax_batch_size
parameters as needed. - This example runs the example_chat_completion.py found in this repository, but you can change that to a different .py file.
Different models require different model-parallel (MP) values:
Model | MP |
---|---|
8B | 1 |
70B | 8 |
All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to max_seq_len
and max_batch_size
values. So set those according to your hardware.
These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.
See example_text_completion.py
for some examples. To illustrate, see the command below to run it with the llama-3-8b model (nproc_per_node
needs to be set to the MP
value):
torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir Meta-Llama-3-8B/ \
--tokenizer_path Meta-Llama-3-8B/tokenizer.model \
--max_seq_len 128 --max_batch_size 4
The
fine-tuned models were trained for dialogue applications. To get the
expected features and performance for them, specific formatting defined
in ChatFormat
needs to be followed: The prompt begins with a <|begin_of_text|>
special token, after which one or more messages follow. Each message starts with the <|start_header_id|>
tag, the role system
, user
or assistant
, and the <|end_header_id|>
tag. After a double newline \n\n
, the message's contents follow. The end of each message is marked by the <|eot_id|>
token.
You can also deploy additional classifiers to filter out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.
Examples using llama-3-8b-chat:
torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir Meta-Llama-3-8B-Instruct/ \
--tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
--max_seq_len 512 --max_batch_size 6
Llama 3 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. To help developers address these risks, we have created the Responsible Use Guide.
Please report any software “bug” or other problems with the models through one of the following means:
- Reporting issues with the model: https://github.com/meta-llama/llama3/issues
- Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
- Reporting bugs and security concerns: facebook.com/whitehat/info
from https://github.com/meta-llama/llama3 )
------------------------------------------------------
相关帖子:
https://briteming.blogspot.com/2024/04/maxkb-llm.html
---------------------------------------------------------
构建 AI 助手的框架
Phidata 是一个用于构建具有记忆、知识和工具的AI助手的框架,用来解决大型语言模型(LLM)上下文限制和无法执行操作的问题。他的工作原理如下:
记忆:通过数据库存储聊天历史,使LLM能够进行长期对话。
知识:通过向量数据库存储信息,为LLM提供上下文。
工具:使LLM能够执行如从API提取数据、发送电子邮件或查询数据库等操作。
开源地址:https://github.com/phidatahq/phidata
----------------------------------------------------------
开源的 RAG 引擎
RAGFlow 是由开发者 infiniflow 开源,目前已经获得了 5.2K 的 Star。该项目是一个开源的 RAG(Retrieval-Augmented Generation,检索增强生成)引擎,基于深度文档理解,为不同规模的企业提供简化的 RAG 工作流程。
它由如下关键特性:
高质量输入输出: 从复杂格式的非结构化数据中进行深度文档理解并提取知识。
模板化分块: 提供智能且可解释的模板选项。
基于引用的引用: 减少幻觉,通过可视化的文本分块允许人工干预,快速查看关键引用和可追溯的引用来支持基于事实的答案。
异构数据源兼容性: 支持 Word、PPT、Excel、TXT、图像、扫描副本、结构化数据、网页等。
自动化 RAG 工作流: 提供为个人和大型企业量身定制的简化 RAG 编排,包括可配置的 LLM 和嵌入模型,多重召回与融合重排,以及与业务无缝集成的直观 API。
开源地址:https://github.com/infiniflow/ragflow
( RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
ragflow.io📕 Table of Contents
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.
Try our demo at https://demo.ragflow.io.
- 2024-07-08 Supports workflow based on Graph.
- 2024-06-27 Supports Markdown and Docx in the Q&A parsing method.
- 2024-06-27 Supports extracting images from Docx files.
- 2024-06-27 Supports extracting tables from Markdown files.
- 2024-06-14 Supports PDF in the Q&A parsing method.
- 2024-06-06 Supports Self-RAG, which is enabled by default in dialog settings.
- 2024-05-30 Integrates BCE and BGE reranker models.
- 2024-05-28 Supports LLM Baichuan and VolcanoArk.
- 2024-05-23 Supports RAPTOR for better text retrieval.
- 2024-05-21 Supports streaming output and text chunk retrieval API.
- 2024-05-15 Integrates OpenAI GPT-4o.
- Deep document understanding-based knowledge extraction from unstructured data with complicated formats.
- Finds "needle in a data haystack" of literally unlimited tokens.
- Intelligent and explainable.
- Plenty of template options to choose from.
- Visualization of text chunking to allow human intervention.
- Quick view of the key references and traceable citations to support grounded answers.
- Supports Word, slides, excel, txt, images, scanned copies, structured data, web pages, and more.
- Streamlined RAG orchestration catered to both personal and large businesses.
- Configurable LLMs as well as embedding models.
- Multiple recall paired with fused re-ranking.
- Intuitive APIs for seamless integration with business.
- CPU >= 4 cores
- RAM >= 16 GB
- Disk >= 50 GB
- Docker >= 24.0.0 & Docker Compose >= v2.26.1
If you have not installed Docker on your local machine (Windows, Mac, or Linux), see Install Docker Engine.
Ensure vm.max_map_count
>= 262144:
To check the value of
vm.max_map_count
:$ sysctl vm.max_map_count
Reset vm.max_map_count
to a value at least 262144 if it is not.
# In this case, we set it to 262144:
$ sudo sysctl -w vm.max_map_count=262144
This change will be reset after a system reboot. To ensure your change remains permanent, add or update the vm.max_map_count
value in /etc/sysctl.conf accordingly:
vm.max_map_count=262144
Clone the repo:
$ git clone https://github.com/infiniflow/ragflow.git
Build the pre-built Docker images and start up the server:
Running the following commands automatically downloads the dev version RAGFlow Docker image. To download and run a specified Docker version, update
RAGFLOW_VERSION
in docker/.env to the intended version, for exampleRAGFLOW_VERSION=v0.8.0
, before running the following commands.
$ cd ragflow/docker
$ chmod +x ./entrypoint.sh
$ docker compose up -d
The core image is about 9 GB in size and may take a while to load.
Check the server status after having the server up and running:
$ docker logs -f ragflow-server
The following output confirms a successful launch of the system:
____ ______ __
/ __ \ ____ _ ____ _ / ____// /____ _ __
/ /_/ // __ `// __ `// /_ / // __ \| | /| / /
/ _, _// /_/ // /_/ // __/ / // /_/ /| |/ |/ /
/_/ |_| \__,_/ \__, //_/ /_/ \____/ |__/|__/
/____/
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:9380
* Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit
If you skip this confirmation step and directly log in to RAGFlow, your browser may prompt a
network anomaly
error because, at that moment, your RAGFlow may not be fully initialized.-
In your web browser, enter the IP address of your server and log in to RAGFlow.
With the default settings, you only need to enter
http://IP_OF_YOUR_MACHINE
(sans port number) as the default HTTP serving port80
can be omitted when using the default configurations. -
In service_conf.yaml, select the desired LLM factory in
user_default_llm
and update theAPI_KEY
field with the corresponding API key.See llm_api_key_setup for more information.
The show is now on!
When it comes to system configurations, you will need to manage the following files:
- .env: Keeps the fundamental setups for the system, such as
SVR_HTTP_PORT
,MYSQL_PASSWORD
, andMINIO_PASSWORD
. - service_conf.yaml: Configures the back-end services.
- docker-compose.yml: The system relies on docker-compose.yml to start up.
You must ensure that changes to the .env file are in line with what are in the service_conf.yaml file.
The ./docker/README file provides a detailed description of the environment settings and service configurations, and you are REQUIRED to ensure that all environment settings listed in the ./docker/README file are aligned with the corresponding configurations in the service_conf.yaml file.
To update the default HTTP serving port (80), go to docker-compose.yml and change 80:80
to <YOUR_SERVING_PORT>:80
.
Updates to all system configurations require a system reboot to take effect:
$ docker-compose up -d
To build the Docker images from source:
$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/
$ docker build -t infiniflow/ragflow:dev .
$ cd ragflow/docker
$ chmod +x ./entrypoint.sh
$ docker compose up -d
To launch the service from source:
-
Clone the repository:
$ git clone https://github.com/infiniflow/ragflow.git $ cd ragflow/
Create a virtual environment, ensuring that Anaconda or Miniconda is installed:
$ conda create -n ragflow python=3.11.0
$ conda activate ragflow
$ pip install -r requirements.txt
# If your CUDA version is higher than 12.0, run the following additional commands:
$ pip uninstall -y onnxruntime-gpu
$ pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
Copy the entry script and configure environment variables:
# Get the Python path:
$ which python
# Get the ragflow project path:
$ pwd
$ cp docker/entrypoint.sh .
$ vi entrypoint.sh
# Adjust configurations according to your actual situation (the following two export commands are newly added): # - Assign the result of `which python` to `PY`. # - Assign the result of `pwd` to `PYTHONPATH`. # - Comment out `LD_LIBRARY_PATH`, if it is configured. # - Optional: Add Hugging Face mirror. PY=${PY} export PYTHONPATH=${PYTHONPATH} export HF_ENDPOINT=https://hf-mirror.com
Launch the third-party services (MinIO, Elasticsearch, Redis, and MySQL):
$ cd docker
$ docker compose -f docker-compose-base.yml up -d
Check the configuration files, ensuring that:
- The settings in docker/.env match those in conf/service_conf.yaml.
- The IP addresses and ports for related services in service_conf.yaml match the local machine IP and ports exposed by the container.
Launch the RAGFlow backend service:
$ chmod +x ./entrypoint.sh
$ bash ./entrypoint.sh
Launch the frontend service:
$ cd web
$ npm install --registry=https://registry.npmmirror.com --force
$ vim .umirc.ts
# Update proxy.target to http://127.0.0.1:9380
$ npm run dev
Deploy the frontend service:
$ cd web
$ npm install --registry=https://registry.npmmirror.com --force
$ umi build
$ mkdir -p /ragflow/web
$ cp -r dist /ragflow/web
$ apt install nginx -y
$ cp ../docker/nginx/proxy.conf /etc/nginx
$ cp ../docker/nginx/nginx.conf /etc/nginx
$ cp ../docker/nginx/ragflow.conf /etc/nginx/conf.d
$ systemctl start nginx
See the RAGFlow Roadmap 2024
from https://github.com/infiniflow/ragflow)
No comments:
Post a Comment