Merge pull request #1810 from hzg0601/dev

1. 支持加载p-tuning，详细步骤见docs/chatchat加载ptuing.md；2. 根据系统自动指定binding_host
2023-10-20 19:34:02 +08:00 · 2023-10-20 19:34:02 +08:00 · 109bb7f6c5
parent 1d9d9df9e9 a81bd827dd
commit 109bb7f6c5
4 changed files with 702 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -60,13 +60,14 @@ docker run -d --gpus all -p 80:8501 registry.cn-beijing.aliyuncs.com/chatchat/ch
 ## 环境最低要求
 想顺利运行本代码，请按照以下的最低要求进行配置：
 + Python版本: >= 3.8.5, < 3.11
 + Cuda版本: >= 11.7, 且能顺利安装Python
 如果想要顺利在GPU运行本地模型(int4版本)，你至少需要以下的硬件配置:
 + chatglm2-6b & LLaMA-7B  最低显存要求: 7GB   推荐显卡: RTX 3060, RTX 2060
-+ LLaMA-13B 最低显存要求: 11GB  推荐显卡: RTX 2060 12GB, RTX3060 12GB, RTX3080, RTXA2000 
+ LLaMA-13B 最低显存要求: 11GB  推荐显卡: RTX 2060 12GB, RTX3060 12GB, RTX3080, RTXA2000
 + Qwen-14B-Chat 最低显存要求: 13GB 推荐显卡: RTX 3090
 + LLaMA-30B 最低显存要求: 22GB  推荐显卡：RTX A5000,RTX 3090,RTX 4090,RTX 6000,Tesla V100,RTX Tesla P40
 + LLaMA-65B 最低显存要求: 40GB  推荐显卡：A100,A40,A6000
@ -215,8 +216,11 @@ docker run -d --gpus all -p 80:8501 registry.cn-beijing.aliyuncs.com/chatchat/ch
 关于如何使用自定义分词器和贡献自己的分词器，可以参考[Text Splitter 贡献说明](docs/splitter.md)。
 ## Agent生态
 ### 基础的Agent
 在本版本中，我们实现了一个简单的基于OpenAI的React的Agent模型，目前，经过我们测试，仅有以下两个模型支持：
 + OpenAI GPT4
 + ChatGLM2-130B
@ -278,6 +282,7 @@ $ git clone https://huggingface.co/moka-ai/m3e-base
 在开始执行 Web UI 或命令行交互前，请先检查 [configs/model_config.py](configs/model_config.py) 和 [configs/server_config.py](configs/server_config.py) 中的各项模型参数设计是否符合需求：
 - 请确认已下载至本地的 LLM 模型本地存储路径写在 `llm_model_dict` 对应模型的 `local_model_path` 属性中，如:
 ```
 "chatglm2-6b": "/Users/xxx/Downloads/chatglm2-6b",
@ -374,9 +379,13 @@ CUDA_VISIBLE_DEVICES=0,1 python startup.py -a
 #### 5.4 PEFT 加载(包括lora,p-tuning,prefix tuning, prompt tuning,ia3等)
-本项目基于 FastChat 加载 LLM 服务，故需以 FastChat 加载 PEFT 路径，即保证路径名称里必须有 peft 这个词，配置文件的名字为 adapter_config.json，peft 路径下包含.bin 格式的 PEFT 权重，peft路径在startup.py中create_model_worker_app函数的args.model_names中指定，并开启环境变量PEFT_SHARE_BASE_WEIGHTS=true参数。
+本项目基于 FastChat 加载 LLM 服务，故需以 FastChat 加载 PEFT 路径，针对chatglm,falcon，codet5p以外的模型，以及非p-tuning以外的peft方法，步骤如下：
-注：如果上述方式启动失败，则需要以标准的fastchat服务启动方式分步启动，分步启动步骤参考第六节，PEFT加载详细步骤参考[加载lora微调后模型失效](https://github.com/chatchat-space/Langchain-Chatchat/issues/1130#issuecomment-1685291822)，
+1. 将训练peft生成的config.json文件命名为adapter_config.json；
 2. 重命名文件夹，保证文件夹中包含'peft'一词；
 3. 开启 `PEFT_SHARE_BASE_WEIGHTS=true`环境变量，再执行python startup.py -a
 针对p-tuning和chatglm模型，需要对fastchat进行较大幅度的修改，详细步骤参考[chatchat加载p-tuning](docs/chatchat加载ptuning.md)
 #### **5.5 注意事项：**
@ -454,10 +463,7 @@ CUDA_VISIBLE_DEVICES=0,1 python startup.py -a
 🎉 langchain-Chatchat 项目微信交流群，如果你也对本项目感兴趣，欢迎加入群聊参与讨论交流。
 ## 关注我们
 <img src="img/official_account.png" alt="图片" width="900" height="300" />
 🎉 langchain-Chatchat 项目官方公众号，欢迎扫码关注。
--- a/README_en.md
+++ b/README_en.md
@ -59,18 +59,18 @@ docker run -d --gpus all -p 80:8501 registry.cn-beijing.aliyuncs.com/chatchat/ch
 ## Environment Minimum Requirements
 To run this code smoothly, please configure it according to the following minimum requirements:
 + Python version: >= 3.8.5, < 3.11
 + Cuda version: >= 11.7, with Python installed.
 If you want to run the native model (int4 version) on the GPU without problems, you need at least the following hardware configuration.
 + chatglm2-6b & LLaMA-7B Minimum RAM requirement: 7GB Recommended graphics cards: RTX 3060, RTX 2060
-+ LLaMA-13B Minimum graphics memory requirement: 11GB Recommended cards: RTX 2060 12GB, RTX3060 12GB, RTX3080, RTXA2000 
+ LLaMA-13B Minimum graphics memory requirement: 11GB Recommended cards: RTX 2060 12GB, RTX3060 12GB, RTX3080, RTXA2000
 + Qwen-14B-Chat Minimum memory requirement: 13GB Recommended graphics card: RTX 3090
 + LLaMA-30B Minimum Memory Requirement: 22GB Recommended Cards: RTX A5000,RTX 3090,RTX 4090,RTX 6000,Tesla V100,RTX Tesla P40
 + Minimum memory requirement for LLaMA-65B: 40GB Recommended cards: A100,A40,A6000
 If int8 then memory x1.5 fp16 x2.5 requirement.
 For example: using fp16 to reason about the Qwen-7B-Chat model requires 16GB of video memory.
@ -137,7 +137,6 @@ The project use [FastChat](https://github.com/lm-sys/FastChat) to provide the AP
 * Any [EleutherAI](https://huggingface.co/EleutherAI) pythia model such as [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b)(任何 [EleutherAI](https://huggingface.co/EleutherAI) 的 pythia 模型，如 [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b))
 * Any [Peft](https://github.com/huggingface/peft) adapter trained on top of a model above. To activate, must have `peft` in the model path. Note: If loading multiple peft models, you can have them share the base model weights by setting the environment variable `PEFT_SHARE_BASE_WEIGHTS=true` in any model worker.
 The above model support list may be updated continuously as [FastChat](https://github.com/lm-sys/FastChat) is updated, see [FastChat Supported Models List](https://github.com/lm-sys/FastChat/blob/main /docs/model_support.md).
 In addition to local models, this project also supports direct access to online models such as OpenAI API, Wisdom Spectrum AI, etc. For specific settings, please refer to the configuration information of `llm_model_dict` in `configs/model_configs.py.example`.
 Online LLM models are currently supported:
@ -230,6 +229,7 @@ $ git clone https://huggingface.co/moka-ai/m3e-base
 ```
 ### 3. Setting Configuration
 Copy the model-related parameter configuration template file [configs/model_config.py.example](configs/model_config.py.example) and store it under the project path `. /configs` path and rename it `model_config.py`.
 Copy the service-related parameter configuration template file [configs/server_config.py.example](configs/server_config.py.example) and store it under the project path `. /configs` path and rename it `server_config.py`.
@ -237,6 +237,7 @@ Copy the service-related parameter configuration template file [configs/server_c
 Before you start executing the Web UI or command line interactions, check that each of the items in [configs/model_config.py](configs/model_config.py) and [configs/server_config.py](configs/server_config.py) The model parameters are designed to meet the requirements:
 - Please make sure that the local storage path of the downloaded LLM model is written in the `local_model_path` attribute of the corresponding model in `llm_model_dict`, e.g..
 ```
 "chatglm2-6b":"/Users/xxx/Downloads/chatglm2-6b",
@ -330,9 +331,9 @@ CUDA_VISIBLE_DEVICES=0,1 python startup.py -a
 Including lora,p-tuning,prefix tuning, prompt tuning,ia3
-This project loads the LLM service based on FastChat, so one must load the PEFT in a FastChat way, that is, ensure that the word `peft` must be in the path name, the name of the configuration file must be `adapter_config.json`, and the path contains PEFT weights in `.bin` format. The peft path is specified in `args.model_names` of the `create_model_worker_app` function in `startup.py`, and enable the environment variable `PEFT_SHARE_BASE_WEIGHTS=true` parameter.
+This project loads the LLM service based on FastChat, so one must load the PEFT in a FastChat way. For models other than "chatglm", "falcon" or "code5p" and peft other than "p-tuning", ensure that the word `peft` must be in the path name, the name of the configuration file must be `adapter_config.json`, and the path contains PEFT weights in `.bin` format. The peft path is specified in `args.model_names` of the `create_model_worker_app` function in `startup.py`, and enable the environment variable `PEFT_SHARE_BASE_WEIGHTS=true` parameter.
-If the above method fails, you need to start standard fastchat service step by step.  Step-by-step procedure could be found Section 6. For further steps, please refer to [Model invalid after loading lora fine-tuning](https://github. com/chatchat-space/Langchain-Chatchat/issues/1130#issuecomment-1685291822).
+For "p-tuning" PEFT, please refer to [load p-tuning with chatchat](docs/chatchat加载ptuning.md).
 #### **5.5 Some Notes**
@ -369,6 +370,7 @@ Please refer to [FAQ](docs/FAQ.md)
 - [X] Langchain applications
  - [X] Load local documents
    - [X] Unstructured documents
      - [X] .md
      - [X] .txt
@ -376,29 +378,36 @@ Please refer to [FAQ](docs/FAQ.md)
    - [ ] Structured documents
      - [X] .csv
      - [ ] .xlsx
    - [] TextSplitter and Retriever
      - [X] multiple TextSplitter
      - [X] ChineseTextSplitter
      - [ ] Reconstructed Context Retriever
    - [ ] Webpage
    - [ ] SQL
    - [ ] Knowledge Database
  - [X] Search Engines
    - [X] Bing
    - [X] DuckDuckGo
  - [X] Agent
    - [X] Agent implementation in the form of basic React, including calls to calculators, etc.
    - [X] Langchain's own Agent implementation and calls
    - [ ] More Agent support for models
    - [ ] More tools
 - [X] LLM  Models
  - [X] [FastChat](https://github.com/lm-sys/fastchat) -based LLM Models
  - [ ] Mutiply Remote LLM API
 - [X] Embedding Models
  - [X] HuggingFace -based Embedding models
  - [ ] Mutiply Remote Embedding API
 - [X] FastAPI-based API
 - [X] Web UI
  - [X] Streamlit -based Web UI
 ---
@ -412,4 +421,4 @@ Please refer to [FAQ](docs/FAQ.md)
 ## Follow us
 <img src="img/official_account.png" alt="image" width="900" height="300" />
-🎉 langchain-Chatchat project official public number, welcome to scan the code to follow.
+🎉 langchain-Chatchat project official public number, welcome to scan the code to follow.
--- a/configs/server_config.py.example
+++ b/configs/server_config.py.example
@ -9,7 +9,7 @@ HTTPX_DEFAULT_TIMEOUT = 300.0
 OPEN_CROSS_DOMAIN = False
 # 各服务器默认绑定host。如改为"0.0.0.0"需要修改下方所有XX_SERVER的host
-DEFAULT_BIND_HOST = "0.0.0.0"
+DEFAULT_BIND_HOST = "0.0.0.0" if sys.platform != "win32" else "127.0.0.1"
 # webui.py server
 WEBUI_SERVER = {
--- a/docs/chatchat加载ptuning.md
+++ b/docs/chatchat加载ptuning.md
@ -0,0 +1,674 @@
 # chatchat加载ptuning指南
 P-tuning虽然是一种peft方法，但并不能于huggingface的peft python包兼容，而fastchat在多处以字符串匹配的方式进行硬编码加载模型，因此导致fastchat和chatchat不能兼容p-tuning，经langchain-chatchat开发组多次尝试，给出如下指南进行p-tuning加载。
 # 1. peft文件夹修改
 1. 将config.json文件修改为adapter_config.json;
 2. 保证文件夹包含pytorch_model.bin文件；
 3. 修改文件夹名称，保证文件夹包含'peft'一词；
 4. 在adapter_config.json文件中增加如下字段：
   ```json
       "base_model_name_or_path": "/root/model/chatglm2-6b/"
       "task_type": "CAUSAL_LM",
       "peft_type": "PREFIX_TUNING",
       "inference_mode": true,
       "revision": "main",
       "num_virtual_tokens": 16
   ```
   **其中,"base_model_name_or_path"为基础模型的存在位置**；
 5. 将文件夹移入项目文件夹中，如Langchain-Chatchat项目文件夹目录下；
 # 2. fastchat包代码修改
 ## 2.1 fastchat.model.model_adapter文件修改
 1. 将fastchat.model.model_adapter.py文件的load_model函数修改为：
   ```python
   def load_model(
       model_path: str,
       device: str = "cuda",
       num_gpus: int = 1,
       max_gpu_memory: Optional[str] = None,
       dtype: Optional[torch.dtype] = None,
       load_8bit: bool = False,
       cpu_offloading: bool = False,
       gptq_config: Optional[GptqConfig] = None,
       awq_config: Optional[AWQConfig] = None,
       revision: str = "main",
       debug: bool = False,
       load_kwargs = {}
   ):
       """Load a model from Hugging Face."""
       # get model adapter
       adapter = get_model_adapter(model_path)
       kwargs = load_kwargs
       # Handle device mapping
       cpu_offloading = raise_warning_for_incompatible_cpu_offloading_configuration(
           device, load_8bit, cpu_offloading
       )
       if device == "cpu":
           kwargs["torch_dtype"]= torch.float32
           if CPU_ISA in ["avx512_bf16", "amx"]:
               try:
                   import intel_extension_for_pytorch as ipex
                   kwargs ["torch_dtype"]= torch.bfloat16
               except ImportError:
                   warnings.warn(
                       "Intel Extension for PyTorch is not installed, it can be installed to accelerate cpu inference"
                   )
       elif device == "cuda":
           kwargs["torch_dtype"] = torch.float16
           if num_gpus != 1:
               kwargs["device_map"] = "auto"
               if max_gpu_memory is None:
                   kwargs[
                       "device_map"
                   ] = "sequential"  # This is important for not the same VRAM sizes
                   available_gpu_memory = get_gpu_memory(num_gpus)
                   kwargs["max_memory"] = {
                       i: str(int(available_gpu_memory[i] * 0.85)) + "GiB"
                       for i in range(num_gpus)
                   }
               else:
                   kwargs["max_memory"] = {i: max_gpu_memory for i in range(num_gpus)}
       elif device == "mps":
           kwargs["torch_dtype"] = torch.float16
           # Avoid bugs in mps backend by not using in-place operations.
           replace_llama_attn_with_non_inplace_operations()
       elif device == "xpu":
           kwargs["torch_dtype"] = torch.bfloat16
           # Try to load ipex, while it looks unused, it links into torch for xpu support
           try:
               import intel_extension_for_pytorch as ipex
           except ImportError:
               warnings.warn(
                   "Intel Extension for PyTorch is not installed, but is required for xpu inference."
               )
       elif device == "npu":
           kwargs["torch_dtype"]= torch.float16
           # Try to load ipex, while it looks unused, it links into torch for xpu support
           try:
               import torch_npu
           except ImportError:
               warnings.warn("Ascend Extension for PyTorch is not installed.")
       else:
           raise ValueError(f"Invalid device: {device}")
       if cpu_offloading:
           # raises an error on incompatible platforms
           from transformers import BitsAndBytesConfig
           if "max_memory" in kwargs:
               kwargs["max_memory"]["cpu"] = (
                   str(math.floor(psutil.virtual_memory().available / 2**20)) + "Mib"
               )
           kwargs["quantization_config"] = BitsAndBytesConfig(
               load_in_8bit_fp32_cpu_offload=cpu_offloading
           )
           kwargs["load_in_8bit"] = load_8bit
       elif load_8bit:
           if num_gpus != 1:
               warnings.warn(
                   "8-bit quantization is not supported for multi-gpu inference."
               )
           else:
               model, tokenizer = adapter.load_compress_model(
                   model_path=model_path,
                   device=device,
                   torch_dtype=kwargs["torch_dtype"],
                   revision=revision,
               )
               if debug:
                   print(model)
               return model, tokenizer
       elif awq_config and awq_config.wbits < 16:
           assert (
               awq_config.wbits == 4
           ), "Currently we only support 4-bit inference for AWQ."
           model, tokenizer = load_awq_quantized(model_path, awq_config, device)
           if num_gpus != 1:
               device_map = accelerate.infer_auto_device_map(
                   model,
                   max_memory=kwargs["max_memory"],
                   no_split_module_classes=[
                       "OPTDecoderLayer",
                       "LlamaDecoderLayer",
                       "BloomBlock",
                       "MPTBlock",
                       "DecoderLayer",
                   ],
               )
               model = accelerate.dispatch_model(
                   model, device_map=device_map, offload_buffers=True
               )
           else:
               model.to(device)
           return model, tokenizer
       elif gptq_config and gptq_config.wbits < 16:
           model, tokenizer = load_gptq_quantized(model_path, gptq_config)
           if num_gpus != 1:
               device_map = accelerate.infer_auto_device_map(
                   model,
                   max_memory=kwargs["max_memory"],
                   no_split_module_classes=["LlamaDecoderLayer"],
               )
               model = accelerate.dispatch_model(
                   model, device_map=device_map, offload_buffers=True
               )
           else:
               model.to(device)
           return model, tokenizer
       kwargs["revision"] = revision
       if dtype is not None:  # Overwrite dtype if it is provided in the arguments.
           kwargs["torch_dtype"] = dtype
       # Load model
       model, tokenizer = adapter.load_model(model_path, kwargs)
       if (
           device == "cpu"
           and kwargs["torch_dtype"] is torch.bfloat16
           and CPU_ISA is not None
       ):
           model = ipex.optimize(model, dtype=kwargs["torch_dtype"])
       if (device == "cuda" and num_gpus == 1 and not cpu_offloading) or device in (
           "mps",
           "xpu",
           "npu",
       ):
           model.to(device)
       if device == "xpu":
           model = torch.xpu.optimize(model, dtype=kwargs["torch_dtype"], inplace=True)
       if debug:
           print(model)
       return model, tokenizer
   ```
 2. 将fastchat.model.model_adapter.py的函数修改为：
   ```python
   def get_generate_stream_function(model: torch.nn.Module, model_path: str):
       """Get the generate_stream function for inference."""
       from fastchat.serve.inference import generate_stream
       model_type = str(type(model)).lower()
       is_chatglm = "chatglm" in model_type 
       is_falcon = "rwforcausallm" in model_type
       is_codet5p = "codet5p" in model_type 
       is_peft = "peft" in model_type
       if is_chatglm:
           return generate_stream_chatglm
       elif is_falcon:
           return generate_stream_falcon
       elif is_codet5p:
           return generate_stream_codet5p
       elif peft_share_base_weights and is_peft:
           # Return a curried stream function that loads the right adapter
           # according to the model_name available in this context.  This ensures
           # the right weights are available.
           @torch.inference_mode()
           def generate_stream_peft(
               model,
               tokenizer,
               params: Dict,
               device: str,
               context_len: int,
               stream_interval: int = 2,
               judge_sent_end: bool = False,
           ):
               model.set_adapter(model_path)
               if "chatglm" in str(type(model.base_model)).lower():
                   model.disable_adapter()
                   prefix_state_dict = torch.load(os.path.join(model_path, "pytorch_model.bin"))
                   new_prefix_state_dict = {}
                   for k, v in prefix_state_dict.items():
                       if k.startswith("transformer.prefix_encoder."):
                           new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
                       elif k.startswith("transformer.prompt_encoder."):
                           new_prefix_state_dict[k[len("transformer.prompt_encoder."):]] = v
                   model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
                   for x in generate_stream_chatglm(
                       model,
                       tokenizer,
                       params,
                       device,
                       context_len,
                       stream_interval,
                       judge_sent_end,
                   ):
                       yield x
               elif "rwforcausallm" in str(type(model.base_model)).lower():
                   for x in generate_stream_falcon(
                       model,
                       tokenizer,
                       params,
                       device,
                       context_len,
                       stream_interval,
                       judge_sent_end,
                   ):
                       yield x   
               elif "codet5p" in str(type(model.base_model)).lower():
                   for x in generate_stream_codet5p(
                       model,
                       tokenizer,
                       params,
                       device,
                       context_len,
                       stream_interval,
                       judge_sent_end,
                   ):
                       yield x   
               else:
                   for x in generate_stream(
                       model,
                       tokenizer,
                       params,
                       device,
                       context_len,
                       stream_interval,
                       judge_sent_end,
                   ):
                       yield x
           return generate_stream_peft
       else:
           return generate_stream
   ```
 3. 将fastchat.model.model_adapter.py的PeftModelAdapter类的load_model方法修改为：
   ```python
       def load_model(self, model_path: str, from_pretrained_kwargs: dict):
           """Loads the base model then the (peft) adapter weights"""
           from peft import PeftConfig, PeftModel
           config = PeftConfig.from_pretrained(model_path)
           base_model_path = config.base_model_name_or_path
           if "peft" in base_model_path:
               raise ValueError(
                   f"PeftModelAdapter cannot load a base model with 'peft' in the name: {config.base_model_name_or_path}"
               )
           # Basic proof of concept for loading peft adapters that share the base
           # weights.  This is pretty messy because Peft re-writes the underlying
           # base model and internally stores a map of adapter layers.
           # So, to make this work we:
           #  1. Cache the first peft model loaded for a given base models.
           #  2. Call `load_model` for any follow on Peft models.
           #  3. Make sure we load the adapters by the model_path.  Why? This is
           #  what's accessible during inference time.
           #  4. In get_generate_stream_function, make sure we load the right
           #  adapter before doing inference.  This *should* be safe when calls
           #  are blocked the same semaphore.
           if peft_share_base_weights:
               if base_model_path in peft_model_cache:
                   model, tokenizer = peft_model_cache[base_model_path]
                   # Super important: make sure we use model_path as the
                   # `adapter_name`.
                   model.load_adapter(model_path, adapter_name=model_path)
               else:
                   base_adapter = get_model_adapter(base_model_path)
                   base_model, tokenizer = base_adapter.load_model(
                       base_model_path, from_pretrained_kwargs
                   )
                   # Super important: make sure we use model_path as the
                   # `adapter_name`.
                   from peft import get_peft_model
                   model = get_peft_model(base_model,config,adapter_name=model_path)
                   peft_model_cache[base_model_path] = (model, tokenizer)
               return model, tokenizer
           # In the normal case, load up the base model weights again.
           base_adapter = get_model_adapter(base_model_path)
           base_model, tokenizer = base_adapter.load_model(
               base_model_path, from_pretrained_kwargs
           )
           from peft import get_peft_model
           model = get_peft_model(base_model,config,adapter_name=model_path)
           return model, tokenizer
   ```
 4. 将fastchat.model.model_adapter.py的ChatglmAdapter类的load_model方法修改为：
   ```python
       def load_model(self, model_path: str, from_pretrained_kwargs: dict):
           revision = from_pretrained_kwargs.get("revision", "main")
           tokenizer = AutoTokenizer.from_pretrained(
               model_path, trust_remote_code=True, revision=revision
           )
           config = AutoConfig.from_pretrained(model_path, trust_remote_code=True,**from_pretrained_kwargs)
           model = AutoModel.from_pretrained(
               model_path, trust_remote_code=True, config=config
           )
           return model, tokenizer
   ```
 ## 2.2 fastchat.serve.model_worker文件修改
 1. 将fastchat.serve.model_worker文件的ModelWorker的__init__方法修改如下：
   ```python
   class ModelWorker(BaseModelWorker):
       def __init__(
           self,
           controller_addr: str,
           worker_addr: str,
           worker_id: str,
           model_path: str,
           model_names: List[str],
           limit_worker_concurrency: int,
           no_register: bool,
           device: str,
           num_gpus: int,
           max_gpu_memory: str,
           dtype: Optional[torch.dtype] = None,
           load_8bit: bool = False,
           cpu_offloading: bool = False,
           gptq_config: Optional[GptqConfig] = None,
           awq_config: Optional[AWQConfig] = None,
           stream_interval: int = 2,
           conv_template: Optional[str] = None,
           embed_in_truncate: bool = False,
           seed: Optional[int] = None,
           load_kwargs = {}, #修改点
           **kwargs,
       ):
           super().__init__(
               controller_addr,
               worker_addr,
               worker_id,
               model_path,
               model_names,
               limit_worker_concurrency,
               conv_template=conv_template,
           )
           logger.info(f"Loading the model {self.model_names} on worker {worker_id} ...")
           self.model, self.tokenizer = load_model(
               model_path,
               device=device,
               num_gpus=num_gpus,
               max_gpu_memory=max_gpu_memory,
               dtype=dtype,
               load_8bit=load_8bit,
               cpu_offloading=cpu_offloading,
               gptq_config=gptq_config,
               awq_config=awq_config,
               load_kwargs=load_kwargs #修改点
           )
           self.device = device
           if self.tokenizer.pad_token == None:
               self.tokenizer.pad_token = self.tokenizer.eos_token
           self.context_len = get_context_length(self.model.config)
           print("**"*100)
           self.generate_stream_func = get_generate_stream_function(self.model, model_path)
           print(f"self.generate_stream_func{self.generate_stream_func}")
           print("*"*100)
           self.stream_interval = stream_interval
           self.embed_in_truncate = embed_in_truncate
           self.seed = seed
           if not no_register:
               self.init_heart_beat()
   ```
 2. 在fastchat.serve.model_worker文件的create_model_worker增加如下args参数：
   ```python
   parser.add_argument("--load_kwargs",type=dict,default={})
   ```
    并将如下语句：
 ```python
    worker = ModelWorker(
        args.controller_address,
        args.worker_address,
        worker_id,
        args.model_path,
        args.model_names,
        args.limit_worker_concurrency,
        no_register=args.no_register,
        device=args.device,
        num_gpus=args.num_gpus,
        max_gpu_memory=args.max_gpu_memory,
        dtype=str_to_torch_dtype(args.dtype),
        load_8bit=args.load_8bit,
        cpu_offloading=args.cpu_offloading,
        gptq_config=gptq_config,
        awq_config=awq_config,
        stream_interval=args.stream_interval,
        conv_template=args.conv_template,
        embed_in_truncate=args.embed_in_truncate,
        seed=args.seed,
    )
 ```
 修改为：
 ```python
    worker = ModelWorker(
        args.controller_address,
        args.worker_address,
        worker_id,
        args.model_path,
        args.model_names,
        args.limit_worker_concurrency,
        no_register=args.no_register,
        device=args.device,
        num_gpus=args.num_gpus,
        max_gpu_memory=args.max_gpu_memory,
        dtype=str_to_torch_dtype(args.dtype),
        load_8bit=args.load_8bit,
        cpu_offloading=args.cpu_offloading,
        gptq_config=gptq_config,
        awq_config=awq_config,
        stream_interval=args.stream_interval,
        conv_template=args.conv_template,
        embed_in_truncate=args.embed_in_truncate,
        seed=args.seed,
        load_kwargs=args.load_kwargs
    )
 ```
 至此，我们完成了fastchat加载ptuning的所有修改，在调用fastchat加载p-tuning时，可以通过加入 `PEFT_SHARE_BASE_WEIGHTS=true`，并以字典的形式添加--load_kwargs参数为训练ptuning时的pre_seq_len值即可，例如将2.2.2步骤中的 `parser.add_argument("--load_kwargs",type=dict,default={})`修改为：
 `parser.add_argument("--load_kwargs",type=dict,default={"pre_seq_len":16})`
 # 3 langchain-chatchat代码修改：
 1. 在configs/serve_config.py中的FSCHAT_MODEL_WORKERS字典中增加如下字段：
   ```
   "load_kwargs": {"pre_seq_len": 16} #值修改为adapter_config.json中的pre_seq_len值
   ```
 2. 将startup.py中的create_model_worker_app修改为：
   ```python
   def create_model_worker_app(log_level: str = "INFO", **kwargs) -> FastAPI:
       """
       kwargs包含的字段如下：
       host:
       port:
       model_names:[`model_name`]
       controller_address:
       worker_address:
       对于online_api:
           online_api:True
           worker_class: `provider`
       对于离线模型：
           model_path: `model_name_or_path`,huggingface的repo-id或本地路径
           device:`LLM_DEVICE`
       """
       import fastchat.constants
       fastchat.constants.LOGDIR = LOG_PATH
       from fastchat.serve.model_worker import worker_id, logger
       import argparse
       logger.setLevel(log_level)
       parser = argparse.ArgumentParser()
       args = parser.parse_args([])
       for k, v in kwargs.items():
           setattr(args, k, v)
       # 在线模型API
       if worker_class := kwargs.get("worker_class"):
           from fastchat.serve.model_worker import app
           worker = worker_class(model_names=args.model_names,
                                 controller_addr=args.controller_address,
                                 worker_addr=args.worker_address)
           sys.modules["fastchat.serve.model_worker"].worker = worker
       # 本地模型
       else:
           from configs.model_config import VLLM_MODEL_DICT
           if kwargs["model_names"][0] in VLLM_MODEL_DICT and args.infer_turbo == "vllm":
               import fastchat.serve.vllm_worker
               from fastchat.serve.vllm_worker import VLLMWorker,app
               from vllm import AsyncLLMEngine
               from vllm.engine.arg_utils import AsyncEngineArgs,EngineArgs
               args.tokenizer = args.model_path # 如果tokenizer与model_path不一致在此处添加
               args.tokenizer_mode = 'auto'
               args.trust_remote_code= True
               args.download_dir= None
               args.load_format = 'auto'
               args.dtype = 'auto'
               args.seed = 0
               args.worker_use_ray = False
               args.pipeline_parallel_size = 1
               args.tensor_parallel_size = 1
               args.block_size = 16
               args.swap_space = 4  # GiB
               args.gpu_memory_utilization = 0.90
               args.max_num_batched_tokens = 2560
               args.max_num_seqs = 256
               args.disable_log_stats = False
               args.conv_template = None
               args.limit_worker_concurrency = 5
               args.no_register = False
               args.num_gpus = 1 # vllm worker的切分是tensor并行，这里填写显卡的数量
               args.engine_use_ray = False
               args.disable_log_requests = False
               if args.model_path:
                   args.model = args.model_path
               if args.num_gpus > 1:
                   args.tensor_parallel_size = args.num_gpus
               for k, v in kwargs.items():
                   setattr(args, k, v)
               engine_args = AsyncEngineArgs.from_cli_args(args)
               engine = AsyncLLMEngine.from_engine_args(engine_args)
               worker = VLLMWorker(
                           controller_addr = args.controller_address,
                           worker_addr = args.worker_address,
                           worker_id = worker_id,
                           model_path = args.model_path,
                           model_names = args.model_names,
                           limit_worker_concurrency = args.limit_worker_concurrency,
                           no_register = args.no_register,
                           llm_engine =  engine,
                           conv_template = args.conv_template,
                           )
               sys.modules["fastchat.serve.vllm_worker"].engine = engine
               sys.modules["fastchat.serve.vllm_worker"].worker = worker
           else:
               from fastchat.serve.model_worker import app, GptqConfig, AWQConfig, ModelWorker
               args.gpus = "0" # GPU的编号,如果有多个GPU，可以设置为"0,1,2,3"
               args.max_gpu_memory = "20GiB"
               args.num_gpus = 1  # model worker的切分是model并行，这里填写显卡的数量
               args.load_8bit = False
               args.cpu_offloading = None
               args.gptq_ckpt = None
               args.gptq_wbits = 16
               args.gptq_groupsize = -1
               args.gptq_act_order = False
               args.awq_ckpt = None
               args.awq_wbits = 16
               args.awq_groupsize = -1
               args.model_names = []
               args.conv_template = None
               args.limit_worker_concurrency = 5
               args.stream_interval = 2
               args.no_register = False
               args.embed_in_truncate = False
               args.load_kwargs = {"pre_seq_len": 16} # 改*************************
               for k, v in kwargs.items():
                   setattr(args, k, v)
               if args.gpus:
                   if args.num_gpus is None:
                       args.num_gpus = len(args.gpus.split(','))
                   if len(args.gpus.split(",")) < args.num_gpus:
                       raise ValueError(
                           f"Larger --num-gpus ({args.num_gpus}) than --gpus {args.gpus}!"
                       )
                   os.environ["CUDA_VISIBLE_DEVICES"] = args.gpus
               gptq_config = GptqConfig(
                   ckpt=args.gptq_ckpt or args.model_path,
                   wbits=args.gptq_wbits,
                   groupsize=args.gptq_groupsize,
                   act_order=args.gptq_act_order,
               )
               awq_config = AWQConfig(
                   ckpt=args.awq_ckpt or args.model_path,
                   wbits=args.awq_wbits,
                   groupsize=args.awq_groupsize,
               )
               worker = ModelWorker(
                   controller_addr=args.controller_address,
                   worker_addr=args.worker_address,
                   worker_id=worker_id,
                   model_path=args.model_path,
                   model_names=args.model_names,
                   limit_worker_concurrency=args.limit_worker_concurrency,
                   no_register=args.no_register,
                   device=args.device,
                   num_gpus=args.num_gpus,
                   max_gpu_memory=args.max_gpu_memory,
                   load_8bit=args.load_8bit,
                   cpu_offloading=args.cpu_offloading,
                   gptq_config=gptq_config,
                   awq_config=awq_config,
                   stream_interval=args.stream_interval,
                   conv_template=args.conv_template,
                   embed_in_truncate=args.embed_in_truncate,
                   load_kwargs=args.load_kwargs #改*************************
               )
               sys.modules["fastchat.serve.model_worker"].args = args
               sys.modules["fastchat.serve.model_worker"].gptq_config = gptq_config
               sys.modules["fastchat.serve.model_worker"].worker = worker
       MakeFastAPIOffline(app)
       app.title = f"FastChat LLM Server ({args.model_names[0]})"
       app._worker = worker
       return app
   ```
 至此，我们完成了langchain-chatchat加载p-tuning的全部操作，可以如下方式加载p-tuning：
 ```shell
 PEFT_SHARE_BASE_WEIGHTS=true python startup.py -a
 ```