[Misc][Model][Refactor] Pass the prefix into Linear layers #28259

MengqingCao · 2025-11-07T01:24:23Z

Purpose

Refactor the modeling code: pass the prefix into Linear layers

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: MengqingCao <cmq0113@163.com>

gemini-code-assist

Code Review

This pull request is a large-scale refactoring to pass a prefix argument to various linear layers across multiple models. This is a good improvement for code consistency and modularity, especially for weight loading and quantization. The changes are mostly correct, but I've identified two critical copy-paste errors that would break model loading. Please see the detailed comments for fixes.

vllm/model_executor/models/chameleon.py

vllm/model_executor/models/persimmon.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/models/persimmon.py

Signed-off-by: MengqingCao <cmq0113@163.com>

jeejeelee

LGTM

MengqingCao · 2025-11-07T06:26:26Z

CI failed due to llama3.1 + ngram accept rate doesn't reach 66%, seems no related to this pr, is it a known issue on CI?

<html>
<body>
<!--StartFragment-->


[2025-11-07T04:19:06Z] =================================== FAILURES ===================================
  | [2025-11-07T04:19:06Z] ____________ test_ngram_and_suffix_correctness[speculative_config1] ____________
  | [2025-11-07T04:19:06Z]
  | [2025-11-07T04:19:06Z] speculative_config = {'method': 'suffix', 'suffix_decoding_max_spec_factor': 2.0, 'target_model_config': ModelConfig(model='meta-llama/Llam...rank=0, _data_parallel_master_port_list=[], decode_context_parallel_size=1, _api_process_count=1, _api_process_rank=0)}
  | [2025-11-07T04:19:06Z] monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f6ce5646ab0>
  | [2025-11-07T04:19:06Z] sampling_config = SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0, top_p=1.0, top...tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None)
  | [2025-11-07T04:19:06Z] model_name = 'meta-llama/Llama-3.1-8B-Instruct'
  | [2025-11-07T04:19:06Z]
  | [2025-11-07T04:19:06Z]     @pytest.mark.parametrize(
  | [2025-11-07T04:19:06Z]         "speculative_config",
  | [2025-11-07T04:19:06Z]         [
  | [2025-11-07T04:19:06Z]             {
  | [2025-11-07T04:19:06Z]                 "method": "ngram",
  | [2025-11-07T04:19:06Z]                 "prompt_lookup_max": 5,
  | [2025-11-07T04:19:06Z]                 "prompt_lookup_min": 3,
  | [2025-11-07T04:19:06Z]                 "num_speculative_tokens": 3,
  | [2025-11-07T04:19:06Z]             },
  | [2025-11-07T04:19:06Z]             {
  | [2025-11-07T04:19:06Z]                 "method": "suffix",
  | [2025-11-07T04:19:06Z]                 "suffix_decoding_max_spec_factor": 2.0,
  | [2025-11-07T04:19:06Z]             },
  | [2025-11-07T04:19:06Z]         ],
  | [2025-11-07T04:19:06Z]     )
  | [2025-11-07T04:19:06Z]     def test_ngram_and_suffix_correctness(
  | [2025-11-07T04:19:06Z]         speculative_config: dict,
  | [2025-11-07T04:19:06Z]         monkeypatch: pytest.MonkeyPatch,
  | [2025-11-07T04:19:06Z]         sampling_config: SamplingParams,
  | [2025-11-07T04:19:06Z]         model_name: str,
  | [2025-11-07T04:19:06Z]     ):
  | [2025-11-07T04:19:06Z]         """
  | [2025-11-07T04:19:06Z]         Compare the outputs of an original LLM and a speculative LLM
  | [2025-11-07T04:19:06Z]         should be the same when using ngram speculative decoding.
  | [2025-11-07T04:19:06Z]         """
  | [2025-11-07T04:19:06Z]         test_prompts = get_test_prompts(mm_enabled=False)
  | [2025-11-07T04:19:06Z]
  | [2025-11-07T04:19:06Z]         ref_llm = LLM(model=model_name, max_model_len=1024)
  | [2025-11-07T04:19:06Z]         ref_outputs = ref_llm.chat(test_prompts, sampling_config)
  | [2025-11-07T04:19:06Z]         del ref_llm
  | [2025-11-07T04:19:06Z]         torch.cuda.empty_cache()
  | [2025-11-07T04:19:06Z]         cleanup_dist_env_and_memory()
  | [2025-11-07T04:19:06Z]
  |  

<br class="Apple-interchange-newline"><!--EndFragment-->
</body>
</html>

jeejeelee · 2025-11-07T06:48:43Z

I also see this failure in other PRs, also cc @DarkLight1337

MengqingCao · 2025-11-07T11:04:44Z

CI passed now, It seems like that was an occasional issue. @jeejeelee could you help merge this, thx!

…ect#28259) Signed-off-by: MengqingCao <cmq0113@163.com>

…ect#28259) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

[Misc][Model][Refactor] Pass the prefix into Linear layers

0f559ee

Signed-off-by: MengqingCao <cmq0113@163.com>

MengqingCao requested a review from sighingnow as a code owner November 7, 2025 01:24

mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models labels Nov 7, 2025

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

vllm/model_executor/models/chameleon.py Outdated Show resolved Hide resolved

vllm/model_executor/models/persimmon.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 7, 2025

View reviewed changes

vllm/model_executor/models/persimmon.py Outdated Show resolved Hide resolved

fix typo

f944030

Signed-off-by: MengqingCao <cmq0113@163.com>

jeejeelee approved these changes Nov 7, 2025

View reviewed changes

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025

DarkLight1337 merged commit 1958bda into vllm-project:main Nov 7, 2025
54 checks passed

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Misc][Model][Refactor] Pass the prefix into Linear layers (vllm-proj…

380e0ca

…ect#28259) Signed-off-by: MengqingCao <cmq0113@163.com>

MengqingCao deleted the prefix branch November 10, 2025 08:39

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025

[Misc][Model][Refactor] Pass the prefix into Linear layers (vllm-proj…

36651e8

…ect#28259) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Alnusjaponica mentioned this pull request Nov 14, 2025

[Model] [Bugfix] Fix inconsistencies in the handling of layer names #27453

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc][Model][Refactor] Pass the prefix into Linear layers #28259

[Misc][Model][Refactor] Pass the prefix into Linear layers #28259

Uh oh!

MengqingCao commented Nov 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

jeejeelee left a comment

Uh oh!

MengqingCao commented Nov 7, 2025

Uh oh!

jeejeelee commented Nov 7, 2025 •

edited

Loading

Uh oh!

MengqingCao commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Misc][Model][Refactor] Pass the prefix into Linear layers #28259

[Misc][Model][Refactor] Pass the prefix into Linear layers #28259

Uh oh!

Conversation

MengqingCao commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented Nov 7, 2025

Uh oh!

jeejeelee commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MengqingCao commented Nov 7, 2025 •

edited by github-actions bot

Loading

jeejeelee commented Nov 7, 2025 •

edited

Loading