Skip to content

Conversation

@MengqingCao
Copy link
Contributor

@MengqingCao MengqingCao commented Nov 7, 2025

Purpose

Refactor the modeling code: pass the prefix into Linear layers


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: MengqingCao <cmq0113@163.com>
@mergify mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models labels Nov 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large-scale refactoring to pass a prefix argument to various linear layers across multiple models. This is a good improvement for code consistency and modularity, especially for weight loading and quantization. The changes are mostly correct, but I've identified two critical copy-paste errors that would break model loading. Please see the detailed comments for fixes.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: MengqingCao <cmq0113@163.com>
Copy link
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeejeelee jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025
@MengqingCao
Copy link
Contributor Author

CI failed due to llama3.1 + ngram accept rate doesn't reach 66%, seems no related to this pr, is it a known issue on CI?

<html>
<body>
<!--StartFragment-->


[2025-11-07T04:19:06Z] =================================== FAILURES ===================================
  | [2025-11-07T04:19:06Z] ____________ test_ngram_and_suffix_correctness[speculative_config1] ____________
  | [2025-11-07T04:19:06Z]
  | [2025-11-07T04:19:06Z] speculative_config = {'method': 'suffix', 'suffix_decoding_max_spec_factor': 2.0, 'target_model_config': ModelConfig(model='meta-llama/Llam...rank=0, _data_parallel_master_port_list=[], decode_context_parallel_size=1, _api_process_count=1, _api_process_rank=0)}
  | [2025-11-07T04:19:06Z] monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f6ce5646ab0>
  | [2025-11-07T04:19:06Z] sampling_config = SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0, top_p=1.0, top...tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None)
  | [2025-11-07T04:19:06Z] model_name = 'meta-llama/Llama-3.1-8B-Instruct'
  | [2025-11-07T04:19:06Z]
  | [2025-11-07T04:19:06Z]     @pytest.mark.parametrize(
  | [2025-11-07T04:19:06Z]         "speculative_config",
  | [2025-11-07T04:19:06Z]         [
  | [2025-11-07T04:19:06Z]             {
  | [2025-11-07T04:19:06Z]                 "method": "ngram",
  | [2025-11-07T04:19:06Z]                 "prompt_lookup_max": 5,
  | [2025-11-07T04:19:06Z]                 "prompt_lookup_min": 3,
  | [2025-11-07T04:19:06Z]                 "num_speculative_tokens": 3,
  | [2025-11-07T04:19:06Z]             },
  | [2025-11-07T04:19:06Z]             {
  | [2025-11-07T04:19:06Z]                 "method": "suffix",
  | [2025-11-07T04:19:06Z]                 "suffix_decoding_max_spec_factor": 2.0,
  | [2025-11-07T04:19:06Z]             },
  | [2025-11-07T04:19:06Z]         ],
  | [2025-11-07T04:19:06Z]     )
  | [2025-11-07T04:19:06Z]     def test_ngram_and_suffix_correctness(
  | [2025-11-07T04:19:06Z]         speculative_config: dict,
  | [2025-11-07T04:19:06Z]         monkeypatch: pytest.MonkeyPatch,
  | [2025-11-07T04:19:06Z]         sampling_config: SamplingParams,
  | [2025-11-07T04:19:06Z]         model_name: str,
  | [2025-11-07T04:19:06Z]     ):
  | [2025-11-07T04:19:06Z]         """
  | [2025-11-07T04:19:06Z]         Compare the outputs of an original LLM and a speculative LLM
  | [2025-11-07T04:19:06Z]         should be the same when using ngram speculative decoding.
  | [2025-11-07T04:19:06Z]         """
  | [2025-11-07T04:19:06Z]         test_prompts = get_test_prompts(mm_enabled=False)
  | [2025-11-07T04:19:06Z]
  | [2025-11-07T04:19:06Z]         ref_llm = LLM(model=model_name, max_model_len=1024)
  | [2025-11-07T04:19:06Z]         ref_outputs = ref_llm.chat(test_prompts, sampling_config)
  | [2025-11-07T04:19:06Z]         del ref_llm
  | [2025-11-07T04:19:06Z]         torch.cuda.empty_cache()
  | [2025-11-07T04:19:06Z]         cleanup_dist_env_and_memory()
  | [2025-11-07T04:19:06Z]
  |  

<br class="Apple-interchange-newline"><!--EndFragment-->
</body>
</html>

@jeejeelee
Copy link
Collaborator

jeejeelee commented Nov 7, 2025

I also see this failure in other PRs, also cc @DarkLight1337

@MengqingCao
Copy link
Contributor Author

CI passed now, It seems like that was an occasional issue. @jeejeelee could you help merge this, thx!

@DarkLight1337 DarkLight1337 merged commit 1958bda into vllm-project:main Nov 7, 2025
54 checks passed
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
@MengqingCao MengqingCao deleted the prefix branch November 10, 2025 08:39
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025
…ect#28259)

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants