Skip to content

Conversation

@rogeryoungh
Copy link

What does this PR do?

This PR adds MiniMax-M2 model to Hugging Face Transformers from MiniMaxAI.

Relevant Links:

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @Cyrilvallez

xuebi added 6 commits October 31, 2025 14:17
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, minimax_m2

Signed-off-by: xuebi <xuebi@minimaxi.com>
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean integration, no particular comments. Thank you! cc @Cyrilvallez for core review


## Overview

MiniMax-M2 redefines efficiency for agents. It's a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed a bit to be more factual

Suggested change
MiniMax-M2 redefines efficiency for agents. It's a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever.
MiniMax-M2 is a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever.

Comment on lines +139 to +157
keys_to_ignore_at_inference = ["past_key_values"]
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.block_sparse_moe.gate": "colwise_rep", # we need to replicate here to correctly route experts
"layers.*.block_sparse_moe.experts.*.w1": "colwise",
"layers.*.block_sparse_moe.experts.*.w2": "rowwise",
"layers.*.block_sparse_moe.experts.*.w3": "colwise",
}
base_model_pp_plan = {
"embed_tokens": (["input_ids"], ["inputs_embeds"]),
"layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
"norm": (["hidden_states"], ["hidden_states"]),
}
attribute_map = {
"num_experts": "num_local_experts",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For further reviews: This part is identical to Minimax1 configuration but inheriting the configuration would require deleting a bunch of keys, like full_attn_beta_factor, and so on, so ok to keep.

Comment on lines +238 to +239
class MiniMaxM2MLP(MixtralMLP):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be safe to remove

Suggested change
class MiniMaxM2MLP(MixtralMLP):
pass

Comment on lines +347 to +350

class MiniMaxM2DecoderLayer(MixtralDecoderLayer):
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe to delete as well

Suggested change
class MiniMaxM2DecoderLayer(MixtralDecoderLayer):
pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants