-
Notifications
You must be signed in to change notification settings - Fork 31k
Add support for MiniMax-M2 #42028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support for MiniMax-M2 #42028
Conversation
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, minimax_m2 |
Signed-off-by: xuebi <xuebi@minimaxi.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean integration, no particular comments. Thank you! cc @Cyrilvallez for core review
|
|
||
| ## Overview | ||
|
|
||
| MiniMax-M2 redefines efficiency for agents. It's a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed a bit to be more factual
| MiniMax-M2 redefines efficiency for agents. It's a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. | |
| MiniMax-M2 is a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. |
| keys_to_ignore_at_inference = ["past_key_values"] | ||
| base_model_tp_plan = { | ||
| "layers.*.self_attn.q_proj": "colwise", | ||
| "layers.*.self_attn.k_proj": "colwise", | ||
| "layers.*.self_attn.v_proj": "colwise", | ||
| "layers.*.self_attn.o_proj": "rowwise", | ||
| "layers.*.block_sparse_moe.gate": "colwise_rep", # we need to replicate here to correctly route experts | ||
| "layers.*.block_sparse_moe.experts.*.w1": "colwise", | ||
| "layers.*.block_sparse_moe.experts.*.w2": "rowwise", | ||
| "layers.*.block_sparse_moe.experts.*.w3": "colwise", | ||
| } | ||
| base_model_pp_plan = { | ||
| "embed_tokens": (["input_ids"], ["inputs_embeds"]), | ||
| "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), | ||
| "norm": (["hidden_states"], ["hidden_states"]), | ||
| } | ||
| attribute_map = { | ||
| "num_experts": "num_local_experts", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For further reviews: This part is identical to Minimax1 configuration but inheriting the configuration would require deleting a bunch of keys, like full_attn_beta_factor, and so on, so ok to keep.
| class MiniMaxM2MLP(MixtralMLP): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be safe to remove
| class MiniMaxM2MLP(MixtralMLP): | |
| pass |
|
|
||
| class MiniMaxM2DecoderLayer(MixtralDecoderLayer): | ||
| pass | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe to delete as well
| class MiniMaxM2DecoderLayer(MixtralDecoderLayer): | |
| pass |
What does this PR do?
This PR adds MiniMax-M2 model to Hugging Face Transformers from MiniMaxAI.
Relevant Links:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker @Cyrilvallez