Skip to content

Conversation

@Xqle
Copy link
Contributor

@Xqle Xqle commented Nov 4, 2025

What does this PR do?

1. Improve fps handling in Qwen2_5_VLProcessor.__call__()

  • Previously, the __call__() function always defaulted to fps=2, regardless of the input video.

  • Now, when fps is not explicitly provided and cannot be read from the video metadata, it falls back to 24 by default. Otherwise, the provided or metadata fps is used. It follows a priority order:

    kwargs["fps"] → video_metadata["fps"] → 24 (default)

    The default value 24 is chosen as it represents a common frame rate for most videos.

  • This prevents potential errors in second_per_grid_ts calculation when the actual video frame rate differs from the assumed default.

2. More robust do_sample_frames behavior

  • If fps is manually provided but do_sample_frames is not specified, do_sample_frames will now be automatically set to True.
  • This avoids cases where custom fps is ignored due to missing sampling flag.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

@Rocketknight1
Copy link
Member

cc @molbap @yonigozlan for processors

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Xqle , thanks for the PR though I am not convinced we have to automatically change the flag depending on input args. Please see my comments below

Comment on lines 931 to 935
output_kwargs["videos_kwargs"]["do_sample_frames"] = True
logger.info(
"User specified 'fps' without 'do_sample_frames'; "
"'do_sample_frames' has been automatically enabled."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is a good idea. We can't automatically change the flag only for a single model. If we are to change to all models, it will be much harder because the sampling can be set either with num_frames or fps and optionally other args. Checking for what required args are present per model is not as straightforward always

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part, we can just revert the change.

Comment on lines 946 to 948
fps = output_kwargs["videos_kwargs"].get(
"fps", [metadata.fps if metadata.fps is not None else 24 for metadata in video_metadata]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is the fps if the original video here we are using. It's more probably the sampling fps rate used which is in videos_kwargs

The sampling fps however isn't always there, we can sample with number of frames, with default values of video processor or do not sample at all. So the 2 here is just the default value of the video processor

Copy link
Contributor Author

@Xqle Xqle Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp

Qwen2VL Video Processor Default Parameters image

Thanks for the feedback.

In Qwen2.5-VL, the video processor we are using is the Qwen2-VL video processor, which does not initialize fps or num_frames by default, see above. So the model will not sample the input video by default.

And I think the fps in this context is not only used for frame sampling, but is later involved in computing seconds_per_grid_ts, which directly affects the temporal RoPE index inside get_rope_index. So, the fps used here should correspond to the effective fps of the processed video (i.e., the fps after potential frame sampling).

Specifically:

  • When fps and do_sample_frames are passed in video_kwargs, the processor will sample the video according to that fps, so the processed video’s fps should match the sampling fps in video_kwargs.

  • When eitherfps ordo_sample_frames is not provided, no sampling occurs, and the processed video keeps its original fps, which is metadata.fps. In that case, using metadata.fps ensures that seconds_per_grid_ts and thus the temporal RoPE positions remain consistent with the actual video duration.

  • Finally, if no sampling occurs and metadata.fps is unavailable, it’s reasonable to fall back to a typical video frame rate such as 24 fps, to maintain temporal consistency in downstream computations.

So, I think we can revert the part of automatic change of flag do_sample_frames, and change this part to something like:

fps = [metadata.fps if metadata.fps is not None else 24 for metadata in video_metadata]
if output_kwargs["videos_kwargs"].get("do_sample_frames") is not None:
    if output_kwargs["videos_kwargs"].get("fps") is not None:
        fps = output_kwargs["videos_kwargs"]["fps"]
    elif output_kwargs["videos_kwargs"].get("num_frames") is not None:
        fps = [output_kwargs["videos_kwargs"]["num_frames"] / metadata.duration for metadata in video_metadata]

@Xqle
Copy link
Contributor Author

Xqle commented Nov 5, 2025

@zucchini-nlp

Thanks for the review. But I think the fps should align with the fps of video after video processing, no matter the video is sampled during video processing or not.
Please see my reply for your comment above.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_5_vl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants