
Apple researchers have developed an adapted version of the SlowFast-LLaVA model that beats larger models at long-form video analysis and understanding. Here’s what that means.
The nerdy bits
Very basically, when an LLM is trained to also understand video, it learns to split videos into frames, apply computer vision to extract visual features, analyze how those features change over time, and align all of that with language so it can describe or reason about the video in the form of text.
One very inefficient way to do this is to analyze every single frame of a video, which creates an overwhelming amount of duplicated information, since most frames rarely include significant changes from one to the next.
With this overwhelming amount of duplicated information at hand, it is very easy to blow past the LLM’s context window, which is the maximum amount of information it can retain at once. Once an LLM exceeds its context window, in order for a conversation to keep going, it stops taking older tokens into account to make room for new ones as it predicts each new token.
Of course, there are more efficient ways to train video LLMs (NVIDIA recently published an interesting paper on this), but this is the general idea to keep in mind for Apple’s study.
Apple’s study
As Apple’s researchers explain it in the paper SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding:
“Video large language models (LLMs) integrate video perception into pre-trained LLMs to process videos and generate responses to user commands. Although significant progress has been made, notable limitations remain in existing Video LLMs.”
The limitations, according to them, are threefold:
- Existing models tend to rely heavily on long context windows and huge numbers of frames, which is inefficient and not easily transferable to smaller models;
- Most of them require complex multi-stage training pipelines (often using private datasets) that are hard to reproduce;
- Many are optimized only for video tasks, which limits their usefulness as general-purpose models that also understand images.
To address those limitations, Apple first looked at SlowFast-LLaVA, an open-source model that had already shown promising results by combining spatial and temporal cues through a two-stream setup: a slow stream that looks at fewer frames in higher detail to capture what’s in the scene, and a fast stream that looks at more frames in lower detail to track how things move over time.
First, Apple fine-tuned SlowFast-LLaVA on images, in order to build general visual reasoning capabilities. Then, it was trained jointly on both images and videos (from public datasets), to learn temporal structure without sacrificing image understanding.

The result was SlowFast-LLaVA-1.5 (or SF-LLaVA-1.5), a family of models at 1B, 3B, and 7B parameter scales, that manages to outperform much larger models across a range of video tasks, sometimes “by significant margins,” as noted by the researchers themselves.

In fact, on long-form video benchmarks like LongVideoBench and MLVU, Apple’s model sets new state-of-the-art results across all model sizes, including its smallest, 1B, version.
What’s more, the model also overcomes one of the three shortcomings noted by the researchers, and performs well on image tasks too, including benchmarks for knowledge, math reasoning, OCR, and text-rich scenarios.

The team even tested several video compression strategies, but found that their setup struck the best balance between speed, accuracy, and token count.
Still, there are limitations
With SF-LLaVA-1.5, Apple’s researchers decided that the model would have a maximum input frame length of 128.
This means that whether it is analyzing a clip that is a few minutes or a few hours long, it always maxes out at 128 frames, with 96 evenly spaced frames selected for the fast stream, and 32 evenly spaced frames selected for the slow stream.
With that in mind, the researchers say that:
“This approach may miss some key frames in long-form videos and mislead the model about a video’s playback speed. (…) SF-LLaVA-1.5’s performance can be further improved by tuning all parameters, including the visual encoder. However, we found this is not trivial for Long Video LLMs due to the high GPU memory cost of caching the activation values. Future studies could explore the integration of memory-saving techniques, such as Stochastic BP.”
That said, Apple’s approach rendered it a state-of-the-art model, with the extra chops of being trained exclusively on public datasets. SF-LLaVA-1.5 is now an open-source model available on GitHub and Hugging Face, and you can find the complete study on arXiv.
Below are a few examples of the model in action:



Limited time Apple Watch deals on Amazon
- Apple Watch Ultra 2: $649.99 (19% off)
- Apple Watch Series 10, 46mm: $329 (23% off)
- Apple Watch SE (2nd Gen), 40mm: $169 (32% off)
FTC: We use income earning auto affiliate links. More.

Comments