Video to Text

Summary

Video to text can be performed by connecting a Video Block with a Text Block. Similar to Image to text, it can be used to extract information from any video inputs.

This combination of blocks can be used to Describe the video; Give a numbered list of frames with detailed description of visual content;

Models

Video to Text Model

Llava, a fine-tuned version of of the Qwen model.

Read more about Llava here.

Prompt

  • "Describe this video in a few sentence"

  • "Describe the composition and focal points of each frame, on how elements are arranged and how they guide the viewer's attention."

  • "Illustrate the atmosphere of each scene, focusing on sensory details such as lighting, visceral color sensation, and spatial depth."

  • "Describe how recurring motifs and visual patterns contribute to thematic development in the video."

  • "Give me a list of frame in this video and describe each visual composition."

Last updated