Video Understanding Showdown: Gemini, ChatGPT, Claude

Ana Souza 55 minutes ago

1 6 minutes read

Video Understanding Showdown: Gemini, ChatGPT, Claude

video understanding – A hands-on test compares Gemini, ChatGPT, and Claude at understanding YouTube and local video files, plus thumbnail creation.

The promise that AI can “watch” a video is starting to feel less like marketing and more like a practical workflow. and this showdown makes the differences hard to miss.. In testing focused on video understanding. Gemini came out on top for reading video content across common formats. while Claude flatly declined and ChatGPT leaned heavily on an extra tool to get real work done.

The tests put three systems through the same basic challenge: could they process what’s happening in a video?. The setup included both a YouTube link and local video files, intentionally stripped of helper signals like transcripts or metadata.. The goal wasn’t just to see whether an AI could generate generic commentary. but whether it could actually interpret the contents from frames—and then turn that understanding into something useful. like a better YouTube thumbnail.

The first video was a YouTube upload about the scientific process of annealing. created to test whether the AI could grasp a narrative that includes verbal explanation.. The second test clip was a drone motion exercise: the filmer stands in front of a DJI Neo 2. uses hand gestures to control the drone’s movement. and the video contains no audio.. The third item was the original MOV file used for a “walk-and-talk” about YouTube posting strategy. evaluated locally so the models wouldn’t rely on any YouTube-provided context.

A key detail in how the prompts were handled is what determined whether models tried to “search around” for metadata.. The tester found that asking the AIs to “watch this video” worked better than requesting “understand” or “summarize. ” because those latter prompts sometimes pushed the systems toward looking for information like titles. transcripts. or other cues.. With “watch,” the models generally appeared to treat the request as a request to actually process the video itself.

Claude was unable to take the first step.. Whether using the app or a web interface. it returned essentially the same limitation: it couldn’t watch video content. couldn’t process video or audio streams. and couldn’t handle YouTube links for video understanding.. For this specific test, the interaction didn’t evolve beyond that hard boundary.

Gemini, by contrast, was able to handle every format test presented in the browser.. The tester didn’t even need to rely on a standalone app. and Gemini’s web experience worked with a YouTube URL as well as large local MP4 and MOV files.. That meant the system could remain in one place while interpreting what was shown. rather than forcing a switch into an additional capability layer.

The drone clip was particularly telling.. With no audio and no visible drone. the only information came from what the camera did in response to hand gestures—like raising a palm toward the lens to signal a stop or move. and then guiding it around the yard while keeping the camera movement centered on the subject.. Gemini’s interpretation matched the apparent action closely, describing the gesture-driven changes in angle and distance as the footage progresses.

The annealing video also landed well with Gemini.. It could identify sections, report on specific points made verbally, and demonstrate an understanding that went beyond generic summarization.. The same strength carried into the uploaded walk-and-talk MOV. where Gemini was able to connect location changes and shifts in what the creator was saying across the length of the clip.

Gemini did hit a wall when the task shifted from understanding to image creation style control.. The test asked Gemini to pick a single “maximum impact” frame for a YouTube thumbnail and then use an image generation step to create a new thumbnail using the creator’s existing style as context.. The generated attempts were vivid and potentially clickable. but they drifted away from the intended look—such as inventing a bearded figure instead of using the creator’s own image. and even producing misspellings.. The result underscored that Gemini’s video comprehension didn’t automatically translate into disciplined thumbnail execution.

Where ChatGPT fell behind was more immediate: the base ChatGPT experience couldn’t read the YouTube link in this test. and it also didn’t handle the local video files as provided.. The tester reported a practical limitation tied to video size. with the tested clips exceeding what the system would accept directly.

The good news was that ChatGPT became far more capable when paired with Codex.. Codex was able to process local files and interpret what was happening, including the drone scene.. Its description matched the setup: a residential backyard test. gestures like a hand raise or wave. camera viewpoint movement that shifts angle and distance while keeping the subject mostly centered. and the absence of major scene changes or additional activities.

Codex also faced an early snag with the walk-and-talk MOV file. but it responded by requesting permission to install Python code and libraries to enable audio transcription.. After that setup work. it could view and interpret the video context rather than treating the file as uninterpretable raw data.

The YouTube portion required another workaround.. Codex couldn’t watch the YouTube stream directly. so the tester asked it to download the full video and process it locally.. That approach worked: Codex produced a Python script. installed libraries. and helped develop video-downloading logic on the fly before moving back into watching mode.

Thumbnail generation was where the workflow became more complex.. Codex indicated that it had access to image-generation tools in the session. but it didn’t recognize an exposed “Images 2.0” tool by name at first.. After the tester clarified the Images 2.0 concept and pointed Codex toward OpenAI’s site so it could locate the capability. the tool awareness improved. but it still couldn’t do the thumbnail task effectively on its own.

At that point. the tester bridged the systems: Codex selected a single frame for maximum impact. and then the tester used ChatGPT as the image-generation endpoint.. The workflow involved exporting the frame so it could be uploaded. then providing a prompt that incorporated context from the video and alignment with the creator’s existing thumbnail style.. Compared with Gemini’s thumbnail direction. ChatGPT and Codex’s final output picked up several intended visual cues. including the creator’s lettering color scheme.

Even then, the result wasn’t perfect.. The generated thumbnail omitted certain elements like the logo and shifted design details. and the metal depiction came out differently from what was intended.. The tester iterated by prompting the image model to correct the aluminum shape. then refined again to address how the bend should be drawn and how marker placement should align with the intended guidance marks.. With those prompt revisions, the thumbnail became closer to the target look.

Both Gemini and ChatGPT were also timed in a way that highlighted how quickly they could process a full clip relative to its runtime.. The science video and the walk-and-talk were each about fifteen minutes long. yet both systems were able to “watch” and parse the content in roughly a couple of minutes each.. That speed matters because it changes what “video understanding” can realistically support: scanning. extracting key points. and jumping to moments without manually watching end to end.

The silent drone clip was again used as proof of interpretive ability.. Even with no audio—just frames and camera motion—both Gemini and the ChatGPT-with-Codex approach were able to infer contextual meaning: that a gesture-driven camera/drone test was underway. with the drone operating around human height and maintaining framing of the subject.

Practical uses came up immediately from the test results.. Gemini was used with a YouTube news report to request details about discussed topics. and the tester also described an approach for security-camera style workflows—using video scanning to find a specific kind of action quickly.. Another practical feature highlighted was timestamping key thoughts. which enables skipping directly to relevant sections by clicking into the video at the labeled moments.

For creators, the combination of video understanding and frame selection points to a new kind of production tool.. The tester still preferred manual thumbnail creation. but the ability to extract usable frames and construct thumbnails suggests a faster drafting method: the AI can narrow down candidates and generate early options. leaving human taste to finalize the final design.

Claude. while it was the only system that effectively failed the core “watch video” test. still retained relevance in the tester’s broader toolkit.. The report notes Claude’s strength in other areas. including vibe coding. even though it wasn’t able to participate in this particular video-focused capability comparison.

Overall. the experiment reinforces a split that is becoming clearer across the market: some models are genuinely strong at interpreting video from frames inside their primary interface. while others require additional agent tooling to bridge gaps like video input handling. transcription. or downloading before they can perform the “watch” part.. The most interesting part may be that the right workflow can still turn those limitations into results. even if it takes more steps than simply opening the app and pressing play.

video AI Gemini Pro ChatGPT Plus Claude video Codex agent YouTube thumbnails AI transcription

Ana Souza 55 minutes ago

1 6 minutes read

Leave a Reply Cancel reply