Grok Imagine 1.5 Just Took #1 on the Image-to-Video AI Leaderboard. The Real Story Is the Pace.

all thing growth @ ngram.com

TL;DR

Grok Imagine 1.5 debuted May 31, 2026, entered the Artificial Analysis Image-to-Video Arena at #1 with a 1404 Elo score, and opened the API to developers on June 3, 2026.
The model runs on Aurora, an autoregressive architecture that generates each frame from all previous ones, a structural departure from the parallel diffusion approach used by most competitors.
Native audio is generated in the same inference pass as the video: dialogue, sound effects, ambient sound, and music, without a separate audio pipeline.
The leaderboard top spot has changed hands at least four times since January 2026, roughly every 8 to 10 weeks. The pace of change, not any single model, is the main signal.
For teams building on AI video, this is an argument for routing across multiple models rather than committing to one. When any model improves, a multi-model setup absorbs it without rebuilding.

In July 2025, xAI had no video product. On May 31, 2026, Grok Imagine Video 1.5 launched and immediately climbed to #1 on the Artificial Analysis Image-to-Video Arena leaderboard with a 1404 Elo score, beating HappyHorse-1.0, Seedance 2.0, and Google Veo. Ten months from zero to #1 on a global benchmark.

That timeline is worth pausing on. The image-to-video AI category has reshuffled the top spot at least four times since January 2026: Grok Imagine 1.0 in January, Seedance 2.0 in February, HappyHorse-1.0 in April, Grok Imagine 1.5 in June. Roughly every eight to ten weeks, someone new holds the title. That pace of change tells you more about where the model layer is going than any single launch does.

This post covers what Grok Imagine 1.5 actually is, what the Aurora architecture means technically, what the API looks like for developers, and why the speed of leaderboard churn matters for anyone building on top of AI video generation.

What Grok Imagine 1.5 Is

Grok Imagine Video 1.5 is xAI's image-to-video generation model. It takes a still image as input and animates it into a short clip, with camera movement, scene motion, and native audio generated alongside the video frames. The model does not support text-to-video in the current API release; image-to-video is the only path, according to xAI's developer documentation.

Clips run from 6 to 15 seconds at 24 FPS. Resolution is 720p for final output and 480p for draft/preview. The model supports seven aspect ratios including 16:9, 9:16, and 1:1, which covers the main web and social formats.

The consumer tiers are SuperGrok Lite at $10 per month (480p, 6-second maximum) and SuperGrok at $30 per month (720p, up to 15 seconds). The API opened to developers on June 3, 2026, with pricing at $0.08 per second of output at 480p, $0.14 per second at 720p, and an additional $0.01 per input image, according to Roo's detailed breakdown.

The performance jump over version 1.0 is measurable. Grok Imagine 1.0 debuted at #1 in January 2026 with around 1336 Elo. Version 1.5 enters at 1404, a 52-point gain on the same benchmark in roughly five months.

The Leaderboard Right Now

The Artificial Analysis Image-to-Video Arena uses blind human preference votes to assign Elo scores, the same method used in chess rating systems. A higher Elo means the model wins more head-to-head comparisons against other models in the dataset. The current top six in the image-to-video category, as of June 2026, look like this.

Grok Imagine 1.5 leads with 1404 Elo, ahead of HappyHorse-1.0 at 1357 and Seedance 2.0 at 1352 — *Artificial Analysis Image-to-Video Arena Elo scores, top 6 models, June 2026. Source: Artificial Analysis leaderboard.*

Image-to-video arena Elo scores by model, June 2026
Model	Elo score
Grok Imagine 1.5	1404
HappyHorse-1.0	1357
Seedance 2.0	1352
Grok Imagine 1.0	1336
Google Veo	1325
Runway Gen-4.5	1247

The 47-point gap between Grok Imagine 1.5 and HappyHorse-1.0 is meaningful. On this type of preference leaderboard, a gap of that size represents a consistent edge in blind votes, not a borderline result. The 157-point gap between Grok Imagine 1.5 and Runway Gen-4.5 shows how quickly last year's benchmark leaders have fallen relative to newer entrants.

Aurora: Why the Architecture Is Actually Different

Most image-to-video AI generation today uses diffusion models. The broad pattern: the model gets a noisy version of every output frame at once, then iteratively denoises them in parallel. It is fast, and it works well for many scenes. The weakness shows when a scene requires tight temporal consistency, the model has to negotiate coherence across frames without a clear causal structure baked into the generation process.

Aurora is different. It is an autoregressive architecture, meaning each frame is predicted from all the frames that came before it. Generation is sequential rather than parallel. That gives the model a cleaner handle on motion continuity, character consistency through camera changes, and the cause-and-effect logic of how things move. xAI says Aurora is a Mixture-of-Experts network that jointly models text, image, video, and audio tokens, which is also what enables native audio in a single pass, according to Roo's technical summary.

That training scale is also notable: xAI says the model was trained on 110,000 NVIDIA GB200 GPUs. For context, most labs report training runs in the single-digit or low tens of thousands. That cluster size is not something every team can replicate.

The practical output of the Aurora approach is that audio does not need a separate generation step. Dialogue, sound effects, ambient sound, and music are all generated in the same inference pass as the video. Competitors producing audio-inclusive clips either run a separate audio model afterward or rely on post-processing. Both approaches introduce alignment errors between what is seen and what is heard. Grok Imagine 1.5 avoids that gap by design, according to The Decoder's coverage.

The Leaderboard History: Four Reshuffles in Six Months

The more interesting number might not be 1404. It might be four.

The Artificial Analysis image-to-video top position has changed hands at least four times since January 2026. Grok Imagine 1.0 debuted at #1 in January. ByteDance's Seedance 2.0 climbed to the top in February. Alibaba's HappyHorse-1.0, released in April, jumped to #1 with 1357 Elo before Grok Imagine 1.5 moved past it in June. Runway, which once held the leading position, has dropped considerably in relative ranking as newer models entered.

And OpenAI's Sora shut down its consumer product on April 26, 2026, removing one of the better-known names in the space entirely. We covered the economics behind that decision in our Sora shutdown analysis.

The chart below shows how the top Elo score on the image-to-video leaderboard has moved across the main contenders over the first half of 2026.

Image-to-video #1 ranking changed four times between January and June 2026, with Elo scores rising from around 1295 to 1404 — *Leaderboard position changes across Grok Imagine, Seedance 2.0, and HappyHorse-1.0, January to June 2026. Source: Artificial Analysis Image-to-Video Arena.*

AI image-to-video arena Elo score by model and month, 2026
Month	Model at #1	Approx. Elo
January 2026	Grok Imagine 1.0	1352
February 2026	Seedance 2.0	1295
March 2026	Seedance 2.0	1320
April 2026	HappyHorse-1.0	1357
May 2026	Seedance 2.0	1352
June 2026	Grok Imagine 1.5	1404

This pace is not slowing down. Kling 3.0, ByteDance's model with multilingual lip sync, has multiple entries in the top 10 on the text-to-video leaderboard as of June. Google's Veo 3.1 variant holds the audio-inclusive track competitively. The field is producing capable models faster than the market can settle on a standard.

What the API Looks Like for Developers

The Grok Imagine Video 1.5 API is available now in preview at api.x.ai, model alias grok-imagine-video-1.5-2026-05-30. Access requires an xAI API key. The model is image-to-video only in the current release.

How does that pricing stack up against other major image-to-video generation providers?

Grok Imagine 1.5 charges $0.14 per second at 720p, compared to $0.40 for Google Veo 3.1 and $0.20 for Runway Gen-4.5 — *Published API price per second of 720p output, major image-to-video providers, June 2026. Source: xAI API docs, Google Gemini API pricing, Runway API pricing, Kling API docs.*

API price per second of 720p AI video generation output, June 2026
Provider	Price per second (720p)
Google Veo 3.1	$0.40
Runway Gen-4.5	$0.20
Grok Imagine 1.5	$0.14
Kling 3.0	$0.12
Seedance 2.0	$0.10

At $0.14 per second, Grok Imagine 1.5 sits between Runway ($0.20) and Kling ($0.12). A 10-second clip at 720p costs $1.40. For a team generating high volumes of clips, that is a material number. Google Veo 3.1 at $0.40 per second is almost three times the price for a similar clip length. These pricing differences matter most for teams that run hundreds or thousands of generations per month.

One limitation worth noting: there is no text-to-video path in the current API. If your workflow needs video generation from a prompt alone without a starting image, Grok Imagine 1.5 is not the right tool yet. That may change in a future version.

The Model Layer Is Commoditizing Faster Than Anyone Expected

Here is what four reshuffles in six months actually means: no single model is a durable moat. Being the best image-to-video AI generator in January does not mean you are the best in June. HappyHorse-1.0 was ahead for about six weeks. Grok Imagine 1.0 held the top spot for roughly the same window. The generation quality keeps improving, but so does everyone else's.

This is the same pattern playing out in language models, image generation, and now video. The raw generation capability (turning a prompt or image into output) is becoming a shared infrastructure problem. The models that can do it are multiplying. The gap between the best and the rest is narrowing. And the best model six months from now is not the one that's best today.

For teams building products on top of AI video generation, this creates a structural question: should you commit deeply to one model, or build in a way that lets you swap the underlying generation layer as the leaderboard moves? The answer is increasingly obvious. Single-model commitment works until the leaderboard shifts again, as it will, on roughly an 8-to-10-week cycle.

A multi-model routing approach absorbs leaderboard changes without rebuilding. When Grok Imagine 1.5 is the best choice for a given generation task, you route there. When the next model comes along in two months, you route there instead. The orchestration layer stays stable; only the model underneath gets updated. This is already how the strongest AI video platforms route across providers for language, image, and video generation alike.

ngram takes exactly this approach for AI image generation, routing across FAL (primary), Replicate (fallback), and Grok Imagine, so improvements at the model layer come through automatically. The same logic applies across the video generation stack.

What This Means in Practice

For teams already integrating image-to-video AI into their workflows, Grok Imagine 1.5 is worth evaluating. The 720p quality and native audio are genuine differentiators from most of the field right now. The autoregressive architecture produces more temporally consistent output for scenes with motion and character continuity. At $0.14 per second it is priced competitively.

The limitation to keep in mind: image-to-video only. If your pipeline needs text-to-video (starting from a prompt with no reference image), this model does not cover that case today. And because the leaderboard moves fast, it is worth tracking what the field looks like in August and October.

xAI went from no video product to #1 on the global benchmark in ten months. That tells you something about both what xAI is capable of and how fast the model layer moves. Whatever is sitting at #1 six months from now probably does not exist yet.

The teams best positioned for that future are not the ones who picked the current #1 model and locked in. They are the ones who built for a world where the best image-to-video AI generator keeps changing, and designed their stack to change with it.

Frequently Asked Questions

What is Grok Imagine Video 1.5?

Grok Imagine Video 1.5 is xAI's image-to-video generation model, released May 31, 2026. It takes a still image and animates it into a 6-to-15-second clip at 720p and 24 FPS, with audio generated in the same inference pass. The API became available on June 3, 2026. As of June 2026, it holds the #1 position on the Artificial Analysis Image-to-Video Arena leaderboard with a 1404 Elo score.

What is the Aurora architecture?

Aurora is xAI's autoregressive model architecture. Unlike diffusion models that generate all frames in parallel and then denoise them, Aurora generates each frame sequentially, with each new frame conditioned on everything that came before it. This gives the model tighter control over motion continuity and character consistency. Aurora is also a Mixture-of-Experts network that jointly models text, image, video, and audio tokens, which is how Grok Imagine 1.5 produces native audio without a separate pipeline.

How much does the Grok Imagine Video 1.5 API cost?

API pricing is $0.08 per second of output at 480p and $0.14 per second at 720p, plus $0.01 per input image. A 15-second clip at 720p costs $2.10 in generation fees, not counting the input image. Consumer tiers are SuperGrok Lite at $10 per month (480p, 6-second clips) and SuperGrok at $30 per month (720p, up to 15 seconds).

Can Grok Imagine 1.5 do text-to-video?

Not in the current API release. The model is image-to-video only: it needs a starting still image to work from. Text-to-video (starting from a prompt alone with no image input) is not supported in the Grok Imagine Video 1.5 preview API, according to xAI's documentation.

Why does the AI video leaderboard keep changing?

The image-to-video AI leaderboard is changing because multiple well-resourced teams are shipping capable models in parallel, and the quality ceiling for each generation is rising quickly. Since January 2026, the #1 spot has changed hands at least four times: Grok Imagine 1.0, Seedance 2.0, HappyHorse-1.0, and Grok Imagine 1.5. The underlying technical approaches are also diverging (diffusion vs. autoregressive vs. hybrid), which means different models may hold quality advantages on different tasks.

What happened to Sora's position on the leaderboard?

OpenAI shut down the Sora consumer product on April 26, 2026, removing it from active evaluation. The economics of running large video generation models at consumer scale were unsustainable, as we covered in our Sora shutdown analysis. The API continues to run on a timeline through September 2026.

Does the AI video model layer matter if I'm building a product?

Yes, but not in the way you might expect. The model layer is important, but committing to a single model is increasingly risky given how fast the leaderboard moves. Teams that build with routing logic (the ability to swap the underlying image-to-video AI model without rebuilding the product layer) are better positioned to absorb improvements as they come. This is why AI video statistics for 2026 consistently show usage spreading across multiple providers rather than concentrating on one.

Industry news12 min read

The AI Video Disclosure Era Starts Today: NY Law, EU AI Act, and What $9.1B in Ad Spend Must Change

New York's Synthetic Performer Disclosure Law is live as of June 9, 2026, and EU AI Act Article 50 enforcement arrives August 2. Here's what both laws actually require, who is exposed, and a practical compliance checklist for the next 54 days.

Jun 9, 2026

Industry news20 min read

50+ AI Video Statistics for 2026: The Data Behind Video's Biggest Shift

The most comprehensive collection of AI video statistics for 2026 - covering market size, adoption rates, production cost shifts, viewer behavior, and GTM impact. Every data point sourced and cross-referenced.

Industry news11 min read

Avataar's Varya and the Collapsing Cost of AI Video Generation

Avataar launched Varya, an India-built video model distilled from Wan 2.2 that generates video at about $0.005 per second. Here is what the launch says about collapsing AI video generation costs.

Jun 12, 2026

Industry news11 min read

Black Forest Labs' FLUX 3: One AI Model for Video, Audio, and Robots

Black Forest Labs launched FLUX 3, its first multimodal frontier model unifying video, audio, and robotic action. Here's why the first production customer is a car factory, not a marketing team.

Jul 27, 2026

Industry news10 min read

Gemini Omni Flash on YouTube: What Happens When AI Video Goes Native

Google just embedded AI video generation into YouTube for free. Here's what that means for the 2.7 billion people who already use the platform, for content creators, and for where the AI video industry goes from here.

Jun 5, 2026

Industry news15 min read

Goldman Sachs Just Made AI Video Generation Quality a Stock Signal

Goldman Sachs ranked ByteDance's video-generation models above Zhipu, DeepSeek, and every other Chinese AI developer it evaluated, the first standalone investable ranking of AI video quality from a bulge-bracket bank. Here is what the ranking, the Zhipu coverage initiation, and the numbers behind Seedance actually show.