Back to Industry news
Industry news

Grok Imagine 1.5 Just Took #1 on the Image-to-Video AI Leaderboard. The Real Story Is the Pace.

xAI's Grok Imagine 1.5 debuted May 31, 2026 and hit #1 on the Artificial Analysis Image-to-Video Arena with 1404 Elo, beating HappyHorse, Seedance 2.0, and Google Veo. Built on the Aurora autoregressive architecture with native audio in a single pass, it reached the top in 10 months from zero. Here's what the pace means.

Grok Imagine 1.5 Just Took #1 on the Image-to-Video AI Leaderboard. The Real Story Is the Pace.
10 min readUpdated at June 17, 2026
Written and edited by
Rishikesh Ranjan
Rishikesh Ranjan
all thing growth @ ngram.com

In July 2025, xAI had no video product. On May 31, 2026, Grok Imagine Video 1.5 launched and immediately climbed to #1 on the Artificial Analysis Image-to-Video Arena leaderboard with a 1404 Elo score, beating HappyHorse-1.0, Seedance 2.0, and Google Veo. Ten months from zero to #1 on a global benchmark.

That timeline is worth pausing on. The image-to-video AI category has reshuffled the top spot at least four times since January 2026: Grok Imagine 1.0 in January, Seedance 2.0 in February, HappyHorse-1.0 in April, Grok Imagine 1.5 in June. Roughly every eight to ten weeks, someone new holds the title. That pace of change tells you more about where the model layer is going than any single launch does.

This post covers what Grok Imagine 1.5 actually is, what the Aurora architecture means technically, what the API looks like for developers, and why the speed of leaderboard churn matters for anyone building on top of AI video generation.

What Grok Imagine 1.5 Is

Grok Imagine Video 1.5 is xAI's image-to-video generation model. It takes a still image as input and animates it into a short clip, with camera movement, scene motion, and native audio generated alongside the video frames. The model does not support text-to-video in the current API release; image-to-video is the only path, according to xAI's developer documentation.

Clips run from 6 to 15 seconds at 24 FPS. Resolution is 720p for final output and 480p for draft/preview. The model supports seven aspect ratios including 16:9, 9:16, and 1:1, which covers the main web and social formats.

The consumer tiers are SuperGrok Lite at $10 per month (480p, 6-second maximum) and SuperGrok at $30 per month (720p, up to 15 seconds). The API opened to developers on June 3, 2026, with pricing at $0.08 per second of output at 480p, $0.14 per second at 720p, and an additional $0.01 per input image, according to Roo's detailed breakdown.

The performance jump over version 1.0 is measurable. Grok Imagine 1.0 debuted at #1 in January 2026 with around 1336 Elo. Version 1.5 enters at 1404, a 52-point gain on the same benchmark in roughly five months.

The Leaderboard Right Now

The Artificial Analysis Image-to-Video Arena uses blind human preference votes to assign Elo scores, the same method used in chess rating systems. A higher Elo means the model wins more head-to-head comparisons against other models in the dataset. The current top six in the image-to-video category, as of June 2026, look like this.

Grok Imagine 1.5 leads with 1404 Elo, ahead of HappyHorse-1.0 at 1357 and Seedance 2.0 at 1352
Artificial Analysis Image-to-Video Arena Elo scores, top 6 models, June 2026. Source: Artificial Analysis leaderboard.
Image-to-video arena Elo scores by model, June 2026
ModelElo score
Grok Imagine 1.51404
HappyHorse-1.01357
Seedance 2.01352
Grok Imagine 1.01336
Google Veo1325
Runway Gen-4.51247

The 47-point gap between Grok Imagine 1.5 and HappyHorse-1.0 is meaningful. On this type of preference leaderboard, a gap of that size represents a consistent edge in blind votes, not a borderline result. The 157-point gap between Grok Imagine 1.5 and Runway Gen-4.5 shows how quickly last year's benchmark leaders have fallen relative to newer entrants.

Aurora: Why the Architecture Is Actually Different

Most image-to-video AI generation today uses diffusion models. The broad pattern: the model gets a noisy version of every output frame at once, then iteratively denoises them in parallel. It is fast, and it works well for many scenes. The weakness shows when a scene requires tight temporal consistency, the model has to negotiate coherence across frames without a clear causal structure baked into the generation process.

Aurora is different. It is an autoregressive architecture, meaning each frame is predicted from all the frames that came before it. Generation is sequential rather than parallel. That gives the model a cleaner handle on motion continuity, character consistency through camera changes, and the cause-and-effect logic of how things move. xAI says Aurora is a Mixture-of-Experts network that jointly models text, image, video, and audio tokens, which is also what enables native audio in a single pass, according to Roo's technical summary.

That training scale is also notable: xAI says the model was trained on 110,000 NVIDIA GB200 GPUs. For context, most labs report training runs in the single-digit or low tens of thousands. That cluster size is not something every team can replicate.

The practical output of the Aurora approach is that audio does not need a separate generation step. Dialogue, sound effects, ambient sound, and music are all generated in the same inference pass as the video. Competitors producing audio-inclusive clips either run a separate audio model afterward or rely on post-processing. Both approaches introduce alignment errors between what is seen and what is heard. Grok Imagine 1.5 avoids that gap by design, according to The Decoder's coverage.

The Leaderboard History: Four Reshuffles in Six Months

The more interesting number might not be 1404. It might be four.

The Artificial Analysis image-to-video top position has changed hands at least four times since January 2026. Grok Imagine 1.0 debuted at #1 in January. ByteDance's Seedance 2.0 climbed to the top in February. Alibaba's HappyHorse-1.0, released in April, jumped to #1 with 1357 Elo before Grok Imagine 1.5 moved past it in June. Runway, which once held the leading position, has dropped considerably in relative ranking as newer models entered.

And OpenAI's Sora shut down its consumer product on April 26, 2026, removing one of the better-known names in the space entirely. We covered the economics behind that decision in our Sora shutdown analysis.

The chart below shows how the top Elo score on the image-to-video leaderboard has moved across the main contenders over the first half of 2026.

Image-to-video #1 ranking changed four times between January and June 2026, with Elo scores rising from around 1295 to 1404
Leaderboard position changes across Grok Imagine, Seedance 2.0, and HappyHorse-1.0, January to June 2026. Source: Artificial Analysis Image-to-Video Arena.
AI image-to-video arena Elo score by model and month, 2026
MonthModel at #1Approx. Elo
January 2026Grok Imagine 1.01352
February 2026Seedance 2.01295
March 2026Seedance 2.01320
April 2026HappyHorse-1.01357
May 2026Seedance 2.01352
June 2026Grok Imagine 1.51404

This pace is not slowing down. Kling 3.0, ByteDance's model with multilingual lip sync, has multiple entries in the top 10 on the text-to-video leaderboard as of June. Google's Veo 3.1 variant holds the audio-inclusive track competitively. The field is producing capable models faster than the market can settle on a standard.

What the API Looks Like for Developers

The Grok Imagine Video 1.5 API is available now in preview at api.x.ai, model alias grok-imagine-video-1.5-2026-05-30. Access requires an xAI API key. The model is image-to-video only in the current release.

How does that pricing stack up against other major image-to-video generation providers?

Grok Imagine 1.5 charges $0.14 per second at 720p, compared to $0.40 for Google Veo 3.1 and $0.20 for Runway Gen-4.5
Published API price per second of 720p output, major image-to-video providers, June 2026. Source: xAI API docs, Google Gemini API pricing, Runway API pricing, Kling API docs.
API price per second of 720p AI video generation output, June 2026
ProviderPrice per second (720p)
Google Veo 3.1$0.40
Runway Gen-4.5$0.20
Grok Imagine 1.5$0.14
Kling 3.0$0.12
Seedance 2.0$0.10

At $0.14 per second, Grok Imagine 1.5 sits between Runway ($0.20) and Kling ($0.12). A 10-second clip at 720p costs $1.40. For a team generating high volumes of clips, that is a material number. Google Veo 3.1 at $0.40 per second is almost three times the price for a similar clip length. These pricing differences matter most for teams that run hundreds or thousands of generations per month.

One limitation worth noting: there is no text-to-video path in the current API. If your workflow needs video generation from a prompt alone without a starting image, Grok Imagine 1.5 is not the right tool yet. That may change in a future version.

The Model Layer Is Commoditizing Faster Than Anyone Expected

Here is what four reshuffles in six months actually means: no single model is a durable moat. Being the best image-to-video AI generator in January does not mean you are the best in June. HappyHorse-1.0 was ahead for about six weeks. Grok Imagine 1.0 held the top spot for roughly the same window. The generation quality keeps improving, but so does everyone else's.

This is the same pattern playing out in language models, image generation, and now video. The raw generation capability (turning a prompt or image into output) is becoming a shared infrastructure problem. The models that can do it are multiplying. The gap between the best and the rest is narrowing. And the best model six months from now is not the one that's best today.

For teams building products on top of AI video generation, this creates a structural question: should you commit deeply to one model, or build in a way that lets you swap the underlying generation layer as the leaderboard moves? The answer is increasingly obvious. Single-model commitment works until the leaderboard shifts again, as it will, on roughly an 8-to-10-week cycle.

A multi-model routing approach absorbs leaderboard changes without rebuilding. When Grok Imagine 1.5 is the best choice for a given generation task, you route there. When the next model comes along in two months, you route there instead. The orchestration layer stays stable; only the model underneath gets updated. This is already how the strongest AI video platforms route across providers for language, image, and video generation alike.

ngram takes exactly this approach for AI image generation, routing across FAL (primary), Replicate (fallback), and Grok Imagine, so improvements at the model layer come through automatically. The same logic applies across the video generation stack.

What This Means in Practice

For teams already integrating image-to-video AI into their workflows, Grok Imagine 1.5 is worth evaluating. The 720p quality and native audio are genuine differentiators from most of the field right now. The autoregressive architecture produces more temporally consistent output for scenes with motion and character continuity. At $0.14 per second it is priced competitively.

The limitation to keep in mind: image-to-video only. If your pipeline needs text-to-video (starting from a prompt with no reference image), this model does not cover that case today. And because the leaderboard moves fast, it is worth tracking what the field looks like in August and October.

xAI went from no video product to #1 on the global benchmark in ten months. That tells you something about both what xAI is capable of and how fast the model layer moves. Whatever is sitting at #1 six months from now probably does not exist yet.

The teams best positioned for that future are not the ones who picked the current #1 model and locked in. They are the ones who built for a world where the best image-to-video AI generator keeps changing, and designed their stack to change with it.

Frequently Asked Questions

What is Grok Imagine Video 1.5?

Grok Imagine Video 1.5 is xAI's image-to-video generation model, released May 31, 2026. It takes a still image and animates it into a 6-to-15-second clip at 720p and 24 FPS, with audio generated in the same inference pass. The API became available on June 3, 2026. As of June 2026, it holds the #1 position on the Artificial Analysis Image-to-Video Arena leaderboard with a 1404 Elo score.

What is the Aurora architecture?

Aurora is xAI's autoregressive model architecture. Unlike diffusion models that generate all frames in parallel and then denoise them, Aurora generates each frame sequentially, with each new frame conditioned on everything that came before it. This gives the model tighter control over motion continuity and character consistency. Aurora is also a Mixture-of-Experts network that jointly models text, image, video, and audio tokens, which is how Grok Imagine 1.5 produces native audio without a separate pipeline.

How much does the Grok Imagine Video 1.5 API cost?

API pricing is $0.08 per second of output at 480p and $0.14 per second at 720p, plus $0.01 per input image. A 15-second clip at 720p costs $2.10 in generation fees, not counting the input image. Consumer tiers are SuperGrok Lite at $10 per month (480p, 6-second clips) and SuperGrok at $30 per month (720p, up to 15 seconds).

Can Grok Imagine 1.5 do text-to-video?

Not in the current API release. The model is image-to-video only: it needs a starting still image to work from. Text-to-video (starting from a prompt alone with no image input) is not supported in the Grok Imagine Video 1.5 preview API, according to xAI's documentation.

Why does the AI video leaderboard keep changing?

The image-to-video AI leaderboard is changing because multiple well-resourced teams are shipping capable models in parallel, and the quality ceiling for each generation is rising quickly. Since January 2026, the #1 spot has changed hands at least four times: Grok Imagine 1.0, Seedance 2.0, HappyHorse-1.0, and Grok Imagine 1.5. The underlying technical approaches are also diverging (diffusion vs. autoregressive vs. hybrid), which means different models may hold quality advantages on different tasks.

What happened to Sora's position on the leaderboard?

OpenAI shut down the Sora consumer product on April 26, 2026, removing it from active evaluation. The economics of running large video generation models at consumer scale were unsustainable, as we covered in our Sora shutdown analysis. The API continues to run on a timeline through September 2026.

Does the AI video model layer matter if I'm building a product?

Yes, but not in the way you might expect. The model layer is important, but committing to a single model is increasingly risky given how fast the leaderboard moves. Teams that build with routing logic (the ability to swap the underlying image-to-video AI model without rebuilding the product layer) are better positioned to absorb improvements as they come. This is why AI video statistics for 2026 consistently show usage spreading across multiple providers rather than concentrating on one.

Related articles

The AI Video Disclosure Era Starts Today: NY Law, EU AI Act, and What $9.1B in Ad Spend Must Change
Industry news12 min read

The AI Video Disclosure Era Starts Today: NY Law, EU AI Act, and What $9.1B in Ad Spend Must Change

New York's Synthetic Performer Disclosure Law is live as of June 9, 2026, and EU AI Act Article 50 enforcement arrives August 2. Here's what both laws actually require, who is exposed, and a practical compliance checklist for the next 54 days.

Industry NewsAI Video
Rishikesh Ranjan
Rishikesh Ranjan
Growth Lead
Jun 9, 2026
50+ AI Video Statistics for 2026: The Data Behind Video's Biggest Shift
Industry news20 min read

50+ AI Video Statistics for 2026: The Data Behind Video's Biggest Shift

The most comprehensive collection of AI video statistics for 2026 - covering market size, adoption rates, production cost shifts, viewer behavior, and GTM impact. Every data point sourced and cross-referenced.

ngramAI Video
Anish Muppalaneni
Anish Muppalaneni
Co-founder & CEO
Apr 16, 2026
Avataar's Varya and the Collapsing Cost of AI Video Generation
Industry news11 min read

Avataar's Varya and the Collapsing Cost of AI Video Generation

Avataar launched Varya, an India-built video model distilled from Wan 2.2 that generates video at about $0.005 per second. Here is what the launch says about collapsing AI video generation costs.

Industry NewsAI Video
Rishikesh Ranjan
Rishikesh Ranjan
Growth Lead
Jun 12, 2026
Gemini Omni Flash on YouTube: What Happens When AI Video Goes Native
Industry news10 min read

Gemini Omni Flash on YouTube: What Happens When AI Video Goes Native

Google just embedded AI video generation into YouTube for free. Here's what that means for the 2.7 billion people who already use the platform, for content creators, and for where the AI video industry goes from here.

Industry NewsAI Video
Rishikesh Ranjan
Rishikesh Ranjan
Growth Lead
Jun 5, 2026
NVIDIA Cosmos 3 and the Next Turn in AI Video
Industry news9 min read

NVIDIA Cosmos 3 and the Next Turn in AI Video

NVIDIA Cosmos 3 is an open world model for physical AI. Here's why it matters for AI video, image-to-video workflows, and the move from clips to simulation.

Industry NewsAI Video
Rishikesh Ranjan
Rishikesh Ranjan
Growth Lead
Jun 4, 2026
Why OpenAI Shut Down Sora: The Economics of AI Video at Scale
Industry news13 min read

Why OpenAI Shut Down Sora: The Economics of AI Video at Scale

OpenAI killed Sora on April 26, 2026. The app burned roughly $1M per day, earned $2.1M in total lifetime revenue, and held a 1% 30-day retention rate. Here is what the numbers reveal about the real economics of AI video infrastructure in 2026.

Industry NewsAI Video
Rishikesh Ranjan
Rishikesh Ranjan
Growth Lead
Jun 13, 2026

Ready to create your first video?

Join thousands of product teams using AI to create professional videos in minutes.