Featured NewsAI Tools & Products NewsMajor Developments

OpenAI Launches GDPval: New AI Test for Real-World Work

OpenAI today introduced GDPval, a new evaluation system that measures how well AI models perform on real, economically valuable tasks.

Instead of judging AI on exams or puzzles, GDPval tests what these models can do in actual jobs — like drafting reports, designing layouts, or planning medical care.

What Is GDPval?

The name “GDPval” is short for “Gross Domestic Product evaluation.” It’s inspired by GDP, which measures an economy’s total output.

In the same way, GDPval measures AI by how much productive, “work-like” tasks it can do.

Unlike older benchmarks, which often test math, reasoning puzzles, or abstract problems, GDPval is about real tasks that contribute to industries and jobs.

What It Measures

GDPval covers 1,320 tasks across 44 occupations in the nine biggest U.S. industries — industries each responsible for more than 5% of GDP. These sectors include healthcare, technology, education, manufacturing, government, finance, and more.

The tasks are varied and realistic. They involve deliverables like legal briefs, engineering blueprints, customer support conversations, spreadsheets, slide decks, and even multimedia files.

They are built with real context (reference files, prompts) rather than just bare text questions.

How Models Are Graded

The evaluation is done via blind comparisons: human experts see AI outputs and human outputs without knowing which is which. They rank them as “better,” “as good as,” or “worse,” using detailed rubrics specific to each role.

OpenAI also built an automated grader — an AI system trained to predict how human graders would score an output.

The automated grader is available to the public as a research tool, though OpenAI notes it’s not yet as reliable as human experts.

Early Results 

The first results show that top AI models are coming close to expert-level work:

Anthropic’s Claude Opus 4.1 was the best performer, producing work rated as good as or better than human experts in 49% of tasks. It excelled in visual presentations and polished documents.

OpenAI’s GPT-5 achieved a 40.6% win/tie rate against humans. While slightly behind Claude overall, GPT-5 was better at technical knowledge and accuracy.

The progress is striking: in just 15 months, AI’s performance on GDPval tasks more than doubled from GPT-4o to GPT-5.

AI models also showed huge speed and cost advantages: they can complete many tasks about 100× faster and 100× cheaper than human professionals — though these numbers don’t include human oversight or integration work.

Real-World Roles Tested

GDPval covers a wide variety of jobs, many beyond the usual tech fields:

Technology: Software engineers, data scientists, video editors

Healthcare: Nurses, pharmacists, medical technicians

Finance & Law: Investment bankers, analysts, lawyers

Public service: Detectives, social workers, administrators

Creative industries: Journalists, marketing pros, designers

Tasks required AI to work with different formats — documents, spreadsheets, images, audio, video, and even CAD files — making the test more realistic than text-only evaluations.

Limitations & What’s Next

The first version, called GDPval-v0, has some limits:

It measures single-task performance only, not long-term projects with feedback and revisions.

It focuses on computer-based knowledge work, not physical labor or company-specific workflows.

In future versions, OpenAI plans to expand GDPval’s scope to include more industries, interactive tasks, ambiguous scenarios, and work that unfolds over time.

News Gist

OpenAI has launched GDPval, a benchmark that measures how AI performs on real U.S. job tasks across 44 occupations and 9 major industries.

Built with industry experts, GDPval shows leading AI models approaching human-level performance while working faster and cheaper.

FAQs

Q1. What is GDPval?

GDPval is OpenAI’s new evaluation system that tests AI models on realistic, economically valuable work tasks instead of academic problems.

Q2. How many tasks does GDPval cover?

It includes 1,320 tasks across 44 professions in the nine largest U.S. industries.

Q3. Which AI models performed best?

Claude Opus 4.1 ranked highest overall, while GPT-5 showed strong technical accuracy and big improvements over GPT-4o.

Q4. Why is GDPval important?

It bridges the gap between lab benchmarks and real workplace performance, helping businesses and policymakers understand AI’s real economic impact.

Q5. Is GDPval public?

Yes. OpenAI has released a subset of tasks, an automated grader at evals.openai.com, and datasets on Hugging Face.

Leave a Reply

Your email address will not be published. Required fields are marked *

AI Binger
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.