| by Arround The Web | No comments

DCOX, PDFs Were Not Built for AI. This New Open Standard Wants to Change That

The LF AI & Data Foundation has announced the formation of the DocLang Specification Working Group, kicking off a collaborative effort to build an open, AI-native document format standard.

The working group operates under the Joint Development Foundation's vendor-neutral governance model, ensuring that no single company controls the roadmap.

The founding members are IBM, NVIDIA, Red Hat, ABBYY, and HumanSignal. Though, the spec documentation also credits Forgis as a founding member, but the announcement didn't mention them.

By the way, DocLang is not the only thing in play here. Combining its open document format specification with Docling, IBM's open source document processing toolkit also under LF AI & Data, the initiative is looking to build a more complete open source document AI stack under one roof.

Together, the two cover the full pipeline from document ingestion and parsing through standardized representation and downstream consumption by language models and agentic AI systems.

As for the specification itself, it is already at v0.6, is available under the Apache 2.0 License, and covers document structure and semantics, geometric layout, pagination, and complex components like tables, charts, formulas, and code blocks.

There's also native support for audio, image, and video content, and governance metadata like privacy flags and model training constraints are embedded directly in the document rather than stored in a separate file.

Who is it for?

The primary target is enterprises running generative AI and agentic workflows on large document sets. Formats like PDF, DOCX, and JPEG were designed for human consumption, not machine interpretation.

When such files are fed into AI pipelines, their reading order gets mangled, tables flatten into plain text, and figures disappear entirely. The result is a scenario where the document quality becomes the bottleneck, not the model itself.

DocLang is meant to fix that by giving pipelines a single, unambiguous representation where the same document always produces the same output regardless of which tool processed it.

It is also relevant to anyone building with LLMs and vision-language models on real-world content. Docling and ABBYY FineReader Engine already support DocLang output natively, so existing pipelines can adopt the standard without overhauling their tooling.

You can go through the specification for DocLang on GitHub.


Suggested Read 📖: Open Standards for What AI Actually Costs

Source: It's FOSS