tekom - Tagungen

Using LLMs to Convert PDF Documents into DITA Source

  • Fachvortrag
  • Künstliche Intelligenz (KI) in der Technischen Kommunikation
  • Dr. Zhijun Gao

    Dr. Zhijun Gao

    • Peking University

Inhalt

We will introduce our research on automatically converting PDF documents into structured DITA sources. The conversion of legacy documents into DITA format traditionally poses significant challenges due to the extensive time and manual labor required. To address these challenges, we first optimized layout parsing using PaddleX, enhanced by manually labeled DITA layout data. This optimization enables the model to more accurately identify various content types such as headings, paragraphs, and code blocks.

Furthermore, we fine-tuned the Qwen2.5-7B large language model to improve the accuracy of converting content into appropriate DITA topics. After obtaining DITA source files, we scan them to identify reusable elements, transforming them into DITA content references (<conref>). We also applied clustering algorithms to automatically generate relation table (<reltable>). Our approach also extracts terms and tags content with them (<indexterm>).

Finally, we will demonstrate our tool, which automatically converts PDFs while keeping technical writers in the loop.

Das lernen Sie

  1. Automated PDF → DITA conversion pipeline
  2. Optimized layout parsing with PaddleX
  3. LLM-driven semantic tagging of topics
  4. Automatic <conref>, <reltable>, <indexterm> generation

Vorkenntnisse

Basic understanding of DITA and LLMS

 

Referent:in

Dr. Zhijun Gao

Dr. Zhijun Gao

  • Peking University
Biografie

Gao Zhijun is an Assistant Professor in the School of Software and Microelectronics at Peking University (Beijing, China) and Secretary-General of the China Technical Communication Alliance (CTCA). He holds a Ph.D. in Technical Communication from the University of Twente and leads the Information Experience Design research group at Peking University. Dr. Gao chairs the development of the Chinese standard “Guidelines for Evaluating the User Experience of Technical Documentation.” His work centers on AI-driven technical writing and information experience design.