从Word到Markdown[已完成]

发表于 2023-12-17 更新于 2025-01-09

你是否了解Markdown？你是否有曾在做笔记时对格式感到头疼？

你是否对Word的格式感到头疼？你是否想要将Word转换为Markdown？

如果你有这样的需求，那么这篇文章就是为你准备的。

这篇文章将会给你带来一个Python脚本，这个脚本可以将Word转换为Markdown。

这篇文章记载了我花费一个下午，将word转换为Markdown的过程。

为什么要将Word转换为Markdown？

我们有历史遗留的Word的需要转换，懒得再写一遍
Markdown更加简洁，更加适合写博客
Markdown的公式与LaTeX兼容，更加适合写博客，而Word的公式与LaTeX不兼容

Word存储格式

Word的存储格式是docx，是一种压缩文件，可以使用7zip打开。
Word实际上是一个压缩文件，里面包含了很多文件，其中有一个叫做document.xml的文件，这个文件就是Word的主要内容。
这个xml中记载了Word的所有内容，包括文字、图片、表格、公式等等。
我们只需要将这个xml文件中的内容提取出来，然后转换为Markdown即可。

转换规则

文件名，将文件名作为Markdown的标题以及一级标题
段落，每个Word文档的段落的第一行为二级标题
公式，将公式转换为Markdown的公式（行内公式和行间公式）
图片，将图片转换为Markdown的图片
列表，列表符号转换为Markdown的列表符号

Python脚本

我们需要用到一个Python库叫做python-docx，这个库可以用来解析docx文件。

这个库的文档在这里：https://python-docx.readthedocs.io/en/latest/

还有一个常用库叫做lxml，这个库可以用来解析xml文件。

我们用到其中的etree模块，这个模块可以用来解析xml文件。

另外，我们因为有打开Word文件的需要，所以我们还需要用到一个库叫做os。

最后，由于我们需要将Word中一些图片资源保存到本地，所以我们还需要用到一个库叫做urllib.parse。

用来解析某些图片的链接。

安装库

首先，我们需要安装需要的库，使用pip安装即可：

1
2
3

pip install lxml
pip install python-docx
pip install urllib

引用库

然后，我们需要引用这两个库：

import os
import xml.etree.ElementTree as ET
from docx import Document
from docx.oxml.ns import qn
from lxml import etree
import urllib.parse

解析Word文档中的公式

Word文档中的公式是用OMML格式存储的，这个格式是一种xml格式，我们可以用lxml库来解析这个格式。

首先，我们需要将公式的xml字符串解析为一个Element对象，用到etree：

def lxml_parse(xml_str):
    # 解析 XML
    try:
        root = etree.fromstring(clean_xml(xml_str))
    except Exception as e:
        print(f'Error: {e}')
        return xml_str

    # 在根元素上使用转换函数
    formula_text = convert_omath_to_text(root)
    return f'${formula_text}$'

然后，我们需要将公式中的一些特殊字符或者格式转换为Markdown中的字符：

# 定义提取和转换函数
def extract_text(elem):
    if elem.tag.endswith('}t'):  # 检查是否为文本节点
        return parse_text(elem.text) or ''  # 转换文本
    elif elem.tag.endswith('}d'):  # Check for delimiter (brackets)
        return parse_bracket(elem)
    elif elem.tag.endswith('}sSup'):  # 特殊处理 sSup 标签
        return parse_ssup(elem)
    elif elem.tag.endswith('}sSub'):  # 特殊处理 sSub 标签（下标）
        return parse_ssub(elem)
    elif elem.tag.endswith('}sSubSup'):  # 特殊处理 sSubSup 标签（同时存在的上下标）
        return parse_ssubsup(elem)
    elif elem.tag.endswith('}f'):  # 特殊处理 f 标签（分数）
        return parse_fraction(elem)
    elif elem.tag.endswith('}rad'):  # Special handling for rad tag (radical/square root)
        return parse_radical(elem)
    else:
        return ''.join(extract_text(child) for child in elem)

def convert_omath_to_text(omath_elem):
    return extract_text(omath_elem)

解析这些特殊格式的函数大同小异，这里以解析分数为例：

我们只需要在elem中找到分子和分母（分别是num和den标签），然后用\frac{分子}{分母}的格式输出即可。

这里用到的递归思想，解析出来的分子分母同样需要用extract_text函数来解析。

def parse_fraction(elem):
    numerator = extract_text(elem.find('.//m:num', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    denominator = extract_text(elem.find('.//m:den', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    return f"\\frac{{{numerator}}}{{{denominator}}}"

解析Word文档中的图片

至此，我们已经有了解析公式的函数，接下来我们需要将嵌入的图片提取出来，然后解析。

word中，存储的内容按照段落来划分，每个段落中可能包含文字、图片、表格、公式等等。

段落中，每个内容都是一个run，我们需要将每个run中的内容提取出来，然后解析。

def save_image_from_run(run, image_dir, image_counter):
    """
    Save the image in a run to the specified directory and return the image link.
    """
    for inline in run.element.findall('.//wp:inline', namespaces={'wp': 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing'}):
        # Assuming each run contains only one image for simplicity
        image_rid = inline.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'}).attrib['{http://schemas.    openxmlformats.org/officeDocument/2006/relationships}embed']
        image_part = run.part.related_parts[image_rid]

    if not os.path.exists(image_dir):
        os.makedirs(image_dir)

    image_path = os.path.join(image_dir, f"image{image_counter}.png")  # 格式化路径，以便在 Windows 上使用
    image_path = image_path.replace('\\', '/')
    with open(image_path, 'wb') as img_file:
        img_file.write(image_part.blob)

    return f"![Image {image_counter}]({urllib.parse.quote(image_path)})"

def contains_image(run):  # 检查 run 是否包含图片
    """
    Check if the run contains an image.
    """
    return 'wp:inline' in run._element.xml

遍历整个Word文档

一般，我们不需要表格（我懒得做），所以我们只需要解析公式和图片即可。

def word_to_markdown(docx_file, md_file):
    doc = Document(docx_file)
    image_dir = os.path.splitext(docx_file)[0] + '_images'
    image_counter = 0
    with open(md_file, 'w', encoding='utf-8') as md:
        next_para = 1  # 下一个段落需要特殊处理
        for para in doc.paragraphs:
            # print(para._element.xml)
            if not para.text:
                if hasFormula(para):
                    formula_text = extract_formula(para)
                    md.write(f'${formula_text}$')
                    md.write('\n\n')
                    continue
                elif contains_image(para):
                    md.write(save_image_from_run(para.runs[0], image_dir, image_counter))
                    image_counter += 1
                    md.write('\n\n')
                    continue
                else:
                    # print('空行')
                    next_para = 1
                    md.write('\n\n')
                    continue
            text = para.text.strip()
            
            paragraph_text = ""
            
            # 处理下一个段落
            if next_para:
                paragraph_text = '## '
                next_para = 0

            for run in para.runs:
                # print(run.text)
                if contains_image(run):
                    paragraph_text += save_image_from_run(run, image_dir, image_counter)
                    image_counter += 1
                text = run.text
                if text.startswith(('•', '➢', ' •', ' ➢')):
                    text = convert_list_item(text)
                paragraph_text += text
                if has_formula_after_run(run):  # Check if a formula is present after the run
                    paragraph_text += extract_formula_from_sibling(run)

                
            md.write(paragraph_text + '\n\n')

完整代码

import os
import xml.etree.ElementTree as ET
from docx import Document
from docx.oxml.ns import qn
from lxml import etree
import urllib.parse

def clean_xml(xml_str):
    try:
        # Parse the XML string
        root = etree.fromstring(xml_str)

        # Normalize the XML structure
        cleaned_str = etree.tostring(root, pretty_print=True, encoding='unicode')

        return cleaned_str
    except etree.XMLSyntaxError as e:
        return f"Error cleaning XML: {e}"

def lxml_parse(xml_str):
    # 解析 XML
    # root = etree.fromstring(xml_str)
    try:
        root = etree.fromstring(clean_xml(xml_str))
    except Exception as e:
        print(f'Error: {e}')
        return xml_str
    
    # 在根元素上使用转换函数
    formula_text = convert_omath_to_text(root)
    # print(formula_text)
    # return formula_text
    return f'${formula_text}$'

# 定义提取和转换函数
def extract_text(elem):
    if elem.tag.endswith('}t'):  # 检查是否为文本节点
        return parse_text(elem.text) or ''  # 转换文本
    elif elem.tag.endswith('}d'):  # Check for delimiter (brackets)
        return parse_bracket(elem)
    elif elem.tag.endswith('}sSup'):  # 特殊处理 sSup 标签
        return parse_ssup(elem)
    elif elem.tag.endswith('}sSub'):  # 特殊处理 sSub 标签（下标）
        return parse_ssub(elem)
    elif elem.tag.endswith('}sSubSup'):  # 特殊处理 sSubSup 标签（同时存在的上下标）
        return parse_ssubsup(elem)
    elif elem.tag.endswith('}f'):  # 特殊处理 f 标签（分数）
        return parse_fraction(elem)
    elif elem.tag.endswith('}rad'):  # Special handling for rad tag (radical/square root)
        return parse_radical(elem)
    else:
        return ''.join(extract_text(child) for child in elem)

def convert_omath_to_text(omath_elem):
    return extract_text(omath_elem)

def parse_fraction(elem):
    numerator = extract_text(elem.find('.//m:num', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    denominator = extract_text(elem.find('.//m:den', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    # print(numerator, denominator)
    return f"\\frac{{{numerator}}}{{{denominator}}}"

def parse_ssup(elem):
    base = extract_text(elem.find('.//m:e', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    exponent = extract_text(elem.find('.//m:sup', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    return f"{base}^{{{exponent}}}"

def parse_ssub(elem):
    # 更精确地获取基础元素和下标元素
    base_elem = elem.find('.//m:e[1]', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'})
    subscript_elem = elem.find('.//m:sub[1]', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'})
    base = extract_text(base_elem) if base_elem is not None else ""
    subscript = extract_text(subscript_elem) if subscript_elem is not None else ""
    return f"{base}_{{{subscript}}}"

def parse_ssubsup(elem):
    base = extract_text(elem.find('.//m:e', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    exponent = extract_text(elem.find('.//m:sup', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    subscript = extract_text(elem.find('.//m:sub', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    return f"{base}_{{{subscript}}}^{{{exponent}}}"

def parse_bracket(elem):
    # Check if it's a left bracket and return the LaTeX representation
    base = extract_text(elem.find('.//m:e', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))
    
    return f"\\left( {base} \\right)"

def parse_radical(elem):
    # Extract the content under the radical
    content = extract_text(elem.find('.//m:e', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'}))

    # Check for the degree of the radical (if any)
    degree_elem = elem.find('.//m:deg', namespaces={'m': 'http://schemas.openxmlformats.org/officeDocument/2006/math'})
    if degree_elem is not None:
        degree = extract_text(degree_elem)
        return f"\\sqrt[{degree}]{{{content}}}"
    else:
        return f"\\sqrt{{{content}}}"

def parse_text(text):
    """
    Parse the text and replace common symbol names with LaTeX equivalents.
    """
    symbol_map = {
        'α': '\\alpha',
        'β': '\\beta',
        'γ': '\\gamma',
        'δ': '\\delta',
        'ε': '\\epsilon',
        'ζ': '\\zeta',
        'η': '\\eta',
        'θ': '\\theta',
        'ι': '\\iota',
        'κ': '\\kappa',
        'λ': '\\lambda',
        'μ': '\\mu',
        'ν': '\\nu',
        'ξ': '\\xi',
        'ο': '\\omicron',  # 在 LaTeX 中通常不使用
        'π': '\\pi',
        'ρ': '\\rho',
        'σ': '\\sigma',
        'τ': '\\tau',
        'υ': '\\upsilon',
        'φ': '\\phi',
        'χ': '\\chi',
        'ψ': '\\psi',
        'ω': '\\omega',
        'Α': '\\Alpha',
        'Β': '\\Beta',
        'Γ': '\\Gamma',
        'Δ': '\\Delta',
        'Ε': '\\Epsilon',
        'Ζ': '\\Zeta',
        'Η': '\\Eta',
        'Θ': '\\Theta',
        'Ι': '\\Iota',
        'Κ': '\\Kappa',
        'Λ': '\\Lambda',
        'Μ': '\\Mu',
        'Ν': '\\Nu',
        'Ξ': '\\Xi',
        'Ο': '\\Omicron',  # 在 LaTeX 中通常不使用
        'Π': '\\Pi',
        'Ρ': '\\Rho',
        'Σ': '\\Sigma',
        'Τ': '\\Tau',
        'Υ': '\\Upsilon',
        'Φ': '\\Phi',
        'Χ': '\\Chi',
        'Ψ': '\\Psi',
        'Ω': '\\Omega',
        '*': '\\times',
        '≤': '\\leq',
        '≥': '\\geq',
        '≠': '\\neq',
        '≈': '\\approx',
        '∞': '\\infty',
        '∑': '\\sum',
        '∏': '\\prod',
        '∫': '\\int',
        '∂': '\\partial',
        '∇': '\\nabla',
        '√': '\\sqrt',
        '∝': '\\propto',
        '∞': '\\infty',
    }

    for symbol, latex in symbol_map.items():
        text = text.replace(symbol, latex + " ")

    return text






def convert_list_item(text):
    """将列表符号转换为 Markdown 格式"""
    return '- ' + text[1:].strip()

def extract_formula(paragraph):
    """尝试从段落中提取公式文本"""
    xml = paragraph._element.xml
    tree = ET.fromstring(xml)
    formula_text = ""
    for el in tree.iter():
        if el.tag == qn('m:oMath'):
            # el.xpath('')
            formula_text += ET.tostring(el, encoding='unicode')
    # print(formula_text)
    return lxml_parse(formula_text) if formula_text else ''

def hasFormula(para):
    for child in para._element.getchildren():
        # print(child.tag)
        if child is not None and child.tag.endswith('oMathPara'):
            return True
        return False

def save_image_from_run(run, image_dir, image_counter):
    """
    Save the image in a run to the specified directory and return the image link.
    """
    for inline in run.element.findall('.//wp:inline', namespaces={'wp': 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing'}):
        # Assuming each run contains only one image for simplicity
        image_rid = inline.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'}).attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed']
        image_part = run.part.related_parts[image_rid]
        
        if not os.path.exists(image_dir):
            os.makedirs(image_dir)
        
        image_path = os.path.join(image_dir, f"image_{image_counter}.png")  # 格式化路径，以便在 Windows 上使用
        image_path = image_path.replace('\\', '/')
        with open(image_path, 'wb') as img_file:
            img_file.write(image_part.blob)
        
        return f"![Image {image_counter}]({urllib.parse.quote(image_path)})"

def contains_image(run):
    """
    Check if the run contains an image.
    """
    return 'wp:inline' in run._element.xml

def has_formula_after_run(run):
    """
    Check if a formula (oMath) is present immediately after the run.
    """
    next_sibling = run._element.getnext()
    if next_sibling is not None and next_sibling.tag.endswith('oMath'):
        return True
    return False

def extract_formula_from_sibling(run):
    """
    Extract the formula from the run's next sibling.
    """
    # Assuming formula is in the next sibling as an oMath element
    formula_sibling = run._element.getnext()
    if formula_sibling is not None:
        """尝试从段落中提取公式文本"""
        tree = formula_sibling
        formula_text = ""
        if tree.tag == qn('m:oMath'):
            # el.xpath('')
            formula_text = ET.tostring(tree, encoding='unicode')
            return lxml_parse(formula_text) if formula_text else ''
        for el in tree.iter():
            if el.tag == qn('m:oMath'):
                # el.xpath('')
                formula_text += ET.tostring(el, encoding='unicode')
        # print(formula_text)
        return lxml_parse(formula_text) if formula_text else ''
    return ""

def word_to_markdown(docx_file, md_file):
    doc = Document(docx_file)
    image_dir = os.path.splitext(docx_file)[0] + '_images'
    image_counter = 0
    with open(md_file, 'w', encoding='utf-8') as md:
        next_para = 1  # 下一个段落需要特殊处理
        for para in doc.paragraphs:
            # print(para._element.xml)
            if not para.text:
                if hasFormula(para):
                    formula_text = extract_formula(para)
                    md.write(f'${formula_text}$')
                    md.write('\n\n')
                    continue
                elif contains_image(para):
                    md.write(save_image_from_run(para.runs[0], image_dir, image_counter))
                    image_counter += 1
                    md.write('\n\n')
                    continue
                else:
                    # print('空行')
                    next_para = 1
                    md.write('\n\n')
                    continue
            text = para.text.strip()
            
            paragraph_text = ""
            
            # 处理下一个段落
            if next_para:
                paragraph_text = '## '
                next_para = 0

            for run in para.runs:
                # print(run.text)
                if contains_image(run):
                    paragraph_text += save_image_from_run(run, image_dir, image_counter)
                    image_counter += 1
                text = run.text
                if text.startswith(('•', '➢', ' •', ' ➢')):
                    text = convert_list_item(text)
                paragraph_text += text
                if has_formula_after_run(run):  # Check if a formula is present after the run
                    paragraph_text += extract_formula_from_sibling(run)

                
            md.write(paragraph_text + '\n\n')

            

def convert_docs_in_directory(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.docx') and not filename.startswith('~$'):
            docx_path = os.path.join(directory, filename)
            md_path = os.path.splitext(docx_path)[0] + '.md'
            word_to_markdown(docx_path, md_path)
            print(f"Converted {filename} to Markdown")

if __name__ == "__main__":
    convert_docs_in_directory('./')  # Convert all DOCX files in the current directory