Python 识别图片形式pdf的尝试（未解决）

2024-06-05 10:23
python, pdf
25人已看

想识别出pdf页面右下角某处的编号。pdf是图片形式页面。查了下方法，有源码是先将页面提取成jpg，再用pytesseract提取图片文件中的内容。

直接用图片来识别。纯数字的图片，如条形码，可识别。带中文的不可以，很乱。

识别为：

如何形成wps图片中的文字识别效果呢？

import pytesseract
from PIL import Image

# lang = 'chi_sim'
# lang = 'eng'
lang = 'eng+chi_sim'


def extract_text_from_image(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image, lang)
    return text


# image_path = r"D:\11.png"
image_path = r"D:\1111.png"
text = extract_text_from_image(image_path)
print(f"图片内容:\n{text}\n")
print('已安装的语言包列表为：', pytesseract.get_languages(config=''))
# 输出 ['chi_sim', 'chi_sim_vert', 'chi_tra', 'chi_tra_vert', 'eng', 'osd']

识别下图：

结果为：

图片内容:
Python 图 几 丶 孛 识 刑 别 tesseract 史 题
解 决

T 图 片 文 字 识 别 测 试 代 码

RRoTRt

pip install Pitlow
pip install pytesseract

ANA


已安装的语言包列表为： ['chi_sim', 'chi_sim_vert', 'chi_tra', 'chi_tra_vert', 'eng', 'osd']

lang参数为'chi_sim'时。中文识别也就那样，但没有英文内容。结果为：

lang参数为'eng'时，没有中文识别内容。有英文内容。结果为：

只有lang参数为两种语言时，效果更好。

lang = 'eng+chi_sim'

仍没有此贴中的效果。Python 图片文字识别和 tesseract 问题解决_pytesseract.pytesseract.tesseracterror: (1, 'warni-CSDN博客文章浏览阅读572次，点赞3次，收藏2次。Python 图片文字识别，以及过程中遇到的问题解决。_pytesseract.pytesseract.tesseracterror: (1, 'warning, detects only orientatihttps://blog.csdn.net/weixin_50357986/article/details/134233359

命令行检查语言包是否已安装。chi_sim和eng就是需要用的。