PDFPlumber：从PDF文件提取文字和表格的Python库.pdf_pdfplumber库

pdf

1星需积分: 48 49 浏览量更新于2023-03-03 评论 3 收藏 1004KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

4/21/2020 PDFPlumberPDFPython - Python | CTOLib

PDF generated with the free version of http://www.html2pdf.solutions

 

53 

2061

2061 

v0.5.15







 > Python> 

PDFPlumberPDFPython

Plumb a PDF for detailed information about each char, rectangle, line, et cetera —and easily extract text and tables.

 





Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual

debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer and pdfminer.six.

Currently tested on Python 2.7, 3.1, 3.4, 3.5, and 3.6.

Installation

Command line interface

Python library

Visual debugging

Extracting tables

Extracting form values

Demonstrations

Acknowledgments / Contributors

Contributing

pip install pdfplumber

To use pdfplumber's visual-debugging tools, you'll also need to have ImageMagick installed on your computer. Installation

instructions here.

curl "https://cdn.rawgit.com/jsvine/pdfplumber/master/examples/pdfs/background-checks.pdf" > background-checks.pdf

pdfplumber < background-checks.pdf > background-checks.csv

The output will be a CSV containing info about every character, line, and rectangle in the PDF.

Argument Description

Go beyond APM

The all-in-one software

intelligence platform.

Dynatrace

PDFPlumber v0.5.13

Table of Contents

Installation

Command line interface

Basic example

Options

4/21/2020 PDFPlumberPDFPython - Python | CTOLib

PDF generated with the free version of http://www.html2pdf.solutions

--format [format] csv or json. The json format returns slightly more information; it includes PDF-level metadata

and height/width information about each page.

--pages [list of pages]

A space-delimited, 1-indexed list of pages or hyphenated page ranges. E.g., 1, 11-15, which

would return data for pages 1, 11, 12, 13, 14, and 15.

--types [list of object

types to extract]

Choices are char, anno, line, curve, rect, rect_edge. Defaults to char, anno, line, curve, rect.

import pdfplumber

with pdfplumber.open("path/to/ﬁle.pdf") as pdf:

ﬁrst_page = pdf.pages[0]

print(ﬁrst_page.chars[0])

pdfplumber provides two main ways to load a PDF:

pdfplumber.open("path/to/ﬁle.pdf")

pdfplumber.load(ﬁle_like_object)

Both methods return an instance of the pdfplumber.PDF class.

To load a password-protected PDF, pass the password keyword argument, e.g., pdfplumber.open("ﬁle.pdf", password =

"test").

The top-level pdfplumber.PDF class represents a single PDF and has two main properties:

Property Description

.metadata

A dictionary of metadata key/value pairs, drawn from the PDF's Info trailers. Typically includes

"CreationDate," "ModDate," "Producer," et cetera.

.pages A list containing one pdfplumber.Page instance per page loaded.

The pdfplumber.Page class is at the core of pdfplumber. Most things you'll do with pdfplumber will revolve around this class.

It has these main properties:

Property Description

.page_number The sequential page number, starting with 1 for the ﬁrst page, 2 for the second, and so on.

.width The page's width.

.height The page's height.

.objects / .chars /

.lines / .rects

Each of these properties is a list, and each list contains one dictionary for each such object

embedded on the page. For more detail, see "Objects" below.

... and these main methods:

Method Description

Python library

Basic example

Loading a PDF

The pdfplumber.PDF class

The pdfplumber.Page class

4/21/2020 PDFPlumberPDFPython - Python | CTOLib

PDF generated with the free version of http://www.html2pdf.solutions

.crop(bounding_box) Returns a version of the page cropped to the bounding box, which should be

expressed as 4-tuple with the values (x0, top, x1, bottom). Cropped pages retain

objects that fall at least partly within the bounding box. If an object falls only partly

within the box, its dimensions are sliced to ﬁt the bounding box.

.within_bbox(bounding_box) Similar to .crop, but only retains objects that fall entirely within the bounding box.

.ﬁlter(test_function)

Returns a version of the page with only the .objects for which test_function(obj)

returns True.

.extract_text(x_tolerance=0,

y_tolerance=0)

Collates all of the page's character objects into a single string. Adds spaces where

the diﬀerence between the x1 of one character and the x0 of the next is greater

than x_tolerance. Adds newline characters where the diﬀerence between the doctop

of one character and the doctop of the next is greater than y_tolerance.

.extract_words(x_tolerance=0,

y_tolerance=0)

Returns a list of all word-looking things and their bounding boxes. Words are

considered to be sequences of characters where the diﬀerence between the x1 of

one character and the x0 of the next is less than or equal to x_tolerance and where

the doctop of one character and the doctop of the next is less than or equal to

y_tolerance.

.extract_tables(table_settings) Extracts tabular data from the page. For more details see "Extracting tables" below.

.to_image(**conversion_kwargs)

Returns an instance of the PageImage class. For more details, see "Visual debugging"

below. For conversion_kwargs, see here.

Each instance of pdfplumber.PDF and pdfplumber.Page provides access to four types of PDF objects. The following

properties each return a Python list of the matching objects:

.chars, each representing a single text character.

.annos, each representing a single annotation-text character.

.lines, each representing a single 1-dimensional line.

.rects, each representing a single 2-dimensional rectangle.

.curves, each representing a series of connected points.

Each object is represented as a simple Python dict, with the following properties:

Property Description

page_number Page number on which this character was found.

text E.g., "z", or "Z" or " ".

fontname Name of the character's font face.

size Font size.

adv Equal to text width * the font size * scaling factor.

upright Whether the character is upright.

height Height of the character.

width Width of the character.

x0 Distance of left side of character from left side of page.

x1 Distance of right side of character from left side of page.

Objects

char / anno properties

剩余12页未读，继续阅读

chen_7733

2022-02-02

就打印个网页来糊弄人了?

网迷

粉丝: 38
资源: 336

会员权益专享

PDFPlumber：从PDF文件提取文字和表格的Python库.pdf

评论1

会员权益专享

最新资源

PDFPlumber：从PDF文件提取文字和表格的Python库.pdf

评论1

pdfplumber-master_Pdfplumber_pdfplumberPython_python_

分离pdf文件的C#源代码，含获取总页数代码

python pdf解析

如何用pdfplumber和Python从PDF文件中提取表格数据

pdfplumber使用详解

处理pdf的python库有哪些

python pdf 同时提取表格和文字

pdfplumber 用法

python pdfplumber提取表格

pdfplumber 提取pdf 表格信息

python 利用pdfplumber从一个文件夹里的pdf中批量提取表格的代码

python pdfplumber提取表格代码

pdfplumber提取pdf中的表格

具体介绍一下pdfplumber

pdfplumber 提取表格示例

python 实现pdf文件中表格的读取

pdfplumber提取pdf中的表格并把想要的数据写入excel文件中

pdfplumber读取pdf内容

精确提取PDF文字内容

Python如何爬取pdf网页数据

会员权益专享

最新资源