没有合适的资源?快使用搜索试试~ 我知道了~
首页python lxml使用文档
资源详情
资源评论
资源推荐
Python XML processing with
lxml
John W. Shipman
2013-08-24 12:39
Abstract
Describes the lxml package for reading and writing XML les with the Python programming
language.
This publication is available in Web form
1
and also as a PDF document
2
. Please forward any
comments to tcc-doc@nmt.edu.
This work is licensed under a
3
Creative Commons Attribution-NonCommercial 3.0
Unported License.
Table of Contents
1. Introduction: Python and XML ................................................................................................. 3
2. How ElementTree represents XML ........................................................................................ 3
3. Reading an XML document ...................................................................................................... 5
4. Handling multiple namespaces ................................................................................................. 6
4.1. Glossary of namespace terms ......................................................................................... 6
4.2. The syntax of multi-namespace documents ..................................................................... 7
4.3. Namespace maps .......................................................................................................... 8
5. Creating a new XML document ................................................................................................ 9
6. Modifying an existing XML document ..................................................................................... 10
7. Features of the etree module ................................................................................................ 10
7.1. The Comment() constructor ......................................................................................... 10
7.2. The Element() constructor ......................................................................................... 11
7.3. The ElementTree() constructor ................................................................................. 12
7.4. The fromstring() function: Create an element from a string ....................................... 13
7.5. The parse() function: build an ElementTree from a le ............................................ 14
7.6. The ProcessingInstruction() constructor ............................................................ 14
7.7. The QName() constructor ............................................................................................. 14
7.8. The SubElement() constructor ................................................................................... 15
7.9. The tostring() function: Serialize as XML ................................................................. 16
7.10. The XMLID() function: Convert text to XML with a dictionary of id values .................. 16
8. class ElementTree: A complete XML document ................................................................ 17
8.1. ElementTree.find() ............................................................................................... 17
8.2. ElementTree.findall(): Find matching elements ................................................... 17
8.3. ElementTree.findtext(): Retrieve the text content from an element ........................ 17
8.4. ElementTree.getiterator(): Make an iterator ...................................................... 17
8.5. ElementTree.getroot(): Find the root element ....................................................... 18
1
http://www.nmt.edu/tcc/help/pubs/pylxml/
2
http://www.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf
3
http://creativecommons.org/licenses/by-nc/3.0/
1Python XML processing with lxmlNew Mexico Tech Computer Center
8.6. ElementTree.xpath(): Evaluate an XPath expression ................................................ 18
8.7. ElementTree.write(): Translate back to XML .......................................................... 18
9. class Element: One element in the tree ............................................................................... 19
9.1. Attributes of an Element instance ................................................................................ 19
9.2. Accessing the list of child elements ............................................................................... 19
9.3. Element.append(): Add a new element child ............................................................ 20
9.4. Element.clear(): Make an element empty ............................................................... 21
9.5. Element.find(): Find a matching sub-element .......................................................... 21
9.6. Element.findall(): Find all matching sub-elements ................................................. 22
9.7. Element.findtext(): Extract text content ................................................................ 22
9.8. Element.get(): Retrieve an attribute value with defaulting ........................................ 23
9.9. Element.getchildren(): Get element children ........................................................ 24
9.10. Element.getiterator(): Make an iterator to walk a subtree ................................... 24
9.11. Element.getroottree(): Find the ElementTree containing this element ............... 25
9.12. Element.insert(): Insert a new child element ........................................................ 26
9.13. Element.items(): Produce attribute names and values ............................................ 26
9.14. Element.iterancestors(): Find an element's ancestors ......................................... 26
9.15. Element.iterchildren(): Find all children ........................................................... 27
9.16. Element.iterdescendants(): Find all descendants ............................................... 27
9.17. Element.itersiblings(): Find other children of the same parent ........................... 28
9.18. Element.keys(): Find all attribute names ................................................................ 28
9.19. Element.remove(): Remove a child element ............................................................ 29
9.20. Element.set(): Set an attribute value ...................................................................... 29
9.21. Element.xpath(): Evaluate an XPath expression ...................................................... 29
10. XPath processing .................................................................................................................. 30
10.1. An XPath example ...................................................................................................... 31
11. The art of Web-scraping: Parsing HTML with Beautiful Soup .................................................. 31
12. Automated validation of input les ....................................................................................... 32
12.1. Validation with a Relax NG schema ............................................................................ 32
12.2. Validation with an XSchema (XSD) schema .................................................................. 33
13. etbuilder.py: A simplied XML builder module ............................................................... 33
13.1. Using the etbuilder module .................................................................................... 33
13.2. CLASS(): Adding class attributes ............................................................................ 35
13.3. FOR(): Adding for attributes .................................................................................... 35
13.4. subElement(): Adding a child element ..................................................................... 35
13.5. addText(): Adding text content to an element ........................................................... 36
14. Implementation of etbuilder ............................................................................................. 36
14.1. Features diering from Lundh's original ..................................................................... 36
14.2. Prologue .................................................................................................................... 36
14.3. CLASS(): Helper function for adding CSS class attributes ......................................... 37
14.4. FOR(): Helper function for adding XHTML for attributes ........................................... 37
14.5. subElement(): Add a child element ......................................................................... 38
14.6. addText(): Add text content to an element ................................................................ 38
14.7. class ElementMaker: The factory class ................................................................... 38
14.8. ElementMaker.__init__(): Constructor ................................................................ 39
14.9. ElementMaker.__call__(): Handle calls to the factory instance .............................. 42
14.10. ElementMaker.__handleArg(): Process one positional argument .......................... 43
14.11. ElementMaker.__getattr__(): Handle arbitrary method calls ............................. 44
14.12. Epilogue .................................................................................................................. 44
14.13. testetbuilder: A test driver for etbuilder ......................................................... 44
15. rnc_validate: A module to validate XML against a Relax NG schema ................................. 45
15.1. Design of the rnc_validate module ........................................................................ 45
New Mexico Tech Computer CenterPython XML processing with lxml2
15.2. Interface to the rnc_validate module ...................................................................... 46
15.3. rnc_validate.py: Prologue .................................................................................... 46
15.4. RelaxException ..................................................................................................... 47
15.5. class RelaxValidator ......................................................................................... 47
15.6. RelaxValidator.validate() ............................................................................... 48
15.7. RelaxValidator.__init__(): Constructor ............................................................ 48
15.8. RelaxValidator.__makeRNG(): Find or create an .rng le .................................... 49
15.9. RelaxValidator.__getModTime(): When was this le last changed? ..................... 51
15.10. RelaxValidator.__trang(): Translate .rnc to .rng format ................................ 51
16. rnck: A standalone script to validate XML against a Relax NG schema ..................................... 52
16.1. rnck: Prologue ............................................................................................................ 52
16.2. rnck: main() ............................................................................................................. 53
16.3. rnck: checkArgs() ................................................................................................... 54
16.4. rnck: usage() ........................................................................................................... 54
16.5. rnck: fatal() ........................................................................................................... 55
16.6. rnck: message() ....................................................................................................... 55
16.7. rnck: validateFile() ............................................................................................. 55
16.8. rnck: Epilogue ............................................................................................................ 56
1. Introduction: Python and XML
With the continued growth of both Python and XML, there is a plethora of packages out there that help
you read, generate, and modify XML les from Python scripts. Compared to most of them, the lxml
4
package has two big advantages:
• Performance. Reading and writing even fairly large XML les takes an almost imperceptible amount
of time.
•
Ease of programming. The lxml package is based on ElementTree, which Fredrik Lundh invented
to simplify and streamline XML processing.
lxml is similar in many ways to two other, earlier packages:
•
Fredrik Lundh continues to maintain his original version of ElementTree
5
.
•
xml.etree.ElementTree
6
is now an ocial part of the Python library. There is a C-language
version called cElementTree which may be even faster than lxml for some applications.
However, the author prefers lxml for providing a number of additional features that make life easier.
In particular, support for XPath makes it considerably easier to manage more complex XML structures.
2. How ElementTree represents XML
If you have done XML work using the Document Object Model (DOM), you will nd that the lxml
package has a quite dierent way of representing documents as trees. In the DOM, trees are built out
of nodes represented as Node instances. Some nodes are Element instances, representing whole elements.
Each Element has an assortment of child nodes of various types: Element nodes for its element children;
Attribute nodes for its attributes; and Text nodes for textual content.
Here is a small fragment of XHTML, and its representation as a DOM tree:
4
http://lxml.de/
5
http://ebot.org/zone/element-index.htm
6
http://docs.python.org/library/xml.etree.elementtree.html
3Python XML processing with lxmlNew Mexico Tech Computer Center
<p>To find out <em>more</em>, see the
<a href="http://www.w3.org/XML">standard</a>.</p>
The above diagram shows the conceptual structure of the XML. The lxml view of an XML document,
by contrast, builds a tree of only one node type: the Element.
The main dierence between the ElementTree view used in lxml, and the classical view, is the asso-
ciation of text with elements: it is very dierent in lxml.
An instance of lxml's Element class contains these attributes:
.tag
The name of the element, such as "p" for a paragraph or "em" for emphasis.
.text
The text inside the element, if any, up to the rst child element. This attribute is None if the element
is empty or has no text before the rst child element.
.tail
The text following the element. This is the most unusual departure. In the DOM model, any text
following an element E is associated with the parent of E; in lxml, that text is considered the “tail”
of E.
.attrib
A Python dictionary containing the element's XML attribute names and their corresponding values.
For example, for the element “<h2 class="arch" id="N15">”, that element's .attrib would
be the dictionary “{"class": "arch", "id": "N15"}”.
(element children)
To access sub-elements, treat an element as a list. For example, if node is an Element instance,
node[0] is the rst sub-element of node. If node doesn't have any sub-elements, this operation
will raise an IndexError exception.
You can nd out the number of sub-elements using the len() function. For example, if node has
ve children, len(node) will return a value of 5.
One advantage of the lxml view is that a tree is now made of only one type of node: each node is an
Element instance. Here is our XML fragment again, and a picture of its representation in lxml.
New Mexico Tech Computer CenterPython XML processing with lxml4
<p>To find out <em>more</em>, see the
<a href="http://www.w3.org/XML">standard</a>.</p>
Notice that in the lxml view, the text ", see the\n" (which includes the newline) is contained in
the .tail attribute of the em element, not associated with the p element as it would be in the DOM
view. Also, the "." at the end of the paragraph is in the .tail attribute of the a (link) element.
Now that you know how XML is represented in lxml, there are three general application areas.
• Section 3, “Reading an XML document” (p. 5).
• Section 5, “Creating a new XML document” (p. 9).
• Section 6, “Modifying an existing XML document” (p. 10).
3. Reading an XML document
Suppose you want to extract some information from an XML document. Here's the general procedure:
1.
You'll need to import the lxml package. Here is one way to do it:
from lxml import etree
2.
Typically your XML document will be in a le somewhere. Suppose your le is named test.xml;
to read the document, you might say something like:
doc = etree.parse('test.xml')
The returned value doc is an instance of the ElementTree class that represents your XML document
in tree form.
Once you have your document in this form, refer to Section 8, “class ElementTree: A complete
XML document” (p. 17) to learn how to navigate around the tree and extract the various parts of its
structure.
For other methods of creating an ElementTree, refer to Section 7, “Features of the etree mod-
ule” (p. 10).
5Python XML processing with lxmlNew Mexico Tech Computer Center
剩余55页未读,继续阅读
hyt19860117
- 粉丝: 0
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- 2023年中国辣条食品行业创新及消费需求洞察报告.pptx
- 2023年半导体行业20强品牌.pptx
- 2023年全球电力行业评论.pptx
- 2023年全球网络安全现状-劳动力资源和网络运营的全球发展新态势.pptx
- 毕业设计-基于单片机的液体密度检测系统设计.doc
- 家用清扫机器人设计.doc
- 基于VB+数据库SQL的教师信息管理系统设计与实现 计算机专业设计范文模板参考资料.pdf
- 官塘驿林场林防火(资源监管)“空天地人”四位一体监测系统方案.doc
- 基于专利语义表征的技术预见方法及其应用.docx
- 浅谈电子商务的现状及发展趋势学习总结.doc
- 基于单片机的智能仓库温湿度控制系统 (2).pdf
- 基于SSM框架知识产权管理系统 (2).pdf
- 9年终工作总结新年计划PPT模板.pptx
- Hytera海能达CH04L01 说明书.pdf
- 数据中心运维操作标准及流程.pdf
- 报告模板 -成本分析与报告培训之三.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0