In this post, we have explained in detail how to deal with XML files in Python. We will also see some of the complex terms of ElementTree module.
What are XML files?
Extensible Markup Language (XML) is a file format that is used for the serialization of data, that is, storing, transmitting, and reconstructing arbitrary data, in a format that is both human-readable and machine-readable.
As a markup language, XML labels, categorizes, and structurally organizes information. where XML tags represent the data structure and contain metadata. What’s within the tags is data, encoded in the way the XML standard specifies.
XML is used in thousands of applications ranging from the transfer of data between two systems to web applications and even documentation.
Creating a simple XML document
The start of an XML file may begin with some information about XML itself like its version <?xml version = “1.0” encoding = “UTF-8”?>
The <xml></xml> tags are used to mark the beginning and end of an XML file. The content within these tags is also called an XML document. It is the first tag that any software will look for to process XML code.
Just see the XML Code below:
<?xml version="1.0" encoding="UTF-8"?> <book> <title>Learning Python XML</title> <author> Mark Wilkins </author> <description> The book is very helpful for learning Python core Basics </description> </book>
‘<book>’, ‘<title>’, ‘<author>’, and ‘<description>’ are markup symbols, called tags in XML which are used to define data. The tags can be of your choice. Note: you have to save your file with the ‘.xml’ extension.
What are the benefits of using XML?
Maintain data integrity: XML allows you to transfer data along with the data’s description, this way you can prevent data integrity. You can use this descriptive information to do the following:
- Verify data accuracy
- Automatically customize data presentation for different users
- Store data consistently across multiple platforms
Improve search engine visibility: Search engines can sort and categorize XML files more efficiently and precisely than other types of documents. Based on XML tags, search engines can accurately categorize marks for relevant search results. Thus, XML helps computers interpret natural language more efficiently.
Design flexible applications: Many technologies, especially newer ones, come with built-in XML support. They can automatically read and process XML data files so that you can make changes without having to reformat your entire database. This way you can conveniently upgrade or modify your application design.
You can read more about XML and its features and applications here on the amazon site.
The XML Information Set (Infoset) is an attempt to define a data model complete enough to represent anything that can be stored in an XML document. The infoset defines a number of “abstract” building blocks, such as elements, attributes, and characters.
Python XML Parsing Modules
For Parsing XML data (reading, writing, and modifying XML data) in Python, we have various modules like xml.etree.ElementTree module, lxml and Minidom (Minimal DOM Implementation) module.
ElementTree provides a simple and lightweight way to parse and modify XML data, while lxml provides a more feature-rich library with a higher-level API for working with XML and HTML data. In this post, we will learn only about the ElementTree module.
ElementTree is a simple and lightweight module that provides functionality to parse, create, and modify XML data.
The module represents an XML document as a tree structure, where each element in the document is represented as a node in the tree. This makes it easy to navigate the document and modify its contents. The
ElementTree module provides a number of methods for parsing XML data, creating new elements and sub-elements, and modifying existing elements. It’s a great choice for working with simple to moderately complex XML data.
The Element type is a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory. The element type can be described as a cross between a Python list and a Python dictionary.
The ElementTree wrapper type adds code to load XML files as trees of Element objects, and save them back again.
Each element has a number of properties associated with it:
- a tag. This is a string identifying what kind of data this element represents (the element type, in other words).
- a number of attributes, stored in a Python dictionary.
- a text string.
- an optional tail string.
- a number of child elements, stored in a Python sequence
The Element type is available as a pure-Python implementation for Python 1.5.2 and later. A C implementation is also available, for use with CPython 2.1 and later. The core components of both libraries are also shipped with Python 2.5 and later. Learn Python Basics.
Working with ElementTree
After successfully installing elementtree module in your Python environment, you should be able to import the ElementTree module, and other modules from the elementtree package:
>>> from elementtree import ElementTree
It’s common practice to import ElementTree under an alias, both to minimize typing and to make it easier to switch between different implementations.
Each Element instance can have an identifying tag, any number of attributes, any number of child element instances, and an associated object (usually a string). To create elements, you can use the Element or Subelement factories:
import elementtree.ElementTree as ET # build a tree structure root = ET.Element("html") head = ET.SubElement(root, "head") title = ET.SubElement(head, "title") title.text = "Page Title" body = ET.SubElement(root, "body") body.set("bgcolor", "#ffffff") body.text = "Hello, World!" # wrap it in an ElementTree instance, and save as XML tree = ET.ElementTree(root) tree.write("page.xhtml")
The ElementTree wrapper adds code to load XML files as trees of Element objects, and save them back again. You can use the parse function to quickly load an entire XML document into an ElementTree instance:
import elementtree.ElementTree as ET tree = ET.parse("page.xhtml") # the tree root is the toplevel html element print tree.findtext("head/title") # if you need the root element, use getroot root = tree.getroot() # ...manipulate tree... tree.write("out.xml")
Basic operations with ElementTree
To create an element, call the Element constructor, and pass the tag string as the first argument:
from elementtree.ElementTree import Element root = Element("root")
You can access the tag string via the tag attribute:
To build a tree, create more elements, and append them to the parent element:
root = Element("root") root.append(Element("one")) root.append(Element("two")) root.append(Element("three"))
Since this is a very common operation, the library provides a helper function called SubElement that creates a new element and adds it to its parent, in one step:
from elementtree.ElementTree import Element, SubElement root = Element("root") SubElement(root, "one") SubElement(root, "two") SubElement(root, "three")
To access the subelements, you can use ordinary list (sequence) operations. This includes len(element) to get the number of subelements, element[i] to fetch the i’th subelement, and using the for-in statement to loop over the subelements:
for node in root: print node
The element type also supports slicing (including slice assignment), and the standard append, insert and remove methods:
nodes = node[1:5] node.append(subnode) node.insert(0, subnode) node.remove(subnode)
Note that remove takes an element, not a tag. To find the element to remove, you can either loop over the parent, or use one of the find methods described below.
In addition to the tag and the list of subelements, each element can have one or more attributes. Each element attribute consists of a string key, and a corresponding value. As for ordinary Python dictionaries, all keys must be unique.
Element attributes are in fact stored in a standard Python dictionary, which can be accessed via the attrib attribute. To set attributes, you can simply assign to attrib members:
from elementtree.ElementTree import Element elem = Element("tag") elem.attrib["first"] = "1" elem.attrib["second"] = "2"
When creating a new element, you can pass in element attributes using keyword arguments. The previous example is better written as:
from elementtree.ElementTree import Element elem = Element("tag", first="1", second="2")
The Element type provides shortcuts for attrib.get, attrib.keys, and attrib.items. There’s also a set method, to set the value of an element attribute:
from elementtree.ElementTree import Element elem = Element("tag", first="1", second="2") # print 'first' attribute print elem.attrib.get("first") # same, using shortcut print elem.get("first") # print list of keys (using shortcuts) print elem.keys() print elem.items() # the 'third' attribute doesn't exist print elem.get("third") print elem.get("third", "default") # add the attribute and try again elem.set("third", "3") print elem.get("third", "default")
The element type also provides a text attribute, which can be used to hold additional data associated with the element. As the name implies, this attribute is usually used to hold a text string, but it can be used for other, application-specific purposes.
from elementtree.ElementTree import Element elem = Element("tag") elem.text = "this element also contains text"
If there is no additional data, this attribute is set to an empty string, or None.
The element type actually provides two attributes that can be used in this way; in addition to text, there’s a similar attribute called tail. It too can contain a text string, an application-specific object, or None. The tail attribute is used to store trailing text nodes when reading mixed-content XML files; text that follows directly after an element are stored in the tail attribute for that element:
<tag><elem>this goes into elem's text attribute</elem>this goes into elem's tail attribute</tag>
Searching for Subelements
The Element type provides a number of methods that can be used to search for subelements:
find(pattern) returns the first subelement that matches the given pattern, or None if there is no matching element.
findtext(pattern) returns the value of the text attribute for the first subelement that matches the given pattern. If there is no matching element, this method returns None.
findall(pattern) returns a list (or another iterable object) of all subelements that match the given pattern.
In addition, the getiterator method can be used to loop over the tree in depth-first order:
getiterator(tag) returns a list (or another iterable object) which contains all subelements that has the given tag, on all levels in the subtree. The elements are returned in document order (that is, in the same order as they would appear if you saved the tree as an XML file).
getiterator() (without argument) returns a list (or another iterable object) of all subelement in the subtree.
getchildren() returns a list (or another iterable object) of all direct child elements. This method is deprecated; new code should use indexing or slicing to access the children, or list(elem) to get a list.
Reading and Writing XML Files
The Element type can be used to represent XML files in memory. The ElementTree wrapper class is used to read and write XML files.
To load an XML file into an Element structure, use the parse function:
from elementtree.ElementTree import parse tree = parse(filename) elem = tree.getroot()
You can also pass in a file handle (or any object with a read method):
from elementtree.ElementTree import parse file = open(filename, "r") tree = parse(file) elem = tree.getroot()
The parse method returns an ElementTree object. To get the topmost element object, use the getroot method.
In recent versions of the ElementTree module, you can also use the file keyword argument to create a tree, and fill it with contents from a file in one operation:
from elementtree.ElementTree import ElementTree tree = ElementTree(file=filename) elem = tree.getroot()
To save an element tree back to disk, use the write method on the ElementTree class. Like the parse function, it takes either a filename or a file object (or any object with a write method):
from elementtree.ElementTree import ElementTree tree = ElementTree(file=infile) tree.write(outfile)
If you want to write an Element object hierarchy to disk, wrap it in an ElementTree instance:
from elementtree.ElementTree import Element, SubElement, ElementTree html = Element("html") body = SubElement(html, "body") ElementTree(html).write(outfile)
To convert between XML and strings, you can use the XML, fromstring, and tostring helpers:
from elementtree.ElementTree import XML, fromstring, tostring elem = XML(text) elem = fromstring(text) # same as XML(text) text = tostring(elem)
Read similar topics The Consumer Interface, Incremental Parsing Using the Consumer API