โมดูลการประมวลผล XML ใน Python

XML ย่อมาจาก "Extensible Markup Language" ส่วนใหญ่จะใช้ในหน้าเว็บที่ข้อมูลมีโครงสร้างเฉพาะ มีองค์ประกอบที่กำหนดโดยแท็กเริ่มต้นและแท็กสิ้นสุด แท็กคือโครงสร้างมาร์กอัปที่ขึ้นต้นด้วย <และลงท้ายด้วย> อักขระระหว่างแท็กเริ่มต้นและแท็กสิ้นสุด เป็นเนื้อหาขององค์ประกอบ องค์ประกอบสามารถมีองค์ประกอบอื่นๆ ได้ ซึ่งเรียกว่า "องค์ประกอบย่อย"

ตัวอย่าง

ด้านล่างนี้คือตัวอย่างไฟล์ XML ที่เราจะใช้ในบทช่วยสอนนี้

<?xml version="1.0"?>
<Tutorials>
   <Tutorial id="Tu101">
      <author>Vicky, Matthew</author>
      <title>Geo-Spatial Data Analysis</title>
      <stream>Python</stream>
      <price>4.95</price>
      <publish_date>2020-07-01</publish_date>
      <description>Learn geo Spatial data Analysis using Python.</description>
   </Tutorial>
   <Tutorial id="Tu102">
      <author>Bolan, Kim</author>
      <title>Data Structures</title>
      <stream>Computer Science</stream>
      <price>12.03</price>
      <publish_date>2020-1-19</publish_date>
      <description>Learn Data structures using different programming lanuages.</description>
   </Tutorial>
   <Tutorial id="Tu103">
      <author>Sora, Everest</author>
      <title>Analytics using Tensorflow</title>
      <stream>Data Science</stream>
      <price>7.11</price>
      <publish_date>2020-1-19</publish_date>
      <description>Learn Data analytics using Tensorflow.</description>
   </Tutorial>
</Tutorials>

การอ่าน xml โดยใช้ xml.etree.ElementTree

โมดูลนี้ให้การเข้าถึงรูทของไฟล์ xml จากนั้นเราสามารถเข้าถึงเนื้อหาขององค์ประกอบภายในได้ ในตัวอย่างด้านล่าง เราใช้แอตทริบิวต์ที่เรียกว่า text และรับเนื้อหาขององค์ประกอบเหล่านั้น

ตัวอย่าง

import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for xml_elmt in xml_root:
   for inner_elmt in xml_elmt:
      print(inner_elmt.text)

ผลลัพธ์

การเรียกใช้โค้ดข้างต้นทำให้เราได้ผลลัพธ์ดังต่อไปนี้ -

Tutorial List :
Vicky, Matthew
Geo-Spatial Data Analysis
Python
4.95
2020-07-01
Learn geo Spatial data Analysis using Python.
Bolan, Kim
Data Structures
Computer Science
12.03
2020-1-19
Learn Data structures using different programming lanuages.
Sora, Everest
Analytics using Tensorflow
Data Science
7.11
2020-1-19
Learn Data analytics using Tensorflow.

การรับแอตทริบิวต์ xml

เราสามารถรับรายการแอตทริบิวต์และค่าได้ในแท็กรูท เมื่อเราพบแอตทริบิวต์แล้ว จะช่วยให้เรานำทางไปยังแผนผัง XML ได้อย่างง่ายดาย

ตัวอย่าง

import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for movie in xml_root.iter('Tutorial'):
   print(movie.attrib)

ผลลัพธ์

การเรียกใช้โค้ดข้างต้นทำให้เราได้ผลลัพธ์ดังต่อไปนี้ -

Tutorial List :
{'id': 'Tu101'}
{'id': 'Tu102'}
{'id': 'Tu103'}

การกรองผลลัพธ์

นอกจากนี้เรายังสามารถกรองผลลัพธ์ออกจากแผนผัง xml ได้โดยใช้ฟังก์ชัน findall() ของโมดูลนี้ ในตัวอย่างด้านล่าง เราจะพบ id ของบทช่วยสอนซึ่งมีราคาอยู่ที่ 12.03

ตัวอย่าง

import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for movie in xml_root.findall("./Tutorial/[price ='12.03']"):
   print(movie.attrib)

ผลลัพธ์

การเรียกใช้โค้ดข้างต้นทำให้เราได้ผลลัพธ์ดังต่อไปนี้ -

Tutorial List :
{'id': 'Tu102'}

การแยกวิเคราะห์ XML ด้วย DOM API

เราสร้างวัตถุขนาดเล็กโดยใช้โมดูล xml.dom ออบเจ็กต์ minidom จัดเตรียมวิธี parser อย่างง่าย ที่สร้างแผนผัง DOM จากไฟล์ XML ได้อย่างรวดเร็ว วลีตัวอย่างเรียกฟังก์ชัน parse( file [,parser] ) ของอ็อบเจ็กต์ minidom เพื่อแยกวิเคราะห์ไฟล์ XML ที่กำหนดโดยไฟล์ลงในออบเจกต์ทรี DOM

ตัวอย่าง

from xml.dom.minidom import parse
import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse('E:\\TutorialsList.xml')
collection = DOMTree.documentElement

# Get all the movies in the collection
tut_list = collection.getElementsByTagName("Tutorial")

print("*****Tutorials*****")
# Print details of each Tutorial.
for tut in tut_list:

   strm = tut.getElementsByTagName('stream')[0]
   print("Stream: ",strm.childNodes[0].data)

   prc = tut.getElementsByTagName('price')[0]
   print("Price: ", prc.childNodes[0].data)

ผลลัพธ์

การเรียกใช้โค้ดข้างต้นทำให้เราได้ผลลัพธ์ดังต่อไปนี้ -

*****Tutorials*****
Stream: Python
Price: 4.95
Stream: Computer Science
Price: 12.03
Stream: Data Science
Price: 7.11