การจับคู่รูปแบบใน Python ด้วย Regex

นิพจน์ทั่วไปคืออะไร

ในโลกแห่งความเป็นจริง การแยกวิเคราะห์สตริงในภาษาโปรแกรมส่วนใหญ่ได้รับการจัดการโดยนิพจน์ทั่วไป นิพจน์ทั่วไปในภาษาโปรแกรม python เป็นวิธีการที่ใช้สำหรับจับคู่รูปแบบข้อความ

โมดูล "re" ที่มาพร้อมกับการติดตั้ง python ทุกครั้งให้การสนับสนุนนิพจน์ทั่วไป

ใน python การค้นหานิพจน์ทั่วไปมักเขียนเป็น:

match = re.search(pattern, string)

re.search() วิธีการรับสองอาร์กิวเมนต์ รูปแบบนิพจน์ทั่วไปและสตริง และค้นหารูปแบบนั้นภายในสตริง หากพบรูปแบบภายในสตริง search() จะส่งกลับวัตถุที่ตรงกันหรือไม่มี ดังนั้นในนิพจน์ทั่วไป ที่กำหนดสตริง ให้กำหนดว่าสตริงนั้นตรงกับรูปแบบที่กำหนดหรือไม่ และเลือกที่จะรวบรวมสตริงย่อยที่มีข้อมูลที่เกี่ยวข้อง สามารถใช้นิพจน์ทั่วไปเพื่อตอบคำถามเช่น −

สตริงนี้เป็น URL ที่ถูกต้องหรือไม่
ผู้ใช้รายใดใน /etc/passwd อยู่ในกลุ่มที่กำหนด
วันที่และเวลาของข้อความเตือนทั้งหมดในไฟล์บันทึกคือวันที่ใด
URL ที่ผู้เข้าชมพิมพ์ขอชื่อผู้ใช้และเอกสารใด

รูปแบบการจับคู่

นิพจน์ทั่วไปเป็นภาษาย่อยที่ซับซ้อน โดยอาศัยอักขระพิเศษเพื่อจับคู่สตริงที่ไม่รู้จัก แต่ให้เริ่มด้วยอักขระตามตัวอักษร เช่น ตัวอักษร ตัวเลข และอักขระเว้นวรรค ซึ่งตรงกับตัวเองเสมอ มาดูตัวอย่างพื้นฐานกัน:

#Need module 're' for regular expression
import re
#
search_string = "TutorialsPoint"
pattern = "Tutorials"
match = re.match(pattern, search_string)
#If-statement after search() tests if it succeeded
if match:
   print("regex matches: ", match.group())
else:
   print('pattern not found')

ผลลัพธ์

regex matches: Tutorials

การจับคู่สตริง

โมดูล “re” ของ python มีวิธีการมากมาย และเพื่อทดสอบว่านิพจน์ทั่วไปนั้นตรงกับสตริงหรือไม่ คุณสามารถใช้ re.search() re.MatchObject ให้ข้อมูลเพิ่มเติม เช่น พบว่าส่วนใดของสตริงที่ตรงกัน

ไวยากรณ์

matchObject = re.search(pattern, input_string, flags=0)

ตัวอย่าง

#Need module 're' for regular expression
import re
# Lets use a regular expression to match a date string.
regex = r"([a-zA-Z]+) (\d+)"
if re.search(regex, "Jan 2"):
   match = re.search(regex, "Jan 2")
   # This will print [0, 5), since it matches at the beginning and end of the
   # string
   print("Match at index %s, %s" % (match.start(), match.end()))
   # The groups contain the matched values. In particular:
   # match.group(0) always returns the fully matched string
   # match.group(1), match.group(2), ... will return the capture
   # groups in order from left to right in the input string  
   # match.group() is equivalent to match.group(0)
   # So this will print "Jan 2"
   print("Full match: %s" % (match.group(0)))
   # So this will print "Jan"
   print("Month: %s" % (match.group(1)))
   # So this will print "2"
   print("Day: %s" % (match.group(2)))
else:
   # If re.search() does not match, then None is returned
   print("Pattern not Found! ")

ผลลัพธ์

Match at index 0, 5
Full match: Jan 2
Month: Jan
Day: 2

เนื่องจากวิธีการข้างต้นหยุดลงหลังจากการจับคู่ครั้งแรก ดังนั้นจึงเหมาะสำหรับการทดสอบนิพจน์ทั่วไปมากกว่าการดึงข้อมูล

การจับกลุ่ม

หากรูปแบบมีวงเล็บตั้งแต่สองตัวขึ้นไป ผลลัพธ์ที่ได้จะเป็น tuple แทนที่จะเป็นรายการสตริง โดยใช้กลไกกลุ่มของวงเล็บ () และ finall() แต่ละรูปแบบที่ตรงกันจะแสดงด้วย tuple และ tuple แต่ละตัวมีข้อมูล group(1), group(2)..

import re
regex = r'([\w\.-]+)@([\w\.-]+)'
str = ('hello [email protected], [email protected], hello [email protected]')
matches = re.findall(regex, str)
print(matches)
for tuple in matches:
   print("Username: ",tuple[0]) #username
   print("Host: ",tuple[1]) #host

ผลลัพธ์

[('john', 'hotmail.com'), ('hello', 'Tutorialspoint.com'), ('python', 'gmail.com')]
Username: john
Host: hotmail.com
Username: hello
Host: Tutorialspoint.com
Username: python
Host: gmail.com

การค้นหาและการแทนที่สตริง

งานทั่วไปอีกประการหนึ่งคือการค้นหาอินสแตนซ์ทั้งหมดของรูปแบบในสตริงที่กำหนดและแทนที่ re.sub(รูปแบบ การแทนที่ สตริง) จะทำอย่างนั้นทั้งหมด เช่น แทนที่อินสแตนซ์ทั้งหมดของโดเมนอีเมลเก่า

รหัส

# requid library
import re
#given string
str = ('hello [email protected], [email protected], hello [email protected], Hello World!')
#pattern to match
pattern = r'([\w\.-]+)@([\w\.-]+)'
#replace the matched pattern from string with,
replace = r'\[email protected]'
   ## re.sub(pat, replacement, str) -- returns new string with all replacements,
   ## \1 is group(1), \2 group(2) in the replacement
print (re.sub(pattern, replace, str))

ผลลัพธ์

hello [email protected], [email protected], hello [email protected], Hello World!

ตั้งค่าสถานะตัวเลือกใหม่

ในนิพจน์ทั่วไปของ python เช่นด้านบน เราสามารถใช้ตัวเลือกต่างๆ เพื่อแก้ไขลักษณะการทำงานของการจับคู่รูปแบบ อาร์กิวเมนต์พิเศษเหล่านี้ แฟล็กทางเลือกถูกเพิ่มลงในฟังก์ชัน search() หรือ findall() เป็นต้น เช่น re.search(pattern, string, re.IGNORECASE)

ละเว้น -

ตามชื่อที่ระบุ มันทำให้รูปแบบไม่คำนึงถึงขนาดตัวพิมพ์ (ตัวพิมพ์ใหญ่/ตัวพิมพ์เล็ก) ด้วยเหตุนี้สตริงที่มี 'a' และ 'A' ตรงกันทั้งคู่
DOTALL

re.DOTALL อนุญาตให้ metacharacter dot(.) จับคู่อักขระทั้งหมดรวมทั้งขึ้นบรรทัดใหม่ (\n)
มัลติไลน์

re.MULTILINE อนุญาตให้จับคู่ start(^) และ end($) ของแต่ละบรรทัดของสตริง อย่างไรก็ตาม โดยทั่วไป ^ และ &จะจับคู่จุดเริ่มต้นและจุดสิ้นสุดของสตริงทั้งหมด