การลบคำหยุดด้วย NLTK ใน Python

เมื่อคอมพิวเตอร์ประมวลผลภาษาธรรมชาติ คำทั่วไปบางคำที่อาจดูเหมือนมีประโยชน์เพียงเล็กน้อยในการช่วยเลือกเอกสารที่ตรงกับความต้องการของผู้ใช้จะไม่รวมอยู่ในคำศัพท์ทั้งหมด คำเหล่านี้เรียกว่าคำหยุด

ตัวอย่างเช่น หากคุณให้ประโยคอินพุตเป็น −

John is a person who takes care of the people around him.

หลังจากหยุดการลบคำ คุณจะได้ผลลัพธ์ -

['John', 'person', 'takes', 'care', 'people', 'around', '.']

NLTK มีชุดคำหยุดซึ่งเราสามารถใช้เพื่อลบคำเหล่านี้ออกจากประโยคที่กำหนด นี่อยู่ในโมดูล NLTK.corpus เราสามารถใช้สิ่งนั้นเพื่อกรองคำหยุดออกจากประโยค ตัวอย่างเช่น

ตัวอย่าง

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

my_sent = "John is a person who takes care of people around him."
tokens = word_tokenize(my_sent)

filtered_sentence = [w for w in tokens if not w in stopwords.words()]

print(filtered_sentence)

ผลลัพธ์

สิ่งนี้จะให้ผลลัพธ์ -

['John', 'person', 'takes', 'care', 'people', 'around', '.']