การวิเคราะห์ข้อความใน Python3

ในงานนี้ เราทำงานกับไฟล์ ไฟล์มีอยู่ทุกหนทุกแห่งในจักรวาลนี้ ในไฟล์ระบบคอมพิวเตอร์เป็นส่วนสำคัญ ระบบปฏิบัติการประกอบด้วยไฟล์จำนวนมาก

Python มีไฟล์สองประเภท - ไฟล์ข้อความและไฟล์ไบนารี

ที่นี่เราจะพูดถึงเกี่ยวกับไฟล์ข้อความ

ที่นี่เราเน้นฟังก์ชั่นที่สำคัญบางอย่างในไฟล์

จำนวนคำ
จำนวนตัวอักษร
ความยาวของคำโดยเฉลี่ย
จำนวนคำหยุด
จำนวนอักขระพิเศษ
จำนวนตัวเลข
จำนวนคำตัวพิมพ์ใหญ่

เรามีไฟล์ทดสอบ "css3.txt" เรากำลังดำเนินการกับไฟล์นั้น

จำนวนคำ

เมื่อเรานับจำนวนคำในประโยค เราใช้ split การทำงาน. นี่เป็นวิธีที่ง่ายที่สุด ในกรณีนี้ เรายังใช้ฟังก์ชันแยก

โค้ดตัวอย่าง

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=contents.split()
   number_words=len(words)
   print("Total words of" + filename ,"is" , str(number_words))

ผลลัพธ์

Total words of C:/Users/TP/Desktop/css3.txt is 3574

จำนวนอักขระ

ในที่นี้เรานับจำนวนอักขระในหนึ่งคำ ที่นี่เราใช้ความยาวของคำ หากความยาวเท่ากับ 5 คำนั้นจะมีอักขระ 5 ตัว

โค้ดตัวอย่าง

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=0
   characters=0
   wordslist=contents.split()
   words+=len(wordslist)
   characters += sum(len(word) for word in wordslist)
   #print(lineno)
   print("TOTAL CHARACTERS IN A TEXT FILE =",characters)

ผลลัพธ์

TOTAL CHARACTERS IN A TEXT FILE = 17783

ความยาวของคำโดยเฉลี่ย

ในที่นี้ เราคำนวณผลรวมของความยาวของคำทั้งหมดแล้วหารด้วยความยาวทั้งหมด

โค้ดตัวอย่าง

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=0
   wordslist=contents.split()
   words=len(wordslist)
   average= sum(len(word) for word in wordslist)/words    
   print("Average=",average)

ผลลัพธ์

Average= 4.97

จำนวนคำหยุด

เพื่อแก้ปัญหานี้ เราใช้ไลบรารี NLP ใน Python

โค้ดตัวอย่าง

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
my_example_sent = "This is a sample sentence"
mystop_words = set(stopwords.words('english')) 
my_word_tokens = word_tokenize(my_example_sent) 
my_filtered_sentence = [w for w in my_word_tokens if not w in mystop_words] 
my_filtered_sentence = []
for w in my_word_tokens: 
   if w not in mystop_words: 
      my_filtered_sentence.append(w) 
print(my_word_tokens) 
print(my_filtered_sentence)

จำนวนอักขระพิเศษ

ที่นี่เราสามารถคำนวณจำนวนแฮชแท็กหรือการกล่าวถึงที่มีอยู่ในนั้น ซึ่งจะช่วยในการดึงข้อมูลเพิ่มเติมจากข้อมูลข้อความของเรา

โค้ดตัวอย่าง

import collections as ct
filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=contents.split()
   number_words=len(words)
   special_chars = "#"
   new=sum(v for k, v in ct.Counter(words).items() if k in special_chars)
   print("Total Special Characters", new)

ผลลัพธ์

Total Special Characters 0

จำนวนตัวเลข

ที่นี่เราสามารถคำนวณจำนวนข้อมูลตัวเลขที่มีอยู่ในไฟล์ข้อความ เหมือนกับการคำนวณจำนวนตัวอักษรในคำ

โค้ดตัวอย่าง

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=sum(map(str.isdigit, contents.split())) 
   print("TOTAL NUMERIC IN A TEXT FILE =",words)

ผลลัพธ์

TOTAL NUMERIC IN A TEXT FILE = 2

จำนวนคำตัวพิมพ์ใหญ่

เมื่อใช้ฟังก์ชัน isupper() เราสามารถคำนวณจำนวนตัวพิมพ์ใหญ่ในข้อความได้

โค้ดตัวอย่าง

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=sum(map(str.isupper, contents.split())) 
   print("TOTAL UPPERCASE WORDS IN A TEXT FILE =",words)

ผลลัพธ์

TOTAL UPPERCASE WORDS IN A TEXT FILE = 121