ชุดข้อมูล Illiad สามารถเตรียมการฝึกอบรมโดยใช้ Python ได้อย่างไร

Tensorflow คือเฟรมเวิร์กแมชชีนเลิร์นนิงที่ให้บริการโดย Google เป็นเฟรมเวิร์กโอเพนซอร์สที่ใช้ร่วมกับ Python เพื่อใช้อัลกอริทึม แอปพลิเคชันการเรียนรู้เชิงลึก และอื่นๆ อีกมากมาย ใช้ในการวิจัยและเพื่อการผลิต

แพ็คเกจ 'tensorflow' สามารถติดตั้งบน Windows ได้โดยใช้บรรทัดโค้ดด้านล่าง -

pip install tensorflow

Tensor เป็นโครงสร้างข้อมูลที่ใช้ใน TensorFlow ช่วยเชื่อมต่อขอบในแผนภาพการไหล แผนภาพการไหลนี้เรียกว่า 'กราฟการไหลของข้อมูล' เทนเซอร์เป็นเพียงอาร์เรย์หลายมิติหรือรายการ

เราจะใช้ชุดข้อมูลของ Illiad ซึ่งมีข้อมูลข้อความของงานแปลสามงานจาก William Cowper, Edward (Earl of Derby) และ Samuel Butler โมเดลนี้ได้รับการฝึกฝนเพื่อระบุตัวแปลเมื่อมีการให้ข้อความบรรทัดเดียว ไฟล์ข้อความที่ใช้ได้รับการประมวลผลล่วงหน้า ซึ่งรวมถึงการนำส่วนหัวและส่วนท้ายของเอกสาร หมายเลขบรรทัด และชื่อบทออก

เรากำลังใช้ Google Colaboratory เพื่อเรียกใช้โค้ดด้านล่าง Google Colab หรือ Colaboratory ช่วยเรียกใช้โค้ด Python บนเบราว์เซอร์และไม่ต้องมีการกำหนดค่าใดๆ และเข้าถึง GPU ได้ฟรี (หน่วยประมวลผลกราฟิก) การทำงานร่วมกันถูกสร้างขึ้นบน Jupyter Notebook

ตัวอย่าง

ต่อไปนี้เป็นข้อมูลโค้ด -

print("Prepare the dataset for training")
tokenizer = tf_text.UnicodeScriptTokenizer()
print("Defining a function named 'tokenize' to tokenize the text data")
def tokenize(text, unused_label):
   lower_case = tf_text.case_fold_utf8(text)
   return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)
print("Iterate over the dataset and print a few samples")
for text_batch in tokenized_ds.take(6):
   print("Tokens: ", text_batch.numpy())

เครดิตโค้ด – https://www.tensorflow.org/tutorials/load_data/text

ผลลัพธ์

Prepare the dataset for training
Defining a function named 'tokenize' to tokenize the text data
WARNING:tensorflow:From /usr/local/lib/python3.6/distpackages/tensorflow/python/util/dispatch.py:201: batch_gather (from
tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
Iterate over the dataset and print a few samples
Tokens: [b'but' b'i' b'have' b'now' b'both' b'tasted' b'food' b',' b'and' b'given']
Tokens: [b'all' b'these' b'shall' b'now' b'be' b'thine' b':' b'but' b'if' b'the'
b'gods']
Tokens: [b'their' b'spiry' b'summits' b'waved' b'.' b'there' b',' b'unperceived']
Tokens: [b'"' b'i' b'pray' b'you' b',' b'would' b'you' b'show' b'your' b'love'
b',' b'dear' b'friends' b',']
Tokens: [b'entering' b'beneath' b'the' b'clavicle' b'the' b'point']
Tokens: [b'but' b'grief' b',' b'his' b'father' b'lost' b',' b'awaits' b'him'
b'now' b',']

คำอธิบาย

ฟังก์ชัน 'tokenize' ถูกกำหนดให้แยกประโยคในชุดข้อมูลออกเป็นคำโดยขจัดช่องว่าง
ฟังก์ชันนี้ถูกเรียกใช้ในชุดข้อมูลอย่างครบถ้วน
ตัวอย่างของชุดข้อมูลหลังการแปลงโทเค็นจะแสดงบนคอนโซล