Skip to main content
Anaconda Learning
Toggle menu
Menu
Back to anaconda.com
< Back to anaconda.com
Course Catalog
Sign In
Data Preparation for Large Language Models
Transforming and cleaning data for LLMs.
Watch Intro Video
31
Getting Started
(03:27)
Getting started with Anaconda Notebooks
Course Overview and Learning Objectives
Introduction to Language Model Data
(38:06)
What is Natural Language Processing?
What Are Large Language Models?
How Computers See Text
Strengths and Limits of Large Language Models
Concerns in Procuring LLM Data
Exercise: Procuring LLM Data
Cleaning Data for Language Models
(30:34)
Text Cleaning
Manual Tokenization
Using the Natural Language Toolkit (NLTK)
Stemming
Using spaCy
Exercise: Tokenize Text
Vectorizing and Encoding
(30:06)
Converting Text to Numbers
Word Counts
Word Frequencies
Word Hashing
Binary and Other Parameters
Exercise: Vectorize Text
Bag of Words Project
(24:23)
Data Preparation with a Real-world Dataset
Cleaning and Tokenizing the Data
Vectorizing the Data
Exercise: Data Quality
Word Embeddings
(33:07)
What are Word Embeddings?
Word2Vec and GloVe
Word Embedding with Gensim
Word Embedding with Gensim Continued
Exercise: Word Embedding
Conclusion
(02:55)
Summary
End of course survey