A comprehensive guide to NLP workflows using Hugging Face, focusing on data preparation, model training, and practical applications.
Getting Started • Example Notebooks • Contributions • Contact
Welcome to the NLP Essentials with Hugging Face repository! 🎉 This repository is a collection of useful examples and notebooks to help you get the most out of the Hugging Face ecosystem, including datasets, tokenizers, collators, and various NLP models.
- Work with Hugging Face Datasets: Learn how to load, manipulate, and utilize datasets effectively.
- Leverage Tokenizers: Explore different tokenization strategies and how to implement them in your projects.
- Implement Data Collators: Understand how to use data collators for different data chunking strategies.
- Develop NLP Applications: Discover how to implement NLP models for tasks such as Named Entity Recognition (NER), Question Answering, and more.
Learn how to load, process, and tokenize datasets using Hugging Face’s datasets and transformers libraries. Explore different data collation strategies, including padding and truncation, to prepare batches for training.
1) Notebook | tokenizer-three-approaches: This notebook provides a walkthrough of chunking and splitting text with a Hugging Face tokenizer.
2) Notebook | tokenizer-three-approaches-with-chat-template: This notebook provides a walkthrough of chunking and splitting text with a Hugging Face tokenizer using a Chat Template.
3) Notebook | dataset-collator: This notebook demonstrates how to effectively utilize a data collator to prepare inputs for a model.
Different text truncation and chunking strategies are crucial for ensuring that models can effectively process input sequences that exceed the model's maximum input length. Choosing the appropriate strategy can significantly impact model performance and the quality of predictions.
- Approach 1: Right-Side Truncation: Handle sequences that exceed the model’s maximum input length by truncating tokens from the right side. Right-side truncation often discards older information.
- Approach 2: Left-Side Truncation: Truncate sequences from the left side, keeping the most recent information.
- Approach 3: Chunking with Overlap: Split long sequences into overlapping chunks to preserve context while handling large text inputs.
Padding and Truncation
Below is the Hugging Face - Padding and Truncation reference table.
| Truncation | Padding | Instruction |
|---|---|---|
| no truncation | no padding | tokenizer(batch_sentences) |
| padding to max sequence in batch | tokenizer(batch_sentences, padding=True) or tokenizer(batch_sentences, padding='longest') |
|
| padding to max model input length | tokenizer(batch_sentences, padding='max_length') |
|
| padding to specific length | tokenizer(batch_sentences, padding='max_length', max_length=42) |
|
| padding to a multiple of a value | tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8) |
|
| truncation to max model input length | no padding | tokenizer(batch_sentences, truncation=True) or tokenizer(batch_sentences, truncation=STRATEGY) |
| padding to max sequence in batch | tokenizer(batch_sentences, padding=True, truncation=True) or tokenizer(batch_sentences, padding=True, truncation=STRATEGY) |
|
| padding to max model input length | tokenizer(batch_sentences, padding='max_length', truncation=True) or tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY) |
|
| padding to specific length | Not possible | |
| truncation to specific length | no padding | tokenizer(batch_sentences, truncation=True, max_length=42) or tokenizer(batch_sentences, truncation=STRATEGY, max_length=42) |
| padding to max sequence in batch | tokenizer(batch_sentences, padding=True, truncation=True, max_length=42) or tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42) |
|
| padding to max model input length | Not possible | |
| padding to specific length | tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42) or tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42) |
Happy coding! 🛠️
