NLP Essentials with Hugging Face

A comprehensive guide to NLP workflows using Hugging Face, focusing on data preparation, model training, and practical applications.

Getting Started • Example Notebooks • Contributions • Contact

Welcome to the NLP Essentials with Hugging Face repository! 🎉 This repository is a collection of useful examples and notebooks to help you get the most out of the Hugging Face ecosystem, including datasets, tokenizers, collators, and various NLP models.

🚀 Getting Started

This repository is designed to be a go-to resource for anyone looking to:

Work with Hugging Face Datasets: Learn how to load, manipulate, and utilize datasets effectively.
Leverage Tokenizers: Explore different tokenization strategies and how to implement them in your projects.
Implement Data Collators: Understand how to use data collators for different data chunking strategies.
Develop NLP Applications: Discover how to implement NLP models for tasks such as Named Entity Recognition (NER), Question Answering, and more.

📚 Example Notebooks

Preprocessing: Datasets, Tokenizers, and Collation

Learn how to load, process, and tokenize datasets using Hugging Face’s datasets and transformers libraries. Explore different data collation strategies, including padding and truncation, to prepare batches for training.

1) Notebook | tokenizer-three-approaches: This notebook provides a walkthrough of chunking and splitting text with a Hugging Face tokenizer.

2) Notebook | tokenizer-three-approaches-with-chat-template: This notebook provides a walkthrough of chunking and splitting text with a Hugging Face tokenizer using a Chat Template.

3) Notebook | dataset-collator: This notebook demonstrates how to effectively utilize a data collator to prepare inputs for a model.

Text Chunking Strategies:

Different text truncation and chunking strategies are crucial for ensuring that models can effectively process input sequences that exceed the model's maximum input length. Choosing the appropriate strategy can significantly impact model performance and the quality of predictions.

Approach 1: Right-Side Truncation: Handle sequences that exceed the model’s maximum input length by truncating tokens from the right side. Right-side truncation often discards older information.
Approach 2: Left-Side Truncation: Truncate sequences from the left side, keeping the most recent information.
Approach 3: Chunking with Overlap: Split long sequences into overlapping chunks to preserve context while handling large text inputs.

Padding and Truncation

Below is the Hugging Face - Padding and Truncation reference table.

Truncation	Padding	Instruction
no truncation	no padding	`tokenizer(batch_sentences)`
	padding to max sequence in batch	`tokenizer(batch_sentences, padding=True)` or `tokenizer(batch_sentences, padding='longest')`
	padding to max model input length	`tokenizer(batch_sentences, padding='max_length')`
	padding to specific length	`tokenizer(batch_sentences, padding='max_length', max_length=42)`
	padding to a multiple of a value	`tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8)`
truncation to max model input length	no padding	`tokenizer(batch_sentences, truncation=True)` or `tokenizer(batch_sentences, truncation=STRATEGY)`
	padding to max sequence in batch	`tokenizer(batch_sentences, padding=True, truncation=True)` or `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`
	padding to max model input length	`tokenizer(batch_sentences, padding='max_length', truncation=True)` or `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`
	padding to specific length	Not possible
truncation to specific length	no padding	`tokenizer(batch_sentences, truncation=True, max_length=42)` or `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`
	padding to max sequence in batch	`tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`
	padding to max model input length	Not possible
	padding to specific length	`tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)`

🌟 Contributions

Contributions are welcome! If you have a tip, trick, or notebook that you think would be valuable to the community, feel free to submit a pull request or open an issue.

✉️ Contact

If you have any questions, suggestions, or feedback, feel free to reach out by opening an issue.

Happy coding! 🛠️

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
imgs		imgs
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Essentials with Hugging Face

🚀 Getting Started

📚 Example Notebooks

Preprocessing: Datasets, Tokenizers, and Collation

Text Chunking Strategies:

🌟 Contributions

✉️ Contact

About

Uh oh!

Releases

Packages

Languages

License

mddunlap924/NLP-Essentials-with-Hugging-Face

Folders and files

Latest commit

History

Repository files navigation

NLP Essentials with Hugging Face

🚀 Getting Started

📚 Example Notebooks

Preprocessing: Datasets, Tokenizers, and Collation

Text Chunking Strategies:

🌟 Contributions

✉️ Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages