Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 29, 2025

Adds comprehensive how-to guide for the Microsoft.ML.Tokenizers library, addressing the gap in documentation for this ML.NET tokenization package that's increasingly used in AI scenarios.

Changes

  • New how-to article at docs/ai/how-to/use-tokenizers.md

    • Installation instructions including required data packages
    • Tiktoken tokenizer examples (GPT-4/OpenAI models)
    • Llama tokenizer examples with Hugging Face model loading
    • BPE tokenization demonstrations
    • Advanced encoding options (normalization, pretokenization control)
    • Migration guidance from DeepDev.TokenizerLib and SharpToken
  • Working code examples in docs/ai/how-to/snippets/use-tokenizers/csharp/TokenizersExamples/

    • Complete, compilable project with all tokenizer types
    • Demonstrates token counting, encoding/decoding, and text trimming operations
  • TOC and cross-references

    • Added "Tokenization" section to AI docs TOC
    • Linked from existing "Understanding tokens" conceptual article

Example usage

// Initialize tokenizer for GPT-4o (cache this instance)
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");

string text = "Text tokenization is the process of splitting a string into a list of tokens.";

// Count tokens for cost estimation
Console.WriteLine($"Tokens: {tokenizer.CountTokens(text)}");  // Output: 16

// Encode to token IDs
IReadOnlyList<int> ids = tokenizer.EncodeToIds(text);

// Trim text to token limit
var index = tokenizer.GetIndexByTokenCount(text, 5, out string? processed, out _);
Console.WriteLine($"First 5 tokens: {processed.Substring(0, index)}");

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • huggingface.co
    • Triggering command: /home/REDACTED/work/docs/docs/docs/ai/how-to/snippets/use-tokenizers/csharp/TokenizersExamples/bin/Debug/net9.0/TokenizersExamples (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Add content for Microsoft.ML.Tokenizers</issue_title>
<issue_description>Add a conceptual doc for using the Microsoft.ML.Tokenizers package, which is technically part of the ML.NET set of libraries. That's an area we don't have a lot of documentation around besides what's in the NuGet README but has received more investment in the past few months.

The new article should either live with the ML.NET docs or the .NET AI docs.

Content from NuGet readme:

About

Microsoft.ML.Tokenizers provides an abstraction for tokenizers as well as implementations of common tokenization algorithms.

Key Features

  • Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
  • BPE - Byte pair encoding model
  • English Roberta model
  • Tiktoken model
  • Llama model
  • Phi2 model

How to Use

using Microsoft.ML.Tokenizers;
using System.IO;
using System.Net.Http;

//
// Using Tiktoken Tokenizer
//

// Initialize the tokenizer for the `gpt-4o` model. This instance should be cached for all subsequent use.
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");

string source = "Text tokenization is the process of splitting a string into a list of tokens.";

Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// prints: Tokens: 16

var trimIndex = tokenizer.GetIndexByTokenCountFromEnd(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// prints: 5 tokens from end:  a list of tokens.

trimIndex = tokenizer.GetIndexByTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// prints: 5 tokens from start: Text tokenization is the

IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13

//
// Using Llama Tokenizer
//

// Open a stream to the remote Llama tokenizer model data file.
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);

// Create the Llama tokenizer using the remote stream. This should be cached for all subsequent use.
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);

string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991

Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// prints: Tokens: 5

Main Types

The main types provided by this library are:

  • Microsoft.ML.Tokenizers.Tokenizer
  • Microsoft.ML.Tokenizers.BpeTokenizer
  • Microsoft.ML.Tokenizers.EnglishRobertaTokenizer
  • Microsoft.ML.Tokenizers.TiktokenTokenizer
  • Microsoft.ML.Tokenizers.Normalizer
  • Microsoft.ML.Tokenizers.PreTokenizer

Additional Documentation

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.


Internal previews

📄 File 🔗 Preview link
docs/ai/conceptual/understanding-tokens.md Understand tokens
docs/ai/how-to/use-tokenizers.md Use Microsoft.ML.Tokenizers for text tokenization
docs/ai/toc.yml docs/ai/toc

Copilot AI and others added 3 commits October 29, 2025 23:00
Co-authored-by: gewarren <24882762+gewarren@users.noreply.github.com>
Co-authored-by: gewarren <24882762+gewarren@users.noreply.github.com>
Co-authored-by: gewarren <24882762+gewarren@users.noreply.github.com>
Copilot AI changed the title [WIP] Add conceptual documentation for Microsoft.ML.Tokenizers Add Microsoft.ML.Tokenizers documentation to .NET AI docs Oct 29, 2025
Copilot AI requested a review from gewarren October 29, 2025 23:06
Copilot finished work on behalf of gewarren October 29, 2025 23:06
@gewarren gewarren marked this pull request as ready for review October 30, 2025 00:25
@gewarren gewarren requested a review from a team as a code owner October 30, 2025 00:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive documentation for the Microsoft.ML.Tokenizers library, introducing a new how-to guide that covers tokenization concepts and provides practical examples for working with different tokenizer models in .NET applications.

Key changes:

  • New how-to article explaining Microsoft.ML.Tokenizers usage with examples for Tiktoken, Llama, and BPE tokenizers.
  • Code examples demonstrating common tokenization operations including encoding, decoding, token counting, and text trimming.
  • Integration with existing documentation through TOC updates and cross-references.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
docs/ai/toc.yml Adds new "Tokenization" section with link to the tokenizers guide
docs/ai/how-to/use-tokenizers.md New how-to article covering Microsoft.ML.Tokenizers library usage
docs/ai/how-to/snippets/use-tokenizers/csharp/TokenizersExamples/TokenizersExamples.csproj Project file for tokenizer code examples
docs/ai/how-to/snippets/use-tokenizers/csharp/TokenizersExamples/TiktokenExample.cs Example code demonstrating Tiktoken tokenizer usage
docs/ai/how-to/snippets/use-tokenizers/csharp/TokenizersExamples/Program.cs Entry point for running all tokenizer examples
docs/ai/how-to/snippets/use-tokenizers/csharp/TokenizersExamples/LlamaExample.cs Example code for Llama tokenizer usage
docs/ai/how-to/snippets/use-tokenizers/csharp/TokenizersExamples/BpeExample.cs Example code for BPE tokenizer usage
docs/ai/how-to/snippets/use-tokenizers/csharp/TokenizersExamples/.gitignore Ignores build artifacts for example project
docs/ai/conceptual/understanding-tokens.md Adds cross-reference to new tokenizers guide

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add content for Microsoft.ML.Tokenizers

2 participants