Skip to content

MITLibraries/timdex-embeddings

Repository files navigation

timdex-embeddings

A CLI application for creating embeddings for TIMDEX.

Development

  • To preview a list of available Makefile commands: make help
  • To install with dev dependencies: make install
  • To update dependencies: make update
  • To run unit tests: make test
  • To lint the repo: make lint
  • To run the app: my-app --help (Note the hyphen - vs underscore _ that matches the project.scripts in pyproject.toml)

Environment Variables

Required

SENTRY_DSN=### If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
WORKSPACE=### Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.

Optional

TE_MODEL_URI=# HuggingFace model URI
TE_MODEL_PATH=# Path where the model will be downloaded to and loaded from
HF_HUB_DISABLE_PROGRESS_BARS=#boolean to use progress bars for HuggingFace model downloads; defaults to 'true' in deployed contexts

Configuring an Embedding Model

This CLI application is designed to create embeddings for input texts. To do this, a pre-trained model must be identified and configured for use.

To this end, there is a base embedding class BaseEmbeddingModel that is designed to be extended and customized for a particular embedding model.

Once an embedding class has been created, the preferred approach is to set env vars TE_MODEL_URI and TE_MODEL_PATH directly in the Dockerfile to a) download a local snapshot of the model during image build, and b) set this model as the default for the CLI.

This allows invoking the CLI without specifying a model URI or local location, allowing this model to serve as the default, e.g.:

uv run --env-file .env embeddings test-model-load

CLI Commands

For local development, all CLI commands should be invoked with the following format to pickup environment variables from .env:

uv run --env-file .env embeddings <COMMAND> <ARGS>

ping

Usage: embeddings ping [OPTIONS]

  Emit 'pong' to debug logs and stdout.

download-model

Usage: embeddings download-model [OPTIONS]

  Download a model from HuggingFace and save locally.

Options:
  --model-uri TEXT   HuggingFace model URI (e.g., 'org/model-name')
                     [required]
  --model-path PATH  Path where the model will be downloaded to and loaded
                     from, e.g. '/path/to/model'.  [required]
  --help             Show this message and exit.

test-model-load

Usage: embeddings test-model-load [OPTIONS]

  Test loading of embedding class and local model based on env vars.

  In a deployed context, the following env vars are expected:     -
  TE_MODEL_URI     - TE_MODEL_PATH

  With these set, the embedding class should be registered successfully and
  initialized, and the model loaded from a local copy.

  This CLI command is NOT used during normal workflows.  This is used primary
  during development and after model downloading/loading changes to ensure the
  model loads correctly.

Options:
  --model-uri TEXT   HuggingFace model URI (e.g., 'org/model-name')
                     [required]
  --model-path PATH  Path where the model will be downloaded to and loaded
                     from, e.g. '/path/to/model'.  [required]
  --help             Show this message and exit.

create-embeddings

Usage: embeddings create-embeddings [OPTIONS]

  Create embeddings for TIMDEX records.

Options:
  --model-uri TEXT             HuggingFace model URI (e.g., 'org/model-name')
                               [required]
  --model-path PATH            Path where the model will be downloaded to and
                               loaded from, e.g. '/path/to/model'.  [required]
  -d, --dataset-location PATH  TIMDEX dataset location, e.g.
                               's3://timdex/dataset', to read records from.
                               [required]
  --run-id TEXT                TIMDEX ETL run id.  [required]
  --run-record-offset INTEGER  TIMDEX ETL run record offset to start from,
                               default = 0.  [required]
  --record-limit INTEGER       Limit number of records after --run-record-
                               offset, default = None (unlimited).  [required]
  --strategy [full_record]     Pre-embedding record transformation strategy.
                               Repeatable to apply multiple strategies.
                               [required]
  --output-jsonl TEXT          Optionally write embeddings to local JSONLines
                               file (primarily for testing).
  --help                       Show this message and exit.

About

TIMDEX Embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published