Object Model Reference

Description and control of your inference service is accomplished via two primary objects: Packaged Model and Deployment. Additionally, when referencing a Package Model via an external hosting method such as S3 bucket, or huggingface.co registry, you may need to configure a Registry to enable that access.

Deployment

The Deployment object controls when a Packaged Model is deployed. The Deployment object has the following attributes:

Input Attributes

  • name The name of the deployment to enable access to it via the REST interface or CLI. This is the name used in the associated KServe inference service that will be created.
  • namespace The Kubernetes namespace into which the service is deployed. It must already exist.
  • model The name (or id) of the packaged model to be deployed.
  • security Encapsulates the security option (authenticationRequired) for the deployed service.
  • autoScaling Controls the scaling limits minReplicas/maxReplicas, metric to control scaling, and the target value.
  • canaryTrafficPercent The percentage of traffic to route to this particular model version. The default is 100.
  • goalStatus Specifies the intended status to be achieved by the deployment. The default is Ready.
    • Ready
      The inference service will be deployed to enable inference calls.
    • Paused
      The inference service will be stopped and no longer accept calls.
  • environment Environment variables to be provided to the container image when started.
  • arguments Arguments to be passed to the container image when started. These are in addition to any configured on the packaged model.

Managed Attributes

  • id A unique identifier to identify this service.
  • status Summary status of the deployed service.
    • Deploying
      Service configuration is in progress.
    • Failed
      The service configuration failed.
    • Ready
      The service has been successfully configured and is serving.
    • Updating
      A new service revision is being rolled out.
    • UpdateFailed
      The current service revision failed to roll out due to an error. The prior version is still serving requests.
    • Deleting
      The deployed service is being removed.
    • Paused
      The deployed service has been stopped by the user or an external action.
    • Unknown
      Unable to determine the status.
    • Canceled
      The deployed service has been canceled.
  • state State details of the current service configuration requested. See the DeploymentStateDetails component for details.
  • secondaryState State details of a prior service configuration until the currently requested configuration has been fully rolled out.

PackagedModel

The Packaged Model object identifies the model and code that make up your inference service. The code may be provided via a container image, or via an external hosting method (S3 or huggingface.co registry). The Packaged Model object has the following attributes:

Input Attributes

  • name The name of the model.
  • description A text description of the model.
  • modelFormat Model format for downloaded models (e.g. from S3, http, etc.).
    • custom
      The packaged model is provided in a container image.
    • bento-archive
      The packaged model is a bento archive file (.bento). It will be downloaded, expanded, and then will be served using the bentolm serve command in a provided bentoml serving container.
    • openllm
      The packaged model will be served using the openllm serve command in a provided openllm serving container.
    • nim
      The packaged model will be served using the specified NVIDIA NIM microservices container image.
  • registry The name or id of a registry object. If the model data is not provided via a container image, this must be specified.
  • url Reference to the Bento or model to be served.
    • openllm
      An OpenLLM model name from huggingface.co dynamically loaded and executed with a VLLM backend.
    • s3
      An OpenLLM model path which will be dynamically downloaded from an associated S3 registry bucket.
    • ngc
      An NVIDIA NGC model will be dynamically downloaded from the associated ngc registry bucket and executed with the specified NVIDIA NIM microservices container image.
  • resources The resource requirements for running the service (requests/limits) for cpu/memory/gpu.
  • image The containerized bento where the inference service is to be deployed.
  • environment Environment variables to be provided to the container image when started. See Packaged Model Environment Variables for a list of default options.
  • arguments Arguments to be passed to the container image when started.

Managed Attributes

  • id A unique identifier to identify this particular model version.
  • version An automatically incrementing integer version of the model as you make changes.

Registry

The Registry object provides the metadata that describes how to download a Packaged Model for deployment.

Input Attributes

  • name The name of service to enable access to it via the REST interface or CLI.
  • description A text description of the service.
  • type The type of this model registry.
    • s3
      Configuration to enable access to an S3 bucket.
    • openllm
      Configuration to enable direct download of openllm models from huggingface.co. Provide your access token in the secretKey field.
    • ngc
      Configuration to enable direct download from the NGC: AI Development Catalog.
  • endpointUrl The registry endpoint (host).
    • s3
      The S3 registry endpoint for the associated S3 region. Required.
    • openllm
      The huggingface.co-compatible endpoint (default https://huggingface.co).
    • ngc
      The NVIDIA NGC-compatible api endpoint (default https://api.ngc.nvidia.com).
  • bucket The bucket or organization name, depending on which of the following values is selected as the model registry type.
    • s3
      The required S3 bucket name.
    • ngc
      The required NGC org name.
  • accessKey The access key, username or team name for the registry.
    • s3
      The access key/username.
    • ngc
      The optional NGC team name.
  • secretKey The secret key is the password, secret key, or access token for the registry.
    • s3
      The secretKey provides a secret key for the S3 bucket.
    • openllm
      The secretKey is the access token for huggingface.co and is supplied to the launched container via the HF_TOKEN environment variable.
    • ngc
      The NVIDIA NGC apikey.
  • insecureHttps For https endpoints, the server certificate will be accepted without validation.

Managed Attributes

  • id A unique identifier to identify this service.

DeploymentStateDetails

The state details of an inference service Deployment are described with the following attributes:

Attributes

  • endpoint The endpoint uri used to access the inference service.
  • nativeAppName The name of the Kubernetes application for the specific service version. Use this name to match the app value in Grafana/Prometheus to obtain logs and metrics for this deployed service version.
  • status The status of a particular inference service revision.
    • Deploying
      The service configuration is in progress.
    • Failed
      The service configuration failed.
    • Ready
      The service has been successfully configured and is serving.
    • Updating
      A new service revision is being rolled out.
    • UpdateFailed
      The current service revision failed to rollout due to an error. The prior version is still serving requests.
    • Deleting
      The service is being removed.
    • Paused
      The service has been stopped by the user or an external action.
    • Unknown
      Unable to determined the service’s status.
    • Canceled
      The specified model version of the deployment was canceled by the user.
  • trafficPercentage Percent of traffic being processed by this service/model version.
  • failureInfo A list of any failures associated with the deployment of this service/model version.
  • modelId The id of the deployed packaged model associated with this state.