Enable blob storage
By default, LangSmith stores run inputs, outputs, and errors in ClickHouse. In addition, LangSmith will store run manifests in Postgres. If you so choose, you can instead store this information in blob storage, which has a couple of notable benefits:
- In high trace environments, inputs, outputs, errors, and manifests may balloon the size of your databases.
- If using LangSmith Managed ClickHouse, you may want sensitive information in blob storage that resides in your environment. To alleviate this, LangSmith supports storing run manifests, inputs, outputs, and errors in an external blob storage system.
Requirements
- Access to a valid blob storage service
- Amazon S3
- Google Cloud Storage
- Currently, Azure Blob Storage is not supported (coming soon)
- A bucket/directory in your blob storage to store the data. We highly recommend creating a separate bucket/directory for LangSmith data.
- If you are using TTLs, you will need to set up a lifecycle policy to delete old data. You can find more information on configuring TTLs here. These policies should mirror the TTLs you have set in your LangSmith configuration, or you may experience data loss. See here on how to setup the lifecycle rules for TTLs for blob storage.
- Credentials to permit LangSmith Services to access the bucket/directory
- You will need to provide your LangSmith instance with the necessary credentials to access the bucket/directory. Read the authentication section below for more information.
- An API url for your blob storage service
- This will be the URL that LangSmith uses to access your blob storage system
- In the case of Amazon S3, this will be the URL of the S3 endpoint. Something like:
https://s3.amazonaws.com
orhttps://s3.us-west-1.amazonaws.com
if using a regional endpoint. - In the case of Google Cloud Storage, this will be the URL of the GCS endpoint. Something like:
https://storage.googleapis.com
Authentication
Amazon S3
To authenticate to Amazon S3, you will need to create an IAM policy granting admin permissions on your bucket. This will look something like the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Once you have the correct policy, there are two ways to authenticate with Amazon S3:
- IAM Role for Service Account: You can create an IAM role for your LangSmith instance and attach the policy to that role. You can then provide the role to LangSmith. This is the recommended way to authenticate with Amazon S3 in production.
- You will need to create an IAM role with the policy attached.
- You will need to allow LangSmith service accounts to assume the role. The
langsmith-queue
andlangsmith-backend
service accounts will need to be able to assume the role.Service Account NamesThe service account names will be different if you are using a custom release name. You can find the service account names by running
kubectl get serviceaccounts
in your cluster. - You will need to provide the role ARN to LangSmith. You can do this by adding the
eks.amazonaws.com/role-arn: "<role_arn>"
annotation to thelangsmith-queue
andlangsmith-backend
services in your Helm Chart installation.
- Access Key and Secret Key: You can provide LangSmith with an access key and secret key. This is the simplest way to authenticate with Amazon S3. However, it is not recommended for production use as it is less secure.
- You will need to create a user with the policy attached. Then you can provision an access key and secret key for that user.
Google Cloud Storage
To authenticate with Google Cloud Storage, you will need to create a service account
with the necessary permissions to access your bucket.
Your service account will need the Storage Admin
role or a custom role with equivalent permissions. This can be scoped to the bucket that LangSmith will be using.
Once you have a provisioned service account, you will need to generate a HMAC key
for that service account. This key and secret will be used to authenticate with Google Cloud Storage.
CH Search
By default, LangSmith will still store tokens for search in ClickHouse. If you are using LangSmith Managed Clickhouse, you may want to disable this feature to avoid sending potentially sensitive information to ClickHouse. You can do this in your blob storage configuration.
Configuration
After creating your bucket and obtaining the necessary credentials, you can configure LangSmith to use your blob storage system.
- Helm
- Docker
config:
blobStorage:
enabled: true
chSearchEnabled: true # Set to false if you want to disable CH search (Recommended for LangSmith Managed Clickhouse)
bucketName: "your-bucket-name"
apiURL: "Your connection url"
accessKey: "Your access key" # Optional. Only required if using access key and secret key
accessKeySecret: "Your access key secret" # Optional. Only required if using access key and secret key
backend: # Optional, only required if using IAM role for service account
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "<role_arn>"
queue: # Optional, only required if using IAM role for service account
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "<role_arn>"
# In your .env file
BLOB_STORAGE_ENABLED=false # Set to true if you want to enable blob storage
BLOB_STORAGE_BUCKET_NAME=langsmith-blob-storage # Change to your desired blob storage bucket name
BLOB_STORAGE_API_URL=https://s3.us-west-2.amazonaws.com # Change to your desired blob storage API URL
BLOB_STORAGE_ACCESS_KEY=your-access-key # Change to your desired blob storage access key
BLOB_STORAGE_ACCESS_KEY_SECRET=your-access-key-secret # Change to your desired blob storage access key secret
If using an access key and secret, you can also provide an existing Kubernetes secret that contains the access key and secret key. This is recommended over providing the access key and secret key directly in your config.
TTL Configuration
If using the TTL feature with LangSmith, you'll also have to configure TTL rules for your blob storage. Trace information stored on blob storage is stored on a particular prefix path, which determines the TTL for the data. When a trace's retention is extended, its corresponding blob storage path changes to ensure that it matches the new extended retention.
The following TTL prefix are used:
ttl_s/
: Short term TTL, configured for 14 days.ttl_l/
: Long term TTL, configured for 400 days.
If you have customized the TTLs in your LangSmith configuration, you will need to adjust the TTLs in your blob storage configuration to match.
Amazon S3
If using S3 for your blob storage, you will need to setup a filter lifecycle configuration that matches the prefixes above. You can find information for this in the Amazon Documentation.
As an example, if you are using Terraform to manage your S3 bucket, you would setup something like this:
rule {
id = "short-term-ttl"
prefix = "ttl_s/"
enabled = true
expiration {
days = 14
}
}
rule {
id = "long-term-ttl"
prefix = "ttl_l/"
enabled = true
expiration {
days = 400
}
}
Google Cloud Storage
You will need to setup lifecycle conditions for your GCS buckets that you are using. You can find information for this in the Google Documentation, specifically using matchesPrefix.
As an example, if you are using Terraform to manage your GCS bucket, you would setup something like this:
lifecycle_rule {
condition {
age = 14
matches_prefix = ["ttl_s"]
}
action {
type = "Delete"
}
}
lifecycle_rule {
condition {
age = 400
matches_prefix = ["ttl_l"]
}
action {
type = "Delete"
}
}