- Published on
Pinecone - Cost Optimization & Performance Best Practices
- Authors
- Name
- John "NextWord" Hwang
- @nextworddev
Pinecone DB- Cost Optimization & Performance Best Practices
In this post, I will provide 17 best practices for optimizing cost with Pinecone specifically for newcomers to vector databases (or building AI apps in general). Following these best practices can save you tens of thousands of dollars for your startup, or help you avoid surprise $200 bills for your hobby side project.
All advice in this article is actionable, and are based on my personal experience using Pinecone in production settings, as well as my time at AWS and Alexa where I helped hundreds of customers architect production ML systems. The best practices are bucketed into four categories:
- General Tips & Common Mistakes: a list of common mistakes to avoid and quick fixes
- Application-Level Best Practices: how to structure your app and data to save money
- Infrastructure-Level Best Practices: how to extract as much performance for the buck when creating and configuring indices
- Paid-Tier Specific Advice: knowing which paid features to pay for, and when.
Unlike some folks, I have found Pinecone’s free tier to be quite generous, and its pricing formula to be transparent, compared to managed databases like DynamoDB or BigQuery (which can have confusing custom capacity units). However, there are some gotchas that may be overlooked by those who are new to using managed databases. This article will take you through the vast majority of gotchas you will encounter. Let’s jump in.
TLDR / Table of Contents
General Tips
- Quick mental model of how Pinecone and its pricing works
- A list of common pitfalls for newcomers
- You don't need a premium tier unless you have 350K+ vectors
- Use
namespace
to isolate your apps - How to leverage
collections
to shut down & rehydrate - Avoid
p1
orp2
types when prototyping
Application Level Best Practices
- Don’t treat Pinecone like MongoDB or DynamoDB
- Store
foreign_key
in your meta, not the whole JSON - Index only a subset of
metadata
fields
Infrastructure Level Advice
- TLDR Scaling Advice
- Understand the Scaling Model
- How and When to Use Replicas
- How to Scale Horizontally
- Scale Out Dynamically, Adjusting to Usage (Code Sample)
- Vertical Scaling Gotchas
Miscellaneous
- How to Pick the Initial Pod Configurations
- Get steep discounts by committing to a spend
- My final thoughts on various VectorDB options
General Tips
Quick mental model of how Pinecone and its pricing works
Before discussing specific tips, let’s first establish a mental model of how pricing works with Pinecone, which is a simple “pay as you go” model, without any weirdness like custom ‘compute credits’ or ‘capacity units’, etc, like compared to Snowflake or DynamoDB. There’s no separate ingress/egress/network charge, which also is a delight.
That said, here are 5 main things you need to know about Pinecone pricing:
- Pinecone is basically built on Kubernetes, so you get literal “pods” when you pay for Pinecone.
- “Pod” is the unit of infrastructure you pay for, kind of like “instances” for AWS EC2. Having 3x more pods will cost you 3x more, all else equal.
- You pay for each pod for usage time, rounded up to nearest 15 minutes. So using a pod for 3 hours costs 3x more than renting a pod for an hour.
- There are 3 types of pods (S1, P1, P2), but P2 is 50% more expensive than S1 or P1s.
- Where you deploy Pinecone (GCP or AWS) matters. AWS costs ~14% more than GCP.
So basically, the pinecone pricing can be broken down into this pseudocode. It’s a function of 1) what pods you use, 2) for how long, and 3) in which cloud.
def pinecone_bill(account):
if account.free_tier: return 0 # free tiers are free
sum_total = 0
for pod in account.pods_used_in_last_30_days:
pod_type = pod.get_type()
cloud_env = pod.get_cloud_env()
pod_price_per_hr = get_pod_price_per_hr(pod_type, cloud_env)
num_hrs_used = pod.get_rounded_meter()
sum_total += pod_price_per_hr * num_hrs_used
return sum_total
A list of common pitfalls for newcomers
So the formula above already foreshadows some rookie mistakes:
- If you leave indice(s) on while your hobby project sits idle with no traffic, you are wasting money.
- Most hobby projects don’t require an always-on index, so if you have to use paid tier, it’s okay to turn it off and rehydrate from a snapshot.
- Snapshots are called
collections
in Pinecone-speak. I’ll walk you through this in an upcoming section. - Note, not everyone - companies and hobbysts alike - needs a paid index. Rule of thumb: if you have less than 300K documents using openAI’s embeddings (1536) in a non-prod environment, then switch to free tier.
- If you have more pods than necessary, than you are wasting money by definition.
- Pinecone provides you with
index_fullness
that you can use this to dynamically upsize and downsize pods in real time. I'll show you a code snippet to do this later in this post.- Once you are in production, chances are most newcomers aren’t dynamically resizing their indices based on metrics.
- Pinecone provides you with
- If you have the wrong pod mix (e.g. renting P2 pods without actually needing them), you are wasting money, too.
- If you have the wrong pod-hour allocation (e.g. renting more P2 pods versus cheaper S1 pods when it’s not necessary). It’s best to use cheaper pods for development, testing, and pretty much everything other than real time serving.
- If you are deployed in
AWS
when you don’t need to be, you are wasting money, sinceGCP
deployments are ~14% cheaper.- Yes, I get that your Lambdas and EC2s are in AWS but can you live with extra latency? Unless you are a production app serving super impatient users, the answer is probably no.
Now, before we go deeper into the above tips, let’s first discuss the biggest cost savings gain tip, which is…
You don't need a premium tier unless you have 350K+ vectors
Pinecone has a very generous free tier - a single P1 pod can hold up to 300K ~ 450K and a S1 pod can hold ~2 million vectors. Chances are you aren’t creating millions of embeddings. Here’s how many 1536 dimension vectors each Pod type can handle.
- S1 ~ 1.5m to 2m vectors. You can always scale vertically to S1.X2, X4, etc and increase this number.
- P1 ~ 400K
- P2 ~ 600K
Note, the numbers above were derived from this table, which assumes 768 dimension BERT embeddings. Since OpenAI’s embeddings are larger, we adjust by a factor of 2. Embeddings for images and videos which have larger dimensionality means fewer vectors per pod.
But here’s the truth - most side-projects and companies don’t have 300K+ embeddings to store, unless they are doing something truly “big-data”. So why do people upgrade usually? Most people upgrade to get around two issues: 1) to have multiple indices for their projects (because free tier only affords 1 index), 2) to get better reliability since free-tier indices can go down or degrade often.
Issue 2 is a legitimate reason to upgrade, especially when you have paying customers. But issue 1 - getting separate indices for multiple projects apps - is not a good reason to upgrade. What you should do is to get separate namespace
to isolate each app.
namespace
to isolate your apps
Use In other words, use a single index, but have namespace=[{project_name}_{subnamespace}]
and so on. You can upsert
vectors and isolate them in different namespaces
. And at query time, vectors are also retrieved in an isolated fashion, like this:
...
index.query(
vector=[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
top_k=3,
namespace="app_1_namespace_2"
)
Basically, by using some naming convention for your namespaces, you get the benefits of isolation you achieve with having separate indices.
- Note 1: you can have any number of namespaces, so you shouldn't worry about "squandering" namespaces
- Note 2: You can’t query across namespaces so make sure that the vector embeddings are indeed logically distinct. That’s actually by design, however. You don’t want your embeddings for
FinanceNewsArticles
be fetched with embeddings forSECFilings
(unless you somehow want both at the same time, in which it’s fine to just make two separate queries to each namespace)
Now, having separate indices is the ‘cleaner’ way to isolate vectors, and allows for creating snapshots aka collections
more cleanly. But given each index costs at least 1 pod - it is usually worth the upgrade for most small projects.
But what if you MUST have a paid plan? How do we save money? Let’s lead with the lowest hanging fruit-advice, which is…
collections
to shut down & rehydrate
How to leverage Now, suppose you are on a paid plan, but want to avoid paying for billable hours when your app is sitting idle. There are many ways to skin the cat, but the simplest one would be to:
- Create a
collection
out of yourindex
, which is essentially creating a point in time snapshot - Delete your
index
- Then if and when you do need the
index
back up, you can create anindex
from thecollection
- this process doesn’t take long (less than ~5 minutes) even for hundreds of thousands of vectors.
I have seen people freak out about deleting indices - but that’s because they are viewing Pinecone indices as storage like a MongoDB collection, which is just a wrong perspective. Indices aren’t meant to be storage
for your data. It’s essentially what the name implies - an index to your data. Just do it.
You can get fancy and actively monitor your utilization, and trigger this shutdown/recovery from snapshot logic, using some metric feeds like CloudWatch, etc. In code,
def backup(index_name="my-index", create_new=True):
collection = pinecone.describe_collection(index_name + "_collection")
collection_exists = collection is not None and collection.status != "Ready"
if not collection_exists or create_new:
pinecone.create_collection(index_name + "_collection", index_name)
def restore(index_name):
pinecone.create_index(index_name,
dimension=1536,
source_collection=index_name + "_collection")
def shut_down_if_idle(index_name, utilization, utilization_threshold=0.7)
if utilization < utilization_threshold:
pinecone.delete_index(index_name)
### every 30 minutes
while ...:
shut_down_if_idle(...)
****
p1
or p2
types when prototyping
Avoid When prototyping, you are not as throughput or latency sensitive, which makes storage optimized S
type Pods far more suitable than the P
types which get you 3-5x less storage capacity. Since S
instances can store 7-figure vector count easily, it’s hard to imagine you running out of pod space easily. And since it’s a pre-MVP product, you won’t have throughput issues.
Application Level Advice
Don’t treat Pinecone like MongoDB or DynamoDB
Many newcomers mistakenly see vector DBs like Pinecone or Weviate as a drop-in replacement for NoSQL databases like Mongo or DynamoDB. This encourages bad habits like trying to put entire JSON objects as meta data, when upserting embeddings. This is suboptimal, because:
- Having more metadata crowds out the space you need to store vectors
metadata
field in Pinecone is not meant to store entire objects, including full text of documents. This design philosophy is apparent in the40
Kb size limit for Pinecone metadata field, which is 10X smaller than DynamoDB’s400
Kb size limit forItems
In reality, indices are exactly what the name implies - they are indices for fast look up of documents. Indices should be treated as a view into your data, that is potentially ephemeral. It’s not equivalent to your data itself. Your index should be treated as its own thing, that is elastic based on workload.
So what should you do instead of storing entire JSONs in metadata
?
foreign_key
in your meta, not the whole JSON
Store Don’t store the whole JSON, store references to the actual document in S3 or Filestore in your meta data field, rather than to waste an egregious amount of space storing whole blobs of text. Of course, this means you need another datastore to actually store your data, but it’s certainly cheaper than storing it inside a live index inside Pinecone.
- The obvious caveat is that now you need another data store to hold your documents, but chances are you already have that anyways, and it’s more expensive to migrate that out of AWS or GCP to Pinecone.
For example:
# instead of doing this
index.upsert([("<id>", [...vector...], {
"full_text": "really long text...", # this is wasteful
"date": "2023-01-01", # you are actually filtering on date, so we need 'date' in metadata
...
}
# do this
index.upsert([("<id>", [...vector...], {
"document_id": uuid, #uuid of the document
"date": "2023-01-01" #keep this
}
And in query time, you make one additional query to your document store (DynamoDB, etc) using the foreign_key
or uuid
This will add a small latency that comes from the extra hop, which should be less than ~10 milliseconds range as long as your Pinecone instance is in the same region as your servers.
matches = index.query(...)
uuid = matches[0]["metadata"]["uuid"]
# Get the table object
table = dynamodb.Table(table_name)
# Define the key for the item to retrieve
key = {'uuid': uuid}
# Use the get_item method to retrieve the item
response = table.get_item(Key=key)
# Extract the item from the response
item = response.get('Item')
Note, there a few gotchas:
- Gotcha 1: decoupling document store with vector store may not be an option if your infrastructure is deployed in a region where Pinecone isn’t available (such as APAC) - in which case it’s better to store relevant document info inside the meta data field.
- Gotcha 2: you could also have a happy compromise of storing just previews of your document in the vector store, as well as the foreign key. This will create a happy medium of latency and convenience versus storage efficiency.
metadata
fields
Index only a subset of Another huge gotcha with Pinecone is that it indexes all metadata fields by default. This is often not a desirable behavior, because not all meta data fields will be used in metadata filtering queries. Each additional metadata index consumes both memory (RAM) and storage, resulting in less vectors fitting inside each pod and degrading performance.
Luckily you can pre-define which metadata
fields are indexed using metadata configs. Example: Say you are using date
, customer_id
, and billing_address
in metadata field. If you never use billing_address
as a field to filter on, it’s complete waste of space to create an index on it (which is the default behavior). To prevent that, you can do this.
metadata_config = {
"indexed": ["date", "customer_id"]
}
pinecone.create_index("openai-index", dimension=1536,
metadata_config=metadata_config)
As always there is a gotcha:
metadata
can be only configured at theindex
creation time, meaning you can’t add additional metadata fields to index on a running index. Note, it’s often hard to predict in advance which meta data field you need meta data filtering on, or not. You can’t change the metadata config after you
Your best option is to snapshot the index (as a collection), and restart an index just to reconfigure it. This can lead to some downtime on your index, so it’s advisable to think through which fields you will most likely filter on. Generally
What should you put in metadata
fields?
Generally speaking, avoid putting whole blocks of text into your meta data fields. That’s apparent when you consider what type of filtering operations are available for metadata fields (below). None of the operations support real free text search (not even contains()
). Time or categorical fields such as date
, category
are a good fit.
$eq
- Equal to (number, string, boolean)$ne
- Not equal to (number, string, boolean)$gt
- Greater than (number)$gte
- Greater than or equal to (number)$lt
- Less than (number)$lte
- Less than or equal to (number)$in
- In array (string or number)$nin
- Not in array (string or number)
Infrastructure Level Advice
So you have decided you are ready to graduate to a paid plan, and deal with multiple indices & pods. This unlocks an entire new class of cost optimization strategies, but first we need to understand how horizontal and vertical scaling works with Pinecone.
TLDR Scaling Advice
- Start small and start with cheaper configurations
- Work backwards from your expected usage to figure out the initial cluster configurations, but don’t expect your projections to be accurate. Stay flexible.
- Use
index_fullness
andrequest_latency_seconds
metrics as signals to scale in and out dynamically - There are things that you can only configure at index-creation time, versus dynamically.
- Only at index-creation time:
- Pod type (
s1
vsp1
, etc), number of pods
- Pod type (
- Can change at index runtime:
- Number of replicas, pod sizes (
p1.x1
top1.x4
, etc) - Note, you CAN’T downgrade a pod size dynamically (
p1.x4
top1.x1
)
- Number of replicas, pod sizes (
- Only at index-creation time:
- Here’s a simple rule of thumb:
- To handle bursts of traffic or usage in a production application, go with vertical scaling or add read replicas
- For every other use case, just delete your index and start over (from a collection / snapshot) with new pod settings
If you want more, here are some additional details.
Understand the Scaling Model
Three are three ways to scale your index from just 1 pod to N pods:
- horizontally:
- adding read replicas
- adding Pods
- vertically (up the ‘type’ of each Pod e.g.
x1
tox2
)- (warning) this has the effect of sizing up ALL pods in your cluster.
But there are some key things and gotchas to know for each strategy. At the end of this section, I even include a simple auto-scaling function that synthesizes all of these into one working code.
How and When to Use Replicas
TLDR: If you are overloaded with traffic and your latency is skyrocketing, add replicas.
When to use: You are experiencing degraded query performance which can be measured with
pinecone_request_latency_seconds
metric.Benefit: Each replica will help read query performance by distributing the load. Adding 5x number of replicas will theoretically increase throughput by 5x (table below shows how many Queries Per Second (QPS) each replica adds, per type.
Note, the pod type of your replica is going to be the same as the ‘base type’ you specified when you created the index - which you CANNOT change.
- E.g. You provisioned an index with 3 pods of type S1. If you add 5 replicas, you get 5 more of type S1, and you won’t get to choose P1 or P2.
Gotcha: Adding replicas WON’T increase your vector
storage
capacity. All replicas do is copy over vectors from your primary - therefore, it’s not going to help you store more number of vectors, which you can only accomplish by either adding more pods (of the same kind), or sizing up each pod.Queries Per Second by Pod Type and
top_k
Pod type | top_k 10 | top_k 250 | top_k 1000 |
---|---|---|---|
p1 | 30 | 25 | 20 |
p2 | 150 | 50 | 20 |
s1 | 10 | 10 | 10 |
How to Scale Horizontally
- TLDR:
- Increase the # of pods if you want to add storage and query throughput to your index, and can afford to shut down your site while you recreate the index (which may take 2-30 minutes)
- How It Works:
- When you create indices, you get to pass number of
pods
andpod_type
as configuration. This is your lever. You can only specify the number of pods at index creation time. Thus, changing this figure requires a full deletion of index, and restoring from a snapshot with new configuration. This recreation of index MAY lead to DB url changing, which can be a problem for production apps. To mitigate this issue, use an API gateway to load balance across Pinecone indices.
- When you create indices, you get to pass number of
- Here is a simple snippet illustrating scale-in and scale-out your Pods:
Scale Out Dynamically, Adjusting to Usage (Code Sample)
### your code ###
SCALE_OUT_THRESHOLD = 0.7
BUFFER = 0.1
POD_INCREMENTS = 2
### run this in some interval ###
def dynamic_scale(scale_type="horizontal"):
# get index fullness
index = pinecone.Index('my-index-name')
index_stats_response = index.describe_index_stats()
index_fullness = index_stats_response["index_fullness"]
index_description = pinecone.describe_index('my-index-name')
num_pods = index_description["pods"]
pod_type = index_description["pod_type"]
if index_fullness > SCALE_OUT_THRESHOLD + BUFFER:
delta = POD_INCREMENTS # move in 2 pods at a time, could be configurable
elif index_fullness < SCALE_OUT_THRESHOLD - BUFFER:
delta = -POD_INCREMENTS
if scale_type == "horizontal":
pinecone.create_collection("my-index-name-collection", "my-index-name")
pinecone.create_index("my-index-name",
dimension=1536,
source_collection="my-index-name-collection",
metric="dotproduct",
pods=num_pods + delta
pod_type=pod_type
)
elif scale_type == "vertical" and delta > 0: # vertical scaling
size = int(pod_type[-1]) # e.g. s1.x2 -> 2
new_size = max(size * 2, 8)
pinecone.configure_index("my-index-name", pod_type=pod_type[:-1] + new_size)
# note: can't scale down
Vertical Scaling Gotchas
- TLDR: Allows you to
2x
,4x
,8x
both your throughput and storage for an active index, which makes it different from adding more pods which requires recreating indices. - When to use: Useful to handle spikes in usage, and you can't afford any downtime to 1) snapshot a collection, 2) shut down the index, 3) and recreate the index from the collection.
- Gotchas:
- If you scale from X1 to X2, X4, etc - then this scale up applies to ALL of your pods. In other words, if you are paying for 10 pods and scale up vertically 2x, you will instantly double your bill.
- You cannot scale down, once you scale up. Therefore, when the spike is over and you want to wind down, you will have to eventually delete the index and recreate it. You may need a mechanism to manage the HTTPS endpoints for your Pinecone index URLs.
- Neither can you change pod types with vertical scaling. If you want to change your pod type while scaling, then horizontal scaling is the better option.
from: https://docs.pinecone.io/docs/manage-indexes#replicas … The default pod size is
x1
. After index creation, you can increase the pod size for an index. Increasing the pod size of your index does not result in downtime. Reads and writes continue uninterrupted during the scaling process. Currently, you cannot reduce the pod size of your indexes. Your number of replicas and your total number of pods remain the same, but each pod changes size. Resizing completes in about 10 minutes.
Miscellaneous Advice
How to Pick the Initial Pod Configurations
Now, to do this, you actually want to work backwards from your application’s main requirements. One can write another 10K+ words on this topic alone so I’ll save this for another post, but don’t spend too much time forecasting, because you will be wrong anyways. Some questions to ask are:
- How many vectors + metadata do you have?
- What are your most common queries?
- What’s your tolerance for latency?
- What’s your requirement for throughput?
Based on this, you can back out a guess as to whether you need to get a S-pod or a P-pod. Most people who aren’t serving production apps should go with S-pods which are storage optimized.
Get steep discounts by committing to a spend
Once you have validated your Pinecone use case and see yourself using it for at least a year, it’s best to secure a discount based on spend commitment. Contact sales for discounts. This advice applies to any Cloud service, really. It’s not uncommon get 25%+ discount or more by committing to 1, 2 year spend. For example, AWS has done this for pretty much all of their popular products, such as EC2 reserved instances.
My final thoughts on various VectorDB options
In the upcoming posts, I will review other Vector databases like weviate and some open source projects like Langchain and Llama Index as well. I help individuals and enterprises alike in getting their AI embeddings based projects to MVP and beyond. You can reach me via https://twitter.com/nextworddev
Cost optimization is closely tied to making good decisions about both infrastructure AND your code. As you saw, there are many levers to pull to optimize your spend without settling for worse performance.