Published on

Pinecone - Cost Optimization & Performance Best Practices

Authors

Pinecone DB- Cost Optimization & Performance Best Practices

In this post, I will provide 17 best practices for optimizing cost with Pinecone specifically for newcomers to vector databases (or building AI apps in general). Following these best practices can save you tens of thousands of dollars for your startup, or help you avoid surprise $200 bills for your hobby side project.

All advice in this article is actionable, and are based on my personal experience using Pinecone in production settings, as well as my time at AWS and Alexa where I helped hundreds of customers architect production ML systems. The best practices are bucketed into four categories:

  • General Tips & Common Mistakes: a list of common mistakes to avoid and quick fixes
  • Application-Level Best Practices: how to structure your app and data to save money
  • Infrastructure-Level Best Practices: how to extract as much performance for the buck when creating and configuring indices
  • Paid-Tier Specific Advice: knowing which paid features to pay for, and when.

Unlike some folks, I have found Pinecone’s free tier to be quite generous, and its pricing formula to be transparent, compared to managed databases like DynamoDB or BigQuery (which can have confusing custom capacity units). However, there are some gotchas that may be overlooked by those who are new to using managed databases. This article will take you through the vast majority of gotchas you will encounter. Let’s jump in.

TLDR / Table of Contents

General Tips

Application Level Best Practices

Infrastructure Level Advice

Miscellaneous

General Tips

Quick mental model of how Pinecone and its pricing works

Before discussing specific tips, let’s first establish a mental model of how pricing works with Pinecone, which is a simple “pay as you go” model, without any weirdness like custom ‘compute credits’ or ‘capacity units’, etc, like compared to Snowflake or DynamoDB. There’s no separate ingress/egress/network charge, which also is a delight.

That said, here are 5 main things you need to know about Pinecone pricing:

  • Pinecone is basically built on Kubernetes, so you get literal “pods” when you pay for Pinecone.
  • “Pod” is the unit of infrastructure you pay for, kind of like “instances” for AWS EC2. Having 3x more pods will cost you 3x more, all else equal.
  • You pay for each pod for usage time, rounded up to nearest 15 minutes. So using a pod for 3 hours costs 3x more than renting a pod for an hour.
  • There are 3 types of pods (S1, P1, P2), but P2 is 50% more expensive than S1 or P1s.
  • Where you deploy Pinecone (GCP or AWS) matters. AWS costs ~14% more than GCP.

So basically, the pinecone pricing can be broken down into this pseudocode. It’s a function of 1) what pods you use, 2) for how long, and 3) in which cloud.

def pinecone_bill(account):
	if account.free_tier: return 0 # free tiers are free
	sum_total = 0
	for pod in account.pods_used_in_last_30_days:
		pod_type = pod.get_type()
		cloud_env = pod.get_cloud_env()
		pod_price_per_hr = get_pod_price_per_hr(pod_type, cloud_env)
		num_hrs_used = pod.get_rounded_meter()
		sum_total += pod_price_per_hr * num_hrs_used
	return sum_total

A list of common pitfalls for newcomers

So the formula above already foreshadows some rookie mistakes:

  • If you leave indice(s) on while your hobby project sits idle with no traffic, you are wasting money.
    • Most hobby projects don’t require an always-on index, so if you have to use paid tier, it’s okay to turn it off and rehydrate from a snapshot.
    • Snapshots are called collections in Pinecone-speak. I’ll walk you through this in an upcoming section.
    • Note, not everyone - companies and hobbysts alike - needs a paid index. Rule of thumb: if you have less than 300K documents using openAI’s embeddings (1536) in a non-prod environment, then switch to free tier.
  • If you have more pods than necessary, than you are wasting money by definition.
    • Pinecone provides you with index_fullness that you can use this to dynamically upsize and downsize pods in real time. I'll show you a code snippet to do this later in this post.
      • Once you are in production, chances are most newcomers aren’t dynamically resizing their indices based on metrics.
  • If you have the wrong pod mix (e.g. renting P2 pods without actually needing them), you are wasting money, too.
  • If you have the wrong pod-hour allocation (e.g. renting more P2 pods versus cheaper S1 pods when it’s not necessary). It’s best to use cheaper pods for development, testing, and pretty much everything other than real time serving.
  • If you are deployed in AWS when you don’t need to be, you are wasting money, since GCP deployments are ~14% cheaper.
    • Yes, I get that your Lambdas and EC2s are in AWS but can you live with extra latency? Unless you are a production app serving super impatient users, the answer is probably no.

Now, before we go deeper into the above tips, let’s first discuss the biggest cost savings gain tip, which is…

You don't need a premium tier unless you have 350K+ vectors

Pinecone has a very generous free tier - a single P1 pod can hold up to 300K ~ 450K and a S1 pod can hold ~2 million vectors. Chances are you aren’t creating millions of embeddings. Here’s how many 1536 dimension vectors each Pod type can handle.

  • S1 ~ 1.5m to 2m vectors. You can always scale vertically to S1.X2, X4, etc and increase this number.
  • P1 ~ 400K
  • P2 ~ 600K

Note, the numbers above were derived from this table, which assumes 768 dimension BERT embeddings. Since OpenAI’s embeddings are larger, we adjust by a factor of 2. Embeddings for images and videos which have larger dimensionality means fewer vectors per pod.

But here’s the truth - most side-projects and companies don’t have 300K+ embeddings to store, unless they are doing something truly “big-data”. So why do people upgrade usually? Most people upgrade to get around two issues: 1) to have multiple indices for their projects (because free tier only affords 1 index), 2) to get better reliability since free-tier indices can go down or degrade often.

Issue 2 is a legitimate reason to upgrade, especially when you have paying customers. But issue 1 - getting separate indices for multiple projects apps - is not a good reason to upgrade. What you should do is to get separate namespace to isolate each app.

Use namespace to isolate your apps

In other words, use a single index, but have namespace=[{project_name}_{subnamespace}] and so on. You can upsert vectors and isolate them in different namespaces. And at query time, vectors are also retrieved in an isolated fashion, like this:

...
index.query(
  vector=[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
  top_k=3,
  namespace="app_1_namespace_2"
)

Basically, by using some naming convention for your namespaces, you get the benefits of isolation you achieve with having separate indices.

  • Note 1: you can have any number of namespaces, so you shouldn't worry about "squandering" namespaces
  • Note 2: You can’t query across namespaces so make sure that the vector embeddings are indeed logically distinct. That’s actually by design, however. You don’t want your embeddings for FinanceNewsArticles be fetched with embeddings for SECFilings (unless you somehow want both at the same time, in which it’s fine to just make two separate queries to each namespace)

Now, having separate indices is the ‘cleaner’ way to isolate vectors, and allows for creating snapshots aka collections more cleanly. But given each index costs at least 1 pod - it is usually worth the upgrade for most small projects.

But what if you MUST have a paid plan? How do we save money? Let’s lead with the lowest hanging fruit-advice, which is…

How to leverage collections to shut down & rehydrate

Now, suppose you are on a paid plan, but want to avoid paying for billable hours when your app is sitting idle. There are many ways to skin the cat, but the simplest one would be to:

  • Create a collection out of your index, which is essentially creating a point in time snapshot
  • Delete your index
  • Then if and when you do need the index back up, you can create an index from the collection - this process doesn’t take long (less than ~5 minutes) even for hundreds of thousands of vectors.

I have seen people freak out about deleting indices - but that’s because they are viewing Pinecone indices as storage like a MongoDB collection, which is just a wrong perspective. Indices aren’t meant to be storage for your data. It’s essentially what the name implies - an index to your data. Just do it.

You can get fancy and actively monitor your utilization, and trigger this shutdown/recovery from snapshot logic, using some metric feeds like CloudWatch, etc. In code,

def backup(index_name="my-index", create_new=True):
  collection = pinecone.describe_collection(index_name + "_collection")
  collection_exists = collection is not None and collection.status != "Ready"
  if not collection_exists or create_new:
		pinecone.create_collection(index_name + "_collection", index_name)

def restore(index_name):
	pinecone.create_index(index_name,
		dimension=1536,
		source_collection=index_name + "_collection")

def shut_down_if_idle(index_name, utilization, utilization_threshold=0.7)
	if utilization < utilization_threshold:
		pinecone.delete_index(index_name)

### every 30 minutes
while ...:
	shut_down_if_idle(...)
 ****

Avoid p1 or p2 types when prototyping

When prototyping, you are not as throughput or latency sensitive, which makes storage optimized S type Pods far more suitable than the P types which get you 3-5x less storage capacity. Since S instances can store 7-figure vector count easily, it’s hard to imagine you running out of pod space easily. And since it’s a pre-MVP product, you won’t have throughput issues.

Application Level Advice

Don’t treat Pinecone like MongoDB or DynamoDB

Many newcomers mistakenly see vector DBs like Pinecone or Weviate as a drop-in replacement for NoSQL databases like Mongo or DynamoDB. This encourages bad habits like trying to put entire JSON objects as meta data, when upserting embeddings. This is suboptimal, because:

  • Having more metadata crowds out the space you need to store vectors
  • metadata field in Pinecone is not meant to store entire objects, including full text of documents. This design philosophy is apparent in the 40Kb size limit for Pinecone metadata field, which is 10X smaller than DynamoDB’s 400Kb size limit for Items

In reality, indices are exactly what the name implies - they are indices for fast look up of documents. Indices should be treated as a view into your data, that is potentially ephemeral. It’s not equivalent to your data itself. Your index should be treated as its own thing, that is elastic based on workload.

So what should you do instead of storing entire JSONs in metadata?

Store foreign_key in your meta, not the whole JSON

Don’t store the whole JSON, store references to the actual document in S3 or Filestore in your meta data field, rather than to waste an egregious amount of space storing whole blobs of text. Of course, this means you need another datastore to actually store your data, but it’s certainly cheaper than storing it inside a live index inside Pinecone.

  • The obvious caveat is that now you need another data store to hold your documents, but chances are you already have that anyways, and it’s more expensive to migrate that out of AWS or GCP to Pinecone.

For example:

# instead of doing this
index.upsert([("<id>", [...vector...], {
	"full_text": "really long text...",    # this is wasteful
	"date": "2023-01-01",  # you are actually filtering on date, so we need 'date' in metadata
	...
}

# do this
index.upsert([("<id>", [...vector...], {
	"document_id": uuid, #uuid of the document
	"date": "2023-01-01" #keep this
}

And in query time, you make one additional query to your document store (DynamoDB, etc) using the foreign_key or uuidThis will add a small latency that comes from the extra hop, which should be less than ~10 milliseconds range as long as your Pinecone instance is in the same region as your servers.

matches = index.query(...)
uuid = matches[0]["metadata"]["uuid"]

# Get the table object
table = dynamodb.Table(table_name)

# Define the key for the item to retrieve
key = {'uuid': uuid}

# Use the get_item method to retrieve the item
response = table.get_item(Key=key)

# Extract the item from the response
item = response.get('Item')

Note, there a few gotchas:

  • Gotcha 1: decoupling document store with vector store may not be an option if your infrastructure is deployed in a region where Pinecone isn’t available (such as APAC) - in which case it’s better to store relevant document info inside the meta data field.
  • Gotcha 2: you could also have a happy compromise of storing just previews of your document in the vector store, as well as the foreign key. This will create a happy medium of latency and convenience versus storage efficiency.

Index only a subset of metadata fields

Another huge gotcha with Pinecone is that it indexes all metadata fields by default. This is often not a desirable behavior, because not all meta data fields will be used in metadata filtering queries. Each additional metadata index consumes both memory (RAM) and storage, resulting in less vectors fitting inside each pod and degrading performance.

Luckily you can pre-define which metadata fields are indexed using metadata configs. Example: Say you are using date, customer_id, and billing_address in metadata field. If you never use billing_address as a field to filter on, it’s complete waste of space to create an index on it (which is the default behavior). To prevent that, you can do this.

metadata_config = {
    "indexed": ["date", "customer_id"]
}

pinecone.create_index("openai-index", dimension=1536,
                      metadata_config=metadata_config)

As always there is a gotcha:

  • metadata can be only configured at the index creation time, meaning you can’t add additional metadata fields to index on a running index. Note, it’s often hard to predict in advance which meta data field you need meta data filtering on, or not. You can’t change the metadata config after you

Your best option is to snapshot the index (as a collection), and restart an index just to reconfigure it. This can lead to some downtime on your index, so it’s advisable to think through which fields you will most likely filter on. Generally

What should you put in metadata fields?

Generally speaking, avoid putting whole blocks of text into your meta data fields. That’s apparent when you consider what type of filtering operations are available for metadata fields (below). None of the operations support real free text search (not even contains()). Time or categorical fields such as date, category are a good fit.

  • $eq - Equal to (number, string, boolean)
  • $ne - Not equal to (number, string, boolean)
  • $gt - Greater than (number)
  • $gte - Greater than or equal to (number)
  • $lt - Less than (number)
  • $lte - Less than or equal to (number)
  • $in - In array (string or number)
  • $nin - Not in array (string or number)

Infrastructure Level Advice

So you have decided you are ready to graduate to a paid plan, and deal with multiple indices & pods. This unlocks an entire new class of cost optimization strategies, but first we need to understand how horizontal and vertical scaling works with Pinecone.

TLDR Scaling Advice

  • Start small and start with cheaper configurations
  • Work backwards from your expected usage to figure out the initial cluster configurations, but don’t expect your projections to be accurate. Stay flexible.
  • Use index_fullness and request_latency_seconds metrics as signals to scale in and out dynamically
  • There are things that you can only configure at index-creation time, versus dynamically.
    • Only at index-creation time:
      • Pod type (s1 vs p1, etc), number of pods
    • Can change at index runtime:
      • Number of replicas, pod sizes (p1.x1 to p1.x4, etc)
      • Note, you CAN’T downgrade a pod size dynamically (p1.x4 to p1.x1)
  • Here’s a simple rule of thumb:
    • To handle bursts of traffic or usage in a production application, go with vertical scaling or add read replicas
    • For every other use case, just delete your index and start over (from a collection / snapshot) with new pod settings

If you want more, here are some additional details.

Understand the Scaling Model

Three are three ways to scale your index from just 1 pod to N pods:

  • horizontally:
    • adding read replicas
    • adding Pods
  • vertically (up the ‘type’ of each Pod e.g. x1 to x2)
    • (warning) this has the effect of sizing up ALL pods in your cluster.

But there are some key things and gotchas to know for each strategy. At the end of this section, I even include a simple auto-scaling function that synthesizes all of these into one working code.

How and When to Use Replicas

  • TLDR: If you are overloaded with traffic and your latency is skyrocketing, add replicas.

  • When to use: You are experiencing degraded query performance which can be measured with pinecone_request_latency_seconds metric.

  • Benefit: Each replica will help read query performance by distributing the load. Adding 5x number of replicas will theoretically increase throughput by 5x (table below shows how many Queries Per Second (QPS) each replica adds, per type.

  • Note, the pod type of your replica is going to be the same as the ‘base type’ you specified when you created the index - which you CANNOT change.

    • E.g. You provisioned an index with 3 pods of type S1. If you add 5 replicas, you get 5 more of type S1, and you won’t get to choose P1 or P2.
  • Gotcha: Adding replicas WON’T increase your vector storage capacity. All replicas do is copy over vectors from your primary - therefore, it’s not going to help you store more number of vectors, which you can only accomplish by either adding more pods (of the same kind), or sizing up each pod.

    Queries Per Second by Pod Type and top_k

Pod typetop_k 10top_k 250top_k 1000
p1302520
p21505020
s1101010

How to Scale Horizontally

  • TLDR:
    • Increase the # of pods if you want to add storage and query throughput to your index, and can afford to shut down your site while you recreate the index (which may take 2-30 minutes)
  • How It Works:
    • When you create indices, you get to pass number of pods and pod_type as configuration. This is your lever. You can only specify the number of pods at index creation time. Thus, changing this figure requires a full deletion of index, and restoring from a snapshot with new configuration. This recreation of index MAY lead to DB url changing, which can be a problem for production apps. To mitigate this issue, use an API gateway to load balance across Pinecone indices.
  • Here is a simple snippet illustrating scale-in and scale-out your Pods:

Scale Out Dynamically, Adjusting to Usage (Code Sample)

### your code ###

SCALE_OUT_THRESHOLD = 0.7
BUFFER = 0.1
POD_INCREMENTS = 2

### run this in some interval ###
def dynamic_scale(scale_type="horizontal"):
	# get index fullness
	index = pinecone.Index('my-index-name')
	index_stats_response = index.describe_index_stats()
	index_fullness = index_stats_response["index_fullness"]
	index_description = pinecone.describe_index('my-index-name')
	num_pods = index_description["pods"]
	pod_type = index_description["pod_type"]

	if index_fullness > SCALE_OUT_THRESHOLD + BUFFER:
		delta = POD_INCREMENTS # move in 2 pods at a time, could be configurable
	elif index_fullness < SCALE_OUT_THRESHOLD - BUFFER:
		delta = -POD_INCREMENTS

	if scale_type == "horizontal":
		pinecone.create_collection("my-index-name-collection", "my-index-name")
		pinecone.create_index("my-index-name",
				dimension=1536,
				source_collection="my-index-name-collection",
				metric="dotproduct",
				pods=num_pods + delta
				pod_type=pod_type
		)
  elif scale_type == "vertical" and delta > 0:   # vertical scaling
		size = int(pod_type[-1]) # e.g. s1.x2 -> 2
		new_size = max(size * 2, 8)
		pinecone.configure_index("my-index-name", pod_type=pod_type[:-1] + new_size)
		# note: can't scale down

Vertical Scaling Gotchas

  • TLDR: Allows you to 2x, 4x, 8x both your throughput and storage for an active index, which makes it different from adding more pods which requires recreating indices.
  • When to use: Useful to handle spikes in usage, and you can't afford any downtime to 1) snapshot a collection, 2) shut down the index, 3) and recreate the index from the collection.
  • Gotchas:
    • If you scale from X1 to X2, X4, etc - then this scale up applies to ALL of your pods. In other words, if you are paying for 10 pods and scale up vertically 2x, you will instantly double your bill.
    • You cannot scale down, once you scale up. Therefore, when the spike is over and you want to wind down, you will have to eventually delete the index and recreate it. You may need a mechanism to manage the HTTPS endpoints for your Pinecone index URLs.
    • Neither can you change pod types with vertical scaling. If you want to change your pod type while scaling, then horizontal scaling is the better option.

      from: https://docs.pinecone.io/docs/manage-indexes#replicas … The default pod size is x1. After index creation, you can increase the pod size for an index. Increasing the pod size of your index does not result in downtime. Reads and writes continue uninterrupted during the scaling process. Currently, you cannot reduce the pod size of your indexes. Your number of replicas and your total number of pods remain the same, but each pod changes size. Resizing completes in about 10 minutes.

Miscellaneous Advice

How to Pick the Initial Pod Configurations

Now, to do this, you actually want to work backwards from your application’s main requirements. One can write another 10K+ words on this topic alone so I’ll save this for another post, but don’t spend too much time forecasting, because you will be wrong anyways. Some questions to ask are:

  • How many vectors + metadata do you have?
  • What are your most common queries?
  • What’s your tolerance for latency?
  • What’s your requirement for throughput?

Based on this, you can back out a guess as to whether you need to get a S-pod or a P-pod. Most people who aren’t serving production apps should go with S-pods which are storage optimized.

Get steep discounts by committing to a spend

Once you have validated your Pinecone use case and see yourself using it for at least a year, it’s best to secure a discount based on spend commitment. Contact sales for discounts. This advice applies to any Cloud service, really. It’s not uncommon get 25%+ discount or more by committing to 1, 2 year spend. For example, AWS has done this for pretty much all of their popular products, such as EC2 reserved instances.

My final thoughts on various VectorDB options

In the upcoming posts, I will review other Vector databases like weviate and some open source projects like Langchain and Llama Index as well. I help individuals and enterprises alike in getting their AI embeddings based projects to MVP and beyond. You can reach me via https://twitter.com/nextworddev

Cost optimization is closely tied to making good decisions about both infrastructure AND your code. As you saw, there are many levers to pull to optimize your spend without settling for worse performance.