A 13-month-old LlamaIndex bug re-embeds unchanged content
A 13-month-old hashing default in LlamaIndex silently re-embeds byte-identical content. The bug fires on half the supported storage backends today; the other half is one upstream commit away from activating.
A re-indexing hash bug in llama-index-core re-embeds byte-identical content on every scheduled run that crosses a calendar day. It has been shipping in defaults for thirteen months, inside the SimpleDirectoryReader path that every official quickstart uses. I verified it end to end against real OpenAI billing, and source-verified the behavior across every fsspec backend maintained by the fsspec organization.
The bug is interesting because of how it behaves across the ecosystem. Two underlying issues interact in ways that produce opposite outcomes depending on where your documents live. On roughly half the storage backends LlamaIndex supports (local filesystem, GCS, SFTP, SMB, HDFS via pyarrow, in-memory), the bug fires today at day-level precision. On the other half (S3, Azure Blob, Alibaba OSS, Google Drive, Dropbox, plus ten others), it sits dormant, masked by an unrelated mismatch. A single well-intentioned change to that mismatch would activate the bug across every masked backend simultaneously, at sub-second precision instead of day-level.
This is a walk through both issues, why they produce opposite behavior, the experiments that demonstrate them, and the three-line fix that decouples them. Reproducers: github.com/stirelli/llamaindex-embedding-churn. Fix submitted as issue #21461 and PR #21462.
TL;DR
Node.hash,TextNode.hash, and theIngestionCachekey all include metadata viaMetadataMode.ALL, ignoringexcluded_embed_metadata_keys. Any change to a “volatile” metadata field flips the hash and forces a re-embed of otherwise-unchanged content.SimpleDirectoryReaderon a local filesystem populates date-only timestamps. Scheduled re-indexing silently re-embeds modified files across calendar-day boundaries. Verified end to end against real OpenAI billing.- The bug fires today on every fsspec backend whose
_info()emits a lowercasemtime,atime, orcreatedkey. Source-verified active: local filesystem, GCS, SFTP, SMB, HDFS via pyarrow, in-memory. - The bug is masked today on every backend that emits timestamps under different key names (
LastModified,last_modified,modifiedTime, etc.) or no timestamps at all. Source-verified masked: S3, Azure Blob, Alibaba OSS, OpenStack Swift, TOS, Google Drive, Dropbox, IPFS, OpenDAL, plus the built-in HTTP, WebHDFS, FTP, DBFS, GitHub, Gist, and Git backends. A single change todefault_file_metadata_funcwould lift the mask on every one of them simultaneously. - Any caller on any backend that writes a custom
file_metadatacallable to get proper timestamps is already vulnerable, at whatever precision that callable provides, regardless of which backend the documents live on. Proof in Experiment C of the reproducer. - Fix: change
MetadataMode.ALLtoMetadataMode.EMBEDin three sites. Respects the existingexcluded_embed_metadata_keysexclusion list. Submitted as issue #21461 and PR #21462.
Am I affected? A 60-second check
- Do you use LlamaIndex with
SimpleDirectoryReaderor anyIngestionPipelinereader on a scheduled re-index? If no, this does not apply to you. If yes, continue. - Where do your documents live?
- Local filesystem, Google Cloud Storage, SFTP, SMB (Windows file shares), HDFS via pyarrow, or in-memory: affected today.
- S3, Azure Blob, Alibaba OSS, Google Drive, Dropbox, OpenStack Swift, or any other masked backend listed below: dormant today, will activate when the upstream change lifts the mask.
- A custom
file_metadatacallable that emits a real datetime: affected regardless of backend, at whatever precision the callable provides.
- Which embedding model? LlamaIndex’s default
OpenAIEmbedding()istext-embedding-ada-002at $0.10 per million tokens. Five times the price oftext-embedding-3-small. If you did not set the model explicitly, that is what you are running. - Fix. Three lines, below. If you cannot wait for the PR to merge, patch
MetadataMode.ALLtoMetadataMode.EMBEDin your local copy ofllama-index-coreto stop the churn without changing any other behavior.
Bug 1: Node.hash includes all metadata
The first piece is in llama-index-core/llama_index/core/schema.py. The hash property on a node computes a SHA-256 over a string that concatenates the text hash with the full metadata string:
@property
def hash(self) -> str:
doc_identities = []
metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
if metadata_str:
doc_identities.append(metadata_str)
# ... audio, image, video resources
if self.text_resource is not None:
doc_identities.append(self.text_resource.hash)
doc_identity = "-".join(doc_identities)
return str(sha256(doc_identity.encode("utf-8", "surrogatepass")).hexdigest())
The relevant detail is MetadataMode.ALL. Documents carry two exclusion lists, excluded_embed_metadata_keys and excluded_llm_metadata_keys, that control which fields are included in the text sent to the embedder and the LLM. MetadataMode.ALL ignores both lists. Every key goes into the hash, regardless of whether it was marked as volatile.
Under IngestionPipeline._handle_upserts, a hash mismatch triggers vector_store.delete() followed by a full re-embedding. Any field in the metadata dict that can change between ingestion runs will therefore trigger re-embed on the next run, even when the text content is byte-identical.
To confirm the mechanism in isolation, the simplest experiment is a hash comparison under four distinct metadata mutations plus a control:
Scenario Text same? Hash same? Churn
atime True False 4/4 (100%)
mtime True False 4/4 (100%)
size True False 4/4 (100%)
ctime True False 4/4 (100%)
CONTROL True True 0/4 (0%)
Every mutation type flips the hash for 100% of chunks. The control, identical metadata on both runs, produces identical hashes. The hash is deterministic, not noisy. Bug 1 is real.
End-to-end proof against the real OpenAI API
The hash comparison is a mechanical demonstration. The production-relevant question is whether this bug fires in real ingestion flows.
The smallest real flow is SimpleDirectoryReader over a local filesystem, the pattern every LlamaIndex quickstart uses. The reader populates file-stat-derived timestamps via default_file_metadata_func, formatting each via strftime("%Y-%m-%d"). That date-only granularity matters: modifications within the same calendar day do not change the metadata string. Only cross-day transitions flip it.
The production scenario is straightforward. A daily or weekly re-indexing job runs SimpleDirectoryReader over /data/docs. Between runs, a subset of files gets modified by a formatter, a sync tool, an editor save, a git checkout, anything that advances mtime. On the next run, each modified file’s last_modified_date string changes because the date crossed a day boundary since the last recorded mtime. The hash flips, the vectors get deleted, and the embedding provider gets called again.
To verify this against real billed usage, I set up a minimal pipeline with three small .md files, mtime pinned to yesterday via os.utime, SimpleDirectoryReader feeding a SentenceSplitter feeding OpenAIEmbedding(model="text-embedding-3-small"). Four phases: initial ingest, re-ingest with no change, touch() to advance mtime to today, re-ingest.
The billed token counts matched a local tiktoken estimate exactly, zero delta. Half of the run’s cost was the legitimate first ingest. The other half was a full re-embed of byte-identical content, triggered by a single touch that moved mtime from yesterday to today.
calls tokens cost (USD)
Legitimate (Ph 1) 3 465 $0.00000930
Wasted (Ph 4) 3 465 $0.00000930
TOTAL (actual) 6 930 $0.00001860
Overhead: 100% of the bill was wasted on re-embedding
identical content.
At this scale the cost is a rounding error. The production-relevant question is not the cost of three files but how the trigger rate scales across a real corpus once it is firing, and whether the mask on cloud-backed readers is stable. I come back to the cost math in its own section below.
Testing one cloud backend end to end
At this point the claim was narrow but verified on local filesystem: SimpleDirectoryReader, cross-day modification, date-granular strftime producing a hash flip. The natural next question was whether the same behavior reproduced on a cloud backend.
I picked S3 because that was the one I had an account on. The specific numbers in this section are from S3. The behavior I describe does not generalize uniformly across every fsspec backend (see the GCS exception traced in the next section).
I expected the bug to be worse on cloud than on local filesystem. Cloud object timestamps have second or sub-second precision. If that field ended up in the hash, every re-upload, even sub-second apart, even of byte-identical content, would flip the hash. That would be a much broader trigger surface than “cross-day modifications on local files.”
I set up a real S3 bucket, uploaded three .md files, wired up S3Reader with the same counting-embedder pipeline, and ran the same experiment: first ingest, re-ingest without changes, re-upload with two seconds of delay, re-ingest again. The result was:
Phase A1 (first ingest): embed_calls = 3
Phase A2 (no change): embed_calls = 3 (expect unchanged)
Phase A4 (after re-upload): embed_calls = 3 (expect unchanged if bug does NOT fire)
Zero new embed calls after the re-upload. The bug did not fire. I rebuilt the test from scratch twice, suspecting the docstore strategy or the cache configuration. The result was consistent. Re-uploading the same content to any fsspec-backed reader and re-running load_data() produces zero new embed calls.
The Document metadata explained why:
>>> docs[0].metadata.keys()
dict_keys(['file_path', 'file_name', 'file_type', 'file_size'])
No last_modified_date. No creation_date. None of the temporal fields that fire the bug on local filesystems are here.
Bug 2: default_file_metadata_func uses POSIX-only keys
The missing temporal fields on S3 are not missing because S3 does not have them. s3fs.stat() returns LastModified as a native datetime. Fsspec’s standard API exposes it via s3fs.modified(path). The data is there. It just is not being queried.
Look at how default_file_metadata_func in llama-index-core/llama_index/core/readers/file/base.py extracts timestamps:
creation_date = _format_file_timestamp(stat_result.get("created"))
last_modified_date = _format_file_timestamp(stat_result.get("mtime"))
last_accessed_date = _format_file_timestamp(stat_result.get("atime"))
It queries the lowercase POSIX-style keys that Python’s os.stat() returns for local files: "created", "mtime", "atime". Whether a given fsspec backend emits those keys is a decision made independently by each backend’s maintainers, and they have not agreed.
I read the _info() (or equivalent) of every filesystem maintained under the fsspec GitHub organization and every built-in implementation shipped in filesystem_spec. The split is clean. Roughly half of the ecosystem fires today, the other half is masked, and one wrapper depends on what it wraps.
Active today (the bug fires now)
| Backend | Timestamp keys _info() emits | Keys LlamaIndex reads |
|---|---|---|
Local filesystem (built-in local) | mtime, atime, created, size, mode, uid, gid, ino, nlink | mtime, created |
GCS (gcsfs) | mtime (legacy alias of updated), timeCreated | mtime |
SFTP (sshfs external + built-in sftp) | mtime, time (atime-derived) | mtime |
SMB / CIFS (built-in smb) | mtime (from SMB stats) | mtime |
HDFS via pyarrow (built-in arrow) | mtime | mtime |
In-memory (built-in memory) | created | created |
Masked today (the bug is dormant)
No POSIX-style keys reach the metadata, so stat_result.get("mtime") is always None. One change to default_file_metadata_func lifts the mask on all of them at once.
| Backend | Timestamp keys _info() emits |
|---|---|
S3 (s3fs) | LastModified, ETag, StorageClass, VersionId, ContentType |
Azure Blob (adlfs) | last_modified, creation_time, etag, content_settings, tags |
Alibaba OSS (ossfs) | LastModified, ETag, Size |
OpenStack Swift (swiftspec) | No stat timestamps |
TOS (tosfs) | LastModified |
Google Drive (gdrive-fsspec) | createdTime, modifiedTime (camelCase) |
Dropbox (dropboxdrivefs) | No POSIX keys |
IPFS (ipfsspec) | Content-addressed, no timestamps |
OpenDAL (opendalfs) | No POSIX keys |
Databricks FS (built-in dbfs) | No POSIX keys |
HTTP (built-in http) | No filesystem metadata |
| WebHDFS, FTP, GitHub, Gist, Git (built-ins) | No POSIX keys |
Depends on the wrapped backend
| Backend | Behavior |
|---|---|
Alluxio (alluxiofs) | Delegates _info() to the underlying filesystem. Status inherits from whichever one that is. |
GCS and SFTP stand out for a compatibility reason. gcsfs/core.py explicitly sets result["mtime"] = self._parse_timestamp(object_metadata["updated"]) in the path that populates _info(), with a TODO comment about removing the legacy name. That comment is a few years old. The code is still there. Both SFTP implementations emit mtime by POSIX convention because SFTP itself is a POSIX-shaped protocol. SMB and HDFS emit it for the same reason.
The result on the “Active today” side is that default_file_metadata_func gets a real datetime, formats it via strftime("%Y-%m-%d"), and feeds a day-granular timestamp into the node hash. Same cadence as local filesystem.
On the “Masked today” side, stat_result.get("mtime") returns None, _format_file_timestamp(None) returns None, and the postprocessor filters the None out before returning. Temporal fields silently don’t make it into the Document metadata. No warning, no log, no indication that the reader is working with a stripped-down metadata set.
Bug 2 is a stat-key mismatch, not a bug in any individual cloud backend. It is a LlamaIndex-level assumption that fsspec backends emit POSIX keys. About half do and half don’t. That split, not any particular cloud, determines whether Bug 1 is firing on you today.
Proof that fixing Bug 2 would activate Bug 1
The third experiment in the reproducer repo demonstrates what happens when a caller bypasses Bug 2 by writing a file_metadata callable that queries fsspec correctly:
def s3_aware_metadata(file_path: str) -> dict:
shared_s3fs.invalidate_cache()
return {
"file_path": file_path,
"file_size": shared_s3fs.size(file_path),
"last_modified_date":
shared_s3fs.modified(file_path).strftime("%Y-%m-%dT%H:%M:%SZ"),
}
reader_c = S3Reader(bucket=BUCKET, prefix=PREFIX,
file_metadata=s3_aware_metadata)
The only difference from Experiment A is the file_metadata argument. The rest of the pipeline is identical. With this in place, the same experiment fires the bug:
Experiment A (default S3Reader.load_data()):
embed_calls = 3 → bug fires: False
Experiment C (custom s3_aware_metadata callable):
embed_calls = 6 → bug fires: True
With correctly-queried datetime in metadata, a 2-second S3 re-upload flips last_modified_date, flips the hash, triggers the delete and re-embed. Three extra embed calls for three re-uploaded files, for zero content change. The mechanism is identical to Bug 1 in the first experiment. The only reason Experiment A did not fire is that the default reader path did not populate the field. Once that gap is closed, the churn bug is back. Now with second precision instead of day precision.
What this means in practice
If your documents live on a backend from the active list above (local filesystem, GCS, SFTP, SMB, HDFS-via-pyarrow, in-memory), the bug is live today. Any file whose mtime or created timestamp crosses a calendar day between scheduled indexing runs re-embeds on the next run.
The bug also fires on triggers that don’t look like scheduled re-indexing. Any operation that advances a file’s mtime across a calendar-day boundary qualifies. git checkout resets mtime to the moment of checkout by default, so a team that versions its document corpus in git reindexes on every fresh clone or branch switch. Docker image rebuilds set every file’s timestamp to the build time. Kubernetes rolling updates that mount ConfigMap or Secret volumes refresh timestamps on each deploy. Backup restores flip every timestamp at once and produce a full unnecessary re-index on the next cycle. A/B tests of different embedding models or chunking strategies re-index the entire corpus per variant. For multi-tenant SaaS, the bug applies per tenant and scales linearly with customer count.
If your documents live on a backend from the masked list above (S3, Azure Blob, Alibaba OSS, Swift, TOS, Google Drive, Dropbox, IPFS, OpenDAL, Databricks, plus the metadata-less HTTP/WebHDFS/FTP/Git family), the bug is not firing today, but not because your stack is correct. It is not firing because Bug 2 strips the metadata before Bug 1 can trip on it. The masking is environmental, not defensive. It survives until one of two things happens:
- LlamaIndex patches
default_file_metadata_functo usefs.modified(path)instead of POSIX-only keys. The moment that lands, the hash-churn bug activates for every masked backend simultaneously, at sub-second precision because that is what cloud object APIs expose. - Your team writes a custom
file_metadatacallable that queries fsspec directly to get real timestamps. Anyone who wants proper freshness tracking reaches forfs.modified(). The bug comes with it, at whatever precision that callable emits.
Beyond individual stacks, the bug propagates through the ecosystem at scale. llama-index-core sees roughly 6.5 million downloads per month on PyPI (verified April 2026), making it one of the most widely deployed retrieval frameworks in production. Downstream dependencies inherit the bug transitively. Pathway, a 63,000-star Python ETL framework for stream processing and real-time RAG, pins llama-index-core as a hard dependency in its pyproject.toml. Pathway users running scheduled indexing over active-list backends inherit this bug without necessarily knowing LlamaIndex is underneath.
What the overhead actually costs
Before the numbers, a note on how to read them. The formula below is exactly linear in five inputs, and the output is highly sensitive to the assumptions. Substitute your own values to get your number. The most assumption-heavy part of this analysis is the dev/CI multiplier in the larger scenarios. If your team does not run full ingestion in CI for every PR (most don’t), ignore that multiplier and look only at the production line.
wasted_tokens/year = N_docs × daily_churn × tokens_per_doc × cycles_per_year
wasted_usd/year = wasted_tokens/year × embedding_price_per_token
The embedding-price spread matters: text-embedding-3-small is $0.02/M, text-embedding-3-large is $0.13/M, voyage-3-large is $0.18/M (verified against provider pricing, April 2026). The most expensive common choice is 9× the cheapest.
Three scales, running the math for each.
Small (startup or early SaaS RAG feature). 50,000 documents, 2% daily churn, 2,000 tokens/doc, daily re-indexing, text-embedding-3-small.
- Wasted tokens/year: 730M
- Production embedding cost: ~$15/year
- Not material at this scale. Correctness bug, not a spend problem.
Mid (established SaaS with serious RAG product). 500,000 documents, 5% daily churn, 3,000 tokens/doc, daily re-indexing, text-embedding-3-large.
- Wasted tokens/year: 27.4B
- Production embedding cost: ~$3,560/year
- Teams that run full ingestion in CI for every PR add dev/CI burn on top. For a 5-engineer team running full-pipeline ingestion tests on a 50K-document test corpus through every CI build, that adds an estimated $25-65K/year. This estimate is the most assumption-sensitive part of the model. If your CI does not run full ingestion, production is the whole picture.
- Vector-store write costs at this scale: negligible.
Large (enterprise on the kind of scale visible in LlamaIndex’s case-study list). 5M documents, 5% daily churn, 5,000 tokens/doc, daily re-indexing, text-embedding-3-large.
- Wasted tokens/year: 456B
- Production embedding cost: ~$59,000/year with 3-large, ~$82,000/year with
voyage-3-large - For a platform team shipping RAG as a product, dev/CI burn typically adds 1× to 2× production token spend on top, depending on how much ingestion runs in CI versus staging. Same caveat as above: this multiplier is the biggest source of uncertainty in the estimate.
- Vector-store writes at 1B wasted upserts/year: ~$4,000/year in Pinecone write units.
- Total realistic range: roughly $60-200K/year in production-relevant spend, with a wider band possible once dev/CI is fully counted.
A second consideration: rate-limit capacity. OpenAI’s embedding rate limits are enforced at the organization level. A team running scheduled ingestion and customer-facing inference under the same billing account watches user-facing traffic compete for capacity with re-embedding waste. At the Large scenario above, the wasted tokens consume meaningful capacity that could otherwise serve real product traffic. The exact impact depends on your tier and traffic shape, but the pattern is worth being aware of: most teams share a key across workloads by default, so this competition is live unless someone has explicitly separated ingestion from inference traffic.
The money is not the main reason to care about this bug. At small scale it is trivial. At enterprise scale it is material but not the most expensive thing in the stack. The reason to care is the systemic property: the same pair of issues produces opposite behavior across half the storage backends LlamaIndex supports, has shipped in defaults for thirteen months, and a single well-intentioned cleanup commit flips every masked backend to firing at once. The dollars demonstrate the scale. The underlying concern is the equilibrium itself.
Reproducing it
All five levels of reproducer are in a public repo: github.com/stirelli/llamaindex-embedding-churn.
git clone https://github.com/stirelli/llamaindex-embedding-churn
cd llamaindex-embedding-churn
uv venv
uv pip install -r requirements.txt
uv run python verify_embedding_churn_lvl1.py # hash comparison
uv run python verify_embedding_churn_lvl2.py # counting embedder
uv run python verify_embedding_churn_lvl4.py # reader format survey
Levels 1, 2, and 4 run in under ten seconds with no external dependencies. Level 3 needs an OpenAI API key (cost per run is around two thousandths of a cent, and your dashboard will reflect the exact tokens). Level 5 needs AWS credentials and an S3 bucket (cost: fractions of a cent in S3 request fees).
The fix
The minimal set of changes is three lines across two files. All three sites use MetadataMode.ALL today. Changing them to MetadataMode.EMBED makes the hash align with the text that is actually sent to the embedder, which is the semantically correct thing to hash.
- metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+ metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
This respects excluded_embed_metadata_keys, the mechanism that already exists for marking fields as not-content-relevant. SimpleDirectoryReader populates that list with the volatile file-stat fields by default. The fix closes the churn without touching the original behavior added to detect meaningful metadata changes.
I submitted the findings as issue #21461 and PR #21462, with the three-line fix plus a regression test that covers both directions: volatile metadata must not force a re-embed, content-relevant metadata still must.
Closing
Two issues producing opposite behavior across a large ecosystem is a stable equilibrium in an unstable sense. It survived thirteen months because neither half produces a visible failure on its own. Bug 1 looks like normal re-embedding to the user who has never measured it. Bug 2 looks like the reader just not emitting temporal metadata, which is fine if nobody notices. On roughly half the storage backends LlamaIndex supports, the two align and the churn fires quietly at day-level precision today. On the other half, they cancel and the churn is dormant, waiting for one fix to lift the mask.
The narrow takeaway: audit your ingestion pipeline’s hash key against the input your embedder actually sees. If the two are not the same, you are paying for work that produces the same output. The broader takeaway is that this kind of interaction sits quietly in large codebases, and the same cleanup PR that fixes one half of a pair can release the other half at scale.