A 13-month-old LlamaIndex bug re-embeds unchanged content
A re-indexing hash bug in llama-index-core re-embeds byte-identical content on every scheduled run. It has been shipping in defaults for thirteen months. A source-level inspection of the fsspec-maintained ecosystem shows the bug fires today on local filesystem, GCS, SFTP, SMB, and HDFS-via-pyarrow; the same bug sits dormant on S3, Azure Blob, Alibaba OSS, Google Drive, Dropbox, and ten others, masked by a stat-key mismatch that a single upstream commit would lift. Verified end to end against real OpenAI billing on local filesystem and S3, and source-verified across every fsspec backend maintained by the fsspec org.
A re-indexing hash bug in llama-index-core re-embeds byte-identical content on every scheduled run that crosses a calendar day. I verified it end to end against real OpenAI billing. The bug has shipped in defaults for thirteen months, inside the SimpleDirectoryReader path that every official quickstart uses.
What makes it interesting is that the blast radius depends on which fsspec backend the documents live on. The same two bugs produce opposite behaviour across the ecosystem:
- Active today, at day-level precision: local filesystem, GCS (
gcsfs), SFTP (sshfsand the built-insftp), SMB (smb), HDFS via pyarrow (arrow), in-memory (memory). Any backend whose_info()returns a lowercasemtime,atime, orcreatedkey makes it through to the hash. - Masked today, waiting on one upstream fix to activate: S3 (
s3fs), Azure Blob (adlfs), Alibaba OSS (ossfs), OpenStack Swift (swiftspec), Tinder OS (tosfs), Google Drive (gdrive-fsspec), Dropbox (dropboxdrivefs), IPFS (ipfsspec), OpenDAL (opendalfs), Databricks FS (dbfs), plus the built-in HTTP, WebHDFS, FTP, GitHub, Gist, and Git backends. Each of these either emits timestamps under non-POSIX key names or emits no timestamp at all;default_file_metadata_funclooks for keys that do not exist and silently drops the temporal metadata.
That is a source-verified split, reading every _info() implementation in the fsspec GitHub organisation and the built-in implementations in filesystem_spec. I verified local filesystem and S3 end to end with real billed OpenAI usage. The rest of the split is by static inspection of each backend’s stat-key emission.
This is a walk through both bugs, why the same pair produces opposite behaviour across the ecosystem, the experiments that demonstrate them, and the three-line fix that decouples them. Reproducers: github.com/stirelli/llamaindex-embedding-churn. Fix filed upstream: issue #21461, PR #21462.
TL;DR
Node.hash,TextNode.hash, and theIngestionCachekey all include metadata viaMetadataMode.ALL, ignoringexcluded_embed_metadata_keys. Any change to a “volatile” metadata field flips the hash and forces a re-embed of otherwise-unchanged content.SimpleDirectoryReaderon a local filesystem populates date-only timestamps. Scheduled re-indexing silently re-embeds modified files across calendar-day boundaries. Verified end to end against real OpenAI billing.- The bug fires today, at day-level precision, on every fsspec backend whose
_info()emits a lowercasemtime,atime, orcreatedkey. Source-verified active today: local filesystem, GCS (gcsfs), SFTP (sshfsand built-insftp), SMB (built-insmb), HDFS via pyarrow (built-inarrow), and in-memory (built-inmemory). - The bug is masked today on every fsspec backend that emits timestamps under other key names (
LastModified,last_modified,modifiedTime,creation_time, etc.), or no timestamps at all. Source-verified masked today: S3 (s3fs), Azure Blob (adlfs), Alibaba OSS (ossfs), OpenStack Swift (swiftspec), TOS (tosfs), Google Drive (gdrive-fsspec), Dropbox (dropboxdrivefs), IPFS (ipfsspec), OpenDAL (opendalfs), and the built-in HTTP, WebHDFS, FTP, DBFS, GitHub, Gist, and Git. A single upstream commit todefault_file_metadata_funcwould lift the mask on every one of them simultaneously. - Any caller on any backend that writes a custom
file_metadatacallable to get proper timestamps is already vulnerable, at whatever precision that callable provides, regardless of which backend the documents live on. Proof in Experiment C of the reproducer. - Fix: change
MetadataMode.ALLtoMetadataMode.EMBEDin three sites. Respects the existingexcluded_embed_metadata_keysexclusion list. Filed as issue #21461 and PR #21462.
Am I affected? A 60-second check
- Do you use LlamaIndex with
SimpleDirectoryReaderor anyIngestionPipelinereader on a scheduled re-index? If no, this does not apply to you. If yes, continue. - Where do your documents live?
- Local filesystem, Google Cloud Storage, SFTP, SMB (Windows file shares), HDFS via pyarrow, or in-memory → affected today.
- S3, Azure Blob, Alibaba OSS, Google Drive, Dropbox, OpenStack Swift, or any other masked backend listed further down → dormant today, will activate when the upstream fix lifts the mask.
- A custom
file_metadatacallable that emits a real datetime → affected regardless of backend, at whatever precision the callable provides.
- Which embedding model? LlamaIndex’s default
OpenAIEmbedding()istext-embedding-ada-002at $0.10 per million tokens. Five times the price oftext-embedding-3-small. If you did not set the model explicitly, that is what you are running. - Fix. Three lines, below. If you cannot wait for the PR to merge, patch
MetadataMode.ALLtoMetadataMode.EMBEDin your local copy ofllama-index-coreto stop the churn without changing any other behaviour.
Bug 1: Node.hash includes all metadata
The first piece is in llama-index-core/llama_index/core/schema.py. The hash property on a node computes a SHA-256 over a string that concatenates the text hash with the full metadata string:
@property
def hash(self) -> str:
doc_identities = []
metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
if metadata_str:
doc_identities.append(metadata_str)
# ... audio, image, video resources
if self.text_resource is not None:
doc_identities.append(self.text_resource.hash)
doc_identity = "-".join(doc_identities)
return str(sha256(doc_identity.encode("utf-8", "surrogatepass")).hexdigest())
The relevant detail is MetadataMode.ALL. Documents carry two exclusion lists, excluded_embed_metadata_keys and excluded_llm_metadata_keys, that control which fields are included in the text sent to the embedder and the LLM. MetadataMode.ALL ignores both lists. Every key goes into the hash, regardless of whether it was marked as volatile.
Under IngestionPipeline._handle_upserts, a hash mismatch triggers vector_store.delete() followed by a full re-embedding. Any field in the metadata dict that can change between ingestion runs will therefore trigger re-embed on the next run, even when the text content is byte-identical.
To confirm the mechanism in isolation, the simplest experiment is a hash comparison under four distinct metadata mutations plus a control:
Scenario Text same? Hash same? Churn
atime True False 4/4 (100%)
mtime True False 4/4 (100%)
size True False 4/4 (100%)
ctime True False 4/4 (100%)
CONTROL True True 0/4 (0%)
Every mutation type flips the hash for 100% of chunks. The control, identical metadata on both runs, produces identical hashes. The hash is deterministic, not noisy. Bug 1 is real.
End-to-end proof against the real OpenAI API
The hash comparison is a mechanical demonstration. The production-relevant question is whether this bug fires in real ingestion flows.
The smallest real flow is SimpleDirectoryReader over a local filesystem, the pattern every LlamaIndex quickstart uses. The reader populates file-stat-derived timestamps via default_file_metadata_func, formatting each via strftime("%Y-%m-%d"). That date-only granularity matters: modifications within the same calendar day do not change the metadata string. Only cross-day transitions flip it.
The production scenario is straightforward. A daily or weekly re-indexing job runs SimpleDirectoryReader over /data/docs. Between runs, a subset of files gets modified by a formatter, a sync tool, an editor save, a git checkout, anything that advances mtime. On the next run, each modified file’s last_modified_date string changes because the date crossed a day boundary since the last recorded mtime. The hash flips, the vectors get deleted, and the embedding provider gets called again.
To verify this against real billed usage, I set up a minimal pipeline with three small .md files, mtime pinned to yesterday via os.utime, SimpleDirectoryReader feeding a SentenceSplitter feeding OpenAIEmbedding(model="text-embedding-3-small"). Four phases: initial ingest, re-ingest with no change, touch() to advance mtime to today, re-ingest.
The billed token counts matched a local tiktoken estimate exactly, zero delta. Half of the run’s cost was the legitimate first ingest. The other half was a full re-embed of byte-identical content, triggered by a single touch that moved mtime from yesterday to today.
calls tokens cost (USD)
Legitimate (Ph 1) 3 465 $0.00000930
Wasted (Ph 4) 3 465 $0.00000930
TOTAL (actual) 6 930 $0.00001860
Overhead: 100% of the bill was wasted on re-embedding
identical content.
At this scale the cost is a rounding error. The production-relevant question is not the cost of three files but how the trigger rate scales across a real corpus once it is firing, and whether the mask on cloud-backed readers is stable. I come back to the cost math in its own section below.
Testing one cloud backend end to end
At this point the claim was narrow but verified on local filesystem: SimpleDirectoryReader, cross-day modification, date-granular strftime producing a hash flip. The natural next question was whether the same behaviour reproduced on a cloud backend.
I picked S3 because that was the one I had an account on. The specific numbers in this section are from S3; the behaviour I describe does not generalise uniformly across every fsspec backend (see the GCS exception traced in the next section).
I expected the bug to be worse on cloud than on local filesystem. Cloud object timestamps have second or sub-second precision. If that field ended up in the hash, every re-upload, even sub-second apart, even of byte-identical content, would flip the hash. That would be a much broader trigger surface than “cross-day modifications on local files.”
I set up a real S3 bucket, uploaded three .md files, wired up S3Reader with the same counting-embedder pipeline, and ran the same experiment: first ingest, re-ingest without changes, re-upload with two seconds of delay, re-ingest again. The result was:
Phase A1 (first ingest): embed_calls = 3
Phase A2 (no change): embed_calls = 3 (expect unchanged)
Phase A4 (after re-upload): embed_calls = 3 (expect unchanged if bug does NOT fire)
Zero new embed calls after the re-upload. The bug did not fire. I rebuilt the test from scratch twice, suspecting the docstore strategy or the cache configuration. The result was consistent. Re-uploading the same content to any fsspec-backed reader and re-running load_data() produces zero new embed calls.
The Document metadata explained why:
>>> docs[0].metadata.keys()
dict_keys(['file_path', 'file_name', 'file_type', 'file_size'])
No last_modified_date. No creation_date. None of the temporal fields that fire the bug on local filesystems are here.
Bug 2: default_file_metadata_func uses POSIX-only keys, and not every fsspec backend plays along
The missing temporal fields on S3 are not missing because S3 does not have them. s3fs.stat() returns LastModified as a native datetime. Fsspec’s standard API exposes it via s3fs.modified(path). The data is there. It just is not being queried.
Look at how default_file_metadata_func in llama-index-core/llama_index/core/readers/file/base.py extracts timestamps:
creation_date = _format_file_timestamp(stat_result.get("created"))
last_modified_date = _format_file_timestamp(stat_result.get("mtime"))
last_accessed_date = _format_file_timestamp(stat_result.get("atime"))
It queries the lowercase POSIX-style keys that Python’s os.stat() returns for local files: "created", "mtime", "atime". Whether a given fsspec backend emits those keys is a decision made independently by each backend’s maintainers, and they have not agreed.
I read the _info() (or equivalent) of every filesystem maintained under the fsspec GitHub organisation and every built-in implementation shipped in filesystem_spec. The split is clean: roughly half of the ecosystem fires today, the other half is masked, and one wrapper depends on what it wraps.
Active today (the bug fires now)
| Backend | Timestamp keys _info() emits | Keys LlamaIndex reads |
|---|---|---|
Local filesystem (built-in local) | mtime, atime, created, size, mode, uid, gid, ino, nlink | mtime, created |
GCS (gcsfs) | mtime (legacy alias of updated), timeCreated | mtime |
SFTP (sshfs external + built-in sftp) | mtime, time (atime-derived) | mtime |
SMB / CIFS (built-in smb) | mtime (from SMB stats) | mtime |
HDFS via pyarrow (built-in arrow) | mtime | mtime |
In-memory (built-in memory) | created | created |
Masked today (the bug is dormant)
No POSIX-style keys reach the metadata, so stat_result.get("mtime") is always None. One upstream commit to default_file_metadata_func lifts the mask on all of them at once.
| Backend | Timestamp keys _info() emits |
|---|---|
S3 (s3fs) | LastModified, ETag, StorageClass, VersionId, ContentType |
Azure Blob (adlfs) | last_modified, creation_time, etag, content_settings, tags |
Alibaba OSS (ossfs) | LastModified, ETag, Size |
OpenStack Swift (swiftspec) | No stat timestamps |
TOS (tosfs) | LastModified |
Google Drive (gdrive-fsspec) | createdTime, modifiedTime (camelCase) |
Dropbox (dropboxdrivefs) | No POSIX keys |
IPFS (ipfsspec) | Content-addressed, no timestamps |
OpenDAL (opendalfs) | No POSIX keys |
Databricks FS (built-in dbfs) | No POSIX keys |
HTTP (built-in http) | No filesystem metadata |
| WebHDFS, FTP, GitHub, Gist, Git (built-ins) | No POSIX keys |
Depends on the wrapped backend
| Backend | Behaviour |
|---|---|
Alluxio (alluxiofs) | Delegates _info() to the underlying filesystem. Status inherits from whichever one that is. |
GCS and SFTP stand out for a compatibility-layer reason. gcsfs/core.py explicitly sets result["mtime"] = self._parse_timestamp(object_metadata["updated"]) in the path that populates _info(), with a TODO comment about removing the legacy name. That comment is a few years old; the code is still there. Both SFTP implementations (sshfs and the built-in) emit mtime by POSIX convention because SFTP itself is a POSIX-shaped protocol. SMB and HDFS emit it for the same reason.
The result on the “Active today” side is that default_file_metadata_func gets a real datetime, formats it via strftime("%Y-%m-%d"), and feeds a day-granular timestamp into the node hash. Same cadence as local filesystem.
On the “Masked today” side, stat_result.get("mtime") returns None, _format_file_timestamp(None) returns None, and the default_file_metadata_func postprocessor filters the None out before returning. Temporal fields silently don’t make it into the Document metadata. No warning, no log, no indication that the reader is working with a stripped-down metadata set.
Bug 2 is a stat-key mismatch, not a bug in any individual cloud backend. It is a LlamaIndex-level assumption that fsspec backends emit POSIX keys. About half do and half don’t. That split, not any particular cloud, determines whether Bug 1 is firing on you today.
Proof that fixing Bug 2 would activate Bug 1
The third experiment in the reproducer repo demonstrates what happens when a caller bypasses Bug 2 by writing a file_metadata callable that queries fsspec correctly:
def s3_aware_metadata(file_path: str) -> dict:
shared_s3fs.invalidate_cache()
return {
"file_path": file_path,
"file_size": shared_s3fs.size(file_path),
"last_modified_date":
shared_s3fs.modified(file_path).strftime("%Y-%m-%dT%H:%M:%SZ"),
}
reader_c = S3Reader(bucket=BUCKET, prefix=PREFIX,
file_metadata=s3_aware_metadata)
The only difference from Experiment A is the file_metadata argument. The rest of the pipeline is identical. With this in place, the same experiment fires the bug:
Experiment A (default S3Reader.load_data()):
embed_calls = 3 → bug fires: False
Experiment C (custom s3_aware_metadata callable):
embed_calls = 6 → bug fires: True
With correctly-queried datetime in metadata, a 2-second S3 re-upload flips last_modified_date, flips the hash, triggers the delete and re-embed. Three extra embed calls for three re-uploaded files, for zero content change. The mechanism is identical to Bug 1 in the first experiment. The only reason Experiment A did not fire is that the default reader path did not populate the field. Once that gap is closed, the churn bug is back. Now with second precision instead of day precision.
What this means in practice
If your documents live on a backend from the active list above (local filesystem, GCS, SFTP, SMB, HDFS-via-pyarrow, in-memory), the bug is live today. Any file whose mtime or created timestamp crosses a calendar day between scheduled indexing runs re-embeds on the next run. Verified end to end on local filesystem; the rest share the exact same code path once the POSIX key reaches default_file_metadata_func. For teams running daily or weekly re-indexing over a corpus that sees writes between runs (dataset sync, upload pipeline, editor workflow, customer-facing write paths), this is silent overhead that scales with modification rate.
Triggers that do not look like scheduled re-indexing. The bug fires on any operation that advances a file’s mtime across a calendar-day boundary. git checkout resets mtime to the moment of checkout by default, so a team that versions its document corpus in git reindexes on every fresh clone or branch switch. Docker image rebuilds set every file’s timestamp to the build time. Kubernetes rolling updates that mount ConfigMap or Secret volumes refresh timestamps on each deploy. Backup restores flip every timestamp at once and produce a full unnecessary re-index on the next cycle. A/B tests of different embedding models or chunking strategies re-index the entire corpus per variant. For multi-tenant SaaS, the bug applies per tenant and scales linearly with customer count. Few of these look like “scheduled re-indexing” from a CFO’s angle; all of them hit the same code path.
If your documents live on a backend from the masked list above (S3, Azure Blob, Alibaba OSS, Swift, TOS, Google Drive, Dropbox, IPFS, OpenDAL, Databricks, plus the metadata-less HTTP/WebHDFS/FTP/Git family), the bug is not firing today, but not because your stack is correct. It is not firing because Bug 2 strips the metadata before Bug 1 can trip on it. The masking is environmental, not defensive. It survives until one of two things happens:
- LlamaIndex patches
default_file_metadata_functo usefs.modified(path)instead of POSIX-only keys. That is the natural fix and it is a one-liner. The moment it lands, the hash-churn bug activates for every masked backend simultaneously, at sub-second precision because that is what cloud object APIs expose. Every S3 re-upload, every Azure Blob update, every OSS or Swift or TOS put becomes a re-embed trigger, even for byte-identical content. - Your team writes a custom
file_metadatacallable that queries fsspec directly to get real timestamps into the document metadata. Anyone who wants proper freshness tracking reaches forfs.modified(). The bug comes with it, at whatever precision that callable emits. That path is already live for any project that has written one, regardless of which backend.
What the overhead actually costs
The dollar impact is exactly linear in five inputs. The formula:
wasted_tokens/year = N_docs × daily_churn × tokens_per_doc × cycles_per_year
wasted_usd/year = wasted_tokens/year × embedding_price_per_token
dev_ci_multiplier = (tokens spent re-running ingestion in CI, dev, and staging)
vector_store_cost = writes/year × store_write_price
total_real_overhead ≈ wasted_usd/year × dev_ci_multiplier + vector_store_cost
Assumptions published so the reader can substitute their own. The embedding-price spread matters a lot: text-embedding-3-small is $0.02/M, text-embedding-3-large is $0.13/M, voyage-3-large is $0.18/M (verified against provider pricing, April 2026). The most expensive common choice is 9× the cheapest.
Three scales, running the math for each.
Small (startup / early SaaS RAG feature). 50,000 documents, 2% daily churn, 2,000 tokens/doc, daily re-indexing, text-embedding-3-small.
- Wasted tokens/year: 730M
- Production embedding cost: ~$15/year
- Dev/CI multiplier: lunch money
- Not material. Correctness bug, not a spend problem at this scale.
Mid (established SaaS with serious RAG product, small eng team). 500,000 documents, 5% daily churn, 3,000 tokens/doc, daily re-indexing, text-embedding-3-large.
- Wasted tokens/year: 27.4B
- Production embedding cost: ~$3,560/year
- Plus dev/CI burn: a 5-engineer team running ingestion tests 3× each on a 50K-doc test corpus through every CI build (20 PRs/day, ~100M tokens per full-pipeline run) easily adds 200–500B tokens/year across dev, CI, and staging. At $0.13/M that is $26K–$65K/year on top.
- Plus Pinecone-style write amplification ($4/M writes): ~$100/year. Negligible.
- Total realistic range: $30K–70K/year once you count the pipeline outside production.
Large (enterprise on the kind of scale visible in LlamaIndex’s own case-study list: Boeing/Jeppesen, Experian, KPMG, Carlyle, Cemex, NTT DATA). 5M documents, 5% daily churn, 5,000 tokens/doc, daily re-indexing, text-embedding-3-large.
- Wasted tokens/year: 456B
- Production embedding cost: ~$59,000/year with 3-large, ~$82,000/year with
voyage-3-large - Dev/CI multiplier for a platform team shipping RAG as a product: typically 1.5–3× production token burn
- Vector-store writes at 1B wasted upserts/year: ~$4,000/year in Pinecone write units
- Total realistic range: $120K–300K/year once dev/CI is counted
- At hourly re-indexing or with sub-second triggers (post-activation), multiply by 10–50×. That is the territory where individual pipeline cost lines show up in quarterly reviews.
Ecosystem-aggregate. llama-index-core sees roughly 6.5 million downloads per month on PyPI (verified April 2026). Only a fraction of those are production users at scale, but the tail is long. If even 1% of active installations run the default SimpleDirectoryReader on an active-list backend with daily re-indexing and modest churn, the aggregate wasted embedding spend across the ecosystem is in the millions of dollars per month, with most of it invisible in any single user’s bill. That is the shape of drift that is hardest to catch, because nobody has enough skin in their own line to investigate it.
A notable downstream amplifier worth flagging: Pathway, a 63,000-star Python ETL framework for stream processing and real-time RAG, pins llama-index-core as a hard dependency in its pyproject.toml. Pathway users running scheduled indexing over active-list backends inherit this bug without necessarily knowing LlamaIndex is underneath.
A second multiplier worth pulling out separately: rate-limit capacity. The dollar cost above is one metric; OpenAI’s rate limits are another, and they are enforced at the organisation level, not per API key. Tier 1 caps embedding traffic at 1M tokens per minute and 3,000 requests per minute; Tier 5 tops out at 10M TPM / 10K RPM. At the large-enterprise scenario above (273.75 billion wasted tokens per year) that waste alone consumes roughly 4,562 hours of Tier 1 TPM capacity per year, or 19 days of Tier 5. Teams whose OpenAI organisation runs scheduled ingestion and live customer-facing inference under the same billing account are watching user-facing traffic get throttled by waste rather than by real growth. The cost shows up as 429 rate-limit errors on the product and as procurement decisions to move to a higher tier sooner than real capacity demanded. Most teams share a key across workloads by default (single OPENAI_API_KEY env var, one Kubernetes secret, one enterprise agreement per organisation), so this multiplier is live unless someone has explicitly separated ingestion from inference traffic.
The money is not the main reason to care about this bug. At small scale it is trivial; at enterprise scale it is material but not the most expensive thing in the stack; in aggregate it is significant; in every case it is preventable with a three-line fix. The real reason to care is the other thing in the table above: the same pair of bugs produces opposite behaviour across half the fsspec ecosystem, shipped in defaults of a top-tier framework for thirteen months, and a single well-intentioned cleanup commit flips every masked backend to firing at once. The dollars are just the sanity check.
Reproducing it
All five levels of reproducer are in a public repo: github.com/stirelli/llamaindex-embedding-churn.
git clone https://github.com/stirelli/llamaindex-embedding-churn
cd llamaindex-embedding-churn
uv venv
uv pip install -r requirements.txt
uv run python verify_embedding_churn_lvl1.py # hash comparison
uv run python verify_embedding_churn_lvl2.py # counting embedder
uv run python verify_embedding_churn_lvl4.py # reader format survey
Levels 1, 2, and 4 run in under ten seconds with no external dependencies. Level 3 needs an OpenAI API key (the cost per run is around two thousandths of a cent, and your dashboard will reflect the exact tokens). Level 5 needs AWS credentials and an S3 bucket (cost: fractions of a cent in S3 request fees).
Suggested fix
The minimal set of changes is three lines across two files. All three sites use MetadataMode.ALL today. Changing them to MetadataMode.EMBED makes the hash align with the text that is actually sent to the embedder, which is the semantically correct thing to hash.
- metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+ metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
This respects excluded_embed_metadata_keys, the mechanism that already exists for marking fields as not-content-relevant. SimpleDirectoryReader populates that list with the volatile file-stat fields by default. The fix closes the churn without touching the original behavior added to detect meaningful metadata changes.
I filed the findings upstream as issue #21461 and submitted PR #21462 with the three-line fix plus a regression test that covers both directions: volatile metadata must not force a re-embed, content-relevant metadata still must. Both are public. Both are verifiable.
Closing
Two bugs producing opposite behaviour across a large ecosystem is a stable equilibrium in an unstable sense. It survived thirteen months because neither half produces a visible failure on its own. Bug 1 looks like normal re-embedding to the user who has never measured it. Bug 2 looks like the reader just not emitting temporal metadata, which is fine if nobody notices. On roughly half the fsspec ecosystem, the two align and the churn fires quietly at day-level precision today. On the other half, they cancel and the churn is dormant, waiting for one upstream commit to lift the mask.
The narrow takeaway is: audit your ingestion pipeline’s hash key against the input your embedder actually sees. If the two are not the same, you are paying for work that produces the same output. The broader takeaway is that this kind of interaction sits quietly in large codebases, and the same cleanup PR that fixes one half of a pair can release the other half at scale.