Data validation & accuracy

How we stay anchored to ground truth, and the limits you should understand before relying on any value rendered here.

Where the data comes from

Every value rendered on insiderdelta is derived from one of three first-party sources. We do not redistribute any third-party aggregator’s database, and we do not synthesize values.

SEC EDGAR - www.sec.gov and data.sec.gov. Source for all filings (Form 4, 10-K, 10-Q, 8-K, 13D, 13G, 13F), XBRL Company Facts, the bulk Financial Statement Data Sets, and per-CIK submissions metadata (SIC, exchange, fiscal year end). This is the authoritative source for company-reported financials.
Yahoo Finance - daily closing prices, dividend ex-dates, and stock-split events, via the yahoo-finance2 library. Yahoo’s prices are split-adjusted, which matches our split-adjusted historical EPS so that historical valuation ratios are consistent.
Loughran & McDonald - the canonical finance-text sentiment lexicons used for our 10-K sentiment scoring. Distributed by the authors at the University of Notre Dame.

How we resolve canonical values

The SEC’s XBRL tag space is intentionally open and inconsistent - issuers report the same economic concept under different tag names across years, taxonomies, and reporting conventions. Reading the raw filing data without normalization gives wrong answers. Our resolver applies the following layered strategy:

Two-pass alias resolution. For each logical line item we maintain an ordered alias chain (e.g. for Revenue: RevenuesNetOfInterestExpense → RevenueFromContractWithCustomerExcludingAssessedTax → Revenues). We try the authoritative Company Facts API source across the WHOLE alias chain before falling back to the bulk-data alternate, never the other way around.
Largest-match strategy for ambiguous concepts. Some issuers report multiple revenue concepts (e.g. Bunge: both an aggregated Revenues tag and a subset RevenueFromContract... tag). We pick the larger by absolute value - the consolidated total is always ≥ any sub-component.
Period-shape filtering. We classify XBRL period contexts by their duration (days between start and end), not by the issuer-supplied fiscal-period label, because some filers tag standalone quarters as fp="FY" and many Q2/Q3 facts come as both 90-day standalone and year-to-date 180/270-day variants under the same label.
Split-adjusted historical EPS. Per-share values from periods prior to a subsequent stock split are divided by the cumulative forward-split factor of all later splits, so YoY and multi-year EPS comparisons are like-for-like.
Dimensional segment extraction. Segment, product, and geographic revenue breakdowns are extracted from the raw inline-XBRL instance document of each 10-K - we read the dimensional axis context that the Company Facts JSON API drops.
Source provenance markers. When a cell is sourced from the FSDS bulk fallback rather than the authoritative Company Facts API, we mark it with a small ◊ on the financials page so the reader knows.
IFRS taxonomy coverage for foreign issuers. ASML, Toyota, BP, Shell and other 20-F filers tag concepts under the ifrs-full taxonomy. We accept IFRS-equivalent aliases alongside US-GAAP ones for every line item.

Continuous validation

We maintain a curated set of canary tickerswith hand-verified values pulled directly from each issuer’s 10-K cover page, plus Yahoo Finance snapshots of corporate-action events. A validation script in our pipeline runs the same data path the website uses against those known-good values and flags any discrepancy outside tolerance:

Latest-FY headline financials (revenue, net income, EPS, assets, equity, operating cash flow) - ±1% tolerance
Sanity bounds on ratios (ROE sign, net margin band, current ratio > 0)
Stock splits present in our DB match Yahoo’s history
TTM dividends per share within ±5% of Yahoo
Most recent ex-dividend date within ±7 days
Segment revenue axes populated as expected per issuer
Same-SIC peer set surfaces the expected siblings
Issuer metadata (SIC, exchange) matches

Tolerances are explicit, not implicit. Failures are surfaced; we do not silently mask them. Run our suite yourself by reading data/validation-truths.json in the repository.

What we deliberately do not do

No analyst estimates. We surface what issuers themselves disclosed in filings - no consensus EPS, no price targets. Those numbers come from third-party survey vendors whose methodology we cannot independently verify.
No prediction labels. Our 10-K signals report a sentiment shift and a language-diff score, not a buy/sell/hold rating. The academic results we lean on (see Methodology) are statistical regularities, not instructions.
No undisclosed sources. Everything we render is derived from public filings and Yahoo’s public market-data feed. No proprietary data, no scraped paid services, no "alternative data" we cannot trace.

Known limitations

Honest limits we have measured and have not yet solved. Knowing them is the cost of relying on any value here.

Issuer-side errors and restatements. We faithfully reflect what the issuer filed. When an issuer amends a filing, the amended values appear after our next ingest tick - until then we show the prior version. We do not catch issuer reporting errors before the SEC does.
XBRL coverage by vintage. XBRL tagging was phased in by the SEC starting in 2009 and became universal for filers only in the mid-2010s. Pre-2015 financials may have lower coverage; pre-2009 filings have essentially no XBRL.
Yahoo coverage limits. Yahoo’s split / dividend history reliably starts in 1990; earlier events may be missing.
Segment extraction is per-filing. We display the latest 10-K’s segmentation; multi-year segment trends require joining across filings whose member taxonomies may differ across years (a Microsoft “Office” segment in 2017 is not the same member as “Productivity and Business Processes” in 2024).
CUSIP-to-ticker mapping for 13F holdings. 13F-HR filings report holdings by CUSIP; we do not yet maintain a paid CUSIP-to-ticker map, so 13F holdings are displayed by CUSIP + name-of-issuer as filed.
Peer comparison narrowness. Peers are matched on exact 4-digit SIC code. SIC is the SEC’s canonical industry grouping but is too narrow to capture, for example, “tech megacap” as a peer set spanning multiple SIC codes.

Disclaimers

Not investment advice. insiderdelta is a research tool that surfaces and normalizes data from public SEC filings and public market-data feeds. We are not a registered investment advisor, broker-dealer, or fiduciary. Nothing rendered on this site constitutes a recommendation to buy, sell, or hold any security.

No warranty of accuracy. We use the techniques described above to maximize fidelity to our underlying sources, but we make no representation or warranty as to the accuracy, completeness, timeliness, or fitness for any purpose of any value rendered. Errors in underlying sources, errors in our parsing of those sources, and gaps in coverage are inevitable. You are responsible for independently verifying any value before relying on it.

No affiliation with the SEC, Yahoo, or any issuer. References to securities, issuers, regulators, or third-party data providers are for identification only. We are not endorsed by, sponsored by, or affiliated with any of them.

Past performance does not predict future returns. Historical insider behavior, language-change signals, dividend streaks, activist filings, and any other patterns surfaced here are descriptive of past events. Markets are forward-looking and past patterns frequently break.

Use at your own risk. To the maximum extent permitted by applicable law, the insiderdelta authors and operators disclaim all liability for any loss, damage, or consequence arising from the use of this site or its data, whether direct, indirect, incidental, consequential, or otherwise. Your access to and use of the site is at your sole discretion and risk.

Reporting an issue

If you believe a value rendered on insiderdelta is materially incorrect, please tell us what you observed, which page, and what you believe the correct value is. The validation suite documented above means we can usually reproduce, isolate, and fix a real discrepancy quickly - assuming we know about it.

See Methodology for how our 10-K language signals are computed.