Skip to content

How it works

From 25+ feeds to one coherent newsstand.

Every few minutes, we pull headlines from across the Nepali press, match them across languages, group them into stories, extract the people and places inside them, and ship the result to the feed you see. Here’s each stage of that pipeline — with the trade-offs we made along the way.

Ingest
Normalize
Embed
Cluster
Entities
Summarise
Render
Ingestion

Pulling 25+ newsrooms, every few minutes

A scheduled fetcher runs continuously against the RSS feeds and public sitemaps of every outlet we track. Each run collects fresh headlines, URLs, publication timestamps, and a short excerpt — never the full article body.

  • Respectful crawling. We honour robots.txt and skip any outlet that disallows our User-Agent. We rate-limit per domain and back off on errors so we never act like a scraper.
  • Headlines only. We store title, canonical URL, timestamp, and an excerpt. Every item on the site links back to the publisher’s own page for the full read.
  • Deduplication. Canonical URLs and content hashes suppress the same story if a publisher re-pushes it under a slightly different link.

Are you a publisher?

We’re continually adding new Nepali and English outlets — if your newsroom publishes in the public interest and you’d like to be tracked, write in. We also honour takedown and delisting requests and will respond within 72 hours. Ownership or editorial-metadata corrections go to the same address.

Normalize

Two scripts, one schema

Devanagari and Latin headlines live side-by-side in the same database. Each incoming article is tagged with its language, attached to its publisher record (including ownership type, HQ, and editorial stance), and normalised so that केपी ओली and KP Oli are addressable as the same concept later on.

No transliteration, no auto-translation at this stage — both scripts are preserved exactly as the publisher wrote them. Script mixing happens only in downstream views, explicitly.

Embed

Every headline becomes a point in meaning-space

A multilingual embedding model converts every headline (plus its excerpt) into a 1,024-dimensional vector. The key property: two headlines about the same event land close together in this space — even if one is in Nepali and the other in English, and even if they share zero words.

Vectors are L2-normalised, so cosine similarity — the angle between two points — becomes a direct measure of how much two headlines are “about” the same thing. We store them in Postgres via pgvector with an HNSW index, so “find the nearest article to this one” is a millisecond query across the whole corpus.

Cluster

Grouping the coverage of one event

When a new article arrives, we search its embedding against every article from the last ten days. If the closest neighbour exceeds our similarity threshold (cosine ≈ 0.78), the new article joins that neighbour’s cluster. Otherwise it opens a new one.

This is a single-pass, online algorithm — cheap, incremental, and self-healing as more members join. A typical busy day yields 40–60 multi-article clusters out of roughly 1,000 headlines. Clusters with three or more distinct publishers get promoted to the Stories view — those are the pieces of news the whole press is talking about.

Edge cases (long transitive chains, near-duplicates, cross-event spillover) are surfaced in an internal diagnostics view so we can tune the threshold over time without changing the core loop.

Entities

People, places, organisations — surfaced, not guessed

ENTITY TAGS EXTRACTED FROM A HEADLINE

प्रधानमन्त्री केपी ओलीले सर्वोच्चको फैसलालाई स्वागत गरे

केपी ओलीperPrime Ministerperनेपाल कांग्रेसorgSupreme CourtorgकाठमाडौंlocPokharalocबालेन साहperUMLorg
PER = personORG = organisationLOC = place

Before anything AI-driven runs, we scan each headline against a curated registry of Nepali entities — politicians, parties, ministries, districts, companies, institutions — with every alias they’re known by in both scripts. Most matches resolve here, deterministically, for free.

Anything missed by the registry goes to a second pass that identifies named entities, disambiguates them against known records, and queues genuinely new names for operator review before they enter the registry. Role-sensitive entries (Prime Minister, Mayor of Kathmandu) are resolved against time-bounded assignments, so a story from last year points to the right person — not whoever holds the office today.

The result: every article carries a list of tagged entities, which powers the Topics page, entity profile pages, and the ability to filter the feed by who or what it’s about.

Summarise

A bilingual brief for every multi-outlet story

ENGLISH · AI

The Supreme Court reinstated the lower-house dissolution petition for a fresh hearing, six outlets reported. Parties welcomed the move but disagreed on the timeline.

cites 6 articles · 4 publishers

नेपाली · AI

सर्वोच्च अदालतले प्रतिनिधिसभा विघटनको मुद्दा पुनः सुनुवाइका लागि दर्ता गरेको छ। छ वटा सञ्चारमाध्यमले समाचार दिए तर समय-तालिकामा मतभेद देखियो।

६ लेख · ४ प्रकाशक

Once a cluster has three or more articles from different outlets, a nightly job drafts a story summary in both Nepali and English and a single neutral headline for the cluster. The summaries use only facts that appear in the contributing articles — nothing is inferred or imported from background knowledge.

  • Summaries are labelled as AI-generated on every surface they appear.
  • Every summary cites only the articles that contributed to it; the source list is visible from the story page.
  • Nepali and English are generated at parity — neither is a translation of the other. Both are drafted directly from the source material.
Render

Served as a reading surface, not a dashboard

PAST HOURNew stories arriving
12
1–3HMomentum picking up
28
3–12HTop clusters consolidating
64
EARLIERLong tail
182

TIME-BUCKETED · ROUND-ROBIN PUBLISHER MIX · BILINGUAL

The final stage is the site you’re on. Server-rendered pages, time-bucketed feed (past hour · past three hours · past twelve · earlier today · yesterday), round-robin publisher mix so no single outlet dominates a segment, bilingual language toggle, and the ownership badge on every card. No tracking pixels, no account required.

Surfaces like Stories, Topics, and Publishers are different cuts of the same underlying pipeline — events, entities, and outlets, respectively.

Principles the pipeline follows

A few rules we don’t break

  • Headlines only

    Full article bodies are never republished. Every item links out.

  • Ownership on

    Every headline carries its publisher's ownership type — state, private, independent, non-profit.

  • AI labelled

    Summaries and entity extractions are tagged as AI-generated and cite their sources.

  • No invented facts

    Summaries may only use facts that appear in contributing articles. No bridging, no inference.

  • Bilingual at parity

    Nepali and English are first-class — neither is a translation of the other.

  • Outside Nepal

    Infrastructure runs outside Nepal by design — a deliberate regulatory posture.

See it running

The pipeline is live right now.

Open the live feed, browse today’s clustered stories, explore trending topics, or meet the outlets we track.