General Discussion

erronis

(24,934 posts) Wed Jun 24, 2026, 07:39 PM 3 hrs ago

Web archive lets you easily search millions of government documents

https://phys.org/news/2026-06-web-archive-easily-millions-documents.html
by Stefan Milne, University of Washington

This looks to be operational although they say it is still indexing content.

https://govscape.net/

At the end of every presidential term, the End of Term Web Archive preserves that administration's web presence as a vast trove of documents and webpages. The archive began in 2008, with George W. Bush's second term, and runs through 2024, collecting images, text, graphs, redacted pages and other media. So while it contains important public information, finding that information in the glut can prove difficult.

A University of Washington-led research team created GovScape, an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like "FAFSA," or use a semantic search, which finds documents on a topic even if the exact search terms don't appear on the page. A visual search option lets them query for qualities like "redacted documents," "aerial photographs" or "pie charts." The system can currently search the 10 million PDFs hosted online during Donald Trump's first term; the team plans to expand it to the whole archive.

Because researchers used highly efficient artificial intelligence models to read the documents, processing all the PDFs costs less than $1,500, or about $1 per 47,000 pages. By comparison, Google might charge consumers $1 to parse around 100 pages with AI.

The team will present its research July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego. The work is published on the arXiv preprint server.

. . .