- “Only 7% of the sources Topix.net crawls have XML feeds. I’d estimate that only a few hundreds of the top 3,000 newspapers we crawl have RSS support. The rest we obtain with a news crawler which is good about finding articles on news sites, leaving behind the ads and navigation sidebars. It’s low maintenance so we don’t have to change anything everytime a site redesigns its html.
“Even for sites which offer feeds, we’ll generally continue to crawl the human-readable version. We’ve seen sites where the RSS broke but no one at the paper seemed to notice, or cases where the RSS was out of sync with the human-viewable web content. By crawling both we get full coverage of the content available.
“There are approximately 1,400 weekly newspapers in the US, and over 2,600 weeklies. There are around 3,000 magazines, and thousands of radio and TV station websites. Not to mention the city government websites we crawl looking for local announcements.
“Despite the enthusiasm around RSS, there is a long way to go before the bulk of this content will be available in feeds.”