"Only 7% of the sources Topix.net crawls have XML feeds. I'd estimate that only a few hundreds of the top 3,000 newspapers we crawl have RSS
support. The rest we obtain with a news crawler which is good about finding articles on news sites, leaving behind the ads and navigation sidebars. It's low maintenance so we don't have to change anything everytime a site redesigns its html. "
Geen opmerkingen:
Een reactie posten