December 29, 2003

More thoughts on RSS Technology

Home-ripped RSS feeds
Yesterday I said that it was a pain that some websites I regularly read do not offer RSS syndication feeds. An obvious countermeasure would be for people to scrape these sites and rip their own RSS feeds. The three sites I named yesterday, CNN, Debka, and Drudgereport all have been around a long time and appear to have pretty stable formats that could be parsed.

So now if I now have a little engine that is ripping bootleg RSS feeds for me, why not share them? Yeah, sure, but how to let people know they're available? Presumably the sites themselves aren't going to be publishing the link for me.

P2P RSS networks
I had another idea -- set up a P2P RSS network where where everyone shares their RSS feeds, whether they're official or home-ripped. So everyone's P2P agent would be sitting around asking each other, "What feeds do you have available? Do you have a feed for BoingBoing that is more recent than mine? What feeds do you need, maybe I have one ?"

Since we're just exchanging modest-size text files, and not MP3s or DVDs the bandwidth burden is pretty low, even for home dialup users. That means there would be lewer defectors who take only and do not share. It also relieves some bandwidth burden from the original publisher's site.

It shouldn't be too hard to design. All you need to know is the RSS's URL, the timestamp from when it was received from the publisher's site, and a hash code of the contents so that you can easily prevent downloading dupes.

GUI-based Spidering
Oh, and back to scraping web sites to produce RSS feeds. I'm sure there is an Oreilly book out there that would describe the best practices for spidering a site and scraping its contents, but hell, do I really want to write and maintain a perl script every time I want to scrape a new site? No. It's a mess picking through all the piles of badly-formatted and baroque .html source.

What would be much cooler is a GUI-based engine that looks something like a browser, loads up a page and displays its structure, allowing you to annotate it with GUI devices, describing the landmarks on the page, what to scrape and what to ignore. It would then take that information and mechanically produce perl script for scraping the site according to those hints.

Posted by Nils Blutig at December 29, 2003 12:12 AM | TrackBack