Publish And Subscribe

aka PubSub

In the BlogWeb context, this is about handling a reader's interest in multiple blogs. Specifically, having a mechanism so that the reader gets alerted to changes (new postings).

Russ Lipton's page talks about various points, including the cognitive issue of browsing hundreds of subscribed blogs. (See Universal Inbox for ideas on that.)

Why is anything resembling real-time necessary? Would it be such a bad thing if everyone just grabbed stuff once a day?

  • that would probably suck the energy out of the "HotLink-s"/Flash Crowd-s process

    • one the other hand, will we get to a point when we feel like that process is more like watching TV than thinking?

RssAggregator polling (not really PubSub)

An RssAggregator typically has a single global setting (sometimes not even user-editable) controlling how frequently it polls every subscribed blog to see whether it's changed.

Some aggregators do an HTTP GET, which can involve grabbing a fair amount of data if the blog includes its full content.

Some aggregators (I believe) just do an HTTP HEAD first, so the blog server responds with the date-time of the last change - then the aggregator compares that to the last time it got data, and if the last change is more recent it then does a full HTTP GET.

If an aggregator passes as part of its request the date-time it most recently got data, theoretically the blog server can decide to send no data if there's been no change. That reduces/avoids the bandwidth concern, but adds some computational overhead to each such request. But probably the best trade-off. See HttpConditionalGet

At the opposite end of the cooperativeness scale, Mark Pilgrim's had horrible problems with Newsmonster.


Another approach might be to get last-update info from a separate aggregator Ping Service, like WeblogsCom.

RSS readers should probably check here before asking individual sites for updated RSS. This would improve scalability (vs current practice of grabbing every chosen site's RSS every 30-60 min regardless of whether it's been updated).

  • actually, this assumes that RSS feeds/files are updated at the same time as content, which isn't necessarily true. So WeblogsCom has a separate process for feeds.

  • hmmm, this also strengthens the idea that even if Blogger doesn't give RSS feeds with their free service, they could separately consider pinging WeblogsCom. I wonder whether the do, or would?

  • I wonder what % of sites with RSS ping weblogs.com? (or any other site) I wonder how many do separate pings for main content and feeds?

How handle a world of non-centralized weblogs.com sites? Assuming that each blog picks a single ping-site to ping, they could store that site's URL in (a) a 'link' tag (just like many sites point to their RSS feed) and (b) an RSS channel property.

An issue is what to do when you've been offline for more than 3 hours (or haven't been running your ping-checker). I guess:

  • RSS reader could store last-check-datetime for each Ping Service it checks

  • if past the 3-hour window (or whatever that reported window used by each Ping Service, since it needn't be the same), then grab the RSS from the blog.

  • is the "real" solution to have Ping Service-s allow querying, so you could submit a list of blogs and get back a list of last-ping-times? I suppose that's more "expensive" for the ping-site to process.

Phil Ringnalda describes the apparently-common practice of pinging multiples services.


True PubSub

Radio Userland, Userland Manila - using XmlRpc or SOAP (over HTTP only?) - http://www.thetwowayweb.com/soapmeetsrss and http://backend.userland.com/publishSubscribeWalkthrough |walkthrough directions for Userland Manila

Some Perl-base cloning by Ben Hammersley.

DJAdams did something similar with Jabber. There's a Jabber PubSub http://www.jabber.org/jeps/jep-0024.html |spec I could see this making sense for use while the reader's machine is online, then with a batch catch-up process after a period offline. You'd want the update messages dumped into your Universal Inbox for prioritizing.

While it's not an issue for engines that render upstream to the server separately from saving content, simpler systems there are lots of incremental changes to items (esp on a wiki, vs a weblog). Does it really make sense to shove all those "saves" out to the network? No, it makes sense to use some sort of periodicity/batching model. One would be to have the author trigger an updated-ping to his subscibers when he thinks it's appropriate (I manually ping WeblogsCom). Another would be to have an agent check for changes on a time period (half an hour?), then generate updated-ping messages.

Scalability

Will we hit a point where there are 10million+ blogs? And 10million+ blog readers? If everyone's sucking in lots more stuff than they can or want to read (so their Universal Inbox can rate/filter it), what does that imply for bandwidth scalability? Will ISP-s run Caching Proxy Server-s? Or will writers start to look for ways to limit the number of semi-readers they get? "If I subscribe to your feed then you get real-time PubSub pings, otherwise you can only read/poll once a day."

Maybe this isn't a big deal if people use some existing HTTP features? (HttpConditionalGet)


Other contexts

Matt Haughey on wanting it for MetaFilter

Apache http://mod-pubsub.sourceforge.net/ |module Web Sphere Java Messaging System http://advisorevents.com/CIW0306p.nsf/4e89a750092af55b88256b66006b2eef/4b451ac5cc3583ae88256c8e005bbe0f?OpenDocument |support Traditional enterprise messaging like Tib Co.

Shrook's Distributed Checking --BillSeitz, 2003/07/21 16:06 GMT
described here - basically a cluster of servers tracking last-updated times for many RSS feeds


Edited:    |       |    Search Twitter for discussion