Jonathan 的个人资料Design by Committee照片日志列表 工具 帮助
5月27日

A scraping incident

One of my mashups fell prey to the dreaded scrape rot - a complete overhaul of the target site that invalidated all of my scraping rules.  The pages in question are from globalincidendmap.com, which previously powered my internationalincident mashup (see Sri Lanka Incident Mashup).  The change was catastrophic - queryable content from the site is no longer free, but requires a paid membership and a login process.  One option would be to pay for a membership, but besides the steep price I doubt that the license terms allow republishing of the data.  I could support the mashup only for paid users of the service, collecting credentials and forwarding them on, but that again is both questionably secure (a user would have to trust that I didn't abuse the credentials temporarily in my possession), and unrealistic since few if any of my audience would spring for the cost of membership.  In effect, my mashup has been totaled.

This illustrates one reason why scraping should be used only as a last resort, when no more stable forms of content are available - feeds or Web services.  When you mix the content and presentation, changes in the presentation are easily confused with changes in the content.  Although the scraping features of the WSO2 Mashup Server are popular, I like to think of them as a stop-gap while publishers find cost-effective ways of serving up presentation-free content, such as delivering simple services using the Mashup Server ;-).  Ideally, more and more publishers will recognize the value of raw content, and the need for Web scraping will diminish.  Gonna take a while though...

Even without scraping, there remains one of the deep problems with mashups and distributed programming, that of services that disappear, are altered, change usage terms, etc., breaking their dependent mashups in the process.  There has been lots written on this, which can be generally summed up as "this is a hard problem."

One thing we plan to do in the future to make sure that service changes don't harm downstream dependents is use more of the advanced functionality of the WSO2 Registry upon which the WSO2 Mashup Server is built - namely, versioning.  Today, each time a service changes, an old copy is retained in the database, but no longer is "alive" as a service.  Some future version will have a simple interface for continuing to keep the old versions online, and help users to lock into one of these previous versions.  Some cool dependency management features on the drawing board for the Registry will also help find and record dependencies and notify dependents of changes.

But would these help in the case of the internationalincident service?  This is a case where there is a deliberate change which prevents "unauthorized" access.  The solution in this case was to mark the service as obsolete, and go out and find a whole new source of data.  The new srilankanincident service is a result - though the data is slightly different, perhaps a result of focusing narrowly on Colombo, it was a fairly short task to reprogram it, and even improve it, once I had found a new source of data.  The speed of fixing catastrophic failures is my current best hope against scrape rot.

评论

请稍候...
很抱歉,您输入的评论太长。请缩短您的评论。
您没有输入任何内容,请重试。
很抱歉,我们当前无法添加您的评论。请稍后重试。
若要添加评论,需要您的家长授予您相应权限。请求权限
您的家长禁用了评论功能。
很抱歉,我们当前无法删除您的评论。请稍后重试。
您已超过了一天之内允许提供的评论数上限。请在 24 小时后重试。
因为我们的系统表明您可能在向其他用户提供垃圾评论,您的帐户已禁用了评论功能。如果您认为我们错误地禁用了您的帐户,请联系 Windows Live 支持部门
完成下面的安全检查,您提供评论的过程才能完成。
您在安全检查中键入的字符必须与图片或音频中的字符一致。
MarshJonat​han 在此页禁用了评论功能。

引用通告

引用此项的网络日志