| Jonathan's profileDesign by CommitteePhotosBlogLists | Help |
|
May 27 A scraping incidentOne of my mashups fell prey to the dreaded scrape rot - a complete overhaul of the target site that invalidated all of my scraping rules. The pages in question are from globalincidendmap.com, which previously powered my internationalincident mashup (see Sri Lanka Incident Mashup). The change was catastrophic - queryable content from the site is no longer free, but requires a paid membership and a login process. One option would be to pay for a membership, but besides the steep price I doubt that the license terms allow republishing of the data. I could support the mashup only for paid users of the service, collecting credentials and forwarding them on, but that again is both questionably secure (a user would have to trust that I didn't abuse the credentials temporarily in my possession), and unrealistic since few if any of my audience would spring for the cost of membership. In effect, my mashup has been totaled. This illustrates one reason why scraping should be used only as a last resort, when no more stable forms of content are available - feeds or Web services. When you mix the content and presentation, changes in the presentation are easily confused with changes in the content. Although the scraping features of the WSO2 Mashup Server are popular, I like to think of them as a stop-gap while publishers find cost-effective ways of serving up presentation-free content, such as delivering simple services using the Mashup Server ;-). Ideally, more and more publishers will recognize the value of raw content, and the need for Web scraping will diminish. Gonna take a while though... Even without scraping, there remains one of the deep problems with mashups and distributed programming, that of services that disappear, are altered, change usage terms, etc., breaking their dependent mashups in the process. There has been lots written on this, which can be generally summed up as "this is a hard problem." One thing we plan to do in the future to make sure that service changes don't harm downstream dependents is use more of the advanced functionality of the WSO2 Registry upon which the WSO2 Mashup Server is built - namely, versioning. Today, each time a service changes, an old copy is retained in the database, but no longer is "alive" as a service. Some future version will have a simple interface for continuing to keep the old versions online, and help users to lock into one of these previous versions. Some cool dependency management features on the drawing board for the Registry will also help find and record dependencies and notify dependents of changes. But would these help in the case of the internationalincident service? This is a case where there is a deliberate change which prevents "unauthorized" access. The solution in this case was to mark the service as obsolete, and go out and find a whole new source of data. The new srilankanincident service is a result - though the data is slightly different, perhaps a result of focusing narrowly on Colombo, it was a fairly short task to reprogram it, and even improve it, once I had found a new source of data. The speed of fixing catastrophic failures is my current best hope against scrape rot. TrackbacksWeblogs that reference this entry
|
|
|