| Jonathan's profileDesign by CommitteePhotosBlogLists | Help |
|
October 26 Mashing up a National Geographic Photo of the Day Feed(This article first appeared a few days ago on the WSO2 Oxygen Tank.) I recently wrote a neat little mashup which demonstrates a little of the power of the WSO2 Mashup Server to flow information from one place to another, and from one format to another. I had a simple set of requirements:
Essentially then the task was to scrape the URLs from the photo of the day, and package them into a feed. The complication comes from the fact that there doesn’t seem to be a list of photos of the day available on the National Geographic web site – just links from a particular photo to the one for the previous (or next) day. Because a feed of 30 photos requires 30 different pages to be scraped, some caching really becomes necessary to improve performance, especially since feed readers can be expected to bombard the service if it proves popular. I initially broke down the task into three parts:
Here’s how I approached each of these tasks.
Scraping a photo of the day pageThe first order of business for scraping a page like this is simply to fetch the page, tidy it into XML so we can navigate it using tools like XPath. The WSO2 Mashup Server provides a “Scraper” object that accepts an XML language describing the steps involved in scraping. This configuration language is defined by the Web Harvest component that we use for scraping. I usually start with a scraping mashup using a simple function that configures and performs the scrape, and returns the results:
The config language itself is pretty straightforward, once you learn to read it inside out – the <http> element fetches the requested URL, the <html-to-xml> does just what it sounds like and tidies the result, which is put into a variable named “response”. The scrape is performed by initializing a new “Scraper” object with the config, and the result is made available through the “response” property on the result – corresponding to the “response” variable we defined within the config file. One trick though – the result is a stream of XML text, including an XML declaration. The E4X extensions can parse this into XML (new XML()), but can’t handle the XML declaration. We have to strip off the declaration ourselves using string manipulation. By placing the above function in a file named “nationalgeographic.js” in the “scripts” directory of the Mashup Server, a Web service with a scrape_picture_page operation will be deployed. We can get to it through the try-it page (http://localhost:7762/services/jonathan/nationalgeographic?tryit) and see what the tidied HTML looks like for the page. Extracting the data from the page can be a tedious process, involving looking at HTTP request-response pairs and trolling through the HTML source of a page. Fortunately the National Geographic site’s HTML is simple and straightforwardly structured, with a number of well-placed identifiers to help us zero in on the interesting content. I usually end up using Firebug (Firefox debugging extension) to navigate the live HTML of the page and develop some XPath expressions that extract the desired metadata for the page. I’ve also found that, since Web Harvest communicates between components using strings rather than parsed XML, that defining a lot of XPath filters to extract information one element at a time during a scrape can perform poorly. Instead it seems much faster to wrap a series of XPath expressions into a simple XSLT stylesheet so the XML can be parsed once, queried as much as needed, and an XML structure containing the results returned in one action. To do that, I added an XSLT stylesheet to the above configuration: var config = Again, fairly straightforward – the <xslt> task has two inputs, <xml> and <stylesheet>. The stylesheet unfortunately has to be enclosed in a CDATA section rather than as straight XML. One other nice trick though – when the output is an XSLT template, the “omit-xml-declaration” flag can be used to strip off the XML declaration so we don’t have to do it through text manipulation, simplifying and accelerating our Javascript code. So we’re almost there with this capability. Some minor improvements and adding caching are all we need:
system.include("storexml.stub.js");xsDate.visible = false; As an aside, this shows a couple of my wishes:
Finding a picture for a particular dateNow that we have a function that can scrape a page given a URL, and given that the data returned and cached by that function contains a link to the page for the previous day’s page, we can do some walking around in the cache to find data for a particular date. That’s what this function does. First, we look in the cache for a photo’s metadata. If it’s there, we can simply return it – we’re done. Otherwise we need to find the URL for the page representing that date and call the scrape_picture_page operation. If I can’t find the requested date in the cache, I look for the next earlier date, and so on, until I do find a photo in the cache (or I reach today’s date). That’s the first while loop. Then, using the <previous> page url, I work backward again, incidentally populating the cache as I go, until I’m back to the date I was looking for. The couple of “if” statements look for exceptional conditions: the first one handles the case where I’ve looked all the way forward till today but still haven’t found anything in the cache, and the second makes sure that if a page can’t be scraped for some reason that we give up and return what little we have before we dig ourselves any deeper. picture_for_date.inputTypes = {"date" : "xs:string"};
Generating the feedNow we have all the pieces in place to aggregate the data and generate a list of some kind as output. The picture_of_the_day operation does that for us. The function has some parameters controlling aspects of the feed – whether to link to the small, medium, large, or wide aspect ratio images, and how many items to include. If no number is specified, we generate a feed of the latest 30 photos – just long enough to enjoy the photo but not so long we get tired of it. The WSO2 Mashup Server has a Feed object to help construct feeds, but because I’m targeting this feed at the Google Photos Screensaver I need to include some feed extensions that aren’t supported in the 0.2 release (though they’ve just been added to the nightly build!). It’s not hard to create an RSS by hand though, so that’s what I chose to do. First I prepopulate the channel with title, links, and description, and then loop through the photos adding an item for each of them. The first time through the loop, I also add in a <pubDate> reflecting the date of today’s photo. Again, this isn’t rocket science – the hardest thing is simply to format the dates appropriately. During the loop I use Javascript Date objects to increment days and tick over at the end of the month. I convert that to an xs:date to access the cache, to an RSS Profile-conformant string for the <pubDate>, and to an xs:dateTime for use in the <atom:published/> element, which seems useful for the subscription page displayed in Internet Explorer 7. picture_of_the_day.inputTypes = You can access this operation through the try-it page at http://localhost:7762/services/jonathan/nationalgeographic?tryit and see that the operation returns a feed. However, the try-it uses SOAP by default under the covers, which isn’t terribly friendly to feed readers like the Google Photos Screensaver. No problem – the Mashup Server also exposes it’s operation through a REST interface. By accessing the URL http://localhost:7762/services/jonathan/nationalgeographic/photo_of_the_day?size=wide, you can see the feed directly in the browser, point the screen saver at it, subscribe to it, etc. By adjusting the “size” and “numPhotos” parameters you can generate variants of the feed that suit your purpose.
Publishing the feedOnce I had the service written, tried it for a day or two to ensure it was stable (and fixed a couple of edge cases as a result), I used the administrative UI in the Mashup Server to publish it to http://mooshup.com, which hosts the service live on the internet for others to use. The publishing process is simple – click the share button, confirm that http://mooshup.com is the destination, and click OK. While we have lots to do to make this site an attractive and useful place for members of the mashup community to hang out, it does give me a stable internet URL for the feed (for example http://mooshup.com/services/jonathan/nationalgeographic/picture_of_the_day?size=wide) so others can enjoy it. You can exercise the try-it page live from there, look at the metadata, or download the service to your local installation of the Mashup Server and run it there.
Last WordHopefully this helps you get a feel for the Mashup Server in action. We did some screen scraping, fairly sophisticated caching by invoking an external storexml Web service, formulated an RSS feed, and made it (and intermediate functions) available through a Web service including SOAP 1.2, SOAP 1.1, and HTTP bindings, including an HTTP GET binding amenable to RSS agents. Although we didn’t look at them in detail in this article, the Mashup Server generated a try-it page for debugging and exercising the service, WSDL, Schema, stubs for accessing the service simply from Javascript or E4X environments, even generated some human-readable documentation for the mashup. We ran the service locally, then published it live onto the internet. It also would not be hard to generate a custom HTML interface providing (for example) a slideshow of these photos, but in this case I wanted to show that user interfaces can go beyond just HTML pages by using Google Photos Screensaver as my ultimate user interface. So what’s next for this service? The main improvement I can think of is rewriting the code to use the Feed object when it becomes capable of handling the images. It took me a while to figure out which RSS extensions were necessary and it would be nice not to worry about the representation of dates. Maybe I could even offer an Atom feed in parallel. Another idea related to performance would be to experiment with a different, perhaps additional, caching strategy – which is to cache the entire feed to disk and periodically refresh it using the recurrence capabilities of the mashup server. But those are perhaps good topics for future articles! Until then, enjoy the great photos available from National Geographic! [Updated 6 Feb 2008 - added "jonathan" user to endpoint urls as required by the Mashup Server 1.0 release, and changed the online links to point to http://mooshup.com.] October 09 WSO2 Mashup Server 0.2 ReleasedThe WSO2 Mashup Server 0.2 release is now available for download! Right on schedule three months after our 0.1 release. As I said on the advent of the 0.1 release, the approach we've taken to the Web Service composition space is simple:
The end result is a scriptable Web Services composition platform. We didn't raise much noise around the 0.1 release, as we still were working on some of the fundamentals. But I'm very proud of the 0.2 release and encourage you to give it a whirl. This release marks major improvements in a number of areas:
One area we didn't do any innovation on was our user interface - which still has a number of useability and functionality issues. We held off incremental improvements between 0.1 and 0.2 in order to focus on a significant revamp in 0.3. I'll be talking more about specific features and use cases of the WSO2 Mashup Server in weeks to come. |
|
|