Integrating Existing Tools Into The VO - Part I
Note: The presentation of this is only scheduled to last about 20 min. Obviously there is more here than can be discussed in that time and so this document is aimed at the proceedings, the presentation itself will be much quicker and cover only the bullet items.
IRAF Web Services Development
Introduction
Many of the tools and protocols beings used in the VO (and the Summer School) today are naturally new to astronomy (e.g. VOTable, SIA/Cone protocols, the VOPlot viewer), or were easily adapted to work in a VO environment given their existing purpose to work as a front-end for archives or data centers (e.g. Aladin already supported tables so adding VOTable wasn't difficult) or explore large multi-dimensional datasets (e.g. Mirage). Newly developed interfaces such as the STIL table library and the TOPCAT viewer built with it are developed in a "modern" language like Java. In the case of IRAF, however, we have an extensive collection of existing tasks that do everything from the strictly utilitarian processing of image data to highly specialized analysis tools. Additionally IRAF is a scripting/user environment already familiar to many astronomers that can be used in developing new science applications. Tasks, however, are implemented in a "custom" language (CL scripts or SPP) and lack the needed system infrastructure (e.g. native XML parsers or SOAP message handling) to easily handle VOTables or web-services interfaces, and resources needed for major system enhancements are scarce. Many other legacy systems share these same traits, but that doesn't make the systems or applications any less desirable in the VO era.
So, how can the VO make use of IRAF and the many other legacy software systems used by astronomers every day, and how can developers in environments that don't traditionally support web-services or network/web protocols make use of VO data? The answer, or at least one possible answer, is the tried-and-true method of using interfaces and wrappers.
Project Goals
The primary goals of the IRAF integration project are to
- Expose existing IRAF tasks as web services;
- Do so in a way that minimizes or eliminates required changes to existing tasks;
- Make it easy to develop new VO services (specifically science applications) using the IRAF system;
- Not require a Java/web-services expert to do the above;
- Automate the process as much as possible to be able to create or modify large numbers of services quickly;
- Allow the system to be packaged so it may be used elsewhere (i.e. closer to the data for processing before being shipped back to the user);
- And if there's time left over, make the solution work for systems other than IRAF
This aspect of the larger IRAF-VO development project focuses mainly on server-side integration of the system. We've seen from the mini-tutorial that it is possible to use host tools from within IRAF tasks to do much of the actual work required of a VO client application, however more complete and seamless integration of the system itself is an ongoing effort and is not discussed here.
Challenges
IRAF tasks may be compiled executables, script tasks, host commands, or some combination and so "standard" approaches to legacy code deployment as web-services are not always practical or appropriate. We can treat all tasks as a form of commandline tool (i.e. call the binary directly or use a host wrapper on CL scripts) but we then ignore our goal of limiting the number of changes required to each task and still need to handle the particulars of each task (parameters, binary location, etc). Additionally we must still deal with the serialization of VO data and protocols (VOTable, SOAP message handling, URLs, etc) in an environment where we may not be able to easily use existing interfaces for such things. Lastly, we must address the move from what is traditionally a desktop analysis environment operated by an intelligent user to the distributed data/computing model of the VO where interactive applications need to operate in batch mode easily, and it is now remote software invoking the task.
Approach
There are far too many tasks to consider integrating each into the VO individually, even if only trivial modifications are needed to each task -- a more systematic approach is needed. The strategy adopted then was to:
- Keep it Natural and Build Bridges:
- Java is a natural environment for implementing the web-service, dealing with SOAP and attachments, and URL handling.
- IRAF is the natural environment for implementing new science functionality
- So, figure out how Java can "run" IRAF the way a user would.
- Separate the System from the Application:
- Use core infrastructure where possible, i.e. stock IRAF/packages
- Hand-code support routines only where necessary
- Automate the generation of the web-service endpoints
- Think BIG:
- Interfacing individual tasks introduces too many complexities and too much hand-coding per task
- Interface the entire system and we can separate the web-service from the legacy code entirely
System Overview
|
The architecture for the current implementation is shown in Figure 1. At (1) the user initiates a request to the web-service using any form of client application (Java, Python, Perl, etc) and sends the standard SOAP message perhaps containing attachments of data to be processed. The web-service endpoint at (2) is auto-generated Java code to handle the message (unpack or convert data, validate input, etc) and construct a command string to be executed in the legacy system. Utility code (hand-written) in (3) sends the message as a simple text string over a socket to an "dapter process" that is already running on the system. This adapter at (4) is a Unix process that may be initiated from the xinetd super-daemon that spawns the entire IRAF system as a child process and opens an input socket to accept command input. Commands are passed through to the IRAF CL in (5) just as a user would have entered them at the keyboard, and output is returned along the same path to the user.
The WEBSVC external package is like any other IRAF package but was created to hold the various small wrapper scripts or new science tasks we wish to deploy. Wrapper scripts are often necessary for a number of reasons:
- To expose only a few of the IRAF task parameters to the web-service interface;
- To combine more than one task to gain the functionality needed;
- To filter undesirable output from tasks, or to have tasks create output
- To deploy an entirely new task in the system;
In addition, any other external package, core system task or host command is available just as it would be through the CL command prompt. The "IRAF adapter" process starts a CL session and initializes the child CL with the needed environment.
|
Developing a new service is a matter of writing and XML configuration file that describes the web-service interface (name, argument lists, etc) and the associated IRAF functionality (i.e. the command string to be sent). XSLT strylesheets transform this XML config file into the Java endpoint code (much of which is redundant with other services), client test code and the user documentation via Javadocs. Build scripts tie this all together and will automatically deploy the service if needed. Abstract data types such as "VOTable" trigger code to be generated that automatically converts the VOTable IRAF doesn't understand to a FITS binary table it can deal with and passes this temporary table through for processing; output tables are likewise re-converted back to VOTable when returning from the service.
A sample (minimal) configuration file for an echo service is shown in Figure 2. The <svc> root element defines the service class, there may be any number of <method> blocks defining individual services. The <method> blocks define the calling argument list for the service as well as the associated IRAF command in the <iraf> block. In this case we simply call the print command of the CL with the input argument and return a string. The <session> block allows us to configure in part how the service will be deployed, e.g. running on a different inet port so that simple calculator services won't be blocked while a long-running image mosaicing service is processing another request. Any number of service classes may be running simultaneously (including private user sessions), each operating on a different socket. In the generated code fragment we see the use of the CLSocketClient class which is one of the hand-coded utilities need, here it simply sends the command string to the adapter process and returns the result to the service.
The Science Theme Services -- A Real World Example
The GALCAS/GALMORPH demo services developed for the science theme use both the above system to deploy the web-service and some of the techniques discussed in the IRAF mini-tutorial to implement the science functionality as a new IRAF task. In this case we used a 3-rd party external package (GIM2D from Luc Simard at CADC) for Bulge+Disk decomposition in GALMORPH and a modified version of SExtractor to compute the CAS (Concentration, Asymmetry and Surface brightness) parameters for both tasks. Unlike most IRAF packages, source code for GIM2D is not available from the author and documentation is sometimes anecdotal at best.
To achieve the science aim of the task, a number of individual tasks in the package needed to be combined to automate the process a user would normally execute as separate steps in the analysis. We also tuned parameters and the script itself to better handle the WESIX input catalog we are using to find galaxies and the SDSS image data we are expecting (since image-specific quantities such as seeing estimates, pixel-scale, detector gain etc are needed by GIM2D).
WESIX gave us a cross-matched catalog for galaxy identification, but the service doesn't provide the required localized background estimate as one of its options. This was an easy addition to SExtractor but meant we had to additionally run a localized SExtractor and merge the resulting table with the WESIX input to produce a suitable input table for the GIM2D tasks. The STILTS table utilities were used to convert the input tables to FITS tables for processing in the script, thereafter we used standard TABLES package tasks for manipulating tables.
With the proper input tables for our selected galaxies GIM2D ran much more reliably, but unfortunately it also ran more slowly (on the order of minutes/galaxy cutout to fit the model). As part of a research program this may be acceptable, but as a demonstration during a tutorial it wasn't nearly fast enough to process a sufficient number of galaxies to show a useful plot or provide students with a hands-on experience.
The next step was to create a faster service that could be used somewhat interactively. The script developed so far was already using a modified SExtractor with custom output columns so the decision was made to use a further modified version to compute the CAS parameters at the same time. For the GALCAS service we stop at this point and format a return table, for GALMORPH we continue on to the GIM2D processing stage and return all derived values in a merged table. This brought the total processing time down to a few tens of seconds for large clusters with only a slight overhead for development and processing
Lessons Learned
In developing both the system used to deploy IRAF tasks as web-services, and in looking at the different use-cases presented by existing applications of all types, of number of lessons were learned:
- Operating in a New Environment
- Legacy tasks were probably designed with a specific purpose or user interaction in mind -- this doesn't always match with the way that task may be used as a service by client software. Similarly, even general-purpose tasks make assumptions or have requirements on their data and don't always trap cases where these assumptions aren't met.
- Not All Tasks Make Good Web-Services
- Tasks which are even slightly interactive or produce screen graphics generally make poor web-service candidates. In other cases an application may have too many parameters/options, has the one output number we're interested in buried in a stack of other printout, or must be run as part of a larger processing pipeline. Here we simply make use of a wrapper script to combine functionality, filter output or present a different interface.
- Not All VO Data are Alike
- Another presentation will discuss how to deal with invalid data. From
our experience however we encountered data that aren't necessarily
invalid, but are quite what we expected:
- Plate-solution keywords instead of FITS WCS keywords in the header
- Gzip-compressed FITS images
- VOTables with little or no UCD usage
- Optimizations are Important Considerations
- Consider the costs of data transfer when designing the service. For
example,
- 2K2 real image ==> ~16Mb
- 100 50x50 cutouts ==> ~1Mb (pixel only)
- As FITS files ==> ~5Mb (but half is padding!)
In the case of GALMORPH the transfer time is tiny compared to the processing time required by the service, and although we do end up using the entire image we could just as easily have used cutouts with a bit more cleverness.
In cases where we don't need associated image headers, the use of in-line data in the VOTable itself can significantly reduce the amount of bandwidth required for a service. However, support for in-line data is not always available and presents its own challenges in IRAF.
- Handling VOtables
- The majority of data providers use only the simplest features of a
VOTable and converting these to a standard FITS binary table for
processing is not a problem. For IRAF script tasks we use a host
command to do the conversion and native tasks for the processing,
There are a few examples however where handling VOTables becomes an
issue:
- Multiple resources in a single VOTable document don't necessarily map well to FITS bintables.
- Variable length array elements. Bintables support these but the data itself may represent e.g. the CD matrix of an image WCS where applications may expect this information to be in the image header.
- In-line image data aren't well-supported by interfaces other than STIL (fortunately they aren't used much either).
It is up to the service implementer to decide whether a VOTable is a sensible return object.
- URL Handling
- Image access references from SIA queries, and the query itself, depend heavily on the resolution of a URL to return an object. While we can lean on host tools to resolve the URL, we do so using the foreign task mechanism that executes the command in a spawned shell. URLs however typically contain characters ('?', '&', etc) that are special to the shell and so must be escaped. While this can be made to work it is not nearly as efficient as having URL support directly in the system.
- Acting as a VO Client
- The GALMORPH service takes as input an image and the output of a WESIX call with specific parameters. A more logical implementation would be for the service to require only the image and then make the WESIX call itself. This isn't such a hard thing to do when you consider there is already a WESIX client demo available in the software distribution that could be easily modified to do just what we need. We can get around the lack of a SOAP interface in IRAF by declaring this client to be just another foreign task. Similarly, SIA/SSA/Cone requests can be done using general-purpose tasks written in other languages until more integrated support is available in the system.
- Exception Handling
- Tasks may fail, and report that failure, in any number of ways: A
legacy fortran code may simply print an error message repeatedly
as it iterates over the image expecting that the human operator
will read the message and know something went wrong. A command-line
task may abort and set an error code in the system that can be
checked, and still other environments like IRAF may try to
recover from the error internally.
The problem when deploying any of these tasks as services is to find a way to trap those errors and report back to the standard Java endpoint in some standard way. Here again we can use a wrapper to check for anticipated error strings or status codes and return something that causes an exception. GALCAS/GALMORPH were developed using the interactive CL environment as normal scripts, but when these were first deployed as services there were a number of unexpected errors due to missing files or environment definition that could only be debugged in the service itself. Since we didn't see these when building the service they weren't specifically trapped in the script making the transition from user-script to web-service more difficult. When in doubt, heavily error check your legacy tasks.
