Behind the Scenes: How the TAP Service Works
============================================
Once you get past issues dealing with databases, ADQL and the like, TAP itself is fairly
straightforward. There are multiple paths through the processing, so that again can look
complicated but once you understand the basics those variants fall into place.
The simplest TAP query is a "synchronous" request, where you hand the service an ADQL statement,
it gets processed, and the result table streams back as the response. But it is easier to
understand the processing by examining an "asynchronous" request, then seeing the synchronous
as a special case. Use synchronous queries if you know the query time will be short. If not
(or if you don't know for sure) use asynchronous queries.
This information is for completeness; most users will probably use either synchronous request
or a TAP client library that hides all this from the end user. Users who want to build their
own client and have queries the run long enough to preclude synchronous querying will need to
understand what follows.
Asynchronous TAP Requests
-------------------------
The user starts by contacting the TAP service and hands it an ADQL query.
This is done through a URL like::
https://exoplanetarchive.ipac.caltech.edu/TAP/async?query=select+pl_name,ra,dec+from+ps
The service creates a workspace (with a random name) and a status.xml file in it containing
information on the query and the state of the processing::
tap_4pxj0j5c
10709
PENDING
0
votable
ADQL
-1
select pl_name,ra,dec from ps
It returns the job ID string (here "tap_4pxj0j5c") attached to the base URL and exits.
Technically, the status (and final data) could be retrieved by anyone but since the ID
is a random string and active for a short time (see Operations Issues) this
obfuscation provides adequate security.
The rest of the processing involves interacting with this job. You can query the status
(including retrieving the whole status structure) but the obvious next step is to actually
start the query running. The TAP specification requires that this be done through an
HTTP POST request but we support HTTP GET as well::
https://exoplanetarchive.ipac.caltech.edu/TAP/async/tap_4pxj0j5c/phase?PHASE=RUN
This also returns immediately; the "phase" in the status XML is changed to "EXECUTING"
and a background process is started that runs the query. When this process completes,
the result data is written to the workspace, the status phase is updated to "COMPLETED"
and a "results" section is added to the status::
tap_4pxj0j5c
13957
COMPLETED
2020-06-06T08:33:10.76
2020-06-06T08:33:39.28
28.5
2020-06-10T08:33:10.76
votable
ADQL
-1
select pl_name,ra,dec from ps
But we can't know this without asking. So after submitting the RUN request we have to
poll the phase information (or the whole status) until it is COMPLETED (or errors off)::
https://exoplanetarchive.ipac.caltech.edu/TAP/async/tap_4pxj0j5c/phase
The result link::
https://exoplanetarchive.ipac.caltech.edu:443/workspace/TAP/tap_4pxj0j5c/result.xml
returns the final data.
Synchronous TAP Requests
------------------------
Blocking ("synchronous") requests simply shortcut much of the preceeding. We still maintain
all the same information in the workspace but the query starts running immediately and the
original web connection stays up until the results are available and streamed back.
Obviously, this is much easier on the user but there is a big "but".
Simple HTTP requests time out, usually at somewhere aroung five minutes. Database queries
can literally last for days if you are doing something complex. So unless you can be sure
your query will finish quickly, it is better to run asynchronously.
The example we have been using here is a query to the Exoplanet Archive for a list of
planets with names and sky coordinates. This table currently has a few thousand records
so in fact synchronous queries work fine.
Refinements
-----------
There are a variety of additional things you can do to an asynchronous query. Before
it starts running you can adjust the maximum number of return records through the maxrec
parameter (this is different from including a TOP directive in the ADQL; that is handled
by the DBMS). Likewise, you can adjust the maximum allowable execution duration.
While it is running, you can kill it by setting the phase to ABORT. Refer to the
TAP spec for details.
Clients
-------
As you can see, it is perfectly possible to interact with TAP "manually" using either
a browser or WGET/CURL scripts. However, there is enough stuff to keep track of, especially
in the asynchronous case with polling, that client support software is advisable.
In Python, there are multiple options, notably Astroquery/TAPPlus and PyVO. However,
none of these is (so far) perfect so be sure to test you use case thoroughly.