UNAVCO Home UNAVCO Home
   |    |   |  
UNAVCO Home UNAVCO Facility

Data
Data Archive Interface (DAI)
Permanent Stations Campaigns Monuments
Other Data Search and Access
Station List Data Maps FTP Public RINEX GSAC Other Providers
For Educators
Data for Educators
Archive Information
About the Archive Data Policy Archiving at UNAVCO Submissions unav-data BINEX Glossary Contact Us Comment
Data Tools
TEQC Hatanaka
Data - GSAC Strawman Document

GSAC Strawman Document

6 November 1997

-draft prepared by Myron McCallum, Lou Estey, and Chuck Meertens

Distributed Access to GPS Data Centers

This paper discusses mechanisms for distributing a catalog of GPS data and discusses implementing a common method to access the data. Feedback is requested! The intended audience is participants at the Seamless Archive Workshop at the UNAVCO Boulder Facility, Nov 11. Areas needing more work or group input may be identified at the meeting. Our goal is to leave the meeting with a plan to provide users with a mechanism for (1) identical data request and (2) identical data delivery, regardless of which data center is contacted. Of course, data centers will have extensions and enhancements to the basic method according to their own needs and capabilities.

Concept of Seamless Access for GPS Data Centers

A distributed and cooperative data environment is needed between current GPS data centers and existing or newly-forming networks, to provide the user community with seamless access to GPS data and metadata. Each participant in this proposed GPS Seamless Archive Data Centers (GSADC) will maintain their individuality and continue bringing their own strengths into play, yet give the user community a familiar, consistent data access look and feel by providing standardized data and data products.

Issues related to seamless GPS data exchange:

  • data delivery requirements: near real-time data access to permanent station data for network solutions; "data prospecting" style of search and retrieval of legacy data; access to GPS distributed regional centers and other data centers.
  • data types to be distributed: raw receiver data; compressed files; RINEX files; descriptive information such as logs and site descriptions; orbits; polar motion parameters; antenna phase information.
  • data access mechanisms: anonymous FTP; automated query response by mail, http, LDM; interactive Web-Form-Tool interface.
  • standardized metadata (information about the data) to locate data sets (monument names, locations, instrument parameters, data observation times) and timeliness of updates of this information.
  • seamless data delivery mechanisms: ftp pulls or pushes; LDM; email; media via postal mail.

Each group or organization currently provides some method to access and retrieve data - in some cases frustratingly so because methods are different, yet similar enough to be confusing. Implicit to our goals is to take advantage of existing methods by keeping the parts that work, and converge on similar methods but make them identical.

Seamless Process Discussion

Resolving the primary types of data users and their requirements allows us to outline the core functions of a seamless archive. The resulting definition of a minimum set of metadata enables efficient "data discovery" by users, and exchanges of data to users and between data and analysis centers. Increased efficiency is required to handle the influx of new data types and growth in data networks and to handle specific processing requirements.

Real time data distribution obviously can be best provided by the center closest to collection of the data, should it wish to do so. This same center might wish to defer distribution of legacy data to avoid the impact on staff, computer storage resources and network capacity, especially when the data are available elsewhere. We discuss below some mechanisms to handle each case.

One stop shopping for data products should be provided where possible so that researchers don't waste time seeking combinations of data from various centers, with a different access mechanism to each. However it is impossible to package every product that a some users may want. Initial complete packages - such as providing RINEX, orbits and polar motion parameters - should be defined.

Two methods of accessing GPS data centers in use today are anonymous FTP, and query methods (mostly by Web-based form tools). FTP provides the simplest approach to making pre-identified data available, for example from permanent stations, where a hierarchical directory structure provides a known path to locate data by site name and time. Limitations of FTP include: non-uniform directory structures between data centers; potential for non-unique site names; and difficulty in maintaining and distributing automated updates. Anonymous FTP is in wide use for distributing data soon after collection, however.

A query method manages requests for accessing a wider range of GPS data files, where various data suppliers provide their own naming convention and where delays in arrival of data may approach weeks or months. Queries can request specific data to be delivered, or request an inventory or catalog of information about data availability. The query can be interactive, such as with a Web form, or from an automated or "batch" background process. Examples of batch queries in English:

  1. Send list of all monuments occupied on 1997 10 20.
  2. Send list of monuments near coastlines in the Caribbean Basin.

One problem with the query method is immediately obvious: what query language can be used, and will the user community adopt it? A batch query requires use of a formatted method to enter specific information, and to deliver the request to the data center through an established communications channel. There is no standard mechanism to handle GPS data in this manner at this time. Request for various data types and delivery methods such as: ftp, email, and media type must be handled, as well as returning a acknowledgment. We suggest adopting an existing query mechanism below. This is the IRIS NETDC EMAIL request, evolved from the BREQ_FAST mechanism for requesting and delivering seismic data.

An example of an interactive query that demonstrates the Seamless concept:

  • The GPS researcher contacts any GSADC site via the Web, and sees a Web form.
  • After entering information on the web form to select data, the specific GSADC becomes the coordinator for the users request.
  • The GSADC now assembles the data from its own data holding and by coordinating retrieval from other GSADC sites if necessary. If the contacted GSADC does not contain the requested data, it may also refuse to support the request for outside data, but the response makes it clear which requests were declined. A status report an estimate of time for data delivery is sent.
  • The data are transferred to the requester.

In a different example using a batch method instead of a web browser, the researcher may request an inventory of data holdings using a query language (discussed below) and may specify selection criteria to restrict the data returned. The GSADC will then perform identically to the first example.

Development Plan

Each GSADC will create and maintain an index of specific core information about GPS sites for which it is responsible and make it available to other interested GSADC participants in a timely manner. Each GSADC may therefore extend its catalog to include information from other GSADC's. Between the various GSADC sites supporting this option, queries to a GSADC may result in the GSADC returning data from its own storage or retrieving the data from a participant GSADC. So in some cases, regional GSADC's will maintain only data which they collect themselves, and more comprehensive GSADC's may maintain lists and retrieve data from multiple regional GSADC's. This coordinated index will exist on each GSADC system, will be regularly updated, and will be the primary information accessible by the user community as the first step to providing access to the data in the seamless archive.

A similar look-and-feel user interface for basic queries and retrieval of data and information will be provided by each GSADC participant. GSADC's will also coordinate to provide delivery of identical information and data in response to a users queries.

To enable this process:

  1. Define a basic set of meta-information to coordinate and exchange, consisting of:
  • Monument names (both 4-character IDs, and additional information such as network where the data originated, for unique identification) and locations in latitude, longitude, elevation;
  • Time spans for which data exist for these monuments. This information should be presented as accessible data files, and as tabular listings for on-line interfaces;
  • These temporal listing should take two forms - one is panoramic and shows the earliest and latest period for which some data exist for the site; the other form is more detailed - a separate file for each data type, by GPS week. The panoramic file specifically list for each site the year, month and day for which the earliest and latest data exist. The site name and unique identifier are followed by the time ranges. The detailed file will be organized by file type and GPS week, containing a list of each site name, followed on the same line (record) by a flag for each day of the week for which some data exist (ala Bock, like R, X, or B for respectively raw, RINEX, or both files available) and other information. Granted, the granularity of this presentation means that some days with minimal data will be presented, but any further detail is not warranted.
  • Eventually include RINEX2-required header information such as antenna heights, and identify commonly desired optional RINEX header information.
  • Eventually include all relevant IGS log information, to track station occupation history, but initially just provide the limited data-discovery information above.
  1. Support two methods of data inquiry and retrieval.
  • The first is the FTP method (in wide use today for timely access to on-line data for permanent stations). We should standardize FTP structures (maintain identical directory hierarchy) to simplify locating data (according to the panoramic and detailed files explained above) or agree on a method to manage the different styles. If we know a 4-char ID, the GPS week and day, then the full pathname to locate files can be generated implicitly if we have a standard, and would not have to be explicitly listed. Without the standard, we must retain information or file paths specific to each data center. (Note: no files necessarily need be moved around - in some cases links could be used to establish a standard without impacting on-going work).
  • Second, establish a standard query method to allow data retrieval according to several selection criteria. We propose to implement the IRIS NETDC method, modified for GPS data.
  1. A standardized response to a users queries is must be defined which is easy for the users to understand and computer systems to parse. Again, the IRIS NETDC method is suggested.

GSADC Exchange

GSADC participants will therefore exchange information on data holdings consisting of the following two types of files. This will coordinate data holdings and metadata:

  1. Panoramic listing: a single file with one header line and then a line for each station:
  • Header line: institution/group name, contact information
  • Monument Names, Mi
  • Monument ID's, MIDi
  • Monument location (low-precision coordinates in data file) in lat, lon, elev.
  • Earliest start time for which data exist.
  • Latest end time for which data exist.
  1. Detailed listing - for each GPS week a file with one header line and then one line for each station with the following fields separated by white spaces:
  • Header line: institution/group name, contact information,FTP access name (if any)
  • Monument Names, Mi (4-character name)
  • Operational network data was collected in (4-character abbreviation)
  • Monument ID, MIDi, (Domes number if IGS site; otherwise network specific ID)
  • Monument location in low-precision coordinates (to 3 decimal places, so f8.3 for fortran people) from data file: lat, lon, elev.
  • The following information:
    7 bytes for detailing data existence for each day of the week consisting of ascii characters representing data holdings. For GSADC, the data type is already identified, so a single byte such as x for no data, 0 means some data but don't know how much, and 1-9 identifies data existence percentages divided by 10, and a "?" flags a change in station setup, such as a receiver or antenna swap, antenna height change, or reprogramming of the receiver. (The details of the changes are supposedly noted in the station log).
    Instrument model name & Antenna - we need to define standard names

Examples: GSADC Requests

The GSADC requests will be EMAIL based, with a specific format modeled after IRIS' NETDC format but adopted for GPS data. A properly formatted request would be emailed to a fixed email address at each GSADC participant, for example: Mail gsadcunavco.org.

The requests are parsable by computers but readable, and consists of:

  • HEADER
  • INFO_TYPE

The HEADER identifies the entity making the request, and is described later.

The INFO_TYPE request line is an extendable method, which initially will return three types of information: 1) a catalog (inventory of data holdings, .INV); 2) response information (.RESP); and 3) data (.DATA). Generally a request looks like:

.<INFO_TYPE> <DATA_CENTER> <NETWORK> <STATION_ID> <LOCATION> <START> <END>

Applying this format to examine the catalog of holdings, to determine existence of data for a particular IGS station, held at a specific data center:

.INV CDDIS IGS IGS.POL2 * * *

The results from this request yields the basic location and time information from the panoramic file:

POL2 42.6798 74.6943 1725 1995-05-25 1997-08-22

Similarly, requesting detailed occupation information within a specific time range:

.INV UNAVCO_DMG IGS IGS.POL2 * "1996 11 24 00 00 00" "1996 12 08 23 59 59":

generates the detailed list of daily data availability by searching through detail-listing files:

POL2 42.6798 74.6943 1725 881 0 1 2 3
POL2 42.6798 74.6943 1725 882 0 1 2 3 4 5 6
POL2 42.6798 74.6943 1725 883 0 1 2 3 4 5 6

The missing data for days 4-6 on the 1st line is real. On the 3rd line, only the first day of data was requested but since the query returns information in one-week quantities all days are presented.

As a last example, searching for any station data for May 21,1989:

.INV UNAVCO * * "1989 05 21" "1989 05 21"

would return:

AIRP 44.7 248.9 1946 489 0 1 2 3 4 5 6
PLAT 40.18 255.27 1520 489 0 1 2 3 4 5 6
HEBG 44.864 248.665 1979 489 0 1 2 3
REST 44.89 248.42 1801 489 0 1 2

Apparently you can get REST for 3 days, HEBG for 4 days, and the other two stations for the entire GPS week.

A number of HEADER lines must precede any INFO_TYPE requests. The HEADER lines appear as follows:

.GSADC_REQUEST
.HUB_ID <machine assigned request label>
.NAME <user or organization requesting information>
.INST <name of institution>
.MAIL <return mail address>
.EMAIL <return email address>
.PHONE <phone #>
.FAX <fax #>
.LABEL <user-assigned identifier label for this request>
.END

The line labels are relatively self-explanatory. GSADC_REQUEST is the necessary start of the message to identify the email document. The .HUB_ID is assigned to the request by the receiving data center and tracks the time of the requests and the responding entity. It is never modified once set. The HUB_ID will look like:

.HUB_ID UNAVCO_DMG;Nov_11,01:00:05

SUMMARY & ACTIONS:

This is a strawman, to be more fully detailed at, and after, the meeting. Additional handouts will be provided. We should be able to state a specific goal, like:
GSADC participants will meet the following criteria:

  1. Provide information to be listed in identical file copies kept accessible at each GSADC's FTP area, and which will list their FTP access information, including staff contact information and the internet address to reach the data-holdings computer. Information about the type of data access and delivery provided - FTP, NETDC style of query, LDM or other methods will be provided in these files.
  2. Create the two files - the panoramic and detailed listings of data holdings - and update them on a daily basis. The files will be in identical formats at each GSADC.
  3. The panoramic and detailed-listings files will be accessible in a common FTP area, of the form: <GSADC_internet_address_name:~ftp/pub/GSADC/gsadc.sum> and
    <GSADC_internet_address_name:~ftp/pub/GSADC/data_lists/GPS_WEEK/gsadc_ftype.cmp
    where GPS_WEEK is the numerical gpsweek, ftype is the data type: rinex (rxobs, rxnav, rxmet...), erp, smp, apc, etc. (Erp is Earth Rotation Parameters, apc is "antenna phase center"...)
  4. Each participant will provide one or more of two methods of data delivery:
  5. (A) a definition or a specific path to locate data files based on their names as above, i.e. by GPS week and site name. For example if the comprehensive file list from UNAVCO identifies an FTP transport method and has this line:

    POL2 IGS 12348M001 42.6798 74.6943 1725 881 0 1 2 3
    then the directory path to the file will be accessible through anonymous FTP like this:
    archive.unavco.org:~ftp/pub/GSADC/data/rxobs/881/pol20881.96O

    (B) data access via the NETDC-style mechanism as described previously.

-draft prepared by Myron McCallum, Lou Estey, and Chuck Meertens

Comments or questions about this page? Send e-mail to Lou Estey (louunavco.org).

Last modified Wednesday, 16-Nov-2005 21:21:00 MST

 

Home | About Us | Contact Us | Support | Search | Facility | PBO | Education & Outreach

Comments: webmasterATunavco.org
© 2008 UNAVCO, Inc.