| Home | About Us | Contact Us | Support | Search | | Facility | | PBO | Education & Outreach | ||
![]() |
![]() | |||||||
|
Data Data Archive Interface (DAI)
Permanent Stations
Campaigns
Monuments
Other Data Search and Access
Station List
Data Maps
FTP Public RINEX
GSAC
Other Providers
For Educators
Data for Educators
Archive Information
About the Archive
Data Policy
Archiving at UNAVCO
Submissions
unav-data
BINEX
Glossary
Contact Us
Comment
Data Tools
TEQC
Hatanaka
|
Data - GSAC Structure and Data Exchange Formats
Version 1.1 Draft proposal 1 August 20021. IntroductionThere are many GPS data archiving centers in the U.S. and worldwide. Each typically has an interest in at most a few particular types of data, for instance, data from a regional network, data from the world-wide IGS network, or data which were collected under sponsorship of a particular agency. These archives have operated in an independent manner to this point, requiring data users to be fairly sophisticated in order to find information of interest. From the user's point of view, it would be much easier if they could get any piece of data simply by contacting any one of these archive centers, instead of contacting each one separately. For such a system to work, the data archive centers would need to know what data are contained not only in their own holdings but also in the holdings of all other archives; this will require a large degree of coordination which does not currently exist. This document proposes a data-holdings exchange mechanism which can create this "seamless" data environment; we call the resulting multi-archive system the GPS Seamless Archive Centers (GSAC). Participation in GSAC denotes a willingness on the part of scientists and surveyors who collect GPS data to organize their holdings according to standards agreed to by the GSAC Working Group, and to provide knowledge of these holdings to the international community. Although participants are encouraged to use GSAC as an efficient means of distributing data, this is not a requirement: information "about" the data can be displayed on GSAC while the data themselves are still under distribution restrictions that might be imposed by governments, sponsoring agencies, or the collectors themselves. 2. GSAC HeirarchyThe GSAC are structured in three levels to reflect existing community functions: Data Provider: An individual or agency that provides information to a Wholesaler (defined below) to be archived and published to the GSAC. There must be only a single pathway into the GSAC for each piece of information so that there is a well-defined original copy. To accomplish this the Data Provider must make each piece of information available to one Data Wholesaler. Data are defined as raw and RINEX GPS observations, network solutions, orbit information, site information, and any other information useful to the analysis of GPS observations. Currently allowed data types are defined in Table 3. The data must be distributable in electronic form (though some meta data may be scanned images of paper records), preferably via the Internet. Data Wholesaler: The Data Wholesalers are the warehousers of information and operate a data archive the contents of which they agree to make available to users of the GSAC as well as other GSAC participants (e.g., other Wholesalers and Retailers). Publishing information to the GSAC occurs when a Data Wholesaler creates a Data Holding Record (defined in Section 3) describing a data file held in the Wholesaler's archive. For each piece of information known to the GSAC, there is one Data Wholesaler who is responsible for the original copy of that information, though other Wholesalers may keep and provide duplicate copies, as described in Section 3. Each Data Wholesaler must provide a unique identifier (the unique_info_id field of the DHR), which is an integer number, for each piece of information that it published to the GSAC. Each Data Wholesaler must also provide a unique name (the unique_site_id field of the DHR) for all of its published site-specific information. Wholesalers may assign unique names in any way they wish (e.g. DOMES numbers or an internal scheme keyed to geographic area as is done for the Southern California Earthquake Center). Hence, the unique_site_id for a given site will in general not be same the for all Wholesalers. . A Data Wholesaler performs at least the following three functions:
A fully participating Wholesaler also:
Data Retailer: The Data Retailers are the point of entry to the GSAC for the user community. Data Retailers do not archive data though a single institution may act as both a Retailer and a Wholesaler. A Data Retailer performs at least the following three functions:
3. GSAC Meta-Data ExchangeMeta-data describing a Data Wholesaler's holdings are essential for identifying and exchanging information between Wholesalers and Retailers, and for local querying of data that reside at remote Data Wholesaler's archives. There must be a few well-defined, computer-parsable files that can be exchanged over the Internet for this purpose. The Data Holdings Record, Data Holdings Files, and Monument Catalog accomplish this. Data Holdings Record (DHR): Describes a piece of information published to the GSAC by a Data Wholesaler. There are two types of DHRs, one describing a file of time-dependent data (RINEX, SINEX, etc.), the other describing a monument in the Wholesaler's catalog of sites. For each site described in a Wholesaler's data DHRs, there must be a corresponding monument DHR. The DHRs are collected by Data Retailers so that they can accurately inform users what data are available from the entire GSAC. DHRs describe the data of interest to the user, but are not the actual data that would be sent to the user. Each DHR is a single line of ASCII text comprising fields of information separated by a semi-colon delimiter and terminated by a newline. The exact fields of the two types of DHR are defined in Tables 1 and 2 at the end of this document. Data Holdings File (DHF): A DHF is an ASCII flat-file containing multiple DHRs describing information published by a single Data Wholesaler. A "Full DHF" is one which contains information for all of the published DHRs available from a given Data Wholesaler, where the data pertain to a given year and day; that is, there are many full DHFs, each of which contains information about data for a single date. An "Incremental DHF" is one which only contains information on new, updated, or deleted DHRs of a given Data Wholesaler for a given year and day; that is, an incremental DHF records the changes to a Data Wholesaler's archive on a given date. This is analogous to the difference between a "full data backup" and an "incremental data backup." Filename used: wholesaler_name.yyyy.ddd.[full, inc].dhf Monument Catalog (MC): The MC is an ASCII flat-file containing multiple DHRs describing GPS monuments, including their locations, for which information exists from a single Data Wholesaler. A "Full MC" is one which contains information for all of the GPS monuments available from a given Data Wholesaler. An "Incremental MC" is one which only contains information on new, updated, or deleted GPS monuments available from a given Data Wholesaler. This is analogous to the difference between a "full data backup" and an "incremental data backup." Publication of this file by a Data Wholesaler is the minimum level of participation allowed within the GSAC details). Note that this file only tells of the existence of data from particular monuments and the location of these monuments, but gives few details about the data or monuments. In the future the GSAC may provide additional information (e.g. site descriptions, photos, 'get-to' instructions) via new data types, not yet defined. Filename used: wholesaler_name.yyyy.ddd.[full, inc].mc 3.1 DHF and MC Format
A DHF/MC is an ASCII flat-file consisting of a series of DHRs concatenated together. The format of each DHR is a single line of text with fields of information separated by semi-colons (;) and terminated by a newline character. In cases where a semi-colon is needed for the information in a field it will be escaped with a back-slash (\;). If a back-slash is needed it must be escaped by itself (\\). Some fields of a DHR are allowed to contain multiple entries (see tables below); these multiple entries are separated by commas (,). In cases where a comma is needed for the information in a field it will be escaped by a back-slash (\,). The maximum number of characters allowed on a line in a DHF/MC is 2048 including the trailing newline (this is the POSIX standard for text files). It may happen that a DHR needs more than 2048 characters, for instance when multiple entries are used for a field that allows them. In these cases the DHR will be split into multiple lines, each of which must contain 2048 or fewer characters; the line before the split must contain 2048 characters with the remaining characters on the line-continuation. These split-lines will include a dollar sign ($) as the last printable character (just before the newline: character position 2047) before the line is split and another dollar sign as the first character of the line-continuation. Thus, any line in a DHF that has a dollar sign as the first character is a continuation from the previous line and any line that has a dollar sign as the last printable character is continued on the next line. In cases where a dollar sign is needed for the information in a field it will be escaped by a back-slash (\$). All fields in a DHR must exist, but some fields are allowed to be Null (empty). Thus, when a certain field is Null the DHR will contain a pair of semi-colons with no characters between (;;). The top three lines of every DHF/MC contain header information detailing the name of the Wholesaler who created it , the format version of the DHRs, and the individual fields of information included in each DHR. These lines are distinguished by a # character as the first character of the line. Headers for a DHF are made up of three lines of information in the following order:
There are thus a total of five ASCII characters that have a special meaning in a DHF/MC:
3.2 Use of DHFs and MCs in the GSAC System. Incremental DHF/MC Area. Each Data Wholesaler will make its Incremental DHFs and MCs available via ftp to all other GSAC participants. These incremental files are organized into sub-directories named by the UTC year/day on which the files were created. This sub-directory structure may exist anywhere that a Data Wholesaler prefers as long as it is "mapped" (for instance, through use of the Unix "link" concept) to the standard directory name ~ftp/pub/GSAC/inc. The directories in the incremental area keep track of the work being done on a daily basis by the Wholesaler. They can thus be thought of as incremental information that was published on a given day. A DHR for information with a start_time (see tables below) on 1998:001, for example, will be kept in the incremental DHF for that year and day, called "wholesaler_name.1998.001.inc.dhf." Once a Retailer becomes synchronized with a Wholesaler it need only access the incremental DHFs and MC in order to stay synchronized. On a daily basis, after the UTC day boundary, each Retailer will be responsible for collecting the incremental DHFs and MC from each Wholesaler and incorporating them into the Retailer's Relational Data Base Management System (RDBMS). For example, assume the current day is 1998:320, and examine the ~ftp/pub/GSAC/inc/1998/320 directory at SOPAC. A hypothetical listing of files there might be: As the above listing notes, several incremental DHFs and an incremental MC exist in this directory. The DHFs are an incremental update of DHRs that were published by the SOPAC wholesale archive on day 1998:320. The incremental DHF for day 1998:317 includes DHRs for information published "today" (1998:320) but where the information pertains to (that is, has a start time) three days ago (1998:317) and similarly for the DHFs for days 1998:318 and 1998:319. The MC would contain any DHRs for new or updated monument information that SOPAC wants to make known to the other GSAC participants "today." In each of the incremental directories the Wholesaler will also keep a file containing the filenames and times of modification of the DHFs and MC in that directory. This additional file is named, in the above example, wholesaler_name.1998.320.inc.list. The format of this "listing" file is: Where the time_stamps must be in UTC and in the ISO standard format (see tables below for details). So in the above example, the file sopac.1998.320.inc.list would look like the following. If a data Retailer wants to get an up-to-the-minute listing of data holdings on a Wholesaler's archive it can get this listing-file first to see if anything in that directory has changed in the course of the UTC day; if the filesize and modification time of the DHF and MC are unchanged from an earlier check by the Retailer, there is no reason to copy the DHF or MC to the Data Retailer's system. This minimizes the amount of information exchange needed to check the current state of each Data Wholesale archive. At the end of the UTC day the Data Wholesaler will stop adding any DHRs to the DHFs and MC in that day's incremental directory and begin placing new DHRs into the DHFs and MC in the "next" day's incremental directory. That is, once the UTC day boundary is passed the data Wholesaler will not modify anything in the that day's incremental directory, but will move on to the next day's directory. Since each Wholesaler's computers will keep their own time, and often the clocks in computers are wrong by many minutes or more, each Wholesaler will think that UTC midnight occurs at a different time. Retailers will need to take this into account when probing each Wholesaler's computer system. The sub-directories in the incremental storage area will not be saved past a certain date; on a daily basis, as a new sub-directory is created the oldest one will be deleted. This will provide a ring-buffer of incremental holdings information that will allow the Data Retailers to collect the DHFs and MCs late if necessary. This ring-buffer will be 30 days in length. It is, of course, the Data Retailer's responsibility to keep track of which incremental DHFs and MCs it has collected in order to stay in synchronization with each Data Wholesaler. It is important for the Retailers to stay in synchronization with the Wholesalers so that their RDBMS is an accurate representation of the holdings of the GSAC. If for some reason synchronization is lost the Retailer will have to start over from the Wholesaler's full DHFs and MC records (a "full restore"). Data Wholesalers dealing primarily with daily-download continuous GPS data from Data Providers would almost always have a set of incremental directories that contain Incremental DHFs and MCs which hold DHRs pertaining to information from one day to a few days in the past if the Provider's data downloads take place after the UTC day boundary. If the Provider downloads data during the UTC day, for instance hourly, the the Wholesaler's incremental directories might also contain Incremental DHFs for the present day. Data Wholesalers dealing primarily with survey mode GPS data from Data Providers would almost always have a set of incremental directories that contain Incremental DHFs and MCs which hold DHRs pertaining to information from many days in the past (perhaps data that were collected many years in the past) as they work through their backlog of data files supplied by the Data Providers. Full DHF/MC Area. Each Data Wholesaler will also keep a copy of its full DHFs and full MC available through ftp to all GSAC participants. These files can be stored in whatever directory the Data Wholesaler prefers as long as that directory is "mapped" to the standard directory name ~ftp/pub/GSAC/full. There is one full MC which contains information on all monuments in a Wholesaler's archive, while there are many full DHFs split by year and day. The DHFs are split to keep filesizes to a manageable level, and also to make it easier for a Wholesaler to keep them up to date. As with the incremental DHFs, a DHR for information with a start_time (see tables below) on 1998:001, for example, will be kept in the full DHF for that year and day, called "wholesaler_name.1998.001.full.dhf." These files serve as a permanent record for each Wholesaler of what data they have published from their archive and should be kept up to date at all times; that is, at any given time they should be a complete representation of all information published by the Wholesaler. The full DHFs and MC are not a description at a single (frozen) point in time; they are always kept up to date. These files are made available to other GSAC participants so that they may re-synchronize themselves with the Wholesaler if necessary. Under normal circumstances this should not be required. Using the same example as above, examine the ~ftp/pub/GSAC/full directory at SOPAC. A hypothetical listing of files there might be: Since there is only one full MC there is no yyyy.ddd added to its name. The "listing" file is the same format as above, with the filename and modification time (in UTC and ISO format) of each file listed one per line. In summary, there are two file storage areas at a Data Wholesaler's archive where DHFs and MCs are kept:
When to Generate a New DHR A Data Wholesaler will generate a new DHR for a piece of information when it first publishes the information to the GSAC. At that time, the Wholesaler must provide, in field 0 of the DHR, a unique integer number that will be used to track this published information within the GSAC (see DHR specification above). This unique number is only unique to an individual Wholesaler, not to the entire GSAC. The combination of the Wholesaler's name and this unique number does provide a unique identification within the entire GSAC. How to Update Previously Published Information If a change occurs to a piece of previously published information which causes any of the fields of the previously generated DHR (whether part of a DHF or MC) to change, then the Wholesaler must publish an updated DHR which reflects the correct values for all DHR fields. That is, the Wholesaler generates a new DHR with all mandatory fields filled with the newly correct values. Naturally, if the Wholesaler is updating information in a DHF, they must use the same unique_info_id value in field 0 of the DHR which was supplied when the information was first published. Similarly, if the Wholesaler is updating information in an MC, they must use the same unique_site_id value in field 0 of the DHR which was supplied when the information was first published. How to Delete Previously Published Information A Wholesaler might need to delete published information if, for instance, it is found to have been wrong in some way that is not repairable; if the information could be repaired then the Wholesaler would repair it and then publish an update as discussed in the previous paragraph. When a Wholesaler wishes to completely remove a previously published piece of information from the GSAC, the Wholesaler will generate a DHR that has values in only three fields: field 0, the unique integer identifier; field 1, the Wholesaler's name; field 6, the date and time at which this DHR was written. All other fields must be Null (empty). When the new DHR is published to the appropriate incremental table, the original DHR is simultaneously dropped from the full table. Once a unique identifier number has been removed from the GSAC it should not be re-used by the Wholesaler; it simply disappears from the system. Providing Distributed Data Backup Some Data Wholesalers may choose to copy published information from other Wholesalers and make these backup copies available to the GSAC. This is desirable on several accounts. First, it provides a distributed backup mechanism at geographically separate locations for the data holdings of GSAC participants. Second, it provides multiple access to the duplicated files for users; if the original Wholesaler's archive is off-line for any reason, then the information will still be available to users through the backup Wholesaler. Also, if one wholesale archive is closer to a user then the data can be transferred from that archive instead of from a different Wholesaler farther away; this speeds delivery and allows the user to avoid Internet bottlenecks. When a Data Wholesaler mirrors information available to the GSAC through another Wholesaler, it must, naturally, publish its own DHR describing the copy and how to access it so that all Data Retailers will know about the copy. The Wholesaler must also notify the other GSAC participants that this is a backup and not an original copy (so that the backup is not backed-up yet again by some other Wholesaler; a possible infinite loop problem). The Wholesaler does this two ways. First, by keeping the name of the original Wholesaler in field 1 of the DHR. The mismatch between this field and both the Wholesaler name on the first header line of the DHF and the filename of the DHF (which contains the publishing Wholesaler's name as a part) signifies that the DHR pertains to a backup. Second, the backup Wholesaler creates its own unique number to refer to the published information in field 0, and also places the original unique number from the original Wholesaler after their own in the field, separated by a comma (as for all multiple-entry fields). This establishes a one-to-one correspondence between the original published information and the backup published information. Without this correspondence other GSAC participants would have no way of knowing exactly which piece of information is being mirrored. In a DHR pointing to a mirrored file, field entries describing the data (start_time and provider) will be the same as the DHR for the original Wholesaler, whereas the entries describing the physical file (dhr_create_time, info_url, file_size, file_create_time, file_checksum, file_grouping, and file_compression) will be that of the local (i.e., mirroring) Wholesaler. The unique_site_id must also point to the Monument Catalog of the local Wholesaler in order for the Wholesaler to check that the monument exists. DHF flat-file storage requirements For each piece of information published by a Data Wholesaler there will be a DHR stored in a flat-file on their system. An estimate of the number of bytes needed for one DHR is about 300 including the semi-colon delimiters between field values. Thus, for the entire SCEC survey mode archive, which contains about 10,000 raw and 10,000 RINEX observation files, the total storage needed to keep these DHFs on-line is about 6 MB. For an archive which deals with 250 continuous sites every day (both raw and rinex) this would require about 150 KB per day or 55 MB per year. In either case this is small compared to the storage required for the actual data files themselves. 4. User Interface and Requesting DataThe main purpose of the GSAC is to make data requests simple, and to allow users to access multiple archives from one web site. The following describes one model for how to accomplish this, but is not the only possible approach. The GSAC does not place any restrictions on how the user interface will work as long as it provides the basic functionality of allowing users to locate information of interest to them and to retrieve that information in a reasonably simple manner. Beyond this it is up to the individual Retailer; that is, we encourage individual creativity on the part of the Retailers. Deciphering the request: The user contacts a Data Retailer using a web interface and determines what type(s) of data the user wants and for which sites (if it is site-specific data). For site-specific data the user could do this by supplying the specific names of desired sites or by selecting all sites within some geographic region. In the first case the Retailer will have to resolve the problem of nonunique site names; if the user requests data for a site called PEAK the Retailer could reasonably ask, "which PEAK do you mean out of the following possible sites?" and then present the user with the coordinates of several locations with this name. Since each MC includes site coordinates it should be possible for the Retailer to sort this out, but determining exactly what the user wants from the Retailer will be a difficult question to answer and it is up to the imagination of the Retailer to figure out how they wish to deal with it. Note that this confusion about naming is going to be a problem even though the GSAC participants have renamed sites (internally to the GSAC) to get uniqueness within each Wholesaler's holdings and have a naming cross-reference (because all MCs are available to each Retailer), because each user will likely have his own favorite name for a site and will be unaware of what the GSAC participants have done with respect to unique site-names. Notify User of Data Availability: Once the request is understood, the Retailer notifys the user of the volume of data involved and whether the entire request can be serviced at this time (i.e., some data may be off-line). The user is presented with (at least) two choices:
Collecting Data: If the user chooses for the Data Retailer to bundle the entire request (option 2 above), then the Retailer could either assemble the requested data files from the various sources onto the Retailer's computers, or they could supply the user with, for instance, an ftp script to be executed by the user from the user's computer. If the requested information is accumulated onto the Retailer's computers then once it is all available the Retailer would notify the user and the information could be transmitted by ftp, or perhaps by some non-electronic means like mailing data tapes (for instance, for very large requests). If the user requested any data that are not on-line, the Data Retailer will note which files are off-line and supply the user with contact information for the Wholesaler in possession of each of these files. NOTE: The Retailer should place a limit on the amount of information it will transmit at one time (e.g., to deal with confused or "prank" requests). Table 1. Monument Catalog DHR (Version 1.1, unchanged from Version 1.0)
Table 2. Data Holdings File DHR (Version 1.1, modified from Version 1.0)
Table 3. Field Specifications for Version 1.1 Data Types
Comments or questions about this page? Send e-mail to Lou Estey (lou Last modified Wednesday, 16-Nov-2005 21:21:00 MST |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
Home | About Us | Contact Us | Support | Search | Facility | PBO | Education & Outreach Comments: webmaster |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||