Museums and their collections have specially customized databases in order to optimally gather and record their contents and associated metadata associated with their specimens. To share, exchange, and publish data, an appropriate data standard is essential. ABCD (Access to Biological Collection Data) is a standard for biological collection units, including living and preserved specimen, together with field observation data. Its extension, EFG (Extension for Geoscience), enables sharing and publishing data related to paleontological, mineralogical, and petrological objects. The standard is very granular and allows detailed descriptions, including information about the collection event itself, the holding institution, stratigraphy, chemical analysis, and host rock. The standard extension was developed in 2006 and has been used since then by different initiatives and applied for the publication of collection-related data in domain-specific and interdisciplinary portals.
Natural history museums and universities harbor millions of collection items gathered in expeditions over hundreds of years. Most collections are highly diverse and comprise voucher specimens from different scientific disciplines such as botany, zoology, paleontology, geology, and anthropology. Independently of discipline and type of preservation, each collection item is accompanied by various metadata regarding not only its collecting event (time, region, and collector) but also research data, aggregated over time (laboratory analysis, relations to other collection items, etc.). In addition, images or other multimedia files of the collection item or the gathering locality are connected to stored objects. Furthermore, collections which are born digital collections, e.g., animal sound archives, are likewise commonly associated with natural history museums. This variety of information makes each collection unique and leads to specific customizations in the databases for each museum or even for a single collection.
However, once mobilized, the data should not only subsist in institutional databases but also be shared with the scientific community and laymen alike using the internet. Leading scientific organizations have signed agreements to facilitate the exchange of scientific results in the sense of open access (e.g., Berlin Declaration on Open Access to knowledge in the sciences and humanities, 2003; Bouchout Declaration, 2014). Furthermore, large funding bodies have implemented principles for the sharing and dissemination of research data (e.g., Alliance of German Science Organizations, 2010; National Science Foundation, 2014) and various journals require accessibility of data after publication.
In order to effectively share data relating to collection units from structurally diverse databases, they need to be mapped to a data scheme which is universally intelligible. Data standards are designed to facilitate the exchange and publication of information via a range of different approaches and pattern. Their use provides semantics and structure in order to avoid errors in interpreting the data, and enhances automated treatment, e.g., usage in web portals and services.
ABCD (
Here we review the history and usage of ABCDEFG and highlight current developments, its uniqueness, and importance as a data standard in the domain of earth science.
The development of ABCD started in 2000. Five years later, ABCD version 2.06
was ratified as a standard by the non-profit scientific and educational
association
The ABCDEFG schema is very granular and allows a detailed
description of data gathered in geoscience (see technical schema
documentation at geological and geomorphological observations; geological specimens and their preparation; paleontological, mineral, rock and sediment specimens; anomalous items
(e.g.,
glacial erratics, transported assemblages); stratigraphic and absolute dates, measured stratigraphic sections, and
borehole logs; identifications (extended to cover rock and mineral classifications,
varietal names etc.) and analyses (techniques and results, e.g., chemical
composition of minerals, petrological analysis of rock); integrated description of the host rock as part of a unit record (e.g.,
mineralization and lithological context of a fossil).
As fossil objects share a considerable number of properties from both biology and geology, it is a great advantage to establish a geoscience schema as an extension of a biological data standard, rather than creating a completely separated standard for geosciences. Furthermore, another area of overlap are the data elements that refer to the collection or collection event itself. The EFG extension is not applicable without its backbone, the ABCD schema itself (Fig. 1). Elements describing the collection, the holding institution, contact persons, and any additional information on the dataset are placed within ABCD. Furthermore, details on the gathering event (location, date, collector, etc.) and links to object related multimedia material including their respective licensing or any related publication can be specified in the ABCD standard.
Simplified overview of the schema extension EFG. Given are the relationships among EFG components and between EFG components and the ABCD schema. EFG elements are given in yellow and elements originating from ABCD in green.
ABCDEFG is an XML-based schema and thus can be easily processed by
software and is readable by humans. The schema is hierarchically structured
and the descriptive elements are aggregated in thematic concepts. All
elements are arranged according to their contextual relation to other
elements and their cardinality (Holetschek et al., 2012). The full potential
of ABCDEFG is unlocked in combination with the free and open-source
BioCASe Provider Software (BPS;
The BPS XML wrapper is able to connect to a wide array of database management systems (e.g., MySQL, PostgreSQL, Microsoft SQL, Oracle, as well as Excel spreadsheets) and provides a user interface for mapping the database fields to the data standards' elements. This generic approach of mapping the individual data models to the standard schema is the essential functionality for standardized data mobilization. Once an institution has set up the mapping for its database, the released content (including revisions, updates and new records) is exposed in the same manner through the BPS. Thus, only changes in the data model itself necessitate modifications in the mapping.
Although mapping of data is straightforward, the (potentially heterogeneous)
content needs to be checked on a regular basis to verify that the results
are appropriate within the data standard's context. This practice is
supported by the BioCASe technology, as different datasets and sources
provided in ABCD and ABCDEFG can easily be monitored using the
BioCASe Monitor Service (Glöckler et al., 2013). By using the BioCASe
data interface and protocol in the backend and a comprehensible user
interface as a frontend, the monitoring tool facilitates checks of the
structure, plausibility, and completeness of datasets. Furthermore, it
verifies compliance of provided data for transformation into other target
schemas and can be used for summaries and simple descriptive statistics
across different datasets. Thus, data aggregators like the Geosciences
Collection Access Service (GeoCASe) can make use of the tool in the
supervision of the progress in data provision. A full implementation of the
BioCASe Monitor Service for summarizing the data provision and checking the
minimum requirements of data mapped to ABCDEFG can be found on the
GeoCASe website (
For several geoscientific disciplines, data recommendations exist and
structures provide guidance on what kind of data should be recorded in what
form during field work or which research information should be publicly
available. The Evolution of Terrestrial Ecosystems Program (ETE; Damuth,
1997) was designed in the late 1990s and provided the basis for later
developments of various data structures (Reed et al., 2015). At about the
same time the Paleobiology Database (PBDB;
All these data structures (ETE, PBDB, and FAUNMAP/MIOMAP) perfectly
manage localities and faunal lists, but cannot be applied on specimen data
in a straightforward fashion. Gilbert and Carlson (2011) established a data
dictionary for paleoanthropology, including a specific table schema for
specimens as the central unit of organization, and localities as clusters of
specimens. This structure facilitates the data collection in the field, the
curation of objects in institutional collections, and the publication of
specimen related information. Along with the International Geo Sample Number
(IGSN) for the registering of samples to the System for Earth Sample
Registration SESAR two metadata sets were defined. These metadata schemes
comprehend core metadata used for the registration itself (registration
metadata) as well as basic characteristics of physical samples and collections
(descriptive metadata and latest version of both sets are available via
In biodiversity science and related disciplines, Darwin Core (DwC) and ABCD
are the best-known and most commonly used standards. In parallel with ABCD,
Darwin Core evolved over time and its development is mainly driven by the
community (Wieczorek et al., 2012). Although both schemas aim at
facilitating the mobilization of biodiversity data, they differ in their
purpose. ABCD is a highly structured schema and was intended to be
applicable to a great variety of data in various degrees of granularity,
including their relations and cardinality. It is almost exclusively tied to
the XML format. Darwin Core in contrast aims at providing a flexible
vocabulary for sharing information about biological diversity and can be
represented using various technologies including XML, RDF, or plain CSV
files. It is used to describe the recorded occurrence of a species in space
and time, and to link related information and evidence to this declaration.
The Darwin Core standard was ratified in 2009 by TDWG and since then it is
the most common standard for publishing datasets in the Global Biodiversity
Information Facility (GBIF;
The usage of Darwin Core for geoscience is limited. Although specimens can
be classified as “FossilSpecimen” (
In summary ABCDEFG is the most comprehensive data standard for geoscientific collection data. Although there have been different approaches, none of them provide a comparable detailed and atomized schema for both paleontological and geological data.
ABCDEFG is already used for data aggregation and provision to a
number of thematic data portals. The following examples illustrate different
data portals and aggregators showing collection related information from the
same data source endpoints mapped to the ABCDEFG standard.
The Global Biodiversity Information Facility (GBIF,
Use cases of the ABCDEFG. Given are the different portals, number of institutions using ABCDEFG, and the number of fossils, rocks, and minerals provided in ABCDEFG to different internet platforms. The same datasets might be published in several portals. Please note that most fossil data provided in ABCDEFG and published on GBIF are not directly originating from collection items but are extracted from the literature via the PBDB.
Table 1 summarizes the number of institutions using the ABCDEFG schema and the number of objects published in different portals.
ABCDEFG has been established in the national and international community and its reach is constantly growing. Once the standard and the supporting software are implemented at an institution and its database systems are linked up, there are many application opportunities. Apart from facilitating highly distributed data exchange and publication, this also facilitates institutional processes such as designing a collection database by using the ABCDEFG terms and data management with documentation and archiving based on ABCDEFG structures and definitions.
The ABCD schema and its extension EFG are currently being revised. Within
the project “ABCD 3.0 – A community platform for the development and
documentation of the ABCD standard for natural history collections” (funded
by the Deutsche Forschungsgemeinschaft (DFG; German Research Foundation),
In a further step, the XML-based structure is being changed into a more semantic form (Resource Description Framework, RDF), including terms of external ontologies. To facilitate the usage of ABCDEFG, specific application schemas will be developed. These schemas will include mandatory concepts (see Holetschek, 2016) as well as commonly used elements (Holetschek, 2015) and further relevant concepts identified by the community. Thereby the application schemas represent all information that scientists and curators would like to share with colleagues and the broader public. An application schema for minerals, meteorites, and drilling cores commonly found in geological collections will be provided.
The ABCDEFG schema has been submitted to TDWG for its ratification,
the decision of which is pending. TDWG's Paleobiology Interest Group
(
ABCDEFG XML Schema Definition (XSD) is available at
The authors declare that they have no conflict of interest. The authors and the handling editor are affiliated to the same institution, but this fact did not give any preference to the choice of the journal. The journal was chosen in order to raise awareness of the data standard in the geoscientific and paleontological community.
The authors thank various project collaborators for their input during the
development, improvement and usage of the ABCDEFG schema. The
following projects and initiatives and the people involved therein
sustainably influenced the current status of the schema: GeoCASe (
Funding was provided via GeoCASe (European Commission, SYNTHESYS I, Activity D) for initial developmental phase and funding of workshops, GBIF-D (Federal Ministry of Education and Research, Bundesministerium für Bildung und Forschung, BMBF) for provider support and maintenance, and ABCD 3.0 (DFG) for current developments and revisions. All this support is gratefully acknowledged. Edited by: Johannes Müller Reviewed by: Donald Hobern and Walter Berendsohn