Publication of Type 1 Data via BioCASe Data Pipelines at BGBM Data Center

From GFBio Public Wiki
Jump to: navigation, search

The BGBM Data Center is one of the seven GFBio Collection Data Centers, that are core components of the GFBio Submission, Repository and Archiving Infrastructure. The data curation, archiving and publication at BGBM includes management processes with the international herbarium data management system JACQ , the reBiND-workflow (details see below, DFG-funded project 2011-2015) and the EDIT Platform for Cybertaxonomy. Management tools and archiving processes as conducted at the GFBio Data Center BGBM are described under Technical Documentations. BGBM provides as well a general metadata storage and media management, a data quality service platform and transformation and import services. DOI assignment for research data (via ZB Med) is provided.

Figure 1: The BGBM Workflow, BioCASe (Biological Collection Access Service) data pipelines for GFBio Type 1 Data. Clicking will enlarge the chart.
ABCD - Access to Biological Collections Data schema (V2.06 within GFBio)
SIP - Submission Information Package
AIP - Archival Information Package
DIP - Dissemination Information Package
VAT - Visualizing and Analysing Tool

The workflow comprising these central components is illustrated in figure 1 and the detailed description can be found in the text below.


Data pipeline - Provision of (versioned) DIPs

Central components of the BGBM BioCASe data pipelines for the GFBio publication of type 1 data are the international herbarium data management system JACQ (for type 1a = Botanical Specimens) and the reBiND-workflow (for type 1b = Observations), deriving from the DFG-funded reBiND-project (2011-2015).

  • JACQ workflow:
At BGBM, collection data (herbarium specimens) are managed within the collection data management system JACQ. To facilitate data imports into the JACQ system, the BGBM collection workflow team developed a collection data form. This Microsoft EXCEL based template needs to be filled by scientists/data producers and is often used already during the fieldwork; the template is openly available for GFBio users and and other scientists (see BGBM collection data form (xls template)).
Prior to the import into JACQ initial data quality tests and data harmonization regarding typos, geolocations etc. are executed. OpenRefine serves as one tool for collaborative data cleaning prior to the import. The main curational work (e.g. taxonomy) is processed within JACQ.
After mapping and transformation via the connected BioCASe provider software (BPS) installation the ABCD xml-archives are exported.
Since BPS Version 3.7.1, released in October 2018 the new feature Filtered Export allows to export and download a subset of the records of a datasource into an additional XML archive. If appropriate, this feature is used to create the ABCD xml-archive.


  • reBiND workflow:
The BGBM-reBiND workflow can be used for generic data by combining software tools for transforming data (e.g. OpenRefine, Altova MapForce) stored in database systems and / or submitted files (xls speadsheets) into well-documented, standardized, and commonly understood XML-formats (preferably ABCD 2.06) with a system for storing (eXist-db), documenting, and publishing the information as a web service. The BioCASe protocol is implemented here but it is not a seperate BGBM BioCASe installation. The xmls-archives are provided within an inventory for harvesting/indexing.

Export of DIPs with BGBM in-house-management systems used in GFBio (reBiND, JACQ, Others)

  • Citation
Based on the data provider's input (submission metadata) the citation of the dataset will be prepared by the BGBM Data curator adjusting the input (submission metadata) to be conform with the GFBio citation pattern. The citation is finalised in collaboration with the data producer/customer.
The conformity with GFBio citation pattern is strongly recommended, as exceptions might cause errors/representation problems in the GFBio portal's search.
Exampleː Kilian, N.; Borsch, T.; Müller, K.; Güntsch, A.; Henning, T.; Plitzner, P. & Müller, A. (2018). Digitisation / Cataloguing of non-textual objects: Development of a subject indexing system for collections of the north hemispherical flowering plant genus Campanula (DNA Sample Dataset). [Dataset]. Data Publisher: Botanic Garden and Botanical Museum Berlin. https://data.bgbm.org/dataset/gfbio/0019/.
  • Licenses
Subsequently to the submission by data producer/customer the license of a data package is ingested in the BGBM archiving facilities via JIRA API / metadata ingest.
GFBio is promoting CC licenses, favorite license for data packages is CC BY-SA 4.0.
For exceptional reasons the user and/or BGBM Data Center might want to publish data packages with other than CC BY 4.0 license. (For example CCO license, as used by Herbarium berolinense (B): Textual metadata on specimens from the Herbarium Berolinense (B) are released under the Creative Commons 1.0 Public Domain Dedication waiver [1].
  • GFBio data and metadata created during submission
The metadata which are generated through the GFBio submission are stored in the GFBio JIRA ticket system. A connection via JIRA API allows for metadata ingest and the ingest of the original research data in the BGBM archiving facilities (work in progress; proof of concept successful). Concluding parameter refinement is done manually by the BGBM data curator in cooperation with the data producer/customer.
All GFBio IDs as well as other external IDs on dataset and data unit level as far as available (e.g. DOIs, GenBank accession numbers, BOLD numbers, MycoBank numbers, ORCID IDs, GFBio submission IDs, DSMZ strain numbers, IDs provided by other GFBio data centers for linked datasets) are stored in appropriate databases/tables of the BGBM archiving facilities or will be generated if needed (e.g. BGBM internal AIP-IP). As far as they are part of GFBio consensus documents they will be published.
The occurrence data are stored at two levels and two granularities, (a) at dataset level and (b) at unit level.
  • Other (meta)data
Other (meta) data recommended or mandatory for export are compiled and stored in the BGBM archiving facilities.
Multimedia data are stored in Cleop/iiiF (tool is being implemented).
Subsequently they are included in the data pipeline via semi automatted process within MapForce and exist-db.

Transformation of DIPs as AIPs for BGBM archiving system

  • Archiving and versioning of DIPs (e.g., starting with AIP created by the BioCASe Provider Software)
All DIPs are created as zipped ABCD 2.06 xml archives (with internal date "lastModified") using a regular manual function of the BioCASe Provider Software.
The zipped ABCD 2.06 xml archives are provided via a web accessible storage (https://data.bgbm.org/dataset/gfbio/); a landingpage per dataset provides the original version and potential subsequent versions of the .zip-archive(s).
Version-numbering at BGBM: Every snapshot of the dataset is recognisable at his version number consisting of two parts: Majorversion.Minorversion (i.e. 2.1). Major changes (i.e. adding further data to the dataset) lead to an increase of the first number. Minor changes (i.e. correction of typing errors) will be visible by rising the second part of the version number. For example of a versioned dataset https://data.bgbm.org/dataset/gfbio/0004/
An AIP (.zip-archive incl. the DIP and related documents, as well as the originally submitted files) is created and will be stored from the point of creation onwards within the BGBM longterm archiving storage. The AIP is not accessible from outside the BGBM archiving facilities.

Transformation of DIPs for data publication in GFBio data portal and VAT tool

In general, all seven GFBio Collection Data Centers transform and publish DIPs in a similar way using local installations of the BioCASe Provider Software.

  • Access via BioCASe Local Query Tool, Landingpage
In general, all BGBM data sources deriving from the JACQ-workflow are accessible via BGBM BioCASe Local Query Tool http://ww3.bgbm.org/biocase/querytool/main.cgi (Caution: It contains all! BGBM data sources but not necessarily all BGBM-GFBio datasources).
For the reBiND workflow unit landingpages are automatically generated using a URL structure that mirrors the one from the Local Query Tool, e.g. [2]. This however is not a functional reimplementation of the Local Query Tool. Under the url https://data.bgbm.org/dataset/gfbio/ BGBM provides a separate overview of dataset landingpages for data processed via/for GFBio. A landingpage for each data package is generated semi-automatically.
  • Access via BioCASe Monitor Service
see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Citation of published dataset
The proposed citation string is given according the scheme examplified above, for details see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • DOI assignment
BGBM (as registered customer of ZB Med) is creating DOIs for research data via DOI Fabrica/Datacite. The DOI is also part of the citation of the dataset. Example will follow soon.
  • Indexing/harvesting by central GFBio indexing processes
see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Access via GFBio and VAT Data Portal
see General part: GFBio publication of type 1 data via BioCASe data pipelines