Publication of Type 1 Data via BioCASe Data Pipelines at DSMZ Data Center

From GFBio Public Wiki
Jump to: navigation, search

The DSMZ Data Center is one of the seven GFBio Collection Data Centers which are components of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication is based on an in-house MySQL server system and MS Access frontend. For a more detailed description see Technical Documentations. The data are structured according the ABCD conceptual schema.

The workflow is illustrated in figure 1 and described in the text below.
Figure 1 The DSMZ workflow, BioCASe data pipelines for GFBio type 1 data.
SIP - Submission Information Package
AIP - Archival Information Package
DIP - Dissemination Information Package

Data pipeline

Export of GFBio DIP from DSMZ in-house-management system

  • Citation
The citation is based on the data provider's input and according to the GFBio citation pattern. And it is finalised in collaboration with the data producer/customer.
Example: Reimer, L. C.; Vetcininova, A.; Sardà Carbasse, J.; Söhngen, C.; Gleim, D.; Ebeling, C. & Overmann, J. (2019). BacDive in 2019: bacterial phenotypic data for High-throughput biodiversity analysis. [Dataset]. Version: 20190108. Data Publisher: Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures.
  • Licenses
The licenses for the data packages are ingested during the submission/ ingestion process. Data provider or curators may define own rules to use their data. But in general the DSMZ favorites the CC BY-SA 4.0 license for multimedia URLs.
  • GFBio data and metadata created during submission
The metadata which are generated through the GFBio submission are stored in the GFBio JIRA ticket system. Additional parameter assignment is done manually by the DSMZ data curator in close cooperation with data provider.
All GFBio IDs as well as other external IDs on dataset and data unit level so far available (e.g. DOIs, GenBank accession numbers, BOLD numbers, MycoBank numbers, ORCID IDs, GFBio submission IDs, DSMZ strain numbers, IDs provided by other GFBio data centers for linked datasets) are stored in appropriate tables of the MySQL Server installations. As far as they are part of GFBio consensus documents they will be published.
The occurrence data are stored at two levels and two granularities, (a) at dataset level and (b) at unit level.
  • Other (meta)data
Multimediadata and other metadata are stored in our DSMZ own Server system and linked to corresponding datasets in our MySQL Server.

Transformation of DIPs for DSMZ archiving system

  • Archiving of DIPs and AIPs
All DIPs are created as zipped ABCD 2.06 or ABCD 2.1 xml archives using a regular manual function of the BioCASe Provider Software. The URIs are included as link in the landingpages. The AIP consists the DIP, metadata and related documents (for example image files) and are stored in-house. The AIP is not accessible from outside. A backup of the stored DIPs and AIPs is done on a daily basis by DSMZ IT according to Technical documentation of long-term archiving solutions at the GFBio collection data centers

Transformation of DIPs for publication in GFBio data portal and VAT tool

  • Access via BioCASe Local Query Tool, Landingpage
In general the DSMZ Datasources are accessible via the BioCASe Monitor Service. This includes the access to landingpages and local DSMZ biocase query tools.
  • Access via BioCASe Monitor service (BMS)
see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Citation of published dataset
The proposed citation string is given according to the scheme examplified above, for details see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Indexing/harvesting by central GFBio indexing processes
see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Access via GFBio Data Portal
see General part: GFBio publication of type 1 data via BioCASe data pipelines