Publication of Type 1 Data via BioCASe Data Pipelines at DSMZ Data Center
From GFBio Public Wiki
(Redirected from Data Publishing/DSMZ Data Center: GFBio publication of type 1 data via BioCASe data pipelines)
The DSMZ Data Center is one of the seven GFBio Collection Data Centers which are components of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication is based on an in-house MySQL server system and MS Access frontend. For a more detailed description see Technical Documentations. The data are structured according the ABCD conceptual schema.
- The workflow is illustrated in figure 1 and described in the text below.
Contents
Data pipeline
Export of GFBio DIP from DSMZ in-house-management system
- Citation
- The citation is based on the data provider's input and according to the GFBio citation pattern. And it is finalised in collaboration with the data producer/customer.
- Example: Reimer, L. C.; Vetcininova, A.; Sardà Carbasse, J.; Söhngen, C.; Gleim, D.; Ebeling, C. & Overmann, J. (2019). BacDive in 2019: bacterial phenotypic data for High-throughput biodiversity analysis. [Dataset]. Version: 20190108. Data Publisher: Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures.
- Licenses
- The licenses for the data packages are ingested during the submission/ ingestion process. Data provider or curators may define own rules to use their data. But in general the DSMZ favorites the CC BY-SA 4.0 license for multimedia URLs.
- GFBio data and metadata created during submission
- The metadata which are generated through the GFBio submission are stored in the GFBio JIRA ticket system. Additional parameter assignment is done manually by the DSMZ data curator in close cooperation with data provider.
- GFBio IDs according to GFBio consensus documents
- All GFBio IDs as well as other external IDs on dataset and data unit level so far available (e.g. DOIs, GenBank accession numbers, BOLD numbers, MycoBank numbers, ORCID IDs, GFBio submission IDs, DSMZ strain numbers, IDs provided by other GFBio data centers for linked datasets) are stored in appropriate tables of the MySQL Server installations. As far as they are part of GFBio consensus documents they will be published.
- Occurrence data according to GFBio consensus documents
- The occurrence data are stored at two levels and two granularities, (a) at dataset level and (b) at unit level.
- Other (meta)data
- Multimediadata and other metadata are stored in our DSMZ own Server system and linked to corresponding datasets in our MySQL Server.
Transformation of DIPs for DSMZ archiving system
- Archiving of DIPs and AIPs
- All DIPs are created as zipped ABCD 2.06 or ABCD 2.1 xml archives using a regular manual function of the BioCASe Provider Software. The URIs are included as link in the landingpages. The AIP consists the DIP, metadata and related documents (for example image files) and are stored in-house. The AIP is not accessible from outside. A backup of the stored DIPs and AIPs is done on a daily basis by DSMZ IT according to Technical documentation of long-term archiving solutions at the GFBio collection data centers
Transformation of DIPs for publication in GFBio data portal and VAT tool
- Access via BioCASe Local Query Tool, Landingpage
- In general the DSMZ Datasources are accessible via the BioCASe Monitor Service. This includes the access to landingpages and local DSMZ biocase query tools.
- Access via BioCASe Monitor service (BMS)
- Citation of published dataset
- The proposed citation string is given according to the scheme examplified above, for details see General part: GFBio publication of type 1 data via BioCASe data pipelines
- Indexing/harvesting by central GFBio indexing processes
- Access via GFBio Data Portal