Publication of Type 1 Data via BioCASe Data Pipelines at SNSB Data Center

From GFBio Public Wiki
Jump to: navigation, search

The SNSB Data Center is one of the seven GFBio Collection Data Centers, that are part and form the backbone of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication of type 1 data at SNSB includes management processes with several Diversity Workbench modules (DC, DP, DTN, DA, DG, DR and DST). Management tools and archiving processes as done at the GFBio data center SNSB are described under Technical Documentations. This includes services for documentation, processing and regular archiving of the incoming original (meta) data sets and multimedia objects (source data; SIP) under involvement of DiversityProjects (DP) functionality for metadata ingest from GFBio submission tool. Data producers are welcome to use xls templates as provided under Templates for data submission. SNSB uses DWB tools for data and metadata import, metadata enrichment and data quality control (see

A Diversity Workbench (DWB) tool with GUI for organising data publication is used to operate transfer, filtering and transformation of data and metadata for publication (DWB Video on Export --> Overview). The data transfer is partly done automated. The implementation of DOI assignment for published research datasets is planned for the next months.

Figure 1: The SNSB Workflow, BioCASe (Biological Collection Access Service) data pipelines for GFBio Type 1 Data. Clicking will enlarge the chart.
ABCD - Access to Biological Collections Data schema (V2.06 within GFBio)
AIP - Archival Information Package
DIP - Dissemination Information Package
SIP - Submission Information Package
VAT - Visualizing and Analysing Tool
The workflow with these central components is illustrated in figure 1 and the detailed description can be found in the text below.

Data pipeline - Provision of (versioned) DIPs

Data producers using the GFBio installation of the DWB database suite at the SNSB for (interim) data management have write and read access to their ingested datasets, at least for a certain period of time. Thus, they are able to fulfil certain tasks of data quality control and data curation.

  • Description of DWB workflow 1 applied for biodiversity and collection datasets without material deposit at SNSB
The data producers have also to decide whether their dataset will be published (a) in a traditional way, i.e. without option for revision of the DIP, i.e. the data publication process is done only once resulting in one single DIP published via GFBio, with citation, eventually with DOI assignment and with zipped ABCD 2.06 xml archive and (b) with option of mid-term data management of the processed SIP, later data enrichment, revision and emendment of the dataset. With solution (a) new data revisions have to be treated as a new completely independent data publication starting with a new GFBio submission process. The alternative (b) is a dynamic data publication. By that, the process has the option to treat versions as it is in a regular way done for collection data records and growing data packages with observation data (DIPs with version changes). All decisions regarding the envisaged data pipelines are documented as part of the submission process in DP.
  • Description of DWB workflow 2 applied for collection datasets with material deposit at SNSB
In most aspects, workflow 2 is identical to workflow 1, but includes a final step of data replication from the GFBio installation of the DWB database suite to the SNSB master installation of the DWB database suite. The pipeline is for middle-sized assets of collection data and will be applied in two cases: (a) when data producers use existing SNSB collection material for their research and produce well-structured linked research data and multimedia objects and (b) when they are going to deposit their vouchers together with extended meta data and multimedia objects at the SNSB. As in pipeline 1 the data producers are guided to use the GFBio installation of the DWB database suite at the SNSB for interim data management and have write and read access to their ingested datasets. After an agreed period of time the data are published as DIP via GFBio, e.g. with citation, eventually with DOI assignment and zipped ABCD 2.06 xml archive. Thereafter the data are replicated and kept versioned in the collection databases of the SNSB.

Export of DIPs with the SNSB in-house installation of DWB used in GFBio

The DIPs are created by data curators (data stewards, data scientists) at the SNSB using the DWB BioCASe data publication tool.

  • Citation
The citation as appearing in the respective ABCD 2.06 element is the result of an aggregation of several DiversityProjects elements, element entries and data processing. The year and versioning given in the citation of the DIP is that of creation of the DIP (DWB ABCD_package). The citation agents might have more than one role (authors, publisher). The exact date of creation of the DIP as zipped ABCD 2.06 xml archive (using a manual function of the BioCASe Provider Software) is that indicated in the BioCASe Provider Software as "lastmodified".
Based on the data provider's input (submission metadata) the citation of the dataset will be curated to be conform with the GFBio citation pattern. The citation is finalized in close collaboration with the data provider.
Example: Kölbl-Ebert, M. (2017). The Fossil Fish Collection at the Jura-Museum Eichstätt. [Dataset]. Version: 20171205. Data Publisher: Staatliche Naturwissenschaftliche Sammlungen Bayerns – SNSB IT Center, München.
  • Date and time specification regarding data publications with version changes
The available ABCD 2.06 xml zip archive includes the date of the latest data export from the DWB master databases to the Microsoft SQL cache database (xml element: 'date modified') as well as the date of the last change of each single unit data in the DiversityCollection master database (xml element: 'last edited'). It also includes the citation with versioning as given in the example above. In rare cases the ABCD 2.06 xml zip archive might represent an earlier version of DIP with deviating citation as that provided by the BioCASe Provider Software in parallel for dynamic access. (In contrast to GFBio, GBIF is harvesting the dynamic representation of datasets). The reason is, that there exists an automated export from the DWB to the BioCASe Provider Software in certain intervals, but the creating of the zip-files via BioCASe Provider Software has to be done manually.
For data publications with version changes the historical first year of data deliverage is given in the description home page "overview page" of the dataset.
  • Licenses
The licenses for the data packages are those ingested in DP during the GFBio submission/ ingestion process. The license for each single multimedia object is handled separately and stored in DC together with the respective multimedia URI. GFBio is promoting CC licenses, favorite license for data packages: CC BY 4.0; favorite license for multimedia objects: CC BY-SA 4.0.
  • GFBio data and metadata created during submission
The metadata which are generated through the GFBio submission are stored in the GFBio JIRA ticket system. A connection via JIRA API allows for metadata ingest in DiversityProjects (two DP installations at SNSB are alternatively involved: one for SNSB collection projects and one for GFBio projects). Additional metadata and original research data are imported in DWB RDMS via DWB ImportWizards. Additional parameter assignment is done manually by the data producers and by SNSB data curators using special(web) services of DiversityProjects, which are linking services of the GFBio Terminology Service.
All GFBio IDs as well as other external IDs on dataset and data unit level as far as available (e.g. DOIs of assigned documents and data, GenBank accession numbers, BOLD numbers, MycoBank numbers, ORCID IDs, GFBio submission IDs, DSMZ strain numbers, IDs provided by other GFBio data centers for linked datasets) are stored in appropriate tables of the DWB installations. As far as they are part of GFBio consensus documents they will be published.
The occurrence data are stored at two levels and two granularities, (a) at dataset level in DP (setting elements) and (b) at unit level in DC with DA, DTN, DZ etc. As far as mandatory or recommended as part of GFBio consensus documents they will be published.
  • Other (meta)data
Other (meta)data recommended or mandatory for export are either stored in DP, DA or DC.

Transformation of DIPs as AIPs for SNSB archiving system, DOI assignment

  • Archiving of DIPs (e.g., starting with DIPs created by the BioCASe Provider Software)
All DIPs are created as zipped ABCD 2.06 xml archives (with internal date "lastModified") using a regular manual function of the BioCASe Provider Software. At least the original/ primary and the last subsequent version of the zipped ABCD 2.06 xml archives are provided via web accessible storage (ongoing implementation), see next chapter. At the SNSB all DIPs are locally archived as AIPs by an automated process together with Pywrapper config data. This process is starting each 24 hours using the "lastModified" date for naming the AIPs. Services of the Leibniz-Rechenzentrum (LRZ) München are used for long-term archiving of the AIPs. The AIPs at the LRZ are not accessible from outside. Version-numbering of AIPs is done by adding ISO time strings. As far as DOIs are assigned (manually done by DWB curation in DP) these PIDs are included in the metadata of the DIPs and AIPs.
  • SNSB DOI assignment
The assignment has to be done after the DWB data curation, filtering and transformation process is finished. It concerns the DIPs which are already processed by the BioCASe Provider Software (see below).

Transformation of DIPs for data publication in GFBio data portal and VAT tool

In general, all seven data centers transform and publish DIPs in a similar way using local installations of the BioCASe Provider Software to create ABCD zip-Archives.

  • Access via BioCASe Local Query Tool, DOI Landingpage with DOI assigned DIPs
The SNSB data center is going to organise the creation of "Overview pages" and "DOI landingpages" for DIPs in two ways:
(a) DIPs with versioning do have their own individually designed "overview web page" with links to a DOI landingpage with the zipped files and one link to the BPS access page created by the BioCASe Provider Service. The DOI landingpages (semi-automatically generated from PostgresSQL cache db) will provide a links to the DOI-assigned data packages (ABCD xml, zip-archive) and a link to the "overview page" (ongoing implementation),
(b) DIPs without versioning are currently documented by the BPS access page created by the BioCASe Provider Service. In future, the DOI landingpage will be generated semi-automatically from PostgresSQL cache db as in case (a) and linked with the BPS access page created by the BioCASe Provider Service (ongoing implementation).
  • Access via BioCASe Monitor Service (BMS)
see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Citation of published dataset
The proposed citation string is given according to the scheme examplified above, for details see General part: GFBio publication of type 1 data via BioCASe data pipelines; finally a DOI might be added to the citation pattern.
  • DOI assignment
SNSB is registered at ZB MED and might create unique DOIs for data packages. The DOI is created at DataCite DOI Fabrica, annotated to the corresponding version of the data package and stored in appropriate tables of DiversityProjects. It is also part of the citation of the dataset.
  • Indexing/harvesting by central GFBio indexing processes
see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Access via GFBio and VAT Data Portal
see General part: GFBio publication of type 1 data via BioCASe data pipelines