Publication of Type 1 Data via BioCASe Data Pipelines at SNSB Data Center

The SNSB Data Center is one of the seven GFBio Collection Data Centers, that are part and form the backbone of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication of type 1 data at SNSB includes management processes with several Diversity Workbench modules (DC, DP, DTN, DA, DG, DR and DST). Management tools and archiving processes as done at the GFBio data center SNSB are described under Technical Documentations. This includes services for documentation, processing and regular archiving of the incoming original (meta) data sets and multimedia objects (source data; SIP) under involvement of DiversityProjects (DP) functionality for metadata ingest from GFBio submission tool. Data producers are welcome to use xls templates as provided under Templates for data submission. SNSB uses DWB tools for data and metadata import, metadata enrichment and data quality control (see https://www.gfbio.org/data/tools).

A Diversity Workbench (DWB) tool with GUI for organising data publication is used to operate transfer, filtering and transformation of data and metadata for publication (DWB Video on Export --> Overview). The data transfer is partly done automated. The implementation of DOI assignment for published research datasets is planned for the next months.

[[File:SNSB-Workflow.png||framed|right|Figure 1.1: The SNSB Workflow, BioCASe (Biological Collection Access Service) data pipelines for GFBio Type 1 Data. Clicking will enlarge the chart.

ABCD - Access to Biological Collections Data schema (V2.06 within GFBio)

AIP - Archival Information Package

DIP - Dissemination Information Package

SIP - Submission Information Package

VAT - Visualizing and Analysing Tool]]


 * The workflow with these central components is illustrated in figure 1 and the detailed description can be found in the text below.

Data pipeline - Provision of (versioned) DIPs
Data producers using the GFBio installation of the DWB database suite at the SNSB for (interim) data management have write and read access to their ingested datasets, at least for a certain period of time. Thus, they are able to fulfil certain tasks of data quality control and data curation.


 * Description of DWB workflow 1 applied for biodiversity and collection datasets without material deposit at SNSB
 * The data producers have also to decide whether their dataset will be published (a) in a traditional way, i.e. without option for revision of the DIP, i.e. the data publication process is done only once resulting in one single DIP published via GFBio, with citation, eventually with DOI assignment and with zipped ABCD 2.06 xml archive and (b) with option of mid-term data management of the processed SIP, later data enrichment, revision and emendment of the dataset. With solution (a) new data revisions have to be treated as a new completely independent data publication starting with a new GFBio submission process. The alternative (b) is a dynamic data publication. By that, the process has the option to treat versions as it is in a regular way done for collection data records and growing data packages with observation data (DIPs with version changes). All decisions regarding the envisaged data pipelines are documented as part of the submission process in DP.


 * Description of DWB workflow 2 applied for collection datasets with material deposit at SNSB
 * In most aspects, workflow 2 is identical to workflow 1, but includes a final step of data replication from the GFBio installation of the DWB database suite to the SNSB master installation of the DWB database suite. The pipeline is for middle-sized assets of collection data and will be applied in two cases: (a) when data producers use existing SNSB collection material for their research and produce well-structured linked research data and multimedia objects and (b) when they are going to deposit their vouchers together with extended meta data and multimedia objects at the SNSB. As in pipeline 1 the data producers are guided to use the GFBio installation of the DWB database suite at the SNSB for interim data management and have write and read access to their ingested datasets. After an agreed period of time the data are published as DIP via GFBio, e.g. with citation, eventually with DOI assignment and zipped ABCD 2.06 xml archive. Thereafter the data are replicated and kept versioned in the collection databases of the SNSB.

Export of DIPs with the SNSB in-house installation of DWB used in GFBio
The DIPs are created by data curators (data stewards, data scientists) at the SNSB using the DWB BioCASe data publication tool.


 * Citation
 * The citation as appearing in the respective ABCD 2.06 element is the result of an aggregation of DiversityProjects element entries and data processing. The year and versioning given in the citation of the DIP is that of creation of the DIP (DWB ABCD_package). The exact date of creation of the DIP as zipped ABCD 2.06 xml archive (using a manual function of the BioCASe Provider Software) is that indicated in the BioCASe Provider Software as "lastmodified".


 * Example: Kölbl-Ebert, M. (2017). The Fossil Fish Collection at the Jura-Museum Eichstätt. [Dataset]. Version: 20171205. Data Publisher: Staatliche Naturwissenschaftliche Sammlungen Bayerns – SNSB IT Center, München. http://www.snsb.info/DatabaseClients/JMEpiscescoll/About.html.
 * Kölbl-Ebert, M.ː Agent, Roleː Author
 * (2017)ː inserted through data processing (DWB ABCD_package)
 * The Fossil Fish Collection at the Jura-Museum Eichstättː Title
 * [Dataset]ː inserted through data processing (DWB ABCD_package)
 * Version: 20171205: inserted through data processing (DWB ABCD_package)
 * Staatliche Naturwissenschaftliche Sammlungen Bayerns – SNSB IT Center, Münchenː Agent, Roleː Publisher
 * http://www.snsb.info/DatabaseClients/JMEpiscescoll/About.html – might be linked to a DOI


 * Date and time specification regarding data publications with version changes
 * The available ABCD 2.06 xml zip archive includes the date of the latest data export from the DWB master databases to the Microsoft SQL cache database (xml element: 'date modified') as well as the date of the last change of each single unit data in the DiversityCollection master database (xml element: 'last edited'). It also includes the citation with versioning as given in the example above. In rare cases the ABCD 2.06 xml zip archive might represent an earlier version of DIP with deviating citation as that provided by the BioCASe Provider Software in parallel for dynamic access. (In contrast to GFBio, GBIF is harvesting the dynamic representation of datasets). The reason is, that there exists an automated export from the DWB to the BioCASe Provider Software in certain intervals, but the creating of the zip-files via BioCASe Provider Software has to be done manually.
 * For data publications with version changes the historical first year of data deliverage is given in the description home page (landing page) of the dataset.


 * Licenses
 * The licenses for the data packages are those ingested in DP during the GFBio submission/ ingestion process. The license for each single multimedia object is handled separately and stored in DC together with the respective multimedia URI. GFBio is promoting CC licenses, favorite license for data packages: CC BY 4.0; favorite license for multimedia objects: CC BY-SA 4.0.


 * GFBio data and metadata created during submission
 * The metadata which are generated through the GFBio submission are stored in the GFBio JIRA ticket system. A connection via JIRA API allows for metadata ingest in DiversityProjects (two DP installations at SNSB are alternatively involved: one for SNSB collection projects and one for GFBio projects). Additional metadata and original research data are imported in DWB RDMS via DWB ImportWizards. Additional parameter assignment is done manually by the data producers and by SNSB data curators using special(web) services of DiversityProjects, which are linking services of the GFBio Terminology Service.


 * GFBio IDs according to GFBio consensus documents
 * All GFBio IDs as well as other external IDs on dataset and data unit level as far as available (e.g. DOIs, GenBank accession numbers, BOLD numbers, MycoBank numbers, ORCID IDs, GFBio submission IDs, DSMZ strain numbers, IDs provided by other GFBio data centers for linked datasets) are stored in appropriate tables of the DWB installations. As far as they are part of GFBio consensus documents they will be published.


 * Occurrence data according to GFBio consensus documents
 * The occurrence data are stored at two levels and two granularities, (a) at dataset level in DP (setting elements) and (b) at unit level in DC with DA, DTN, DZ etc. As far as mandatory or recommended as part of GFBio consensus documents they will be published.


 * Other (meta)data
 * Other (meta)data recommended or mandatory for export are either stored in DP, DA or DC.

Transformation of DIPs as AIPs for SNSB archiving system

 * Archiving of DIPs (e.g., starting with DIPs created by the BioCASe Provider Software)


 * All DIPs are created as zipped ABCD 2.06 xml archives (with internal date "lastModified") using a regular manual function of the BioCASe Provider Software. At least the original/ primary and the last subsequent version of the zipped ABCD 2.06 xml archives for each data package is provided via a web accessible storage (ongoing implementation). The URIs are included as link in the landingpages. At the SNSB these DIPs are locally archived as AIPs by an automated process together with Pywrapper config data. This process is starting each 24 hours using the "lastModified" date for naming the AIPs. Services of the Leibniz-Rechenzentrum (LRZ) München are used for long-term archiving of the AIPs. The AIPs at the LRZ are not accessible from outside. Version-numbering of AIPs is done by adding ISO time strings.

Transformation of DIPs for data publication in GFBio data portal and VAT tool
In general, all seven data centers transform and publish DIPs in a similar way using local installations of the BioCASe Provider Software.
 * Access via BioCASe Local Query Tool, Landingpage
 * The SNSB data center is organising the description of "landingpages" for data packages in two ways: (a) Large DIPs and DIPs with versioning do have their own individually designed web page as landingpage. This landing page will provide a link to an overview page with current and some former versions of the data package (ABCD xml) (ongoing impementation); (b) DIPs without versioning are currently documented by the landing pages created by the BioCASe Provider Service. In future, this type of landing page with DOI assignment will be generated semi-automatically from DiversityProjects export (ongoing implementation).


 * Access via BioCASe Monitor Service (BMS)
 * see General part: GFBio publication of type 1 data via BioCASe data pipelines


 * Citation of published dataset
 * The proposed citation string is given according the scheme examplified above, for details see General part: GFBio publication of type 1 data via BioCASe data pipelines


 * DOI assignment
 * The implementation of DOI assignment for research data without versioning is planned.


 * Indexing/harvesting by central GFBio indexing processes
 * see General part: GFBio publication of type 1 data via BioCASe data pipelines


 * Access via GFBio and VAT Data Portal
 * see General part: GFBio publication of type 1 data via BioCASe data pipelines