Publication of Type 1 Data via BioCASe Data Pipelines at SMNS Data Center

From GFBio Public Wiki
Jump to: navigation, search

The SMNS Data Center is one of the seven GFBio Collection Data Centers that are part and form the backbone of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication of type 1 data at SMNS includes management processes with several Diversity Workbench modules (DC, DP, DTN, DA, DG, DR and DST). Management tools and archiving processes as done at the GFBio data center SMNS are described under Technical Documentations. This includes services for documentation, processing and regular archiving of the incoming original (meta) data sets and multimedia objects (source data; SIP) under involvement of DiversityProjects (DP) functionality for metadata ingest from the GFBio submission tool. Data producers are welcome to use xls templates as provided under Templates for data submission. SMNS uses DWB tools for data and metadata import, metadata enrichment and data quality control (see https://www.gfbio.org/data/tools).

A Diversity Workbench (DWB) tool with GUI for organising data publication is used for transferring, filtering and transforming data and metadata for publication (DWB Video on Export --> Overview). The data transfer is partly done automated.

Figure 1: The SMNS Workflow, BioCASe (Biological Collection Access Service) data pipelines for GFBio Type 1 Data. Clicking will enlarge the chart.
ABCD - Access to Biological Collections Data schema (V2.06 within GFBio)
AIP - Archival Information Package
DIP - Dissemination Information Package
SIP - Submission Information Package
VAT - Visualizing and Analysing Tool
The workflow with these central components is illustrated in figure 1 and the detailed description can be found in the text below.


Data pipeline - Provision of (versioned) DIPs

Data producers using the GFBio installation of the DWB database suite at the SMNS for (interim) data management have write and read access rights for their ingested datasets, at least for a certain period of time. Thus, they are able to perform maintenance duties on data quality control and curation.

  • Description of DWB workflow 1 applied for biodiversity and collection datasets without voucher deposition at SMNS
The data producers have to decide whether their datasets will be published (a) in a traditional way, i.e. without option for DIP revision, meaning the data publication process is done only once resulting in one single DIP published via GFBio, with citation, eventually with DOI assignment and with zipped ABCD 2.06 xml archive and (b) with option of mid-term data management of the processed SIP, later data enrichment, dataset revision and amendment. With option (a) new data revisions have to be treated as a completely new independent data publication starting with a new GFBio submission process. The alternative (b) is a dynamic data publication. By that, the process has the option to treat versions as it is in a regular way done for collection data records and growing data packages with observation data (DIPs with version changes). All decisions regarding the envisaged data pipelines are documented as part of the submission process in DP.
  • Description of DWB workflow 2 applied for collection datasets with voucher specimens deposited at SMNS
In most aspects, workflow 2 is identical to workflow 1, yet it includes a final step of data replication from the GFBio installation of the DWB database suite to the SMNS master installation of the DWB database suite. The pipeline is for middle-sized collection data assets and can be applied in two cases: (a) when data producers use existing SMNS collection material for their research and produce well-structured linked research data and multimedia objects and (b) when they deposit their vouchers together with extended meta data and multimedia objects at the SMNS. As in pipeline 1 the data producers are guided to use the GFBio installation of the DWB database suite at the SMNS for interim data management and have write and read access for their ingested datasets. With the owner’s permission the data are published as DIP via GFBio, e.g. with citation, eventually with DOI assignment and zipped ABCD 2.06 xml archive. Thereafter the data are replicated and kept versioned in the collection databases of the SMNS.

Export of DIPs with the SMNS in-house installation of DWB used in GFBio

The DIPs are created by data curators (data curators, data scientists) at the SMNS using the DWB BioCASe data publication tool.

  • Citation
The citation in the respective ABCD 2.06 element is the result of aggregating several DiversityProjects elements, element entries and data processing. The year and versioning given in the citation of the DIP is the one of DIP’s creation (DWB ABCD_package). The citation agents might have more than one role (authors, publisher). The exact date of the DIP’s creation as zipped ABCD 2.06 xml archive (using a manual function of the BioCASe Provider Software) is the one indicated in the BioCASe Provider Software as "lastmodified".
Example: Holstein, J. (2018). Semantische Anreicherung und Mobilisierung von Daten. [Dataset]. Version: 20180919. Data Publisher: Staatliches Museum für Naturkunde Stuttgart. http://col.smns-bw.org/SMNS-E-AraMob/About.html.
  • Date and time specification regarding data publications with version changes
The available ABCD 2.06 xml zip archive includes the date of the latest data export from the DWB master databases to the Microsoft SQL cache database (xml element: 'date modified') as well as the date of the last change of each single unit data in the DiversityCollection master database (xml element: 'last edited'). It also includes the citation with versioning as given in the example above. In rare cases the ABCD 2.06 xml zip archive might represent an earlier version of the DIP with deviating citation as that provided by the BioCASe Provider Software in parallel for dynamic access. (In contrast to GFBio, GBIF harvests the dynamic representation of datasets). The reason is that an automated export from the DWB to the BioCASe Provider Software occurs at certain intervals, but the creation of the zip-files via BioCASe Provider Software has to be done manually.
For data publications with version changes the historical first year of data delivery is given in the description home page (landing page) of the dataset.
  • Licenses
The licenses for the data packages are those ingested in DP during the GFBio submission/ ingestion process. The license for each single multimedia object is handled separately and stored in DC together with the respective multimedia URI. GFBio is promoting CC licenses, favorite license for data packages: CC BY 4.0; favorite license for multimedia objects: CC BY-SA 4.0.
  • GFBio data and metadata created during submission
The metadata which are generated through the GFBio submission are stored in the GFBio JIRA ticket system. A connection via JIRA API allows for metadata ingest in DiversityProjects. Additional metadata and original research data are imported in DWB RDMS via DWB ImportWizards. Additional parameter assignment is done manually by the data producers and by SMNS data curators using special(web) services of DiversityProjects, which are linking services of the GFBio Terminology Service.
All GFBio IDs as well as other external IDs on dataset and data unit level so far available (e.g. DOIs, GenBank accession numbers, BOLD numbers, MycoBank numbers, ORCID IDs, GFBio submission IDs, DSMZ strain numbers, IDs provided by other GFBio data centers for linked datasets) are stored in appropriate tables of the DWB installations. As far as they are part of GFBio consensus documents they will be published.
The occurrence data are stored at two levels and two granularities, (a) at dataset level in DP (setting elements) and (b) at unit level in DC with DA, DTN, DZ etc. As far as mandatory or recommended as part of GFBio consensus documents they will be published.
  • Other (meta)data
Other (meta)data recommended or mandatory for export are either stored in DP, DA or DC.

Transformation of DIPs as AIPs for SMNS archiving system

  • Archiving of DIPs (e.g., starting with DIPs created by the BioCASe Provider Software)
All DIPs are created as zipped ABCD 2.06 xml archives (with internal date "lastModified") using a regular manual function of the BioCASe Provider Software. At least the original/ primary and the last subsequent version of the zipped ABCD 2.06 xml archives for each data package is provided via a web accessible storage (ongoing implementation). The URIs are included as link in the landingpages. AIPs containing the metadata, the zipped ABCD 2.06 xml archive and the packaging information are stored provisionally in-house. The State Museum Württemberg has applied for funding to start a project called LAZARMUS. The aim of this project is to develop and evaluate a concept for long-term archiving of museum-related data at either the State Archive in Stuttgart or the Rechenzentrum of the University Tübingen. Once a concept is developed and established, AIPs resulting from GFBio activities will be stored in one of the mentioned institutions.

Transformation of DIPs for data publication in GFBio data portal and VAT tool

In general, all seven data centers transform and publish DIPs in a similar way using local installations of the BioCASe Provider Software.

  • Access via BioCASe Local Query Tool, Landingpage
The SMNS data center organises the description of "landingpages" for data packages in two ways: (a) Large DIPs and DIPs with versioning do have their own individually designed web page as landingpage. This landing page will provide a link to an overview page with current and some former versions of the data package (ABCD xml) (ongoing impementation); (b) DIPs without versioning are currently documented by the landing pages created by the BioCASe Provider Service. In future, this type of landing page with DOI assignment will be generated semi-automatically from DiversityProjects export (ongoing implementation).
  • Access via BioCASe Monitor Service (BMS)
see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Citation of published dataset
The proposed citation string is given according the scheme examplified above, for details see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • DOI assignment
SMNS is registered at ZB MED and can create unique DOIs for each data package. The DOI is created at DataCite DOI Fabrica, annotated to the corresponding version of the information package and stored in appropriate tables of DiversityProjects. It is also part of the citation of the dataset.
  • Indexing/harvesting by central GFBio indexing processes
see General part: GFBio publication of type 1 data via BioCASe data pipelines
  • Access via GFBio and VAT Data Portal
see General part: GFBio publication of type 1 data via BioCASe data pipelines