Introduction

biowdl-input-converter converts human-readable samplesheets into a format that can be easily processed by BioWDL pipelines.

For more information on BioWDL check out the documentation on https://biowdl.github.io.

Installation

  • Create a new virtualenv
  • run pip install biowdl-input-converter

Usage

Parse samplesheets for BioWDL pipelines.

usage: biowdl-input-converter [-h] [-o OUTPUT] [--validate] [--old]
                              [--skip-file-check] [--check-file-md5sums]
                              samplesheet

Positional Arguments

samplesheet The input samplesheet. Format will be automatically detected.

Named Arguments

-o, --output The output file to which the json is written. Default: stdout
--validate

Do not generate output but only validate the samplesheet.

Default: False

--old

Output old style JSON as used in BioWDL germline-DNA and RNA-seq version 1 pipelines

Default: False

--skip-file-check
 

Skip the checking if files in the samplesheet are present.

Default: False

--check-file-md5sums
 

Do a md5sum check for reads which have md5sums added in the samplesheet.

Default: False

Samplesheet

A samplesheet provides information about fastq files.

  • Sample name
  • Library name (for each sample usually one library is used to prepare the sample for sequencing)
  • Readgroup name (which lane on the sequencer was used)
  • Location of the fastq file containing forward reads (R1) on the filesystem
  • Forward reads fastq (R1) md5sum
  • Location of the fastq file containing reverse reads (R2) on the filesystem
  • Reverse reads fastq (R2) md5sum
  • additional properties (if necessary)

CSV/TSV Format

A samplesheet can be a comma- or tab-delimited file. An example looks like this

"sample","library","readgroup","R1","R1_md5","R2","R2_md5"
"s1","lib1","rg1","r1_1.fq","181a657e3f9c3cde2d3bb14ee7e894a3","r1_2.fq","ebe473b62926dcf6b38548851715820e"
"s2","lib1","rg1","r2_1.fq","7e79b87d95573b06ff2c5e49508e9dbf","r2_2.fq","dc2776dc3a07c4f468455bae1a8ff872"

The md5sums are optional and can be left out:

"sample","library","readgroup","R1","R1_md5","R2","R2_md5"
"s1","lib1","rg1","r1_1.fq",,"r1_2.fq",
"s2","lib1","rg1","r2_1.fq",,"r2_2.fq",

Additional properties at the sample level can be set using additional columns:

"sample","library","readgroup","R1","R1_md5","R2","R2_md5","HiSeq4000","other_property"
"s1","lib1","rg1","r1_1.fq",,"r1_2.fq",,"yes","pizza"
"s2","lib1","rg1","r2_1.fq",,"r2_2.fq",,"no","broccoli"

These files can be easily generated using a spreadsheet program (such as Microsoft Excel or LibreOffice Calc).

Create a table:

sample library readgroup R1 R1_md5 R2 R2_md5 HiSeq4000 other_property
s1 lib1 rg1 r1_1.fq 181a657e3f9c3cde2d3bb14ee7e894a3 r1_2.fq   yes pizza
s2 lib1 rg1 r2_1.fq   r2_2.fq dc2776dc3a07c4f468455bae1a8ff872 no  

Note

Optional fields can be left blank.

And save the table as CSV or TSV format from your spreadsheet program.

YAML format

Alternatively a YAML format can be used

samples:
    - id: s1
      libraries:
        - id: lib1
          readgroups:
            - id: rg1
              reads:
                R1: r1_1.fq
                R1_md5: 181a657e3f9c3cde2d3bb14ee7e894a3
                R2: r1_2.fq
                R2_md5: ebe473b62926dcf6b38548851715820e
    - id: s2
      libraries:
        - id: lib1
          readgroups:
            - id: rg1
              reads:
                R1: r2_1.fq
                R1_md5: 7e79b87d95573b06ff2c5e49508e9dbf
                R2: r2_2.fq
                R2_md5: dc2776dc3a07c4f468455bae1a8ff872

Optional fields can be omitted and extra properties can be added:

samples:
    - id: s1
      HiSeq4000: no
      libraries:
        - id: lib1
          readgroups:
            - id: rg1
              reads:
                R1: r1_1.fq
                R1_md5: 181a657e3f9c3cde2d3bb14ee7e894a3
                R2: r1_2.fq
    - id: s2
      HiSeq4000: yes
      libraries:
        - id: lib1
          readgroups:
            - id: rg1
              reads:
                R1: r2_1.fq
                R2: r2_2.fq

Changelog

0.1.0

  • Added documentation and readthedocs page
  • Added changelog and release procedures
  • Added test suite with coverage metrics, enabled CI
  • Add validate flag to allow users to validate files
  • Added command line interface with ability to write to stdout and files
  • Added ability to check files for presence and md5sum checking
  • Added sample group -> old style JSON/YAML conversion
  • Added sample group -> new style JSON/YAML conversion
  • Added yaml -> sample group conversion
  • Reworked csv conversion by @DavyCats to fit the new sample group structure
  • Added sample group structure to enable any-to-any conversions