Introduction¶
biowdl-input-converter converts human-readable samplesheets into a format that can be easily processed by BioWDL pipelines.
For more information on BioWDL check out the documentation on https://biowdl.github.io.
Installation¶
- Create a new virtualenv
- run
pip install biowdl-input-converter
Usage¶
Parse samplesheets for BioWDL pipelines.
usage: biowdl-input-converter [-h] [-o OUTPUT] [--validate] [--old]
[--skip-file-check] [--check-file-md5sums]
samplesheet
Positional Arguments¶
samplesheet | The input samplesheet. Format will be automatically detected. |
Named Arguments¶
-o, --output | The output file to which the json is written. Default: stdout |
--validate | Do not generate output but only validate the samplesheet. Default: False |
--old | Output old style JSON as used in BioWDL germline-DNA and RNA-seq version 1 pipelines Default: False |
--skip-file-check | |
Skip the checking if files in the samplesheet are present. Default: False | |
--check-file-md5sums | |
Do a md5sum check for reads which have md5sums added in the samplesheet. Default: False |
Samplesheet¶
A samplesheet provides information about fastq files.
- Sample name
- Library name (for each sample usually one library is used to prepare the sample for sequencing)
- Readgroup name (which lane on the sequencer was used)
- Location of the fastq file containing forward reads (R1) on the filesystem
- Forward reads fastq (R1) md5sum
- Location of the fastq file containing reverse reads (R2) on the filesystem
- Reverse reads fastq (R2) md5sum
- additional properties (if necessary)
CSV/TSV Format¶
A samplesheet can be a comma- or tab-delimited file. An example looks like this
"sample","library","readgroup","R1","R1_md5","R2","R2_md5"
"s1","lib1","rg1","r1_1.fq","181a657e3f9c3cde2d3bb14ee7e894a3","r1_2.fq","ebe473b62926dcf6b38548851715820e"
"s2","lib1","rg1","r2_1.fq","7e79b87d95573b06ff2c5e49508e9dbf","r2_2.fq","dc2776dc3a07c4f468455bae1a8ff872"
The md5sums are optional and can be left out:
"sample","library","readgroup","R1","R1_md5","R2","R2_md5"
"s1","lib1","rg1","r1_1.fq",,"r1_2.fq",
"s2","lib1","rg1","r2_1.fq",,"r2_2.fq",
Additional properties at the sample level can be set using additional columns:
"sample","library","readgroup","R1","R1_md5","R2","R2_md5","HiSeq4000","other_property"
"s1","lib1","rg1","r1_1.fq",,"r1_2.fq",,"yes","pizza"
"s2","lib1","rg1","r2_1.fq",,"r2_2.fq",,"no","broccoli"
These files can be easily generated using a spreadsheet program (such as Microsoft Excel or LibreOffice Calc).
Create a table:
sample | library | readgroup | R1 | R1_md5 | R2 | R2_md5 | HiSeq4000 | other_property |
s1 | lib1 | rg1 | r1_1.fq | 181a657e3f9c3cde2d3bb14ee7e894a3 | r1_2.fq | yes | pizza | |
s2 | lib1 | rg1 | r2_1.fq | r2_2.fq | dc2776dc3a07c4f468455bae1a8ff872 | no |
Note
Optional fields can be left blank.
And save the table as CSV or TSV format from your spreadsheet program.
YAML format¶
Alternatively a YAML format can be used
samples:
- id: s1
libraries:
- id: lib1
readgroups:
- id: rg1
reads:
R1: r1_1.fq
R1_md5: 181a657e3f9c3cde2d3bb14ee7e894a3
R2: r1_2.fq
R2_md5: ebe473b62926dcf6b38548851715820e
- id: s2
libraries:
- id: lib1
readgroups:
- id: rg1
reads:
R1: r2_1.fq
R1_md5: 7e79b87d95573b06ff2c5e49508e9dbf
R2: r2_2.fq
R2_md5: dc2776dc3a07c4f468455bae1a8ff872
Optional fields can be omitted and extra properties can be added:
samples:
- id: s1
HiSeq4000: no
libraries:
- id: lib1
readgroups:
- id: rg1
reads:
R1: r1_1.fq
R1_md5: 181a657e3f9c3cde2d3bb14ee7e894a3
R2: r1_2.fq
- id: s2
HiSeq4000: yes
libraries:
- id: lib1
readgroups:
- id: rg1
reads:
R1: r2_1.fq
R2: r2_2.fq
Changelog¶
0.1.0¶
- Added documentation and readthedocs page
- Added changelog and release procedures
- Added test suite with coverage metrics, enabled CI
- Add validate flag to allow users to validate files
- Added command line interface with ability to write to stdout and files
- Added ability to check files for presence and md5sum checking
- Added sample group -> old style JSON/YAML conversion
- Added sample group -> new style JSON/YAML conversion
- Added yaml -> sample group conversion
- Reworked csv conversion by @DavyCats to fit the new sample group structure
- Added sample group structure to enable any-to-any conversions