Introduction¶
biowdl-input-converter converts human-readable samplesheets into a format that can be easily processed by BioWDL pipelines.
For more information on BioWDL check out the documentation on https://biowdl.github.io.
Installation¶
- Create a new virtualenv
- run
pip install biowdl-input-converter
Usage¶
Parse samplesheets for BioWDL pipelines.
usage: biowdl-input-converter [-h] [-f FORMAT] [-o OUTPUT] [--validate]
[--old] [--skip-file-check]
[--skip-duplicate-check] [--check-file-md5sums]
samplesheet
Positional Arguments¶
samplesheet | The input samplesheet. Format will be automatically detected from file suffix if –format argument not provided |
Named Arguments¶
-f, --format | The input samplesheet format - tsv, csv, json, or yaml |
-o, --output | The output file to which the json is written. Default: stdout |
--validate | Do not generate output but only validate the samplesheet. Default: False |
--old | Output old style JSON as used in BioWDL germline-DNA and RNA-seq version 1 pipelines Default: False |
--skip-file-check | |
Skip the checking if files in the samplesheet are present. Default: True | |
--skip-duplicate-check | |
Skip the checks for duplicate files in the samplesheet. Default: True | |
--check-file-md5sums | |
Do a md5sum check for reads which have md5sums added in the samplesheet. Default: False |
Samplesheet¶
A samplesheet provides information about fastq files.
- Sample name
- Library name (for each sample usually one library is used to prepare the sample for sequencing)
- Readgroup name (which lane on the sequencer was used)
- Location of the fastq file containing forward reads (R1) on the filesystem
- Forward reads fastq (R1) md5sum
- Location of the fastq file containing reverse reads (R2) on the filesystem
- Reverse reads fastq (R2) md5sum
- additional properties (if necessary)
CSV/TSV Format¶
A samplesheet can be a comma- or tab-delimited file. An example looks like this
"sample","library","readgroup","R1","R1_md5","R2","R2_md5"
"s1","lib1","rg1","r1_1.fq","181a657e3f9c3cde2d3bb14ee7e894a3","r1_2.fq","ebe473b62926dcf6b38548851715820e"
"s2","lib1","rg1","r2_1.fq","7e79b87d95573b06ff2c5e49508e9dbf","r2_2.fq","dc2776dc3a07c4f468455bae1a8ff872"
The md5sum fields and the R2 field are optional and can be empty:
"sample","library","readgroup","R1","R1_md5","R2","R2_md5"
"s1","lib1","rg1","r1_1.fq",,"r1_2.fq",
"s2","lib1","rg1","r2_1.fq",,"r2_2.fq",
The R1_md5, R2 and R2_md5 columns are optional and can be left out entirely.
"sample","library","readgroup","R1"
"s1","lib1","rg1","r1_1.fq"
"s2","lib1","rg1","r2_1.fq"
Additional properties at the sample level can be set using additional columns:
"sample","library","readgroup","R1","R1_md5","R2","R2_md5","HiSeq4000","other_property"
"s1","lib1","rg1","r1_1.fq",,"r1_2.fq",,"yes","pizza"
"s2","lib1","rg1","r2_1.fq",,"r2_2.fq",,"no","broccoli"
Additional properties for the same sample only have to be defined in one line. This saves a lot of duplication for samples with a high readgroup or library count an makes it easier to read the file.
"sample","library","readgroup","R1","R1_md5","R2","R2_md5","HiSeq4000","other_property"
"s1","lib1","rg1","r1_1.fq",,"r1_2.fq",,"yes","pizza"
"s1","lib1","rg2","r1_1.fq",,"r1_2.fq",,,
"s1","lib2","rg1","r1_1.fq",,"r1_2.fq",,,
"s2","lib1","rg1","r2_1.fq",,"r2_2.fq",,"no","broccoli"
"s2","lib1","rg2","r2_1.fq",,"r2_2.fq",,,
"s2","lib1","rg3","r2_1.fq",,"r2_2.fq",,,
If an additional column is filled with two conflicting values for the same sample an error will be thrown.
Creating comma-delimited files¶
These files can be easily generated using a spreadsheet program (such as Microsoft Excel or LibreOffice Calc).
Create a table:
sample | library | readgroup | R1 | R1_md5 | R2 | R2_md5 | HiSeq4000 | other_property |
s1 | lib1 | rg1 | r1_1.fq | 181a657e3f9c3cde2d3bb14ee7e894a3 | r1_2.fq | yes | pizza | |
s2 | lib1 | rg1 | r2_1.fq | r2_2.fq | dc2776dc3a07c4f468455bae1a8ff872 | no |
Note
Optional fields can be left blank.
And save the table as CSV or TSV format from your spreadsheet program.
YAML format¶
Alternatively a YAML format can be used
samples:
- id: s1
libraries:
- id: lib1
readgroups:
- id: rg1
reads:
R1: r1_1.fq
R1_md5: 181a657e3f9c3cde2d3bb14ee7e894a3
R2: r1_2.fq
R2_md5: ebe473b62926dcf6b38548851715820e
- id: s2
libraries:
- id: lib1
readgroups:
- id: rg1
reads:
R1: r2_1.fq
R1_md5: 7e79b87d95573b06ff2c5e49508e9dbf
R2: r2_2.fq
R2_md5: dc2776dc3a07c4f468455bae1a8ff872
Optional fields can be omitted and extra properties can be added:
samples:
- id: s1
HiSeq4000: no
libraries:
- id: lib1
readgroups:
- id: rg1
reads:
R1: r1_1.fq
R1_md5: 181a657e3f9c3cde2d3bb14ee7e894a3
R2: r1_2.fq
- id: s2
HiSeq4000: yes
libraries:
- id: lib1
readgroups:
- id: rg1
reads:
R1: r2_1.fq
R2: r2_2.fq
Changelog¶
0.3.0¶
- Added option to specify samplesheet fileformat explicitly
- The tool now also checks for duplicated paths in the samplesheet to prevent copy-paste errors.
- Added testing for python 3.8 and 3.9
0.2.0¶
- Make sure only one line of additional properties per sample is need in a csv file.
- Fix a bug where an empty field for an additional property in a csv
samplesheet would be defined as
""
instead ofNone
.
0.1.0¶
- Added documentation and readthedocs page
- Added changelog and release procedures
- Added test suite with coverage metrics, enabled CI
- Add validate flag to allow users to validate files
- Added command line interface with ability to write to stdout and files
- Added ability to check files for presence and md5sum checking
- Added sample group -> old style JSON/YAML conversion
- Added sample group -> new style JSON/YAML conversion
- Added yaml -> sample group conversion
- Reworked csv conversion by @DavyCats to fit the new sample group structure
- Added sample group structure to enable any-to-any conversions