An Example Recipe in NGS
Here we show a real 'recipe' in NGS (Next Generation Sequencing). If you are not into this area, don't worry, we won't touch too much of it, this page is only to illustrate how to use Fourtytwo in detail:
Our task description:
- We have a bunch of sequence files in Fastq.gz format. This is our start point.
- We have 2 programs to run, the first called BWA, the second called SAMTOOLS. BWA needs to be called twice. In total we have 3 'Task'.
- We will get a bunch of files in BAM format. This is our end point.
Let's look directly how you config a 'task', you can do this in a web page form or fire T-SQL if you prefer down-to-the-metal.
ID 1 ExePath <xdrive>bow\bwa.exe ExeArgs aln -q 15 -t 4 -f <o:<nb>_<no>.sai> <i:<xdrive>hs37d5.fa.gz> <i:<n>> Memory 5000 Thread 4 FileType fastq_gz FileCount 1 Container_in raw-10 Container_out sai-10 IsEnabled 1
This is our first task. Everything italic is the name of parameter, following is the value that should be provided by you. Now we explain each:
- ID. This is in fact an internal value that should be hidden from you. But I listed it here to emphasize one concept that is key to Fourtytwo, i.e. we DON'T designate task orders. Fourtytwo does not work like a butler; it is File-Driven and each running instance of your deployment works independently like a lone commando. It checks the environment (task type, file list) regularly, once it finds a match, it will fetch the file and run the task.
- ExePath. The Path to your program, in our case bwa.exe, under a subfolder of bow. Note the <xdrive> here, at runtime, everything quoted by <> will be replaced by a value, and xdrive is one of our reserved variable names. So in runtime, this <xdrive>bow\bwa.exe will be changed to F:\bow\bwa.exe. You can read more of reserved variable name and xdrive by checking the Glossary.
- ExeArgs. The argument for your program. Here in <n> you see a new reserved name, means the full name of the fetched file. <nb> and <no> are not reserved names, they are variable names defined by you, we'll explain later. Now you might noticed <o:> and <i:> wrapped around variables, this is to designate inside a file name that either needs to be downloaded from or uploaded to a Windows Azure Storage Container (check the Glossary).
- Memory. The minimum required free memory in MBytes for the program, only an instance with enough memory can pick up this type of task.
- Thread. The minimum required free thread/CPU core count for the program, more in the Glossary.
- FileType. The File Type for this task/program, more to follow.
- FileCount. This means how many files are the program expecting, here bwa.exe just want one fastq.gz file.
- Container_in. The container to find the input file (wrapped by <i:>).
- Container_out. The container to upload the output file by the program (wrapped by <o:>).
- IsEnabled. Allows you to disable/enable a certain task. Disabled task won't be picked up by any instances.
Now the next two tasks, this time we show it in XML alike format (no need to be scared, just look at the text between the tags). We'll explain new things in detail.
<TaskType> <ID>2</ID> <ExePath>bow\bwa-mt.exe</ExePath> <ExeArgs>sampe -a 1431 -t 4 -P -f <o:<nb>.sam> hs37d5.fa.gz <i:sai-10:<n:1>> <i:sai-10:<n:2>> <i:raw-10:<nb>_<no:1>.fq.gz> <i:raw-10:<nb>_<no:2>.fq.gz></ExeArgs> <ExeExtras><i:fourtytwovhd:bow/bwa-mt.exe> <i:fourtytwovhd:hs37d5.fa.gz> <i:fourtytwovhd:hs37d5.fa.gz.amb> <i:fourtytwovhd:hs37d5.fa.gz.ann> <i:fourtytwovhd:hs37d5.fa.gz.bwt> <i:fourtytwovhd:hs37d5.fa.gz.fai> <i:fourtytwovhd:hs37d5.fa.gz.pac> <i:fourtytwovhd:hs37d5.fa.gz.sa></ExeExtras> <Memory>5000</Memory> <Thread>4</Thread> <FileType>sai</FileType> <FileCount>2</FileCount> <Container_in>sai-10</Container_in> <Container_out>sam-10</Container_out> <IsEnabled>1</IsEnabled> </TaskType>
- The program to run is bwa-mt.exe, an improved bwa on multithreading. Note there is no <xdrive> in the path, this means the path is relative to each instance's local 'scratch' folder.
- There is new syntax in the ExeArgs: <i:sai-10:> between the two colons 'sai-10' is a container name, means download this file from this container other than Container_in; then <n:1> there is a :1 after our reserved variable n, this means full name of the first file.
- ExeExtras. This is null for task 1 but not here. You can do <i:> and <o:> here to specify extra download and upload, which apparently does not belong to ExeArgs. Download happens before program run, upload happens after program run. In this case we are downloading the program itself from a container called 'fourtytwovhd', with a relative path of bow\, this bwa-mt.exe will be saved to the 'scratch' folder under \bow; we are also downloading 7 data files from the same container, these will be saved to 'scratch' folder without relative path.
- Here FileCount is 2, means bwa-mt.exe wants two files as input, that's why we have <n:1>, <n:2> which simply means full name for file 1, and file 2.
<TaskType> <ID>3</ID> <ExePath>bow\samtools.exe</ExePath> <ExeArgs>view -bS <i:<n>> -o <o:<nb>.bam></ExeArgs> <ExeExtras><i:fourtytwovhd:bow/samtools.exe></ExeExtras> <Memory>5</Memory> <Thread>1</Thread> <FileType>sam</FileType> <FileCount>1</FileCount> <Container_in>sam-10</Container_in> <Container_out>bam-10</Container_out> <IsEnabled>1</IsEnabled> </TaskType>
Task 3 has no new parameters. One thing to note: memory required is only 5MB, and Thread is just 1. In this case, if our instance size is Large (4 cores), one instance could spin 4 tasks to use all the cores, i.e. 4 samtools.exe will be running on 4 sam input files at the same time. If one finished running, another will be spin up to keep your instance busy.
So that's it. We hope you agree that the 'recipe' is easy to cook. :)
About File Type
One last thing needed your input and part of the 'recipe' is how we define a File Type. You already saw some of those in the task definitions:
fastq_gz ^(?<nb>.+)_(?<no>\d+)\.(?<ne>fastq.gz|fq.gz)$ sai ^(?<nb>.+)_(?<no>\d+)\.(?<ne>sai)$ sam ^(?<nb>.+)\.(?<ne>sam)$
First we have the Type name, you can use whatever is meanful to you. So we are not limited to the traditional file name extensions. Then we have a string of Regular Expression, if you don't know what that is, to put it simply, it's a pattern used to match strings (File name in any operating system is just a string).
Now you might recognized those <nb>, <no>, and <ne> stuff, this is all to do with File name manipulation. Programs take input files and will output files, by using Regular Expression, you can 'divide' a file name into named parts and use it to construct a new file name. Remember only <n> is reserved, means the full name, others were defined here. So <nb> stands for name base, <no> stands for name order, <ne> stands for name extension, but you can have your own.
Suppose
We have file 1 named 2000_1.fastq.gz and file 2 named 2000_2.fastq.gz, both inside container raw-10, both match the regular expression for file type fastq_gz. <nb> is '2000', <no> is '1' and '2', <ne> is 'fastq.gz'.
For an instance grabbed file 1 and task 1, this is what actually called as in a command window:
f:\bow\bwa.exe aln -q 15 -t 4 -f 2000_1.sai f:\hs37d5.fa.gz 2000_1.fastq.gz
For file 2 and task 1:
f:\bow\bwa.exe aln -q 15 -t 4 -f 2000_2.sai f:\hs37d5.fa.gz 2000_2.fastq.gz
2000_1.sai and 2000_2.sai will be uploaded into container 'sai-10', their names match the file type of 'sai', then an instance will grab those to run task 2:
C:\Resources\directory\970cb6c6f6e449168d0b6329c0aee830.WorkerRole.Scratch\bow\bwa-mt.exe sampe -a 1431 -t 4 -P -f 2000.sam hs37d5.fa.gz 2000_1.sai 2000_2.sai 2000_1.fastq.gz 2000_2.fastq.gz
Scratch folder has a rather long path. 2000.sam will be uploaded into container 'sam-10', then one instance will grab it and run task 3:
C:\Resources\directory\970cb6c6f6e449168d0b6329c0aee830.WorkerRole.Scratch\bow\samtools.exe view -bS 2000.sam -o 2000.bam
2000.bam will be in container 'bam-10', we didn't define a file type for 'bam' and no task defined for 'bam' thus this is our end point. Job done!