Pipelines

BED File Indexing

This pipeline can process bed files into bigbed and HDF5 index files for web use.

Running from the command line

Parameters

assembly : str
Genome assembly ID (e.g. GCA_000001405.22)
chrom : int
Location of chrom.size file
bed_file : str
Location of input bed file
h5_file : str
Location of HDF5 output file

Returns

BigBed : file
BigBed file
HDF5 : file
HDF5 index file

Example

When using a local verion of the [COMPS virtual machine](http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation):

chrom.size file

1
2
3
4
5
6
7
1  123000000
2  50000000
3  25000000
4  10000000
5  5000000
X  75000000
Y  12000000
1
runcompss --lang=python /home/compss/mg-process-files/process_bed.py --assembly GCA_000001405.22 --chrom chrom.size --bed_file <data_dir>/expt.bed --h5_file <data_dir>/expt.hdf5

Methods

class process_bed.process_bed(configuration=None)[source]

Workflow to index BED formatted files within the Multiscale Genomics (MuG) Virtural Research Environment (VRE)

run(input_files, metadata, output_files)[source]

Main run function to index the BED files ready for use in the RESTful API. BED files are index in 2 different ways to allow for optimal data retreival. The first is as a bigbed file, this allows the data to get easily extracted as BED documents and served to the user. The second is as an HDF5 file that is used to identify which bed files have information at a given location. This is to help the REST clients make only the required calls to the relevant BED files rather than needing to pole all potential BED files.

Parameters:
  • inpout_files (list) – List of file locations
  • metadata (list) –
Returns:

outputfiles – List of locations for the output BED and HDF5 files

Return type:

list

WIG File Indexing

This pipeline can process WIG files into bigbed and HDF5 index files for web use.

Running from the command line

Parameters

assembly : str
Genome assembly ID (e.g. GCA_000001405.22)
chrom : int
Location of chrom.size file
wig_file : str
Location of input wig file
h5_file : str
Location of HDF5 output file

Returns

BigWig : file
BigWig file
HDF5 : file
HDF5 index file

Example

When using a local verion of the [COMPS virtual machine](http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation):

chrom.size file:

1
2
3
4
5
6
7
1  123000000
2  50000000
3  25000000
4  10000000
5  5000000
X  75000000
Y  12000000
1
runcompss --lang=python /home/compss/mg-process-files/process_wig.py --assembly GCA_000001405.22 --chrom chrom.size --wig_file <data_dir>/expt.wig --h5_file <data_dir>/expt.hdf5

Methods

class process_wig.process_wig(configuration=None)[source]

Workflow to index WIG formatted files within the Multiscale Genomics (MuG) Virtural Research Environment (VRE)

run(input_files, metadata, output_files)[source]

Main run function to index the WIG files ready for use in the RESTful API. WIG files are indexed in 2 different ways to allow for optimal data retreival. The first is as a bigwig file, this allows the data to get easily extracted as WIG documents and served to the user. The second is as an HDF5 file that is used to identify which bed files have information at a given location. This is to help the REST clients make only the required calls to the relevant WIG files rather than needing to pole all potential WIG files.

Parameters:
  • inpout_files (list) – List of file locations
  • metadata (list) –
Returns:

outputfiles – List of locations for the output BED and HDF5 files

Return type:

list

GFF3 File Indexing

This pipeline can process GFF3 files into Tabix and HDF5 index files for web use.

Running from the command line

Parameters

assembly : str
Genome assembly ID (e.g. GCA_000001405.22)
gff3_file : str
Location of the source gff3 file
h5_file : str
Location of HDF5 index file

Returns

Tabix : file
Tabix index file
HDF5 : file
HDF5 index file

Example

When using a local verion of the [COMPS virtual machine](http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation):

1
runcompss --lang=python /home/compss/mg-process-files/process_gff3.py --assembly GCA_000001405.22 --gff3_file <data_dir>/expt.gff3 --h5_file <data_dir>/expt.hdf5

Methods

class process_wig.process_wig(configuration=None)[source]

Workflow to index WIG formatted files within the Multiscale Genomics (MuG) Virtural Research Environment (VRE)

run(input_files, metadata, output_files)[source]

Main run function to index the WIG files ready for use in the RESTful API. WIG files are indexed in 2 different ways to allow for optimal data retreival. The first is as a bigwig file, this allows the data to get easily extracted as WIG documents and served to the user. The second is as an HDF5 file that is used to identify which bed files have information at a given location. This is to help the REST clients make only the required calls to the relevant WIG files rather than needing to pole all potential WIG files.

Parameters:
  • inpout_files (list) – List of file locations
  • metadata (list) –
Returns:

outputfiles – List of locations for the output BED and HDF5 files

Return type:

list

3D JSON Indexing

This pipeline processes the £D JSON models that have been generated via TADbit into a single HDF5 file that can be used as part of a RESTful API for efficient querying and retrieval of the models.

Running from the command line

Parameters

gz_file : str
Location of the input tar.gz file containing all of the output models and data from the TADbit modelling stage.

Returns

HDF5 : file
HDF5 index file

Example

When using a local verion of the [COMPS virtual machine](http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation):

1
runcompss --lang=python /home/compss/mg-process-files/process_json_3d.py --gz_file <data_dir>/expt.tar.gz

Methods

class process_json_3d.process_json_3d(configuration=None)[source]

Workflow to index JSON formatted files within the Multiscale Genomics (MuG) Virtural Research Environment (VRE) that have been generated as part of the Hi-C analysis pipeline to model the 3D structure of the genome within the nucleus of the cell.

run(input_files, metadata, output_files)[source]

Main run function to index the 3D JSON files that have been generated as part of the Hi-C analysis pipeline to model the 3D structure of the genome within the nucleus of the cellready for use in the RESTful API.

Parameters:
  • files_ids (list) –
    file : str
    Location of the tar.gz file of JSON files representing the 3D models of the nucleus
  • metadata (list) –
Returns:

outputfiles – List with the location of the HDF5 index file for the given dataset

Return type:

list