auto_process_ngs.qc.modules.cellranger_count
Implements the ‘cellranger_count’ QC module:
CellrangerCount: core QCModule class
Pipeline task classes:
DetermineRequired10xPackage
GetCellrangerReferenceData
MakeCellrangerArcCountLibraries
CheckCellrangerCountOutputs
RunCellrangerCount
Pipeline helper functions:
add_cellranger_count: adds tasks to pipeline to run ‘cellranger* count’
filter_10x_pipelines: filters a list of 10x pipelines
verify_10x_pipleine: check for and verify outputs for a 10x package
Also imports the following pipeline tasks:
Get10xPackage
Additional helper functions:
check_cellranger_count_outputs: fetch names of samples with with missing ‘cellranger count’ outputs
check_cellranger_atac_count_outputs: fetch names of samples with missing ‘cellranger-atac count’ outputs
check_cellranger_arc_count_outputs: fetch names of samples with missing ‘cellranger-arc count’ outputs
- class auto_process_ngs.qc.modules.cellranger_count.CellrangerCount
Class for handling the ‘cellranger_count’ QC module
- classmethod add_to_pipeline(*args, **kws)
Adds tasks for ‘cellranger_count’ module to pipeline
Wrapper for the ‘add_cellranger_count’ function
- classmethod collect_qc_outputs(qc_dir)
Collect information on Cellranger count outputs
Returns an AttributeDictionary with the following attributes:
name: set to the QC module name
software: dictionary of software and versions
references: list of associated reference datasets
fastqs: list of associated Fastq names
samples: list of associated sample names
pipelines: list of tuples defining 10x pipelines in the form (name,version,reference)
samples_by_pipeline: dictionary with lists of sample names associated with each 10x pipeline tuple
config_files: list of associated config files (‘libraries.<SAMPLE>.csv’)
output_files: list of associated output files
tags: list of associated output classes
- Parameters:
qc_dir (QCDir) – QC directory to examine
- classmethod verify(params, qc_outputs)
Verify ‘cellranger_count’ QC module against outputs
Returns one of 3 values:
True: outputs verified ok
False: outputs failed to verify
None: verification not possible
- Parameters:
params (AttributeDictionary) – values of parameters used as inputs
qc_outputs (AttributeDictionary) – QC outputs returned from the ‘collect_qc_outputs’ method
- class auto_process_ngs.qc.modules.cellranger_count.CheckCellrangerCountOutputs(_name, *args, **kws)
Check the outputs from cellranger(-atac) count
- finish()
Perform actions on task completion
Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.
Must be implemented by the subclass
- init(project, fastq_dir=None, samples=None, qc_dir=None, qc_module=None, extra_projects=None, cellranger_version=None, cellranger_ref_data=None, verbose=False)
Initialise the CheckCellrangerCountOutputs task.
- Parameters:
project (AnalysisProject) – project to run QC for
fastq_dir (str) – directory holding Fastq files
samples (list) – list of samples to restrict checks to (all samples in project are checked by default)
qc_dir (str) – top-level QC directory to look for ‘count’ QC outputs (e.g. metrics CSV and summary HTML files)
qc_module (str) – QC protocol being used
extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline
cellranger_version (str) – version number of 10xGenomics package
cellranger_ref_data (str) – name or path to reference dataset for single library analysis
verbose (bool) – if True then print additional information from the task
- Outputs:
- fastq_dir (PipelineParam): pipeline parameter
instance that resolves to a string with the path to directory with Fastq files
- samples (list): list of sample names that have
missing outputs from ‘cellranger count’
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.modules.cellranger_count.DetermineRequired10xPackage(_name, *args, **kws)
Determine which 10xGenomics software package is required
By default determines the package name based on the supplied QC module, but this can be overridden by explicitly supplying a required package (which can also be a path to an executable).
The output ‘require_cellranger’ parameter should be supplied to the ‘Get10xPackage’ task, which will do the job of actually locating an executable.
- init(qc_module, require_cellranger=None)
Initialise the DetermineRequired10xPackage task
- Argument:
qc_module (str): QC module being used require_cellranger (str): optional package name
or path to an executable; if supplied then overrides the automatic package determination
- Outputs:
- require_cellranger (pipelineParam): the 10xGenomics
software package name or path to use
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.modules.cellranger_count.GetCellrangerReferenceData(_name, *args, **kws)
- init(project, organism=None, transcriptomes=None, premrna_references=None, atac_references=None, multiome_references=None, cellranger_exe=None, cellranger_version=None, force_reference_data=None)
Initialise the GetCellrangerReferenceData task
- Parameters:
project (AnalysisProject) – project to run QC for
organism (str) – if supplied then must be a string with the names of one or more organisms, with multiple organisms separated by spaces (defaults to the organisms associated with the project)
transcriptomes (mapping) – mapping of organism names to reference transcriptome data for cellranger
premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger
atac_references (mapping) – mapping of organism names to reference genome data for cellranger-atac
multiome_references (mapping) – mapping of organism names to reference datasets for cellranger-arc
cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)
cellranger_version (str) – the version string for the Cellranger package
force_reference_data (str) – if supplied then will be used as the reference dataset, instead of trying to locate appropriate reference data automatically
- Outputs:
- reference_data_path (PipelineParam): pipeline
parameter instance which resolves to a string with the path to the reference data set corresponding to the supplied organism.
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.modules.cellranger_count.MakeCellrangerArcCountLibraries(_name, *args, **kws)
Make ‘libraries.csv’ files for cellranger-arc count
- init(project, qc_dir)
Initialise the MakeCellrangerArcCountLibraries task.
- Parameters:
project (AnalysisProject) – project to run QC for
qc_dir (str) – top-level QC directory to put ‘libraries.csv’ files
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.modules.cellranger_count.RunCellrangerCount(_name, *args, **kws)
Run ‘cellranger* count’
- finish()
Perform actions on task completion
Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.
Must be implemented by the subclass
- init(samples, fastq_dir, reference_data_path, library_type, out_dir, single_nuclei=None, qc_dir=None, cellranger_exe=None, cellranger_version=None, chemistry='auto', fastq_dirs=None, force_cells=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None)
Initialise the RunCellrangerCount task.
- Parameters:
samples (list) – list of sample names to run cellranger count on (it is expected that this list will come from the CheckCellrangerCountsOutputs task)
fastq_dir (str) – path to directory holding the Fastq files
reference_data_path (str) – path to the cellranger compatible reference dataset
library_type (str) – type of data being analysed (e.g. ‘scRNA-seq’)
out_dir (str) – top-level directory to copy all final ‘count’ outputs into. Outputs won’t be copied if no value is supplied
single_nuclei (bool) – explicitly indicate that data are single nuclei rather than single cell
qc_dir (str) – top-level QC directory to put ‘count’ QC outputs (e.g. metrics CSV and summary HTML files) into. Outputs won’t be copied if no value is supplied
cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)
cellranger_version (str) – the version string for the Cellranger package
fastq_dirs (dict) – optional, a dictionary mapping sample names to Fastq directories which will be used to override the paths set by the ‘fastq_dirs’ argument
force_cells (int) – optional, if set then bypasses the cell detection algorithm in ‘cellranger’ and ‘cellranger-atac’ using the ‘–force-cells’ option (does nothing for ‘cellranger-arc’)
chemistry (str) – assay configuration (set to ‘auto’ to let cellranger determine this automatically; ignored if not scRNA-seq)
cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)
cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)
cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)
cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)
cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (defaults to number of slots set in runner)
cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- auto_process_ngs.qc.modules.cellranger_count.add_cellranger_count(p, project_name, project, qc_dir, organism, fastq_dir, qc_module_name, library_type, chemistry, transcriptome_references, premrna_references, atac_references, multiome_references, force_cells, single_nuclei=None, samples=None, fastq_dirs=None, cellranger_exe=None, reference_dataset=None, extra_projects=None, cellranger_out_dir=None, cellranger_jobmode=None, cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None, required_tasks=None, verify_runner=None, cellranger_runner=None, envmodules=None, verbose=False)
Add tasks to pipeline to run ‘cellranger* count’
- Parameters:
p (Pipeline) – pipeline to extend
project_name (str) – name to associate with project for reporting tasks
project (AnalysisProject) – project to run 10x cellranger pipeline within
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
organism (str) – organism for pipeline
fastq_dir (str) – directory holding Fastq files
qc_module (str) – QC module being used
library_type (str) – type of data being analysed (e.g. ‘scRNA-seq’)
chemistry (str) – chemistry to use in single library analysis
transcriptome_references (mapping) – mapping of organism names to reference transcriptome data for cellranger
premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger
atac_references (mapping) – mapping of organism names to ATAC-seq reference genome data for cellranger-atac
multiome_references (mapping) – mapping of organism names to multiome reference datasets for cellranger-arc
force_cells (int) – if set then bypasses the cell detection algorithm in ‘cellranger’ and ‘cellranger-atac’ using the ‘–force-cells’ option (does nothing for ‘cellranger-arc’)
single_nuclei (bool) – if set then indicates data are single nuclei rather than single cell
samples (list) – optional, list of samples to restrict single library analyses to (or None to use all samples in project)
fastq_dirs (dict) – optional, a dictionary mapping sample names to Fastq directories which will be used to override the paths set by the ‘fastq_dirs’ argument
cellranger_exe (str) – optional, explicitly specify the cellranger executable to use for single library analysis (default: cellranger executable is determined automatically)
reference_dataset (str) – optional, path to reference dataset (otherwise will be determined automatically based on organism)
extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline
cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)
cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)
cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)
cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)
cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (default: None)
cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)
required_tasks (list) – list of tasks that the cellranger pipeline should wait for
verify_runner (JobRunner) – runner to use for checks
cellranger_runner (JobRunner) – runner to use for running ‘cellranger* count’
envmodules (list) – environment module names to load for running Cellranger
verbose (bool) – enable verbose output
- auto_process_ngs.qc.modules.cellranger_count.check_cellranger_arc_count_outputs(project, qc_dir=None, prefix='cellranger_count')
Return samples missing QC outputs from ‘cellranger-arc count’
Returns a list of the samples from a project for which one or more associated outputs from cellranger-arc count don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to QC directory (if not the default QC directory for the project)
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
list of sample names with missing outputs
- Return type:
- auto_process_ngs.qc.modules.cellranger_count.check_cellranger_atac_count_outputs(project, qc_dir=None, prefix='cellranger_count')
Return samples missing QC outputs from ‘cellranger-atac count’
Returns a list of the samples from a project for which one or more associated outputs from cellranger-atac count don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to QC directory (if not the default QC directory for the project)
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
list of sample names with missing outputs
- Return type:
- auto_process_ngs.qc.modules.cellranger_count.check_cellranger_count_outputs(project, qc_dir=None, prefix='cellranger_count')
Return samples missing QC outputs from ‘cellranger count’
Returns a list of the samples from a project for which one or more associated outputs from cellranger count don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to QC directory (if not the default QC directory for the project)
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
list of sample names with missing outputs
- Return type:
- auto_process_ngs.qc.modules.cellranger_count.filter_10x_pipelines(p, pipelines)
Filter list of 10x pipelines
Pipelines are described using tuples of the form:
(NAME,VERSION,REFERENCE)
for example:
(‘cellranger’,’6.1.2’,’refdata-gex-2020’)
Only pipelines matching the specified name, version and reference data will be included in the returned list.
Where the supplied version or reference dataset name are either None or ‘*’, these will match any version and/or reference dataset.
- Parameters:
p (tuple) – tuple specifying pipeline(s) to match against
pipelines (list) – list of pipeline tuples to filter
- Returns:
list of matching 10x pipeline tuples.
- Return type:
- auto_process_ngs.qc.modules.cellranger_count.verify_10x_pipeline(pipeline, samples, qc_outputs)
Check for and verify outputs for 10x package
- Parameters:
pipeline (tuple) – tuple specifying pipeline(s) to verify
samples (list) – list of sample names to verify
qc_outputs (AttributeDictionary) – QC outputs returned from the ‘collect_qc_outputs’ method
- Returns:
- True if at least one set of valid outputs
exist for the specified pipeline and sample list, False otherwise.
- Return type:
Boolean