auto_process_ngs.qc.modules.cellranger_count

Implements the ‘cellranger_count’ QC module:

  • CellrangerCount: core QCModule class

Pipeline task classes:

  • DetermineRequired10xPackage

  • GetCellrangerReferenceData

  • MakeCellrangerArcCountLibraries

  • CheckCellrangerCountOutputs

  • RunCellrangerCount

Pipeline helper functions:

  • add_cellranger_count: adds tasks to pipeline to run ‘cellranger* count’

  • filter_10x_pipelines: filters a list of 10x pipelines

  • verify_10x_pipleine: check for and verify outputs for a 10x package

Also imports the following pipeline tasks:

  • Get10xPackage

Additional helper functions:

  • check_cellranger_count_outputs: fetch names of samples with with missing ‘cellranger count’ outputs

  • check_cellranger_atac_count_outputs: fetch names of samples with missing ‘cellranger-atac count’ outputs

  • check_cellranger_arc_count_outputs: fetch names of samples with missing ‘cellranger-arc count’ outputs

class auto_process_ngs.qc.modules.cellranger_count.CellrangerCount

Class for handling the ‘cellranger_count’ QC module

classmethod add_to_pipeline(*args, **kws)

Adds tasks for ‘cellranger_count’ module to pipeline

Wrapper for the ‘add_cellranger_count’ function

classmethod collect_qc_outputs(qc_dir)

Collect information on Cellranger count outputs

Returns an AttributeDictionary with the following attributes:

  • name: set to the QC module name

  • software: dictionary of software and versions

  • references: list of associated reference datasets

  • fastqs: list of associated Fastq names

  • samples: list of associated sample names

  • pipelines: list of tuples defining 10x pipelines in the form (name,version,reference)

  • samples_by_pipeline: dictionary with lists of sample names associated with each 10x pipeline tuple

  • config_files: list of associated config files (‘libraries.<SAMPLE>.csv’)

  • output_files: list of associated output files

  • tags: list of associated output classes

Parameters:

qc_dir (QCDir) – QC directory to examine

classmethod verify(params, qc_outputs)

Verify ‘cellranger_count’ QC module against outputs

Returns one of 3 values:

  • True: outputs verified ok

  • False: outputs failed to verify

  • None: verification not possible

Parameters:
  • params (AttributeDictionary) – values of parameters used as inputs

  • qc_outputs (AttributeDictionary) – QC outputs returned from the ‘collect_qc_outputs’ method

class auto_process_ngs.qc.modules.cellranger_count.CheckCellrangerCountOutputs(_name, *args, **kws)

Check the outputs from cellranger(-atac) count

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, fastq_dir=None, samples=None, qc_dir=None, qc_module=None, extra_projects=None, cellranger_version=None, cellranger_ref_data=None, verbose=False)

Initialise the CheckCellrangerCountOutputs task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • fastq_dir (str) – directory holding Fastq files

  • samples (list) – list of samples to restrict checks to (all samples in project are checked by default)

  • qc_dir (str) – top-level QC directory to look for ‘count’ QC outputs (e.g. metrics CSV and summary HTML files)

  • qc_module (str) – QC protocol being used

  • extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline

  • cellranger_version (str) – version number of 10xGenomics package

  • cellranger_ref_data (str) – name or path to reference dataset for single library analysis

  • verbose (bool) – if True then print additional information from the task

Outputs:
fastq_dir (PipelineParam): pipeline parameter

instance that resolves to a string with the path to directory with Fastq files

samples (list): list of sample names that have

missing outputs from ‘cellranger count’

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.modules.cellranger_count.DetermineRequired10xPackage(_name, *args, **kws)

Determine which 10xGenomics software package is required

By default determines the package name based on the supplied QC module, but this can be overridden by explicitly supplying a required package (which can also be a path to an executable).

The output ‘require_cellranger’ parameter should be supplied to the ‘Get10xPackage’ task, which will do the job of actually locating an executable.

init(qc_module, require_cellranger=None)

Initialise the DetermineRequired10xPackage task

Argument:

qc_module (str): QC module being used require_cellranger (str): optional package name

or path to an executable; if supplied then overrides the automatic package determination

Outputs:
require_cellranger (pipelineParam): the 10xGenomics

software package name or path to use

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.modules.cellranger_count.GetCellrangerReferenceData(_name, *args, **kws)
init(project, organism=None, transcriptomes=None, premrna_references=None, atac_references=None, multiome_references=None, cellranger_exe=None, cellranger_version=None, force_reference_data=None)

Initialise the GetCellrangerReferenceData task

Parameters:
  • project (AnalysisProject) – project to run QC for

  • organism (str) – if supplied then must be a string with the names of one or more organisms, with multiple organisms separated by spaces (defaults to the organisms associated with the project)

  • transcriptomes (mapping) – mapping of organism names to reference transcriptome data for cellranger

  • premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger

  • atac_references (mapping) – mapping of organism names to reference genome data for cellranger-atac

  • multiome_references (mapping) – mapping of organism names to reference datasets for cellranger-arc

  • cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)

  • cellranger_version (str) – the version string for the Cellranger package

  • force_reference_data (str) – if supplied then will be used as the reference dataset, instead of trying to locate appropriate reference data automatically

Outputs:
reference_data_path (PipelineParam): pipeline

parameter instance which resolves to a string with the path to the reference data set corresponding to the supplied organism.

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.modules.cellranger_count.MakeCellrangerArcCountLibraries(_name, *args, **kws)

Make ‘libraries.csv’ files for cellranger-arc count

init(project, qc_dir)

Initialise the MakeCellrangerArcCountLibraries task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – top-level QC directory to put ‘libraries.csv’ files

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.modules.cellranger_count.RunCellrangerCount(_name, *args, **kws)

Run ‘cellranger* count’

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(samples, fastq_dir, reference_data_path, library_type, out_dir, single_nuclei=None, qc_dir=None, cellranger_exe=None, cellranger_version=None, chemistry='auto', fastq_dirs=None, force_cells=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None)

Initialise the RunCellrangerCount task.

Parameters:
  • samples (list) – list of sample names to run cellranger count on (it is expected that this list will come from the CheckCellrangerCountsOutputs task)

  • fastq_dir (str) – path to directory holding the Fastq files

  • reference_data_path (str) – path to the cellranger compatible reference dataset

  • library_type (str) – type of data being analysed (e.g. ‘scRNA-seq’)

  • out_dir (str) – top-level directory to copy all final ‘count’ outputs into. Outputs won’t be copied if no value is supplied

  • single_nuclei (bool) – explicitly indicate that data are single nuclei rather than single cell

  • qc_dir (str) – top-level QC directory to put ‘count’ QC outputs (e.g. metrics CSV and summary HTML files) into. Outputs won’t be copied if no value is supplied

  • cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)

  • cellranger_version (str) – the version string for the Cellranger package

  • fastq_dirs (dict) – optional, a dictionary mapping sample names to Fastq directories which will be used to override the paths set by the ‘fastq_dirs’ argument

  • force_cells (int) – optional, if set then bypasses the cell detection algorithm in ‘cellranger’ and ‘cellranger-atac’ using the ‘–force-cells’ option (does nothing for ‘cellranger-arc’)

  • chemistry (str) – assay configuration (set to ‘auto’ to let cellranger determine this automatically; ignored if not scRNA-seq)

  • cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)

  • cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)

  • cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)

  • cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)

  • cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (defaults to number of slots set in runner)

  • cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

auto_process_ngs.qc.modules.cellranger_count.add_cellranger_count(p, project_name, project, qc_dir, organism, fastq_dir, qc_module_name, library_type, chemistry, transcriptome_references, premrna_references, atac_references, multiome_references, force_cells, single_nuclei=None, samples=None, fastq_dirs=None, cellranger_exe=None, reference_dataset=None, extra_projects=None, cellranger_out_dir=None, cellranger_jobmode=None, cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None, required_tasks=None, verify_runner=None, cellranger_runner=None, envmodules=None, verbose=False)

Add tasks to pipeline to run ‘cellranger* count’

Parameters:
  • p (Pipeline) – pipeline to extend

  • project_name (str) – name to associate with project for reporting tasks

  • project (AnalysisProject) – project to run 10x cellranger pipeline within

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • organism (str) – organism for pipeline

  • fastq_dir (str) – directory holding Fastq files

  • qc_module (str) – QC module being used

  • library_type (str) – type of data being analysed (e.g. ‘scRNA-seq’)

  • chemistry (str) – chemistry to use in single library analysis

  • transcriptome_references (mapping) – mapping of organism names to reference transcriptome data for cellranger

  • premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger

  • atac_references (mapping) – mapping of organism names to ATAC-seq reference genome data for cellranger-atac

  • multiome_references (mapping) – mapping of organism names to multiome reference datasets for cellranger-arc

  • force_cells (int) – if set then bypasses the cell detection algorithm in ‘cellranger’ and ‘cellranger-atac’ using the ‘–force-cells’ option (does nothing for ‘cellranger-arc’)

  • single_nuclei (bool) – if set then indicates data are single nuclei rather than single cell

  • samples (list) – optional, list of samples to restrict single library analyses to (or None to use all samples in project)

  • fastq_dirs (dict) – optional, a dictionary mapping sample names to Fastq directories which will be used to override the paths set by the ‘fastq_dirs’ argument

  • cellranger_exe (str) – optional, explicitly specify the cellranger executable to use for single library analysis (default: cellranger executable is determined automatically)

  • reference_dataset (str) – optional, path to reference dataset (otherwise will be determined automatically based on organism)

  • extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline

  • cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)

  • cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)

  • cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)

  • cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)

  • cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (default: None)

  • cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)

  • required_tasks (list) – list of tasks that the cellranger pipeline should wait for

  • verify_runner (JobRunner) – runner to use for checks

  • cellranger_runner (JobRunner) – runner to use for running ‘cellranger* count’

  • envmodules (list) – environment module names to load for running Cellranger

  • verbose (bool) – enable verbose output

auto_process_ngs.qc.modules.cellranger_count.check_cellranger_arc_count_outputs(project, qc_dir=None, prefix='cellranger_count')

Return samples missing QC outputs from ‘cellranger-arc count’

Returns a list of the samples from a project for which one or more associated outputs from cellranger-arc count don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to QC directory (if not the default QC directory for the project)

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

list of sample names with missing outputs

Return type:

List

auto_process_ngs.qc.modules.cellranger_count.check_cellranger_atac_count_outputs(project, qc_dir=None, prefix='cellranger_count')

Return samples missing QC outputs from ‘cellranger-atac count’

Returns a list of the samples from a project for which one or more associated outputs from cellranger-atac count don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to QC directory (if not the default QC directory for the project)

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

list of sample names with missing outputs

Return type:

List

auto_process_ngs.qc.modules.cellranger_count.check_cellranger_count_outputs(project, qc_dir=None, prefix='cellranger_count')

Return samples missing QC outputs from ‘cellranger count’

Returns a list of the samples from a project for which one or more associated outputs from cellranger count don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to QC directory (if not the default QC directory for the project)

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

list of sample names with missing outputs

Return type:

List

auto_process_ngs.qc.modules.cellranger_count.filter_10x_pipelines(p, pipelines)

Filter list of 10x pipelines

Pipelines are described using tuples of the form:

(NAME,VERSION,REFERENCE)

for example:

(‘cellranger’,’6.1.2’,’refdata-gex-2020’)

Only pipelines matching the specified name, version and reference data will be included in the returned list.

Where the supplied version or reference dataset name are either None or ‘*’, these will match any version and/or reference dataset.

Parameters:
  • p (tuple) – tuple specifying pipeline(s) to match against

  • pipelines (list) – list of pipeline tuples to filter

Returns:

list of matching 10x pipeline tuples.

Return type:

List

auto_process_ngs.qc.modules.cellranger_count.verify_10x_pipeline(pipeline, samples, qc_outputs)

Check for and verify outputs for 10x package

Parameters:
  • pipeline (tuple) – tuple specifying pipeline(s) to verify

  • samples (list) – list of sample names to verify

  • qc_outputs (AttributeDictionary) – QC outputs returned from the ‘collect_qc_outputs’ method

Returns:

True if at least one set of valid outputs

exist for the specified pipeline and sample list, False otherwise.

Return type:

Boolean