pyllelic package

Submodules

pyllelic.config module

Configuration options for pyllelic.

class pyllelic.config.Config(base_directory: Path = PosixPath('/'), promoter_file: Path = PosixPath('/promoter.txt'), results_directory: Path = PosixPath('/results'), analysis_directory: Path = PosixPath('/test'), promoter_start: int = 1293200, promoter_end: int = 1296000, chromosome: str = '5', offset: int = 1298163, viz_backend: str = 'plotly', fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$'))[source]

Bases: object

pyllelic configuration dataclass with TERT promoter defaults.

Pyllelic expects .bam files to be analyzed to be in analysis_directory, under a base_directory, with the promoter reference sequence in the base_directory.

analysis_directory: Path = PosixPath('/test')
base_directory: Path = PosixPath('/')
chromosome: str = '5'
fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$')
offset: int = 1298163
promoter_end: int = 1296000
promoter_file: Path = PosixPath('/promoter.txt')
promoter_start: int = 1293200
results_directory: Path = PosixPath('/results')
viz_backend: str = 'plotly'

pyllelic.process module

Utilities to pre-process and prepare data for use in pyllelic.

exception pyllelic.process.FileNameError[source]

Bases: Exception

Error for invalid filetypes.

exception pyllelic.process.ShellCommandError[source]

Bases: Exception

Error for shell utilities that aren’t installed.

pyllelic.process.bismark(genome: Path, fastq: Path) str[source]

Helper function to run external bismark tool.

Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs

Parameters:
  • genome (Path) – filepath to directory of bismark processed genome files.

  • fastq (Path) – filepath to fastq file to process.

Returns:

output from bismark shell command, usually discarded

Return type:

str

Raises:

ShellCommandError – bismark is not installed.

pyllelic.process.bowtie2_fastq_to_bam(index: Path, fastq: Path, cores: int) str[source]

Helper function to run external bowtie2-build tool.

Parameters:
  • index (Path) – filepath to bowtie index file

  • fastq (Path) – filepath to fastq file to convert to bam

  • cores (int) – number of cores to use for processing

Returns:

output from bowtie2 and samtools shell command, usually discarded

Return type:

str

Raises:

ShellCommandError – bowtie2 is not installed.

pyllelic.process.build_bowtie2_index(fasta: Path) str[source]

Helper function to run external bowtie2-build tool.

Parameters:

fasta (Path) – filepath to fasta file to build index from

Returns:

output from bowtie2-build shell command, usually discarded

Return type:

str

Raises:

ShellCommandError – bowtie2-build is not installed.

pyllelic.process.convert_methbank_bed(path: Path, chrom: str, start: int, stop: int, viz: str = 'plotly') GenomicPositionData[source]

Helper function to convert MethBank BED file into GenomicPositionData obj.

MethBank: https://ngdc.cncb.ac.cn/methbank/

Parameters:
  • path (Path) – path to MethBank formatted BED file

  • chrom (str) – chromosome identifier

  • start (int) – genomic start position

  • stop (int) – genomic stop position

  • viz (str) – Plotting backend to use, defaults to plotly

Returns:

mostly complete pyllelic object with data from BED.

Return type:

GenomicPositionData

pyllelic.process.fastq_to_list(filepath: Path) List[SeqRecord][source]

Read a .fastq or fastq.gz file into an in-memory record_list.

This is a time and memory intensive operation!

Parameters:

filepath (Path) – file path to a fastq.gz file

Returns:

list of biopython sequence records from the fastq file

Return type:

List[SeqRecord]

Raises:

FileNameError – “Wrong filetype”

pyllelic.process.index_bam(bamfile: Path) bool[source]

Helper function to run external samtools index.

Parameters:

bamfile (Path) – filepath to bam file

Returns:

verification of samtools command, usually discarded

Return type:

bool

pyllelic.process.make_records_to_dictionary(record_list: List[SeqRecord]) Dict[str, SeqRecord][source]
Take in list of biopython SeqRecords and output a dictionary

with keys of the record name.

Parameters:

record_list (List[SeqRecord]) – biopython sequence records from a fastq file

Returns:

dict of biopython SeqRecords from a fastq file

Return type:

Dict[str, SeqRecord]

pyllelic.process.prepare_genome(index: Path, aligner: Optional[Path] = None) str[source]

Helper function to run external bismark genome preparation tool.

Uses genomes from, e.g.: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs

Parameters:
  • index (Path) – filepath to unprocessed genome file.

  • aligner (Optional[Path]) – filepath to bowtie2 alignment program.

Returns:

output from genome preparation shell command, usually discarded

Return type:

str

Raises:

ShellCommandError – bismark_genome_preparation is not installed.

pyllelic.process.retrieve_seq(filename: str, chrom: str, start: int, end: int) None[source]

Retrieve the genomic sequence of interest from UCSC Genome Browser.

Parameters:
  • filename (str) – path to store genomic sequence

  • chrom (str) – chromosome of interest, e.g. “chr5”

  • start (int) – start position for region of interest

  • end (int) – end position for region of interest

pyllelic.process.sort_bam(bamfile: Path) bool[source]

Helper function to run pysam samtools sort.

Parameters:

bamfile (Path) – filepath to bam file

Returns:

verification of samtools command, usually discarded

Return type:

bool

pyllelic.pyllelic module

pyllelic: a tool for detection of allelic-specific variation in reduced representation bisulfate DNA sequencing.

class pyllelic.pyllelic.AD_stats(sig: bool, stat: Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], crits: List[Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]])[source]

Bases: NamedTuple

Helper class for NamedTuple results from anderson_darling_test

crits: List[Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]]

Alias for field number 2

sig: bool

Alias for field number 0

stat: Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]

Alias for field number 1

class pyllelic.pyllelic.BamOutput(sam_directory: Path, genome_string: str, config: Config)[source]

Bases: object

Storage container to process BAM sequencing files and store processed results.

genome_values: Dict[str, str]

dictionary of read files and contents.

Type:

Dict[str, str]

name: str

path to bam file analyzed.

Type:

str

positions: List[str]

index of genomic positions in the bam file.

Type:

“pd.Index

values: Dict[str, str]

dictionary of reads at a given position

Type:

Dict[str, str]

class pyllelic.pyllelic.GenomicPositionData(config: Config, files_set: List[str])[source]

Bases: object

Class to process reduced representation bisulfite methylation sequencing data.

When initialized, GenomicPositionData reads sequencing file (.bam) locations from a config object, and then automatically performs alignment into BamOutput objects, and then performs methylation analysis, storing the results as QumaResults.

Finally, the aggregate data is analyzed to create some aggregate metrics such as means, modes, and differences (diffs), as well as expose methods for plotting and statistical analysis.

allelic_data: DataFrame

dataframe of Chi-squared p-values.

Type:

pd.DataFrame

cell_types: List[str]

list of cell types in the data.

Type:

List[str]

config: Config

pyllelic config object.

Type:

Config

diffs: DataFrame

df of difference mean minus mode methylation values.

Type:

pd.DataFrame

file_names: List[str]

list of bam filenames in the data.

Type:

List[str]

files_set: List[str]

list of bam files analyzed.

Type:

List[str]

static from_pickle(filename: str) GenomicPositionData[source]

Read pickled GenomicPositionData back to an object.

Parameters:

filename (str) – filename to read pickle

Returns:

GenomicPositionData object

Return type:

GenomicPositionData

generate_ad_stats() DataFrame[source]

Generate Anderson-Darling normality statistics for an individual data df.

Returns:

df of a-d test statistics

Return type:

pd.DataFrame

heatmap(min_values: int = 1, width: int = 800, height: int = 2000, cell_lines: Optional[List[str]] = None, data_type: str = 'means', backend: Optional[str] = None) None[source]

Display a graph figure showing heatmap of mean methylation across cell lines.

Parameters:
  • min_values (int) – minimum number of points data must exist at a position

  • width (int) – figure width, defaults to 800

  • height (int) – figure height, defaults to 2000

  • cell_lines (Optional[List[str]]) – set of cell lines to analyze,

  • lines. (defaults to all cell) –

  • data_type (str) – type of data to plot. Can be ‘means’, ‘modes’, ‘diffs’, or ‘pvalue’.

  • backend (Optional[str]) – plotting backend to override default

Raises:
  • ValueError – invalid data type

  • ValueError – invalid plotting backend

histogram(cell_line: str, position: str, backend: Optional[str] = None) None[source]

Display a graph figure showing fractional methylation in a given cell line at a given site.

Parameters:
  • cell_line (str) – name of cell line

  • position (str) – genomic position

  • backend (Optional[str]) – plotting backend to override default

Raises:
  • ValueError – invalid plotting backend

  • ValueError – No data available at that position

individual_data: DataFrame

dataframe of individual methylation values.

Type:

pd.DataFrame

means: DataFrame

dataframe of mean methylation values.

Type:

pd.DataFrame

modes: DataFrame

dataframe of modes of methylation values.

Type:

pd.DataFrame

positions: List[str]

list of genomic positions in the data.

Type:

List[str]

quma_results: Dict[str, QumaResult]

list of QumaResults.

Type:

Dict[str, QumaResult]

reads_graph(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) None[source]

Display a graph figure showing methylation of reads across cell lines.

Parameters:
  • cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.

  • backend (Optional[str]) – plotting backend to override default

Raises:
  • ValueError – invalid plotting backend

  • ValueError – Unable to plot more than 20 cell lines at once.

save(filename: str = 'output.xlsx') None[source]

Save quma results to an excel file.

Parameters:

filename (str) – Filename to save to. Defaults to “output.xlsx”.

save_pickle(filename: str) None[source]

Save GenomicPositionData object as a pickled file.

Parameters:

filename (str) – filename to save pickle

sig_methylation_differences(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) None[source]

Display a graph figure showing a bar chart of significantly different mean / mode methylation across all or a subset of cell lines.

Parameters:
  • cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.

  • backend (Optional[str]) – plotting backend to override default

Raises:

ValueError – invalid plotting backend

summarize_allelic_data(cell_lines: Optional[List[str]] = None) DataFrame[source]

Create a dataframe only of likely allelic methylation positions.

Parameters:
  • cell_lines (Optional[List[str]]) – set of cell lines to analyze,

  • lines. (defaults to all cell) –

Returns:

dataframe of cell lines with likely allelic positions

Return type:

pd.DataFrame

write_means_modes_diffs(filename: str) None[source]

Wite out files of means, modes, and diffs for future analysis.

Parameters:

filename (str) – desired root filename

class pyllelic.pyllelic.QumaResult(read_files: List[str], genomic_files: List[str], positions: List[str])[source]

Bases: object

Storage container to process and store quma-style methylation results.

quma_output: List[Quma]

list of Quma result objects.

Type:

List[quma.Quma]

values: DataFrame

dataframe of quma methylation analysis values.

Type:

pd.DataFrame

pyllelic.pyllelic.configure(base_path: str, prom_file: str, prom_start: int, prom_end: int, chrom: str, offset: int, test_dir: Optional[str] = None, fname_pattern: Optional[str] = None, viz_backend: Optional[str] = None, results_dir: Optional[str] = None) Config[source]

Helper method to set up all our environmental variables, such as for testing.

Parameters:
  • base_path (str) – directory where all processing will occur, put .bam files in “test” sub-directory in this folder

  • prom_file (str) – filename of genmic sequence of promoter region of interest

  • prom_start (int) – start position to analyze in promoter region

  • prom_end (int) – final position to analyze in promoter region

  • chrom (str) – chromosome promoter is located on

  • offset (int) – genomic position of promoter to offset reads

  • test_dir (Optional[str]) – name of test directory where bam files are located

  • fname_pattern (Optional[str]) – regex pattern for processing filenames

  • viz_backend (Optional[str]) – which plotting backend to use

  • results_dir (Optional[str]) – name of results directory

Returns:

configuration dataclass instance.

Return type:

Config

pyllelic.pyllelic.make_list_of_bam_files(config: Config) List[str][source]

Check analysis directory for all valid .bam files.

Parameters:

config (Config) – pyllelic configuration options.

Returns:

list of files

Return type:

list[str]

pyllelic.pyllelic.pyllelic(config: Config, files_set: List[str]) GenomicPositionData[source]

Wrapper to call pyllelic routines.

Parameters:
  • config (Config) – pyllelic config object.

  • files_set (List[str]) – list of bam files to analyze.

Returns:

GenomicPositionData pyllelic object.

Return type:

GenomicPositionData

pyllelic.quma module

Tools to quantify methylation in reduced representation bisulfite sequencing reads.

class pyllelic.quma.Fasta(com: str = '', pos: Optional[str] = None, seq: str = '')[source]

Bases: object

Dataclass to wrap fasta results.

com: str = ''
pos: Optional[str] = None
seq: str = ''
class pyllelic.quma.Quma(gfile_contents: str, qfile_contents: str)[source]

Bases: object

Quma methylation analysis parser for bisulfite conversion DNA sequencing.

data: List[Reference]

QUMA Output in object form.

values: str

QUMA output values in tabular form.

class pyllelic.quma.Reference(fasta: Fasta, res: Result, dir: int, gdir: int, exc: int)[source]

Bases: object

Dataclass of quma analysis intermediates.

Includes fasta sequence, quma results, directon of read, genomic direction, and whether result meets exclusion criteria.

dir: int
exc: int
fasta: Fasta
gdir: int
res: Result
class pyllelic.quma.Result(qAli: str = '', gAli: str = '', val: str = '', perc: float = 0.0, pconv: float = 0.0, gap: int = 0, menum: int = 0, unconv: int = 0, conv: int = 0, match: int = 0, aliMis: int = 0, aliLen: int = 0)[source]

Bases: object

Dataclass of quma aligment comparison results.

aliLen: int = 0
aliMis: int = 0
conv: int = 0
gAli: str = ''
gap: int = 0
match: int = 0
menum: int = 0
pconv: float = 0.0
perc: float = 0.0
qAli: str = ''
unconv: int = 0
val: str = ''

pyllelic.visualization module

Utilities to visualize data for use in pyllelic.

pyllelic.visualization._create_heatmap(df: DataFrame, min_values: int, width: int, height: int, title_type: str, backend: str) Union[Figure, Figure][source]

Generate a graph figure showing heatmap of mean methylation across cell lines.

Parameters:
  • df (pd.DataFrame) – dataframe of mean methylation

  • min_values (int) – minimum number of points data must exist at a position

  • width (int) – figure width

  • height (int) – figure height

  • title_type (str) – type of figure being plotted

  • backend (str) – which plotting backend to use

Returns:

plotly or matplotlib figure object

Return type:

Union[go.Figure, plt.Figure]

Raises:

ValueError – invalid plotting backend

pyllelic.visualization._create_histogram(data: DataFrame, cell_line: str, position: str, backend: str) Union[Figure, Figure][source]

Generate a graph figure showing fractional methylation in a given cell line at a given site.

Parameters:
  • data (pd.DataFrame) – dataframe of individual data

  • cell_line (str) – name of cell line

  • position (str) – genomic position

  • backend (str) – which plotting backend to use

Returns:

plotly or matplotlib figure object

Return type:

Union[go.Figure, plt.Figure]

Raises:

ValueError – invalid plotting backend provided

pyllelic.visualization._create_methylation_diffs_bar_graph(df: DataFrame, backend: str) Union[Figure, Figure][source]

Generate a graph figure showing bar graph of significant methylation across cell lines.

Parameters:
  • df (pd.DataFrame) – dataframe of significant methylation positions

  • backend (str) – which plotting backend to use

Returns:

plotly or matplotlib figure object

Return type:

Union[go.Figure, plt.Figure]

Raises:

ValueError – invalid plotting backend

pyllelic.visualization._make_binary(data: Optional[List[int]]) List[int][source]
pyllelic.visualization._make_methyl_df(df: DataFrame, row: str) DataFrame[source]
pyllelic.visualization._make_stacked_fig(df: DataFrame, backend: str) Union[Figure, Figure][source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters:
  • df (pd.DataFrame) – dataframe of individual read data

  • backend (str) – plotting backend to use

Returns:

plotly or matplotlib figure

Return type:

Union[go.Figure, plt.Figure]

Raises:

ValueError – invalid plotting backend

pyllelic.visualization._make_stacked_mpl_fig(df: DataFrame) Figure[source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters:

df (pd.DataFrame) – dataframe of individual read data

Returns:

matplotlib figure

Return type:

plt.Figure

pyllelic.visualization._make_stacked_plotly_fig(df: DataFrame) Figure[source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters:

df (pd.DataFrame) – dataframe of individual read data

Returns:

plotly figure

Return type:

go.Figure

pyllelic.main module

pyllelic: module level interface to run pyllelic from the command line.

Example usage:

python -m pyllelic -o my_data -f fh_cellline_tissue.fastq.gz -g hg19chr5 -chr chr5 -s 1293000 -e 1296000 –viz plotly

This command would save pyllelic results in files with the prefix my_data, analyzing the specified fastq file using the specified reference genome, in the genomic region indicated.

pyllelic.__main__.run_pyllelic() None[source]

Run all processing and analysis steps of pyllelic.