pyllelic package

Submodules

pyllelic.config module

Configuration options for pyllelic.

class pyllelic.config.Config(base_directory: Path = PosixPath('/'), promoter_file: Path = PosixPath('/promoter.txt'), results_directory: Path = PosixPath('/results'), analysis_directory: Path = PosixPath('/test'), promoter_start: int = 1293200, promoter_end: int = 1296000, chromosome: str = '5', offset: int = 1298163, viz_backend: str = 'plotly', fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$'))[source]

Bases: object

pyllelic configuration dataclass with TERT promoter defaults.

Pyllelic expects .bam files to be analyzed to be in analysis_directory, under a base_directory, with the promoter reference sequence in the base_directory.

analysis_directory: Path = PosixPath('/test')

base_directory: Path = PosixPath('/')

chromosome: str = '5'

fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$')

offset: int = 1298163

promoter_end: int = 1296000

promoter_file: Path = PosixPath('/promoter.txt')

promoter_start: int = 1293200

results_directory: Path = PosixPath('/results')

viz_backend: str = 'plotly'

pyllelic.process module

Utilities to pre-process and prepare data for use in pyllelic.

exception pyllelic.process.FileNameError[source]

Bases: Exception

Error for invalid filetypes.

exception pyllelic.process.ShellCommandError[source]

Bases: Exception

Error for shell utilities that aren’t installed.

pyllelic.process.bismark(genome: Path, fastq: Path) → str[source]

Helper function to run external bismark tool.

Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs

Parameters:

genome (Path) – filepath to directory of bismark processed genome files.
fastq (Path) – filepath to fastq file to process.

Returns:

output from bismark shell command, usually discarded

Return type:

str

Raises:

ShellCommandError – bismark is not installed.

pyllelic.process.bowtie2_fastq_to_bam(index: Path, fastq: Path, cores: int) → str[source]

Helper function to run external bowtie2-build tool.

Parameters:

index (Path) – filepath to bowtie index file
fastq (Path) – filepath to fastq file to convert to bam
cores (int) – number of cores to use for processing

Returns:

output from bowtie2 and samtools shell command, usually discarded

Return type:

str

Raises:

ShellCommandError – bowtie2 is not installed.

pyllelic.process.build_bowtie2_index(fasta: Path) → str[source]

Helper function to run external bowtie2-build tool.

Parameters:: fasta (Path) – filepath to fasta file to build index from
Returns:: output from bowtie2-build shell command, usually discarded
Return type:: str
Raises:: ShellCommandError – bowtie2-build is not installed.

pyllelic.process.convert_methbank_bed(path: Path, chrom: str, start: int, stop: int, viz: str = 'plotly') → GenomicPositionData[source]

Helper function to convert MethBank BED file into GenomicPositionData obj.

MethBank: https://ngdc.cncb.ac.cn/methbank/

Parameters:

path (Path) – path to MethBank formatted BED file
chrom (str) – chromosome identifier
start (int) – genomic start position
stop (int) – genomic stop position
viz (str) – Plotting backend to use, defaults to plotly

Returns:

mostly complete pyllelic object with data from BED.

Return type:

GenomicPositionData

pyllelic.process.fastq_to_list(filepath: Path) → List[SeqRecord][source]

Read a .fastq or fastq.gz file into an in-memory record_list.

This is a time and memory intensive operation!

Parameters:: filepath (Path) – file path to a fastq.gz file
Returns:: list of biopython sequence records from the fastq file
Return type:: List[SeqRecord]
Raises:: FileNameError – “Wrong filetype”

pyllelic.process.index_bam(bamfile: Path) → bool[source]

Helper function to run external samtools index.

Parameters:: bamfile (Path) – filepath to bam file
Returns:: verification of samtools command, usually discarded
Return type:: bool

pyllelic.process.make_records_to_dictionary(record_list: List[SeqRecord]) → Dict[str, SeqRecord][source]

Take in list of biopython SeqRecords and output a dictionary: with keys of the record name.

Parameters:: record_list (List[SeqRecord]) – biopython sequence records from a fastq file
Returns:: dict of biopython SeqRecords from a fastq file
Return type:: Dict[str, SeqRecord]

pyllelic.process.prepare_genome(index: Path, aligner: Optional[Path] = None) → str[source]

Helper function to run external bismark genome preparation tool.

Uses genomes from, e.g.: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs

Parameters:

index (Path) – filepath to unprocessed genome file.
aligner (Optional[Path]) – filepath to bowtie2 alignment program.

Returns:

output from genome preparation shell command, usually discarded

Return type:

str

Raises:

ShellCommandError – bismark_genome_preparation is not installed.

pyllelic.process.retrieve_seq(filename: str, chrom: str, start: int, end: int) → None[source]

Retrieve the genomic sequence of interest from UCSC Genome Browser.

Parameters:

filename (str) – path to store genomic sequence
chrom (str) – chromosome of interest, e.g. “chr5”
start (int) – start position for region of interest
end (int) – end position for region of interest

pyllelic.process.sort_bam(bamfile: Path) → bool[source]

Helper function to run pysam samtools sort.

Parameters:: bamfile (Path) – filepath to bam file
Returns:: verification of samtools command, usually discarded
Return type:: bool

pyllelic.pyllelic module

pyllelic: a tool for detection of allelic-specific variation in reduced representation bisulfate DNA sequencing.

class pyllelic.pyllelic.AD_stats(sig: bool, stat: Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], crits: List[Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]])[source]

Bases: NamedTuple

Helper class for NamedTuple results from anderson_darling_test

crits: List[Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]]: Alias for field number 2

sig: bool: Alias for field number 0

stat: Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]: Alias for field number 1

class pyllelic.pyllelic.BamOutput(sam_directory: Path, genome_string: str, config: Config)[source]

Bases: object

Storage container to process BAM sequencing files and store processed results.

genome_values: Dict[str, str]

dictionary of read files and contents.

Type:: Dict[str, str]

name: str

path to bam file analyzed.

Type:: str

positions: List[str]

index of genomic positions in the bam file.

Type:: “pd.Index

values: Dict[str, str]

dictionary of reads at a given position

Type:: Dict[str, str]

class pyllelic.pyllelic.GenomicPositionData(config: Config, files_set: List[str])[source]

Bases: object

Class to process reduced representation bisulfite methylation sequencing data.

When initialized, GenomicPositionData reads sequencing file (.bam) locations from a config object, and then automatically performs alignment into BamOutput objects, and then performs methylation analysis, storing the results as QumaResults.

Finally, the aggregate data is analyzed to create some aggregate metrics such as means, modes, and differences (diffs), as well as expose methods for plotting and statistical analysis.

allelic_data: DataFrame

dataframe of Chi-squared p-values.

Type:: pd.DataFrame

cell_types: List[str]

list of cell types in the data.

Type:: List[str]

config: Config

pyllelic config object.

Type:: Config

diffs: DataFrame

df of difference mean minus mode methylation values.

Type:: pd.DataFrame

file_names: List[str]

list of bam filenames in the data.

Type:: List[str]

files_set: List[str]

list of bam files analyzed.

Type:: List[str]

static from_pickle(filename: str) → GenomicPositionData[source]

Read pickled GenomicPositionData back to an object.

Parameters:: filename (str) – filename to read pickle
Returns:: GenomicPositionData object
Return type:: GenomicPositionData

generate_ad_stats() → DataFrame[source]

Generate Anderson-Darling normality statistics for an individual data df.

Returns:: df of a-d test statistics
Return type:: pd.DataFrame

heatmap(min_values: int = 1, width: int = 800, height: int = 2000, cell_lines: Optional[List[str]] = None, data_type: str = 'means', backend: Optional[str] = None) → None[source]

Display a graph figure showing heatmap of mean methylation across cell lines.

Parameters:

min_values (int) – minimum number of points data must exist at a position
width (int) – figure width, defaults to 800
height (int) – figure height, defaults to 2000
cell_lines (Optional[List[str]]) – set of cell lines to analyze,
lines. (defaults to all cell) –
data_type (str) – type of data to plot. Can be ‘means’, ‘modes’, ‘diffs’, or ‘pvalue’.
backend (Optional[str]) – plotting backend to override default

Raises:

ValueError – invalid data type
ValueError – invalid plotting backend

histogram(cell_line: str, position: str, backend: Optional[str] = None) → None[source]

Display a graph figure showing fractional methylation in a given cell line at a given site.

Parameters:

cell_line (str) – name of cell line
position (str) – genomic position
backend (Optional[str]) – plotting backend to override default

Raises:

ValueError – invalid plotting backend
ValueError – No data available at that position

individual_data: DataFrame

dataframe of individual methylation values.

Type:: pd.DataFrame

means: DataFrame

dataframe of mean methylation values.

Type:: pd.DataFrame

modes: DataFrame

dataframe of modes of methylation values.

Type:: pd.DataFrame

positions: List[str]

list of genomic positions in the data.

Type:: List[str]

quma_results: Dict[str, QumaResult]

list of QumaResults.

Type:: Dict[str, QumaResult]

reads_graph(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) → None[source]

Display a graph figure showing methylation of reads across cell lines.

Parameters:

cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.
backend (Optional[str]) – plotting backend to override default

Raises:

ValueError – invalid plotting backend
ValueError – Unable to plot more than 20 cell lines at once.

save(filename: str = 'output.xlsx') → None[source]

Save quma results to an excel file.

Parameters:: filename (str) – Filename to save to. Defaults to “output.xlsx”.

save_pickle(filename: str) → None[source]

Save GenomicPositionData object as a pickled file.

Parameters:: filename (str) – filename to save pickle

sig_methylation_differences(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) → None[source]

Display a graph figure showing a bar chart of significantly different mean / mode methylation across all or a subset of cell lines.

Parameters:

cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.
backend (Optional[str]) – plotting backend to override default

Raises:

ValueError – invalid plotting backend

summarize_allelic_data(cell_lines: Optional[List[str]] = None) → DataFrame[source]

Create a dataframe only of likely allelic methylation positions.

Parameters:

cell_lines (Optional[List[str]]) – set of cell lines to analyze,
lines. (defaults to all cell) –

Returns:

dataframe of cell lines with likely allelic positions

Return type:

pd.DataFrame

write_means_modes_diffs(filename: str) → None[source]

Wite out files of means, modes, and diffs for future analysis.

Parameters:: filename (str) – desired root filename

class pyllelic.pyllelic.QumaResult(read_files: List[str], genomic_files: List[str], positions: List[str])[source]

Bases: object

Storage container to process and store quma-style methylation results.

quma_output: List[Quma]

list of Quma result objects.

Type:: List[quma.Quma]

values: DataFrame

dataframe of quma methylation analysis values.

Type:: pd.DataFrame

pyllelic.pyllelic.configure(base_path: str, prom_file: str, prom_start: int, prom_end: int, chrom: str, offset: int, test_dir: Optional[str] = None, fname_pattern: Optional[str] = None, viz_backend: Optional[str] = None, results_dir: Optional[str] = None) → Config[source]

Helper method to set up all our environmental variables, such as for testing.

Parameters:

base_path (str) – directory where all processing will occur, put .bam files in “test” sub-directory in this folder
prom_file (str) – filename of genmic sequence of promoter region of interest
prom_start (int) – start position to analyze in promoter region
prom_end (int) – final position to analyze in promoter region
chrom (str) – chromosome promoter is located on
offset (int) – genomic position of promoter to offset reads
test_dir (Optional[str]) – name of test directory where bam files are located
fname_pattern (Optional[str]) – regex pattern for processing filenames
viz_backend (Optional[str]) – which plotting backend to use
results_dir (Optional[str]) – name of results directory

Returns:

configuration dataclass instance.

Return type:

Config

pyllelic.pyllelic.make_list_of_bam_files(config: Config) → List[str][source]

Check analysis directory for all valid .bam files.

Parameters:: config (Config) – pyllelic configuration options.
Returns:: list of files
Return type:: list[str]

pyllelic.pyllelic.pyllelic(config: Config, files_set: List[str]) → GenomicPositionData[source]

Wrapper to call pyllelic routines.

Parameters:

config (Config) – pyllelic config object.
files_set (List[str]) – list of bam files to analyze.

Returns:

GenomicPositionData pyllelic object.

Return type:

GenomicPositionData

pyllelic.quma module

Tools to quantify methylation in reduced representation bisulfite sequencing reads.

class pyllelic.quma.Fasta(com: str = '', pos: Optional[str] = None, seq: str = '')[source]

Bases: object

Dataclass to wrap fasta results.

com: str = ''

pos: Optional[str] = None

seq: str = ''

class pyllelic.quma.Quma(gfile_contents: str, qfile_contents: str)[source]

Bases: object

Quma methylation analysis parser for bisulfite conversion DNA sequencing.

data: List[Reference]: QUMA Output in object form.

values: str: QUMA output values in tabular form.

class pyllelic.quma.Reference(fasta: Fasta, res: Result, dir: int, gdir: int, exc: int)[source]

Bases: object

Dataclass of quma analysis intermediates.

Includes fasta sequence, quma results, directon of read, genomic direction, and whether result meets exclusion criteria.

dir: int

exc: int

fasta: Fasta

gdir: int

res: Result

class pyllelic.quma.Result(qAli: str = '', gAli: str = '', val: str = '', perc: float = 0.0, pconv: float = 0.0, gap: int = 0, menum: int = 0, unconv: int = 0, conv: int = 0, match: int = 0, aliMis: int = 0, aliLen: int = 0)[source]

Bases: object

Dataclass of quma aligment comparison results.

aliLen: int = 0

aliMis: int = 0

conv: int = 0

gAli: str = ''

gap: int = 0

match: int = 0

menum: int = 0

pconv: float = 0.0

perc: float = 0.0

qAli: str = ''

unconv: int = 0

val: str = ''

pyllelic.visualization module

Utilities to visualize data for use in pyllelic.

pyllelic.visualization._create_heatmap(df: DataFrame, min_values: int, width: int, height: int, title_type: str, backend: str) → Union[Figure, Figure][source]

Generate a graph figure showing heatmap of mean methylation across cell lines.

Parameters:

df (pd.DataFrame) – dataframe of mean methylation
min_values (int) – minimum number of points data must exist at a position
width (int) – figure width
height (int) – figure height
title_type (str) – type of figure being plotted
backend (str) – which plotting backend to use

Returns:

plotly or matplotlib figure object

Return type:

Union[go.Figure, plt.Figure]

Raises:

ValueError – invalid plotting backend

pyllelic.visualization._create_histogram(data: DataFrame, cell_line: str, position: str, backend: str) → Union[Figure, Figure][source]

Generate a graph figure showing fractional methylation in a given cell line at a given site.

Parameters:

data (pd.DataFrame) – dataframe of individual data
cell_line (str) – name of cell line
position (str) – genomic position
backend (str) – which plotting backend to use

Returns:

plotly or matplotlib figure object

Return type:

Union[go.Figure, plt.Figure]

Raises:

ValueError – invalid plotting backend provided

pyllelic.visualization._create_methylation_diffs_bar_graph(df: DataFrame, backend: str) → Union[Figure, Figure][source]

Generate a graph figure showing bar graph of significant methylation across cell lines.

Parameters:

df (pd.DataFrame) – dataframe of significant methylation positions
backend (str) – which plotting backend to use

Returns:

plotly or matplotlib figure object

Return type:

Union[go.Figure, plt.Figure]

Raises:

ValueError – invalid plotting backend

pyllelic.visualization._make_binary(data: Optional[List[int]]) → List[int][source]

pyllelic.visualization._make_methyl_df(df: DataFrame, row: str) → DataFrame[source]

pyllelic.visualization._make_stacked_fig(df: DataFrame, backend: str) → Union[Figure, Figure][source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters:

df (pd.DataFrame) – dataframe of individual read data
backend (str) – plotting backend to use

Returns:

plotly or matplotlib figure

Return type:

Union[go.Figure, plt.Figure]

Raises:

ValueError – invalid plotting backend

pyllelic.visualization._make_stacked_mpl_fig(df: DataFrame) → Figure[source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters:: df (pd.DataFrame) – dataframe of individual read data
Returns:: matplotlib figure
Return type:: plt.Figure

pyllelic.visualization._make_stacked_plotly_fig(df: DataFrame) → Figure[source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters:: df (pd.DataFrame) – dataframe of individual read data
Returns:: plotly figure
Return type:: go.Figure

pyllelic.main module

pyllelic: module level interface to run pyllelic from the command line.

Example usage:

python -m pyllelic -o my_data -f fh_cellline_tissue.fastq.gz -g hg19chr5 -chr chr5 -s 1293000 -e 1296000 –viz plotly

This command would save pyllelic results in files with the prefix my_data, analyzing the specified fastq file using the specified reference genome, in the genomic region indicated.

pyllelic.__main__.run_pyllelic() → None[source]: Run all processing and analysis steps of pyllelic.