pyllelic package
Submodules
pyllelic.config module
Configuration options for pyllelic.
- class pyllelic.config.Config(base_directory: Path = PosixPath('/'), promoter_file: Path = PosixPath('/promoter.txt'), results_directory: Path = PosixPath('/results'), analysis_directory: Path = PosixPath('/test'), promoter_start: int = 1293200, promoter_end: int = 1296000, chromosome: str = '5', offset: int = 1298163, viz_backend: str = 'plotly', fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$'))[source]
Bases:
object
pyllelic configuration dataclass with TERT promoter defaults.
Pyllelic expects .bam files to be analyzed to be in analysis_directory, under a base_directory, with the promoter reference sequence in the base_directory.
- analysis_directory: Path = PosixPath('/test')
- base_directory: Path = PosixPath('/')
- chromosome: str = '5'
- fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$')
- offset: int = 1298163
- promoter_end: int = 1296000
- promoter_file: Path = PosixPath('/promoter.txt')
- promoter_start: int = 1293200
- results_directory: Path = PosixPath('/results')
- viz_backend: str = 'plotly'
pyllelic.process module
Utilities to pre-process and prepare data for use in pyllelic.
- exception pyllelic.process.ShellCommandError[source]
Bases:
Exception
Error for shell utilities that aren’t installed.
- pyllelic.process.bismark(genome: Path, fastq: Path) str [source]
Helper function to run external bismark tool.
Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs
- Parameters:
genome (Path) – filepath to directory of bismark processed genome files.
fastq (Path) – filepath to fastq file to process.
- Returns:
output from bismark shell command, usually discarded
- Return type:
str
- Raises:
ShellCommandError – bismark is not installed.
- pyllelic.process.bowtie2_fastq_to_bam(index: Path, fastq: Path, cores: int) str [source]
Helper function to run external bowtie2-build tool.
- Parameters:
index (Path) – filepath to bowtie index file
fastq (Path) – filepath to fastq file to convert to bam
cores (int) – number of cores to use for processing
- Returns:
output from bowtie2 and samtools shell command, usually discarded
- Return type:
str
- Raises:
ShellCommandError – bowtie2 is not installed.
- pyllelic.process.build_bowtie2_index(fasta: Path) str [source]
Helper function to run external bowtie2-build tool.
- Parameters:
fasta (Path) – filepath to fasta file to build index from
- Returns:
output from bowtie2-build shell command, usually discarded
- Return type:
str
- Raises:
ShellCommandError – bowtie2-build is not installed.
- pyllelic.process.convert_methbank_bed(path: Path, chrom: str, start: int, stop: int, viz: str = 'plotly') GenomicPositionData [source]
Helper function to convert MethBank BED file into GenomicPositionData obj.
MethBank: https://ngdc.cncb.ac.cn/methbank/
- Parameters:
path (Path) – path to MethBank formatted BED file
chrom (str) – chromosome identifier
start (int) – genomic start position
stop (int) – genomic stop position
viz (str) – Plotting backend to use, defaults to plotly
- Returns:
mostly complete pyllelic object with data from BED.
- Return type:
- pyllelic.process.fastq_to_list(filepath: Path) List[SeqRecord] [source]
Read a .fastq or fastq.gz file into an in-memory record_list.
This is a time and memory intensive operation!
- Parameters:
filepath (Path) – file path to a fastq.gz file
- Returns:
list of biopython sequence records from the fastq file
- Return type:
List[SeqRecord]
- Raises:
FileNameError – “Wrong filetype”
- pyllelic.process.index_bam(bamfile: Path) bool [source]
Helper function to run external samtools index.
- Parameters:
bamfile (Path) – filepath to bam file
- Returns:
verification of samtools command, usually discarded
- Return type:
bool
- pyllelic.process.make_records_to_dictionary(record_list: List[SeqRecord]) Dict[str, SeqRecord] [source]
- Take in list of biopython SeqRecords and output a dictionary
with keys of the record name.
- Parameters:
record_list (List[SeqRecord]) – biopython sequence records from a fastq file
- Returns:
dict of biopython SeqRecords from a fastq file
- Return type:
Dict[str, SeqRecord]
- pyllelic.process.prepare_genome(index: Path, aligner: Optional[Path] = None) str [source]
Helper function to run external bismark genome preparation tool.
Uses genomes from, e.g.: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs
- Parameters:
index (Path) – filepath to unprocessed genome file.
aligner (Optional[Path]) – filepath to bowtie2 alignment program.
- Returns:
output from genome preparation shell command, usually discarded
- Return type:
str
- Raises:
ShellCommandError – bismark_genome_preparation is not installed.
- pyllelic.process.retrieve_seq(filename: str, chrom: str, start: int, end: int) None [source]
Retrieve the genomic sequence of interest from UCSC Genome Browser.
- Parameters:
filename (str) – path to store genomic sequence
chrom (str) – chromosome of interest, e.g. “chr5”
start (int) – start position for region of interest
end (int) – end position for region of interest
pyllelic.pyllelic module
pyllelic: a tool for detection of allelic-specific variation in reduced representation bisulfate DNA sequencing.
- class pyllelic.pyllelic.AD_stats(sig: bool, stat: Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], crits: List[Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]])[source]
Bases:
NamedTuple
Helper class for NamedTuple results from anderson_darling_test
- crits: List[Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]]
Alias for field number 2
- sig: bool
Alias for field number 0
- stat: Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]
Alias for field number 1
- class pyllelic.pyllelic.BamOutput(sam_directory: Path, genome_string: str, config: Config)[source]
Bases:
object
Storage container to process BAM sequencing files and store processed results.
- genome_values: Dict[str, str]
dictionary of read files and contents.
- Type:
Dict[str, str]
- name: str
path to bam file analyzed.
- Type:
str
- positions: List[str]
index of genomic positions in the bam file.
- Type:
“pd.Index
- values: Dict[str, str]
dictionary of reads at a given position
- Type:
Dict[str, str]
- class pyllelic.pyllelic.GenomicPositionData(config: Config, files_set: List[str])[source]
Bases:
object
Class to process reduced representation bisulfite methylation sequencing data.
When initialized, GenomicPositionData reads sequencing file (.bam) locations from a config object, and then automatically performs alignment into BamOutput objects, and then performs methylation analysis, storing the results as QumaResults.
Finally, the aggregate data is analyzed to create some aggregate metrics such as means, modes, and differences (diffs), as well as expose methods for plotting and statistical analysis.
- allelic_data: DataFrame
dataframe of Chi-squared p-values.
- Type:
pd.DataFrame
- cell_types: List[str]
list of cell types in the data.
- Type:
List[str]
- diffs: DataFrame
df of difference mean minus mode methylation values.
- Type:
pd.DataFrame
- file_names: List[str]
list of bam filenames in the data.
- Type:
List[str]
- files_set: List[str]
list of bam files analyzed.
- Type:
List[str]
- static from_pickle(filename: str) GenomicPositionData [source]
Read pickled GenomicPositionData back to an object.
- Parameters:
filename (str) – filename to read pickle
- Returns:
GenomicPositionData object
- Return type:
- generate_ad_stats() DataFrame [source]
Generate Anderson-Darling normality statistics for an individual data df.
- Returns:
df of a-d test statistics
- Return type:
pd.DataFrame
- heatmap(min_values: int = 1, width: int = 800, height: int = 2000, cell_lines: Optional[List[str]] = None, data_type: str = 'means', backend: Optional[str] = None) None [source]
Display a graph figure showing heatmap of mean methylation across cell lines.
- Parameters:
min_values (int) – minimum number of points data must exist at a position
width (int) – figure width, defaults to 800
height (int) – figure height, defaults to 2000
cell_lines (Optional[List[str]]) – set of cell lines to analyze,
lines. (defaults to all cell) –
data_type (str) – type of data to plot. Can be ‘means’, ‘modes’, ‘diffs’, or ‘pvalue’.
backend (Optional[str]) – plotting backend to override default
- Raises:
ValueError – invalid data type
ValueError – invalid plotting backend
- histogram(cell_line: str, position: str, backend: Optional[str] = None) None [source]
Display a graph figure showing fractional methylation in a given cell line at a given site.
- Parameters:
cell_line (str) – name of cell line
position (str) – genomic position
backend (Optional[str]) – plotting backend to override default
- Raises:
ValueError – invalid plotting backend
ValueError – No data available at that position
- individual_data: DataFrame
dataframe of individual methylation values.
- Type:
pd.DataFrame
- means: DataFrame
dataframe of mean methylation values.
- Type:
pd.DataFrame
- modes: DataFrame
dataframe of modes of methylation values.
- Type:
pd.DataFrame
- positions: List[str]
list of genomic positions in the data.
- Type:
List[str]
- quma_results: Dict[str, QumaResult]
list of QumaResults.
- Type:
Dict[str, QumaResult]
- reads_graph(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) None [source]
Display a graph figure showing methylation of reads across cell lines.
- Parameters:
cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.
backend (Optional[str]) – plotting backend to override default
- Raises:
ValueError – invalid plotting backend
ValueError – Unable to plot more than 20 cell lines at once.
- save(filename: str = 'output.xlsx') None [source]
Save quma results to an excel file.
- Parameters:
filename (str) – Filename to save to. Defaults to “output.xlsx”.
- save_pickle(filename: str) None [source]
Save GenomicPositionData object as a pickled file.
- Parameters:
filename (str) – filename to save pickle
- sig_methylation_differences(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) None [source]
Display a graph figure showing a bar chart of significantly different mean / mode methylation across all or a subset of cell lines.
- Parameters:
cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.
backend (Optional[str]) – plotting backend to override default
- Raises:
ValueError – invalid plotting backend
- summarize_allelic_data(cell_lines: Optional[List[str]] = None) DataFrame [source]
Create a dataframe only of likely allelic methylation positions.
- Parameters:
cell_lines (Optional[List[str]]) – set of cell lines to analyze,
lines. (defaults to all cell) –
- Returns:
dataframe of cell lines with likely allelic positions
- Return type:
pd.DataFrame
- class pyllelic.pyllelic.QumaResult(read_files: List[str], genomic_files: List[str], positions: List[str])[source]
Bases:
object
Storage container to process and store quma-style methylation results.
- values: DataFrame
dataframe of quma methylation analysis values.
- Type:
pd.DataFrame
- pyllelic.pyllelic.configure(base_path: str, prom_file: str, prom_start: int, prom_end: int, chrom: str, offset: int, test_dir: Optional[str] = None, fname_pattern: Optional[str] = None, viz_backend: Optional[str] = None, results_dir: Optional[str] = None) Config [source]
Helper method to set up all our environmental variables, such as for testing.
- Parameters:
base_path (str) – directory where all processing will occur, put .bam files in “test” sub-directory in this folder
prom_file (str) – filename of genmic sequence of promoter region of interest
prom_start (int) – start position to analyze in promoter region
prom_end (int) – final position to analyze in promoter region
chrom (str) – chromosome promoter is located on
offset (int) – genomic position of promoter to offset reads
test_dir (Optional[str]) – name of test directory where bam files are located
fname_pattern (Optional[str]) – regex pattern for processing filenames
viz_backend (Optional[str]) – which plotting backend to use
results_dir (Optional[str]) – name of results directory
- Returns:
configuration dataclass instance.
- Return type:
- pyllelic.pyllelic.make_list_of_bam_files(config: Config) List[str] [source]
Check analysis directory for all valid .bam files.
- Parameters:
config (Config) – pyllelic configuration options.
- Returns:
list of files
- Return type:
list[str]
- pyllelic.pyllelic.pyllelic(config: Config, files_set: List[str]) GenomicPositionData [source]
Wrapper to call pyllelic routines.
- Parameters:
config (Config) – pyllelic config object.
files_set (List[str]) – list of bam files to analyze.
- Returns:
GenomicPositionData pyllelic object.
- Return type:
pyllelic.quma module
Tools to quantify methylation in reduced representation bisulfite sequencing reads.
- class pyllelic.quma.Fasta(com: str = '', pos: Optional[str] = None, seq: str = '')[source]
Bases:
object
Dataclass to wrap fasta results.
- com: str = ''
- pos: Optional[str] = None
- seq: str = ''
- class pyllelic.quma.Quma(gfile_contents: str, qfile_contents: str)[source]
Bases:
object
Quma methylation analysis parser for bisulfite conversion DNA sequencing.
- values: str
QUMA output values in tabular form.
- class pyllelic.quma.Reference(fasta: Fasta, res: Result, dir: int, gdir: int, exc: int)[source]
Bases:
object
Dataclass of quma analysis intermediates.
Includes fasta sequence, quma results, directon of read, genomic direction, and whether result meets exclusion criteria.
- dir: int
- exc: int
- gdir: int
- class pyllelic.quma.Result(qAli: str = '', gAli: str = '', val: str = '', perc: float = 0.0, pconv: float = 0.0, gap: int = 0, menum: int = 0, unconv: int = 0, conv: int = 0, match: int = 0, aliMis: int = 0, aliLen: int = 0)[source]
Bases:
object
Dataclass of quma aligment comparison results.
- aliLen: int = 0
- aliMis: int = 0
- conv: int = 0
- gAli: str = ''
- gap: int = 0
- match: int = 0
- pconv: float = 0.0
- perc: float = 0.0
- qAli: str = ''
- unconv: int = 0
- val: str = ''
pyllelic.visualization module
Utilities to visualize data for use in pyllelic.
- pyllelic.visualization._create_heatmap(df: DataFrame, min_values: int, width: int, height: int, title_type: str, backend: str) Union[Figure, Figure] [source]
Generate a graph figure showing heatmap of mean methylation across cell lines.
- Parameters:
df (pd.DataFrame) – dataframe of mean methylation
min_values (int) – minimum number of points data must exist at a position
width (int) – figure width
height (int) – figure height
title_type (str) – type of figure being plotted
backend (str) – which plotting backend to use
- Returns:
plotly or matplotlib figure object
- Return type:
Union[go.Figure, plt.Figure]
- Raises:
ValueError – invalid plotting backend
- pyllelic.visualization._create_histogram(data: DataFrame, cell_line: str, position: str, backend: str) Union[Figure, Figure] [source]
Generate a graph figure showing fractional methylation in a given cell line at a given site.
- Parameters:
data (pd.DataFrame) – dataframe of individual data
cell_line (str) – name of cell line
position (str) – genomic position
backend (str) – which plotting backend to use
- Returns:
plotly or matplotlib figure object
- Return type:
Union[go.Figure, plt.Figure]
- Raises:
ValueError – invalid plotting backend provided
- pyllelic.visualization._create_methylation_diffs_bar_graph(df: DataFrame, backend: str) Union[Figure, Figure] [source]
Generate a graph figure showing bar graph of significant methylation across cell lines.
- Parameters:
df (pd.DataFrame) – dataframe of significant methylation positions
backend (str) – which plotting backend to use
- Returns:
plotly or matplotlib figure object
- Return type:
Union[go.Figure, plt.Figure]
- Raises:
ValueError – invalid plotting backend
- pyllelic.visualization._make_stacked_fig(df: DataFrame, backend: str) Union[Figure, Figure] [source]
Generate a graph figure showing methylated and unmethylated reads across cell lines.
- Parameters:
df (pd.DataFrame) – dataframe of individual read data
backend (str) – plotting backend to use
- Returns:
plotly or matplotlib figure
- Return type:
Union[go.Figure, plt.Figure]
- Raises:
ValueError – invalid plotting backend
pyllelic.main module
pyllelic: module level interface to run pyllelic from the command line.
Example usage:
python -m pyllelic -o my_data -f fh_cellline_tissue.fastq.gz -g hg19chr5 -chr chr5 -s 1293000 -e 1296000 –viz plotly
This command would save pyllelic results in files with the prefix my_data, analyzing the specified fastq file using the specified reference genome, in the genomic region indicated.