Installation
Installing DiffMethylTools and its dependencies is quick and straightforward.
1. Install the Package
Option 1 — Install from GitHub
Clone the repository:
Ensure that Python and pip are installed, then install the package from the project root directory:Option 2 — Install directly from PyPI
Alternativly, you can install the package using pip
2. Setup Reference Genome
Before running any downstream analysis or annotations, you must initialize the reference files for your target genome build.
Default Setup
By default, DiffMethylTools comes pre-packaged with the necessary reference files for both hg38 and hg19. To initialize the tool with the default files, simply run:
Once this command completes, the tool is ready to use!Advanced Setup: Using Custom Reference Files
If you prefer to use your own custom reference files (e.g., a specific GTF version, custom CpG tracks, or custom array manifests), you can do so by editing the shared configuration file (hg38_min.yml).
In this YAML file, you must define the path, the column separator (sep), and for array manifests, the specific columns to extract (using 0-based indexing). If an optional file path is left empty or is incorrect, the script will safely generate dummy files to prevent pipeline failure.
Here is the template for the yml file:
# Mandatory Information
ref_name: "hg38_custom"
gtf:
path: "../gencode.v42.chr_patch_hapl_scaff.annotation.gtf"
sep: "\t"
CpG:
path: "../CpG.bed"
sep: "\t"
# Optional Files (Leave path empty "" if not using)
ccre:
path: "../encodeCcreCombined.bed"
sep: "\t"
rmsk:
path: "../rmsk.txt"
sep: "\t"
epic_v2:
path: "../EPIC-8v2-0_A1.csv"
sep: ","
columns: [1, 15, 16]
epic_v1:
path: "../infinium-methylationepic-v-1-0-b5-manifest-file-csv"
sep: ","
columns: [1, 48, 49, 50]
hm450:
path: "../HM450.hg38.manifest.tsv"
sep: "\t"
columns: [0, 1, 2, 8]
Required File Formats
To ensure your custom files are parsed correctly, they must match the following structures:
- Reference Name (
ref_name) - Mandatory A unique identifier for your custom reference. The script will generate a directory with this exact name to store all processed files. You must use this name as the input for theref_folderparameter in all subsequent downstream tasks. -
GTF (
gtf) - Mandatory
Standard Ensembl/GENCODE GTF format (tab-separated). Must contain standard feature columns (chr,source,feature,start,end,score,strand,frame,attribute). -
CpG Base (
CpG) - Mandatory
A simple 3-column, tab-separated BED-like file. Note: The index must start from 1.
Format:chr|pos|index
Example:chr1 \t 10469 \t 1 -
cCRE (
ccre) - Optional
Standard ENCODE candidate cis-Regulatory Elements BED file (tab-separated).
Example:chr1 \t 181251 \t 181601 \t EH38E1310153 ... -
RepeatMasker (
rmsk) - Optional
Tab-separated repeats file. Must contain basic coordinates and repeat categories.
Format:chr|start|end|strand|repeat_cat|repeat_subcat... -
Illumina Array Manifests (epic_v2, epic_v1, hm450) - Optional
When providing Illumina manifest files (CSV or TSV), you must specify the 0-based column indices that correspond to the Probe ID, Chromosome, and Position.-
EPIC v2: columns:
[1, 15, 16]extracts Name, CHR, and MAPINFO. -
HM450: columns:
[0, 1, 2, 8]extracts CpG_chrm, CpG_beg, CpG_end, and Probe_ID. -
EPIC v1: columns:
[1, 48, 49, 50]extracts the corresponding Probe ID and coordinate columns based on the v1 manifest structure.
-