VCFormer#

VCF data is typically nested and difficult to work with. To facilitate analysis of VCF data, this library reads VCF files into Pandas or Polars DataFrames and extracts nested fields to the top level, with a focus on the INFO fields and sample-associated fields.

Note: This project builds off our previous work at the VCF Files for Population Genomics: Scaling to Millions of Samples codeathon co-hosted by NCBI and NIAID.

API docs#

vcformer.read_info_schema(path: str)[source]#

Read the schema of the INFO column of a VCF.

This currently uses pysam.

Parameters:

path (str) – Path to VCF file.

Returns:

Dataframe with [name, number, type] columns

Return type:

pd.DataFrame

Notes

Possible values for type are: “Integer”, “Float”, “String”, “Flag”.

Possible values for number are:

  • An integer (e.g. 0, 1, 2, 3, 4, etc.) for fields where the number of values per VCF record is fixed. 0 means the field is a “Flag”.

  • A string (“A”, “G”, “R”) - for fields where the number of values per VCF record is determined by the number of alts, the total number of alleles, or the number of genotypes, respectively.

  • A dot (“.”) - for fields where the number of values per VCF record varies, is unknown, or is unbounded.

vcformer.read_sample_schema(path: str)[source]#

Read the schema of the genotype sample columns of a VCF.

This currently uses pysam.

Parameters:

path (str) – Path to VCF file.

Returns:

Dataframe with [name, number, type] columns

Return type:

pd.DataFrame

Notes

Possible values for type are: “Integer”, “Float”, and “String”.

vcformer.read_vcf_as_pandas(path: str, query: str | None = None, info_fields: list[str] | None = None, sample_fields: list[str] | None = None, samples: list[str] | None = None, include_unspecified: bool = False) DataFrame[source]#

Read a VCF into a pandas dataframe, extracting INFO and sample genotype fields.

Parameters:
  • path (str) – Path to VCF file.

  • query (str, optional) – Genomic range query string. If None, all records will be read.

  • info_fields (list[str], optional) – List of fields to extract from the INFO column. If None, all fields will be extracted.

  • sample_fields (list[str], optional) – List of fields to extract from the sample genotype columns. If None, all fields will be extracted.

  • samples (list[str], optional) – List of samples to extract. If None, all samples will be extracted.

Returns:

Pandas DataFrame with columns corresponding to the requested fields.

Return type:

pd.DataFrame

vcformer.read_vcf_as_polars(path: str, query: str | None = None, info_fields: list[str] | None = None, sample_fields: list[str] | None = None, samples: list[str] | None = None, include_unspecified: bool = False) DataFrame[source]#

Read a VCF into a polars dataframe, extracting INFO and sample genotype fields.

Parameters:
  • path (str) – Path to VCF file.

  • query (str, optional) – Genomic range query string. If None, all records will be read.

  • info_fields (list[str], optional) – List of fields to extract from the INFO column. If None, all fields will be extracted.

  • sample_fields (list[str], optional) – List of fields to extract from the sample genotype columns. If None, all fields will be extracted.

  • samples (list[str], optional) – List of samples to extract. If None, all samples will be extracted.

Returns:

Polars DataFrame with columns corresponding to the requested fields.

Return type:

pl.DataFrame