Data Helpers
Data loading and simulation helpers.
Loading
read_mtx
read_mtx(
mtx_file_name: Union[str, Path],
gene_file_name: Union[str, Path],
barcode_file_name: Optional[Union[str, Path]],
) -> pd.DataFrame
Read mtx data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mtx_file_name
|
Union[str, Path]
|
File name of mtx data |
required |
gene_file_name
|
Union[str, Path]
|
File name of gene vector |
required |
barcode_file_name
|
Optional[Union[str, Path]]
|
File name of barcode vector |
required |
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
A dataframe with genes as rows and cells as columns |
Source code in scTenifold/data/_io.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
read_folder
read_folder(
file_dir: Union[str, Path],
matrix_fn: str = "matrix",
gene_fn: str = "genes",
barcodes_fn: str = "barcodes",
) -> pd.DataFrame
Read mtx + genes + barcodes from a directory by filename substring.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_dir
|
Union[str, Path]
|
Path to a directory containing matrix, gene, and barcode files. |
required |
matrix_fn
|
str
|
Substring identifying the matrix file (e.g. |
'matrix'
|
gene_fn
|
str
|
Substring identifying the gene file. |
'genes'
|
barcodes_fn
|
str
|
Substring identifying the barcode file. |
'barcodes'
|
Returns:
| Type | Description |
|---|---|
Genes-by-cells DataFrame.
|
|
Source code in scTenifold/data/_io.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
fetch_data
fetch_data(
ds_name: str,
dataset_path: Path = Path(__file__).parent.parent.parent
/ Path("datasets"),
owner: str = "qwerty239qwe",
) -> Dict[str, pd.DataFrame]
Fetch and load a remote scTenifold dataset by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ds_name
|
str
|
Dataset name (one of :data: |
required |
dataset_path
|
Path
|
Local directory to cache downloads. |
parent / Path('datasets')
|
owner
|
str
|
GitHub owner of the |
'qwerty239qwe'
|
Returns:
| Type | Description |
|---|---|
Mapping from sample-group name to a genes-by-cells DataFrame.
|
|
Source code in scTenifold/data/_get.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
list_data
list_data(
owner: str = "qwerty239qwe", return_list: bool = True
) -> Union[Dict[str, Dict[str, List[str]]], List[str]]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
owner
|
str
|
owner name of dataset repo |
'qwerty239qwe'
|
return_list
|
bool
|
To return list of data name or return a dict indicating repo structure |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
data_info_tree |
list or dict
|
The obtainable data store in a dict, structure {'data_name': {'group': ['file_names']}} or in a list of data_names |
Source code in scTenifold/data/_get.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
Simulation
get_test_df
get_test_df(
n_cells: int = 100,
n_genes: int = 1000,
random_state: Optional[int] = None,
) -> pd.DataFrame
Function to generate test dataframe
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_cells
|
int
|
Number of cells in the generated df |
100
|
n_genes
|
int
|
Number of genes in the generated df |
1000
|
random_state
|
Optional[int]
|
Random seed of generated data, used the same seed to reproduce the same dataset |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
test_df |
DataFrame
|
testing data |
Source code in scTenifold/data/_sim.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | |
TestDataGenerator
dataclass
TestDataGenerator(
n_genes: int = 1000,
n_samples: int = 100,
pos_eff_ratio: float = 0.3,
neg_eff_ratio: float = 0,
target_pos: Optional[Sequence[str]] = None,
target_neg: Optional[Sequence[str]] = None,
n_bins: int = 25,
n_ctrl: int = 50,
random_state: int = 42,
)
A test data generator produces test data for cell scoring functions
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_genes
|
int
|
Number of genes in the data |
1000
|
n_samples
|
int
|
Number of cells(samples) in the data |
100
|
pos_eff_ratio
|
float
|
Fraction of up-regulated cells |
0.3
|
neg_eff_ratio
|
float
|
Fraction of down-regulated cells |
0
|
target_pos
|
Optional[Sequence[str]]
|
|
None
|
target_neg
|
Optional[Sequence[str]]
|
|
None
|
n_bins
|
int
|
|
25
|
n_ctrl
|
int
|
|
50
|
random_state
|
int
|
|
42
|
__post_init__
__post_init__() -> None
Build the simulated count matrix and gene/sample labels.
Source code in scTenifold/data/_sim.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
save_data
save_data(
file_path: Union[str, Path], use_normalized: bool
) -> None
Save the simulated count matrix as CSV to file_path.
Source code in scTenifold/data/_sim.py
121 122 123 124 125 | |
get_data
get_data(
data_type: str, use_normalized: bool
) -> Dict[str, object]
Return the simulated data packaged for downstream scorers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_type
|
str
|
One of |
required |
use_normalized
|
bool
|
If True, return log-CPM-like normalized counts; otherwise raw counts. |
required |
Returns:
| Type | Description |
|---|---|
Keyword arguments suitable for :func:`cell_cycle_score` or :func:`adobo_score`.
|
|
Source code in scTenifold/data/_sim.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | |