Overview

Dataset statistics

Number of variables6
Number of observations1357230
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory62.1 MiB
Average record size in memory48.0 B

Variable types

Categorical3
Numeric3

Warnings

Gene has a high cardinality: 19670 distinct values High cardinality
Gene name has a high cardinality: 19651 distinct values High cardinality
Cell line has a high cardinality: 69 distinct values High cardinality
TPM is highly correlated with pTPMHigh correlation
pTPM is highly correlated with TPMHigh correlation
TPM is highly skewed (γ1 = 32.03325779) Skewed
pTPM is highly skewed (γ1 = 31.89212424) Skewed
Gene is uniformly distributed Uniform
Gene name is uniformly distributed Uniform
Cell line is uniformly distributed Uniform
TPM has 388709 (28.6%) zeros Zeros
pTPM has 371434 (27.4%) zeros Zeros
NX has 374658 (27.6%) zeros Zeros

Reproduction

Analysis started2021-04-30 11:01:36.755441
Analysis finished2021-04-30 11:02:07.814285
Duration31.06 seconds
Software versionpandas-profiling v2.11.0
Download configurationconfig.yaml

Variables

Gene
Categorical

HIGH CARDINALITY
UNIFORM

Distinct19670
Distinct (%)1.4%
Missing0
Missing (%)0.0%
Memory size10.4 MiB
ENSG00000143171
 
69
ENSG00000284546
 
69
ENSG00000148308
 
69
ENSG00000172568
 
69
ENSG00000080815
 
69
Other values (19665)
1356885 

Length

Max length15
Median length15
Mean length15
Min length15

Characters and Unicode

Total characters20358450
Distinct characters14
Distinct categories2 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowENSG00000000003
2nd rowENSG00000000003
3rd rowENSG00000000003
4th rowENSG00000000003
5th rowENSG00000000003
ValueCountFrequency (%)
ENSG0000014317169
 
< 0.1%
ENSG0000028454669
 
< 0.1%
ENSG0000014830869
 
< 0.1%
ENSG0000017256869
 
< 0.1%
ENSG0000008081569
 
< 0.1%
ENSG0000023999869
 
< 0.1%
ENSG0000019869569
 
< 0.1%
ENSG0000020508969
 
< 0.1%
ENSG0000027829969
 
< 0.1%
ENSG0000016894469
 
< 0.1%
Other values (19660)1356540
99.9%
2021-04-30T20:02:08.106282image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
ensg0000019734569
 
< 0.1%
ensg0000018882069
 
< 0.1%
ensg0000023633469
 
< 0.1%
ensg0000020563969
 
< 0.1%
ensg0000014959669
 
< 0.1%
ensg0000016190569
 
< 0.1%
ensg0000008137769
 
< 0.1%
ensg0000025677169
 
< 0.1%
ensg0000026875069
 
< 0.1%
ensg0000017908869
 
< 0.1%
Other values (19660)1356540
99.9%

Most occurring characters

ValueCountFrequency (%)
07621740
37.4%
11676424
 
8.2%
E1357230
 
6.7%
N1357230
 
6.7%
S1357230
 
6.7%
G1357230
 
6.7%
2814062
 
4.0%
6743337
 
3.7%
8718980
 
3.5%
7712908
 
3.5%
Other values (4)2642079
 
13.0%

Most occurring categories

ValueCountFrequency (%)
Decimal Number14929530
73.3%
Uppercase Letter5428920
 
26.7%

Most frequent character per category

ValueCountFrequency (%)
07621740
51.1%
11676424
 
11.2%
2814062
 
5.5%
6743337
 
5.0%
8718980
 
4.8%
7712908
 
4.8%
4690138
 
4.6%
3687723
 
4.6%
5666816
 
4.5%
9597402
 
4.0%
ValueCountFrequency (%)
E1357230
25.0%
N1357230
25.0%
S1357230
25.0%
G1357230
25.0%

Most occurring scripts

ValueCountFrequency (%)
Common14929530
73.3%
Latin5428920
 
26.7%

Most frequent character per script

ValueCountFrequency (%)
07621740
51.1%
11676424
 
11.2%
2814062
 
5.5%
6743337
 
5.0%
8718980
 
4.8%
7712908
 
4.8%
4690138
 
4.6%
3687723
 
4.6%
5666816
 
4.5%
9597402
 
4.0%
ValueCountFrequency (%)
E1357230
25.0%
N1357230
25.0%
S1357230
25.0%
G1357230
25.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII20358450
100.0%

Most frequent character per block

ValueCountFrequency (%)
07621740
37.4%
11676424
 
8.2%
E1357230
 
6.7%
N1357230
 
6.7%
S1357230
 
6.7%
G1357230
 
6.7%
2814062
 
4.0%
6743337
 
3.7%
8718980
 
3.5%
7712908
 
3.5%
Other values (4)2642079
 
13.0%

Gene name
Categorical

HIGH CARDINALITY
UNIFORM

Distinct19651
Distinct (%)1.4%
Missing0
Missing (%)0.0%
Memory size10.4 MiB
ABCF2
 
138
DIABLO
 
138
SCO2
 
138
POLR2J3
 
138
CCDC39
 
138
Other values (19646)
1356540 

Length

Max length15
Median length5
Mean length5.55526182
Min length2

Characters and Unicode

Total characters7539768
Distinct characters41
Distinct categories5 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowTSPAN6
2nd rowTSPAN6
3rd rowTSPAN6
4th rowTSPAN6
5th rowTSPAN6
ValueCountFrequency (%)
ABCF2138
 
< 0.1%
DIABLO138
 
< 0.1%
SCO2138
 
< 0.1%
POLR2J3138
 
< 0.1%
CCDC39138
 
< 0.1%
COG8138
 
< 0.1%
IGF2138
 
< 0.1%
ALDOA138
 
< 0.1%
SOD2138
 
< 0.1%
ATXN7138
 
< 0.1%
Other values (19641)1355850
99.9%
2021-04-30T20:02:08.466283image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
ccdc39138
 
< 0.1%
sco2138
 
< 0.1%
diablo138
 
< 0.1%
tbce138
 
< 0.1%
txnrd3nb138
 
< 0.1%
pde11a138
 
< 0.1%
atxn7138
 
< 0.1%
igf2138
 
< 0.1%
aldoa138
 
< 0.1%
h2bfs138
 
< 0.1%
Other values (19641)1355850
99.9%

Most occurring characters

ValueCountFrequency (%)
1587604
 
7.8%
A547308
 
7.3%
C474375
 
6.3%
P447465
 
5.9%
R380742
 
5.0%
2378879
 
5.0%
T315675
 
4.2%
S313881
 
4.2%
L305670
 
4.1%
N286281
 
3.8%
Other values (31)3501888
46.4%

Most occurring categories

ValueCountFrequency (%)
Uppercase Letter5416776
71.8%
Decimal Number2006727
 
26.6%
Lowercase Letter75348
 
1.0%
Other Punctuation25599
 
0.3%
Dash Punctuation15318
 
0.2%

Most frequent character per category

ValueCountFrequency (%)
A547308
 
10.1%
C474375
 
8.8%
P447465
 
8.3%
R380742
 
7.0%
T315675
 
5.8%
S313881
 
5.8%
L305670
 
5.6%
N286281
 
5.3%
M275724
 
5.1%
D273102
 
5.0%
Other values (16)1796553
33.2%
ValueCountFrequency (%)
1587604
29.3%
2378879
18.9%
3235911
11.8%
4172569
 
8.6%
5143796
 
7.2%
6118611
 
5.9%
0104535
 
5.2%
7100395
 
5.0%
889700
 
4.5%
974727
 
3.7%
ValueCountFrequency (%)
o25116
33.3%
r25116
33.3%
f25116
33.3%
ValueCountFrequency (%)
-15318
100.0%
ValueCountFrequency (%)
.25599
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin5492124
72.8%
Common2047644
 
27.2%

Most frequent character per script

ValueCountFrequency (%)
A547308
 
10.0%
C474375
 
8.6%
P447465
 
8.1%
R380742
 
6.9%
T315675
 
5.7%
S313881
 
5.7%
L305670
 
5.6%
N286281
 
5.2%
M275724
 
5.0%
D273102
 
5.0%
Other values (19)1871901
34.1%
ValueCountFrequency (%)
1587604
28.7%
2378879
18.5%
3235911
11.5%
4172569
 
8.4%
5143796
 
7.0%
6118611
 
5.8%
0104535
 
5.1%
7100395
 
4.9%
889700
 
4.4%
974727
 
3.6%
Other values (2)40917
 
2.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII7539768
100.0%

Most frequent character per block

ValueCountFrequency (%)
1587604
 
7.8%
A547308
 
7.3%
C474375
 
6.3%
P447465
 
5.9%
R380742
 
5.0%
2378879
 
5.0%
T315675
 
4.2%
S313881
 
4.2%
L305670
 
4.1%
N286281
 
3.8%
Other values (31)3501888
46.4%

Cell line
Categorical

HIGH CARDINALITY
UNIFORM

Distinct69
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size10.4 MiB
U-251 MG
 
19670
SiHa
 
19670
HAP1
 
19670
AN3-CA
 
19670
U-2197
 
19670
Other values (64)
1258880 

Length

Max length31
Median length6
Mean length6.942028986
Min length2

Characters and Unicode

Total characters9421930
Distinct characters50
Distinct categories7 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowA-431
2nd rowA549
3rd rowAF22
4th rowAN3-CA
5th rowASC diff
ValueCountFrequency (%)
U-251 MG19670
 
1.4%
SiHa19670
 
1.4%
HAP119670
 
1.4%
AN3-CA19670
 
1.4%
U-219719670
 
1.4%
MOLT-419670
 
1.4%
BJ hTERT+19670
 
1.4%
ASC TERT119670
 
1.4%
EFO-2119670
 
1.4%
HaCaT19670
 
1.4%
Other values (59)1160530
85.5%
2021-04-30T20:02:08.765319image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
bj78680
 
4.4%
mg59010
 
3.3%
htert59010
 
3.3%
sv4039340
 
2.2%
large39340
 
2.2%
t39340
 
2.2%
asc39340
 
2.2%
tert139340
 
2.2%
htcepi19670
 
1.1%
u-219670
 
1.1%
Other values (68)1337560
75.6%

Most occurring characters

ValueCountFrequency (%)
-767130
 
8.1%
T708120
 
7.5%
E531090
 
5.6%
H472080
 
5.0%
2432740
 
4.6%
413070
 
4.4%
R413070
 
4.4%
C393400
 
4.2%
1354060
 
3.8%
S314720
 
3.3%
Other values (40)4622450
49.1%

Most occurring categories

ValueCountFrequency (%)
Uppercase Letter5114200
54.3%
Decimal Number2026010
 
21.5%
Lowercase Letter924490
 
9.8%
Dash Punctuation767130
 
8.1%
Space Separator413070
 
4.4%
Math Symbol98350
 
1.0%
Other Punctuation78680
 
0.8%

Most frequent character per category

ValueCountFrequency (%)
T708120
13.8%
E531090
10.4%
H472080
 
9.2%
R413070
 
8.1%
C393400
 
7.7%
S314720
 
6.2%
M295050
 
5.8%
A275380
 
5.4%
U216370
 
4.2%
B196700
 
3.8%
Other values (13)1298220
25.4%
ValueCountFrequency (%)
a216370
23.4%
h137690
14.9%
e98350
10.6%
i78680
 
8.5%
d59010
 
6.4%
f59010
 
6.4%
r59010
 
6.4%
p59010
 
6.4%
g39340
 
4.3%
s39340
 
4.3%
Other values (3)78680
 
8.5%
ValueCountFrequency (%)
2432740
21.4%
1354060
17.5%
4196700
9.7%
3196700
9.7%
6196700
9.7%
7157360
 
7.8%
0137690
 
6.8%
8137690
 
6.8%
9118020
 
5.8%
598350
 
4.9%
ValueCountFrequency (%)
-767130
100.0%
ValueCountFrequency (%)
413070
100.0%
ValueCountFrequency (%)
+98350
100.0%
ValueCountFrequency (%)
/78680
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin6038690
64.1%
Common3383240
35.9%

Most frequent character per script

ValueCountFrequency (%)
T708120
 
11.7%
E531090
 
8.8%
H472080
 
7.8%
R413070
 
6.8%
C393400
 
6.5%
S314720
 
5.2%
M295050
 
4.9%
A275380
 
4.6%
a216370
 
3.6%
U216370
 
3.6%
Other values (26)2203040
36.5%
ValueCountFrequency (%)
-767130
22.7%
2432740
12.8%
413070
12.2%
1354060
10.5%
4196700
 
5.8%
3196700
 
5.8%
6196700
 
5.8%
7157360
 
4.7%
0137690
 
4.1%
8137690
 
4.1%
Other values (4)393400
11.6%

Most occurring blocks

ValueCountFrequency (%)
ASCII9421930
100.0%

Most frequent character per block

ValueCountFrequency (%)
-767130
 
8.1%
T708120
 
7.5%
E531090
 
5.6%
H472080
 
5.0%
2432740
 
4.6%
413070
 
4.4%
R413070
 
4.4%
C393400
 
4.2%
1354060
 
3.8%
S314720
 
3.3%
Other values (40)4622450
49.1%

TPM
Real number (ℝ≥0)

HIGH CORRELATION
SKEWED
ZEROS

Distinct15670
Distinct (%)1.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean41.02075669
Minimum0
Maximum42716.3
Zeros388709
Zeros (%)28.6%
Memory size10.4 MiB
2021-04-30T20:02:08.892320image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median3.4
Q321.7
95-th percentile122.5
Maximum42716.3
Range42716.3
Interquartile range (IQR)21.7

Descriptive statistics

Standard deviation281.4168207
Coefficient of variation (CV)6.860351768
Kurtosis1981.124696
Mean41.02075669
Median Absolute Deviation (MAD)3.4
Skewness32.03325779
Sum55674601.6
Variance79195.42696
MonotocityNot monotonic
2021-04-30T20:02:09.030319image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0388709
28.6%
0.155925
 
4.1%
0.232537
 
2.4%
0.322283
 
1.6%
0.416956
 
1.2%
0.513525
 
1.0%
0.611470
 
0.8%
0.79903
 
0.7%
0.88899
 
0.7%
0.97922
 
0.6%
Other values (15660)789101
58.1%
ValueCountFrequency (%)
0388709
28.6%
0.155925
 
4.1%
0.232537
 
2.4%
0.322283
 
1.6%
0.416956
 
1.2%
ValueCountFrequency (%)
42716.31
< 0.1%
39185.31
< 0.1%
28687.51
< 0.1%
28333.91
< 0.1%
27057.41
< 0.1%

pTPM
Real number (ℝ≥0)

HIGH CORRELATION
SKEWED
ZEROS

Distinct17361
Distinct (%)1.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean50.79897158
Minimum0
Maximum53349.2
Zeros371434
Zeros (%)27.4%
Memory size10.4 MiB
2021-04-30T20:02:09.168318image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median4.2
Q326.9
95-th percentile151.9
Maximum53349.2
Range53349.2
Interquartile range (IQR)26.9

Descriptive statistics

Standard deviation347.020905
Coefficient of variation (CV)6.831258473
Kurtosis1988.872797
Mean50.79897158
Median Absolute Deviation (MAD)4.2
Skewness31.89212424
Sum68945888.2
Variance120423.5085
MonotocityNot monotonic
2021-04-30T20:02:09.307321image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0371434
27.4%
0.155282
 
4.1%
0.232796
 
2.4%
0.323006
 
1.7%
0.417234
 
1.3%
0.513998
 
1.0%
0.611399
 
0.8%
0.710041
 
0.7%
0.88760
 
0.6%
0.97885
 
0.6%
Other values (17351)805395
59.3%
ValueCountFrequency (%)
0371434
27.4%
0.155282
 
4.1%
0.232796
 
2.4%
0.323006
 
1.7%
0.417234
 
1.3%
ValueCountFrequency (%)
53349.21
< 0.1%
48942.71
< 0.1%
35831.91
< 0.1%
34597.91
< 0.1%
33796.91
< 0.1%

NX
Real number (ℝ≥0)

ZEROS

Distinct2757
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean8.447667897
Minimum0
Maximum1186.6
Zeros374658
Zeros (%)27.6%
Memory size10.4 MiB
2021-04-30T20:02:09.447317image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median2.6
Q311.1
95-th percentile32.2
Maximum1186.6
Range1186.6
Interquartile range (IQR)11.1

Descriptive statistics

Standard deviation16.73002829
Coefficient of variation (CV)1.980431581
Kurtosis177.0465664
Mean8.447667897
Median Absolute Deviation (MAD)2.6
Skewness8.389758854
Sum11465428.3
Variance279.8938467
MonotocityNot monotonic
2021-04-30T20:02:09.604283image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0374658
27.6%
0.152810
 
3.9%
0.233142
 
2.4%
0.324551
 
1.8%
0.419423
 
1.4%
0.516181
 
1.2%
0.613963
 
1.0%
0.712342
 
0.9%
0.811089
 
0.8%
0.99967
 
0.7%
Other values (2747)789104
58.1%
ValueCountFrequency (%)
0374658
27.6%
0.152810
 
3.9%
0.233142
 
2.4%
0.324551
 
1.8%
0.419423
 
1.4%
ValueCountFrequency (%)
1186.61
< 0.1%
1073.71
< 0.1%
959.61
< 0.1%
885.91
< 0.1%
872.71
< 0.1%

Interactions

2021-04-30T20:02:02.119282image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
2021-04-30T20:02:02.628273image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
2021-04-30T20:02:03.124273image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
2021-04-30T20:02:03.672755image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
2021-04-30T20:02:04.183750image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
2021-04-30T20:02:04.701280image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/

Correlations

2021-04-30T20:02:09.726283image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
2021-04-30T20:02:09.858321image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
2021-04-30T20:02:09.984297image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
2021-04-30T20:02:10.117283image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

2021-04-30T20:02:05.683243image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
A simple visualization of nullity by column.
2021-04-30T20:02:06.407281image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

GeneGene nameCell lineTPMpTPMNX
0ENSG00000000003TSPAN6A-43127.833.97.9
1ENSG00000000003TSPAN6A54937.645.510.6
2ENSG00000000003TSPAN6AF22108.1134.528.7
3ENSG00000000003TSPAN6AN3-CA51.864.414.5
4ENSG00000000003TSPAN6ASC diff32.337.412.6
5ENSG00000000003TSPAN6ASC TERT117.720.86.8
6ENSG00000000003TSPAN6BEWO42.753.511.6
7ENSG00000000003TSPAN6BJ14.918.44.5
8ENSG00000000003TSPAN6BJ hTERT+22.326.57.1
9ENSG00000000003TSPAN6BJ hTERT+ SV40 Large T+31.838.49.0

Last rows

GeneGene nameCell lineTPMpTPMNX
1357220ENSG00000285509AP000646.1U-138 MG0.60.72.0
1357221ENSG00000285509AP000646.1U-2 OS0.00.00.2
1357222ENSG00000285509AP000646.1U-21970.10.10.5
1357223ENSG00000285509AP000646.1U-251 MG0.20.30.9
1357224ENSG00000285509AP000646.1U-266/700.00.10.2
1357225ENSG00000285509AP000646.1U-266/840.00.00.1
1357226ENSG00000285509AP000646.1U-6980.00.00.1
1357227ENSG00000285509AP000646.1U-87 MG0.00.00.1
1357228ENSG00000285509AP000646.1U-9370.00.00.0
1357229ENSG00000285509AP000646.1WM-1150.10.10.4