Protocol to define benchmark sets

  1. PDB, SwissProt and BIOGRID were downloaded on the 20th of February, 2024
  2. Considering the first day of the following month of each cutoff date (30.04.2018; 31.05.2020; 15.02.2021; 30.09.2021; 15.07.2022, 01.11.2022, 01.01.2023, 01.01.2024) we performed the following searches:
    1. Sequence (PSI-BLAST: e-value 0.0001, 3 iterations, maximum target sequences 50 000) search of all structures after cutoff date against structures before cutoff date. All queries having any hit longer than 10 amino acids with at least 20% sequence identity were filtered. For training cutoff dates (30.04.2018,30.09.2021), NMR structures were not included in the background database.
    2. Structure search (Foldseek, maximum target structures 50 000) of all structures after the cutoff date against structures before cutoff date. All queries having any hit longer than 10 amino acids with at least 0.25 TM-Score were filtered. For training cutoff dates (30.04.2018,30.09.2021), NMR structures were not included in the background database.
    3. Sequence search (PSI-BLAST: e-value 0.0001, 3 iterations, maximum target sequences 50 000) of SwissProt proteins against structures before cutoff date. All queries having any hit longer than 10 amino acids with at least 20 sequence identity were filtered. For training cutoff dates (30.04.2018,30.09.2021), NMR structures were not included in the background database.
  3. We also performed search for interactions
    1. We used Voronota to detect all interchain interaction in PDB structures, considering the most probable oligomerization state stored on PDBe.
    2. We used BIOGRID (only 'direct interactions') to search for interactions in SwissProt sequences.
  4. For all cutoff date we list
    1. Single PDB chains without homolog in the training (30.04.2018,30.09.2021) or training/template (31.05.2020; 15.02.2021; 15.07.2022, 01.11.2022, 01.01.2023, 01.01.2024) set.
    2. Interacting PDB chain pairs without homolog in the training (30.04.2018,30.09.2021) or training/template (31.05.2020; 15.02.2021; 15.07.2022, 01.11.2022, 01.01.2023, 01.01.2024) set.
    3. SwissProt sequences without homolog in the training (30.04.2018,30.09.2021) or training/template (31.05.2020; 15.02.2021; 15.07.2022, 01.11.2022, 01.01.2023, 01.01.2024) set.
    4. Interacting SwissProt proteins without homolog in the training (30.04.2018,30.09.2021) or training/template (31.05.2020; 15.02.2021; 15.07.2022, 01.11.2022, 01.01.2023, 01.01.2024) set.
data-flow