The diversity of immunoglobulin (IG) and T cell receptor (TR) chains
depends on several mechanisms: combinatorial diversity, which is a consequence
of the number of V, D and J genes and the N-REGION diversity, which creates an
extensive and clonal somatic diversity at the V-J and V-D-J junctions. For the
IG, the diversity is further increased by somatic hypermutations. The number of
different junctions per chain and per individual is estimated to be
10
$^{12}$
. We have chosen the human TRAV-TRAJ junctions as an
example in order to characterize the required criteria for a standardized
analysis of the IG and TR V-J and V-D-J junctions, based on the IMGT-ONTOLOGY
concepts, and to serve as a first IMGT junction reference set
(IMGT®, http://imgt.cines.fr). We performed
a thorough statistical analysis of 212 human rearranged TRAV-TRAJ sequences,
which were aligned and analysed by the integrated IMGT/V-QUEST software, which
includes IMGT/JunctionAnalysis, then manually expert-verified. Furthermore, we
compared these 212 sequences with 37 other human TRAV-TRAJ junction sequences
for which some particularities (potential sequence polymorphisms, sequencing
errors, etc.) did not allow IMGT/JunctionAnalysis to provide the correct
biological results, according to expert verification. Using statistical
learning, we constructed an automatic warning system to predict if new,
automatically analysed TRAV-TRAJ sequences should be manually re-checked. We
estimated the robustness of this automatic warning system.