OUP user menu

The sequence of the acidic repeat protein (arp) gene differentiates venereal from nonvenereal Treponema pallidum subspecies, and the gene has evolved under strong positive selection in the subspecies that causes syphilis

Kristin N. Harper, Hsi Liu, Paolo S. Ocampo, Bret M. Steiner, Amy Martin, Keith Levert, Dongxia Wang, Madeline Sutton, George J. Armelagos
DOI: http://dx.doi.org/10.1111/j.1574-695X.2008.00427.x 322-332 First published online: 1 August 2008


Despite the completion of the Treponema pallidum genome project, only minor genetic differences have been found between the subspecies that cause venereal syphilis (ssp. pallidum) and the nonvenereal diseases yaws (ssp. pertenue) and bejel (ssp. endemicum). In this paper, we describe sequence variation in the arp gene which allows straightforward differentiation of ssp. pallidum from the nonvenereal subspecies. We also present evidence that this region is subject to positive selection in ssp. pallidum, consistent with pressure from the immune system. Finally, the presence of multiple, but distinct, repeat motifs in both ssp. pallidum and Treponema paraluiscuniculi (the pathogen responsible for rabbit syphilis) suggests that a diverse repertoire of repeat motifs is associated with sexual transmission. This study suggests that variations in the number and sequence of repeat motifs in the arp gene have clinical, epidemiological, and evolutionary significance.

  • Treponema pallidum
  • repeat region
  • arp
  • typing
  • syphilis
  • yaws


The Treponema pallidum group of bacteria causes the human diseases syphilis (ssp. pallidum), yaws (ssp. pertenue), and bejel (ssp. endemicum). Serological tests can identify treponemal infection, but they cannot differentiate between the subspecies responsible. This is problematic, as different infections may require different treatment and prevention measures. For example, medical professionals in New Zealand misdiagnosed a pregnant woman with active syphilis as having yaws, with the result that she was given an ineffective treatment and gave birth to a child with congenital infection (Wilson & Mauger, 1973). Researchers in Niger have reported being unable to tell whether high infant infection rates are due to yaws or congenital syphilis (Julvez et al., 1998). Similarly, differentiating between syphilis and yaws has proved a challenge for public health workers in rural Tanzania, where infection rates in children are high (Klouman et al., 1997); because syphilis infection might suggest sexual abuse or ongoing congenital transmission, the inability to conclusively determine the nature of the infection is of concern.

In the absence of a specific serological test, health workers must rely on symptoms and population-level data to make a diagnosis. However, lesions may be ambiguous. For example, yaws may mimic venereal syphilis (Wilson, 1973; Engelkens et al., 1990), or the symptoms of a previous yaws infection may mask those of a new syphilis infection (Wilson & Mauger, 1973). In addition, population data are not always readily available, as for patients who have recently immigrated (Engelkens et al., 1989), or two infections may be present in the same population (Lagarde et al., 2003). As yaws eradication efforts near their goal (Lahariya & Pradhan, 2007) and case diagnosis becomes more difficult, a molecular means of determining whether the infection is venereal or nonvenereal becomes essential.

Comparative genetic studies of the T. pallidum subspecies have been limited by the small number of nonvenereal strains available for study and the relatively low level of variation present between the subspecies. The first DNA study, which used hybridization to estimate the extent of polymorphism between strains, found that the level of divergence between ssp. pallidum and pertenue was so low as to fall below the level of detection (Miao & Fieldsteel, 1980). In this context, the discoveries of single nucleotide polymorphisms (SNPs) that appeared useful in distinguishing between the venereal and nonvenereal subspecies warranted publication throughout the 1990s (Noordhoek et al., 1990; Centurion-Lara et al., 1998; Cameron et al., 1999). Recently, more extensive variation has been described between the subspecies (Gray et al., 2006; Harper et al., 2008), paving the way for the development of a new, molecular method for identifying the subspecies responsible for an infection. Though several potential systems have been described, they rely on the presence of one SNP (Noordhoek et al., 1990; Centurion-Lara et al., 1998; Cameron et al., 1999) or involve the use of multiple genes (Centurion-Lara et al., 2006). The use of a SNP to differentiate between subspecies has yielded inconsistent results, as several studies have shown (Noordhoek et al., 1990; Harper et al., 2008). In contrast, relying on a multigene system may yield more reliable results, but the testing process is labor intensive and interpretation may be difficult when strains of a single subspecies yield different patterns (Centurion-Lara et al., 2006). As a result of the limited comparative data available, an ideal molecular test with which to distinguish the subspecies has not yet been developed and the genetic differences responsible for the divergent clinical manifestations of the treponematoses remain unclear.

Tandem repeat proteins in pathogenic bacteria were originally identified while studying virulence genes (Denoeud & Vergnaud, 2004). It has been frequently observed that genes containing tandem repeats encode outer membrane proteins, suggesting that such genes help pathogens adapt to their hosts (Denoeud & Vergnaud, 2004). In the past, this relationship has been exploited to identify novel pathogenicity factors on the basis of sequence data in systems such as Haemophilus influenzae (Hood et al., 1996) and Neisseria meningitidis (Saunders et al., 2000). In addition, it is well-documented that variation in tandem repeats accumulates more rapidly than do SNPs (Ellegren, 2000; Keim et al., 2004). Because of the lack of polymorphism in the T. pallidum genome, a focus on repeat regions may prove helpful in developing a subspecies typing system and identifying potential virulence determinants.

The acidic repeat protein gene (arp) has been described as a useful component of a strain subtyping system in ssp. pallidum (Pillay et al., 1998). The 5′ and 3′ ends of the arp gene are conserved. However, a central repeat region composed of 60 base-pair repeat motifs is highly variable with regard to the number of repeat units and the sequence of the nucleotide repeat motifs (Liu et al., 2007). The size of the gene's repeat region has been used to type clinical specimens in South Africa (Pillay et al., 2002; Molepo et al., 2007), North and South Carolina (Pope et al., 2005), and Arizona (Sutton et al., 2001). Recently, a single ssp. pallidum strain has been shown to contain at least three distinct repeat motifs, while the three ssp. pertenue and endemicum strains sequenced to date contain only one type of repeat motif (Liu et al., 2007), suggesting that this gene has potential for use in a subspecies typing system.

Although the function of the Arp has not been demonstrated, it has been shown that the repeat motifs are highly immunogenic and contain a potential fibronectin-binding domain (Liu et al., 2007). Based on analysis of gene sequences, the Arp appears to be either secreted or localized in the plasma membrane. The 20 amino acid repeated motif contains the sequence E-V-E-D, which may bind to fibronectin, as has been described in other systems (Joh et al., 1994). In addition, the Arp is similar to many proteins secreted by type I machinery in both its acidic pH, due to the high concentration of glutamic acid residues, and its many repeated sequences (Delepelaire, 2004).

The first goal of this study was to characterize the arp gene in a greater number of clinical and laboratory isolates of T. pallidum. It is currently unknown whether the variation present in repeat length is matched by sequence variation in ssp. pallidum, or whether variation may be used to distinguish ssp. pallidum from the nonvenereal subspecies. The second goal was to characterize arp homologs in a simian strain of Treponema and Treponema paraluiscuniculi (the agent responsible for rabbit syphilis). We sought to discover whether these strains harbor their own versions of the arp gene, and if so, how these copies have diverged from those found in T. pallidum. The third and final goal of this study was to interpret sequence variation in light of selective pressure and the evolutionary relationships between strains.

Materials and methods


The repeat region of the arp gene was examined in sixteen strains of T. pallidum ssp. pallidum, including eight laboratory strains and eight clinical strains (Table 1). Nine laboratory strains of ssp. pertenue, two strains of ssp. endemicum, and two strains labeled as pertenue but suspected to be pallidum strains (Clark, 1967; Centurion-Lara et al., 1998) were also sequenced. Finally, the arp gene was characterized in the Fribourg–Blanc or simian strain of Treponema, isolated from a wild baboon in Guinea in 1963, and two strains of T. paraluiscuniculi, which causes syphilis in rabbits. DNA was obtained from T. pallidum organisms grown in rabbit tissue or from clinical specimens. Laboratory isolates of T. pallidum were grown in New Zealand white rabbits, consistent with guidelines set out by the Institutional Animal Care and Use Committee at the United States Center for Disease Control and Prevention (CDC). Clinical samples were obtained from swabs of genital ulcers during a previous study in Arizona (Sutton et al., 2001). DNA was extracted from tissue and swab transport media using the QIAamp DNA extraction minikit (Qiagen, Valencia, CA, catalog number 51306), according to the manufacturer's instructions for tissue or fluid preparation.

View this table:
Table 1

Pathogenic Treponema strains (n=32) examined in this study

NameSubspeciesPlace CollectedYearSample typeSite sampledSource
BrazzavilleT.p. pertenueCongo Rep.1960DNA onlyUnknownL. Schouls
CDC-1T.p. pertenueGhana1980Live strainPapillomatous lesionCDC
CDC-2T.p. pertenueGhana1980Live strainPapillomatous lesionCDC
CDC-2575T.p. pertenueGhana1980DNA onlyUnknownL. Schouls
GauthierT.p. pertenueNigeria1960Live strainPapillomatous lesion on scapulaS. Lukehart
GhanaT.p. pertenueGhana1988DNA onlyUnknownL. Schouls
PariamanT.p. pertenueIndonesia1988DNA onlyUnknownL. Schouls
Samoa DT.p. pertenueSamoa1953Live strainFrambesiomaCDC
Samoa FT.p. pertenueSamoa1953Live strainFrambesiomaCDC
BosniaT.p. endemicumBosnia1950Live strainUlcer on shaft of penisCDC
IraqT.p. endemicumIraq1951Live strainUnknownCDC
T.p. pallidumUnited States1998–1999DNA onlyPrimary lesions, ulcersM. Sutton
Baltimore-9T.p. pallidumUnited States1970sLive strainLiver biopsyCDC
Chicago BT.p. pallidumUnited States1951Live strainLesion on prepuceCDC
Dallas-1T.p. pallidumUnited States1988Live strainAmniotic fluidCDC
GradyT.p. pallidumUnited States1990sLive strainUnknownCDC
Haiti BT.p. pallidumHaiti1951Live strainLesions on lower abdomenCDC
MadrasT.p. pallidumIndia1954Live strainUlcerative lesion on thighCDC
Mexico AT.p. pallidumMexico1953Live strainPrimary lesionCDC
NicholsT.p. pallidumUnited States1912Live strainSpinal fluid, neurosyphilisCDC
Philadelphia-1T.p. pallidumUnited States1988Live strainUnknownCDC
Street-14T.p. pallidumUnited States?Live strainUnknownCDC
Simian strainT.p. ?Guinea1966Live strainLymph node of baboonS. Lukehart
T. paraluiscuniculi Strain AN/AUnited States, but rabbit host descended from a European typePre-1957Live strainLaboratory rabbitCDC
T. paraluiscuniculi Strain HN/AUnited States, but rabbit host descended from a European typePre-1957DNA onlyLaboratory rabbitS. Lukehart
  • Noordhoek et al. (1990).

  • Laboratory notebooks of Rob George, CDC.

  • Turner & Hollander (1957).

  • Sutton et al. (2001).

  • Strains were originally labeled ssp. pertenue, but sequence analysis and clinical manifestations in a rabbit infection model suggest that they are, in reality, ssp. pallidum, as described in the Materials and methods.

  • Fribourg-Blanc et al. (1966).

DNA amplification and sequencing

The repeat region of the arp gene was amplified using PCR. Because of the difficulty of amplifying long repeat regions, we found it necessary to optimize the PCR using the Opt-2 kit (Sigma, St. Louis, MO) in order to obtain a single PCR product. Each 50-µL reaction contained 2.5 U of Taq polymerase (Sigma), 5 µL of Opt-2 PCR Buffer #7, 1 µL of the Opt-2 50X Universal Buffer, 0.5 µM dNTPs (Sigma), 0.2 µM of the forward and reverse primers (Invitrogen, Carlsbad, CA), and 1 µg of single stranded binding protein (Sigma). The addition of single stranded binding protein, in particular, was found to prevent the formation of satellite bands during the amplification process. The forward primer 5′-AGCGTGATCCTCTGTCATCC-3′ and reverse primer 5′-TTGGGAGCTGAGTTGGAAAC-3′ were designed to bind to the conserved areas flanking the repeat region. The PCR program was as follows: one cycle of 94° for 5 min; 35–45 cycles of 94° for 1 min, 55°, for 1 min, and 72° for 2.5 min; one cycle of 72° for 7 min. Amplicons were resolved on a 1% agarose gel, and DNA was purified for sequencing using the Qiagen Gel Purification kit (Qiagen), according to the manufacturer's instructions. The dnasize program (Raghava, 2001) was used to compute the sizes of the T. paraluiscuniculi repeat regions, which were exceptionally large. Repeat size was determined based on the distance these amplicons migrated on a 0.8% agarose gel as compared with a DNA ladder (1-kb Plus DNA Ladder, Invitrogen). Purified amplicons were sequenced by SeqWright (Fisher Sequencing, Houston, TX). Sequences were deposited in GenBank, under accession numbers EU101724EU101754.

Analysis of amino acid substitutions

Amino acid substitutions were designated as either conserved or radical according to three independent measures: (1) charge (positive, negative, or neutral); (2) polarity and volume (polar and large, polar and small, neutral and small, neutral and large, nonpolar and small, or nonpolar and large); and (3) Grantham's distance, which reflects volume, weight, polarity, and carbon composition (Grantham, 1974). For the first two measures, a change from one class to another was considered radical. For the last measure, a value of <100 was considered conservative and a value of >100 was considered radical (100 being the mean chemical distance obtained from Grantham's formula).

Statistical analyses

Data analysis was performed on repeat size frequency in clinical samples by pooling data from several studies. Statistical analysis was performed using the R statistics package (Ihaka & Gentleman, 1996). Our analysis of T. pallidum clinical strains (n=267) drew on data from four studies based in Phoenix AZ (Sutton et al., 2001), North and South Carolina (Pope et al., 2005), South Africa (Pillay et al., 2002), and the United States and Africa (Pillay et al., 1998). Central moments were calculated and the Shapiro–Wilk test was used to assess the normality of both the pooled data and the African clinical strains alone (n=185). The comparison of venereal (n=10) and nonvenereal laboratory strains (n=11) drew on data collected in this study. Again, the Shapiro–Wilk test was used to assess the normality of each set. The Kolmogorov–Smirnov test was used to determine whether samples were drawn from the same distribution, a single-tailed Wilcoxon rank sum test was used to compare means of the two samples, and the Fligner–Killeen test of homogeneity of variances was also performed. Finally, our comparisons of strains from early (n=161) and late (n=13) infection consisted of repeat region size data from two studies on South African patients. One study analyzed strains collected from the genital lesions characteristic of early syphilis (Pillay et al., 2002), while the other characterized strains collected from the central nervous system of patients diagnosed with tertiary-stage neurosyphilis (Molepo et al., 2007). A Shapiro–Wilk test was used to assess whether the sample sets were drawn from a normal distribution, a two-tailed Wilcoxon rank sum test was used to compare the means of the two samples, and a Fligner–Killeen test of homogeneity of variances was performed.


Sequence variation in repeat motifs

A total of four repeat motifs were found in T. pallidum strains. While only the type II repeat motif was identified in the nonvenereal ssp. pertenue and endemicum, four types of repeat motif were identified within ssp. pallidum strains: types I–III and type II/III (Fig. 1). The type II/III motif, however, appeared to result from recombination between types II and III, a process often responsible for sequence degeneracy in bacterial tandem repeats (van Belkum et al., 1998). Two strains of unknown subspecies, Haiti B and Madras, appeared to be ssp. pallidum strains based on the sequence at this gene. These data are consistent with previous studies (Clark, 1967; Centurion-Lara et al., 1998; Cameron et al., 1999; Harper et al., 2008). The arp gene was also found in T. paraluiscuniculi and the simian strain of T. pallidum. The simian strain, with 15 repeats, was found to contain only the type II repeat motif, as in the nonvenereal human yaws subspecies. In contrast, four new repeat motifs were identified in two T. paraluiscuniculi strains (Fig. 1). Although highly similar to the repeat motifs found in T. pallidum, none were identical.

Figure 1

Nucleotide and translated amino acid alignment of the repeat motifs observed in the arp gene of various pathogenic Treponema strains.

Five codons, two clustered near the beginning of the repeat motif, at positions six and eight, and three clustered near the end, at positions 15–17, were found to be hypervariable in T. pallidum (Fig. 1). Codons six and eight were also found to be polymorphic within T. paraluiscuniculi (Fig. 1). In both species, codon six was polymorphic for alanine or valine, while codon eight was polymorphic for lysine or glycine. All but one of the codon polymorphisms observed involved the second base position. A single first position nucleotide substitution was found in codon nineteen, which distinguished T. paraluiscuniculi from T. pallidum (Fig. 1). The E-V-E-D region potentially involved in fibronectin binding was conserved in all repeat motifs.

Repeat motif substitutions were analyzed in terms of predicted functional change and evolutionary relationships. All nucleotide differences observed in the arp repeat motifs were predicted to result in amino acid substitutions. Amino acid substitutions were assessed as radical or conservative on the basis of three different criteria: charge, polarity/volume, and Grantham's distance index (Table 2). All six observed substitutions were radical by at least one of these criteria. Five of the six, all but the alanine to valine substitution at residue 6, were radical in at least two of three categories. Three substitutions constituted radical differences in every category: lysine to cysteine at residue 8, glutamic acid to isoleucine at residue 15, and glutamic acid to leucine at residue 17.

View this table:
Table 2

Radical amino acid changes in the repeat units of the arp gene

Residue numberChangeChargePolarity/volumeGrantham distance
6A to VCRC
8K to CRRR
15E to IRRR
16R to FRRC
17E to LRRR
19G to VCRR
  • Substitutions were determined to be either conservative (C) or radical (R) with respect to three different measures, as described in the Materials and methods.

Size and order of repeat motifs

The size of the arp repeat region exhibited great variability even within subspecies. The subspecies pallidum strains sequenced in this study contained from 4 to 16 repeat units (Fig. 2). Even the shortest ssp. pallidum isolate, which had only four repeats, contained the three most common types of repeat motif (I–III), as did every other strain of this subspecies. In contrast, all nonvenereal strains, including the simian isolate which had 15 repeats, contained only type II repeats. Repeat regions in the ssp. pertenue and endemicum strains ranged from three repeat units to 12 repeat units. The repeat region of the arp gene in T. paraluiscuniculi was much longer than those found in T. pallidum strains. Treponema paraluiscuniculi strain A contained 21 repeat units. Using size data, Strain H was determined to contain c. 25 repeat units, the most central of which could not be sequenced (gel not shown). Both T. paraluiscuniculi strains ended with a partial type V repeat (not pictured), before reverting to the conserved flanking sequence.

Figure 2

Composition of the repeat region observed in the arp gene of various pathogenic Treponema strains. Descriptions of all strains, including subspecies and the country in which they were gathered, can be found in Table 1. The dashed line denotes missing sequence data.

Specific patterns were present not only in the motifs themselves, but in the order of motifs within the repeat region. For example, in ssp. pallidum strains, repeat regions began with alternating type I and II motifs and ended with a short run of type III or II/III motifs (Fig. 2). All but one of the clinical strains sequenced had identical repeat regions, suggesting that the outbreak from which they were gathered (Phoenix, AZ) consisted of closely related strains. The ends of the repeat region were conserved between the two T. paraluiscuniculi strains examined, with variability occurring in central repeats.

Analysis of selection based on clinical strain arp size variation

Data on the frequency of repeat region size variants in various populations were used to test hypotheses about the evolution of the arp gene. Under neutral evolution, in the absence of selection pressure, tandem repeats are believed to evolve primarily through the addition or deletion of one repeat unit (Krugylyak et al., 1998). Thus, the distribution of repeat sizes of the arp gene in a given population would be expected to approximate the normal distribution, with a peak at the mean size and decreasing frequency of variants as distance from the mean increases. If, on the other hand, escape from the immune system is a significant force in shaping the evolution of the arp gene, then size variants extremely different from the mean should be observed more frequently than expected. The distribution of size variants obtained from pooled clinical data (n=267) was not drawn from the normal distribution (Shapiro–Wilk normality test, P<0.001). Instead, the data followed a leptokurtic distribution; divergence from the normal distribution was due primarily to an excess number of observations occurring very near the mean size variant (containing 14 repeats) as well as an overabundance of very small and very large repeats (Fig. 3a). Because syphilis is relatively rare in the United States, strains gathered from outbreaks there were likely to be contracted in a short period of time and closely related. In order to see whether strains gathered from the United States (n=82) contributed inordinately to the high center peak, or ‘clumping,’ evident in the distribution of repeat size variants, an analysis of only clinical strains gathered in Africa was performed, using the data from Pillay (1998, 2002). Although syphilis is much more common at the African sites from which strains were collected, and infection in the community is longstanding, the distribution of repeat size variants was qualitatively and quantitatively similar to the pooled data (Fig. 3a). Like the pooled data, the African data were not drawn from the normal distribution (Shapiro–Wilk normality test, P<0.001) and followed a leptokurtic distribution. A comparison of ssp. pallidum laboratory strains (n=10) with ssp. endemicum and pertenue laboratory strains (n=11) revealed that ssp. endemicum and pertenue strains contained significantly shorter repeat regions than venereal strains (one-tailed Wilcoxon rank sum test, P<0.001). In addition, it was found that size variant frequency in nonvenereal strains did not statistically differ from the normal distribution (Shapiro–Wilk test, P=0.558), while the set of observations from the venereal laboratory strains were, like clinical strains, significantly different from the normal distribution (Shapiro–Wilk test, P=0.001) (Fig. 3b). Additionally, the Kolmogorov–Smirnov test indicated that venereal and nonvenereal laboratory strains were likely drawn from different distributions (two-sided P=0.002), although the variance of each sample did not differ significantly (Fligner–Killeen test for homogeneity of variances, P=0.089).

Figure 3

Statistical analysis of arp repeat size variants. (a) Repeat size variants of the arp gene observed in pooled clinical studies of Treponema pallidum ssp. pallidum strains (n=267). Strains were analyzed in previous studies in the United States and Africa, as described in the Materials and methods. A solid line representing the normal distribution for the empirical mean size and SD is included to illustrate deviations from normality. (b) Repeat size of the arp gene in laboratory propagated strains of T. pallidum. Venereal (ssp. pallidum, n=10) vs. nonvenereal (ssp. endemicum and pertenue, n=11) strains were compared. These rabbit propagated strains, sequenced in this study, are distinct from the clinical sample-derived strains analyzed in Fig. 3a. (c) Repeat size of the arp gene in T. pallidum ssp. pallidum strains obtained from early (n=197) vs. late (n=13) stage infections.

Further, if the persistence of arp variants is driven by escape from host immune pressure, one would expect that a population with a greater average length of infection would host more variants with repeat sizes extremely different from the mean than a population with a shorter average time before treatment. Repeat regions from strains obtained from patients that showed clinical manifestations consistent with tertiary stage disease were compared to strains obtained from early stage patients drawn from the same population. The mean repeat size in the tertiary stage strains was significantly less than that found in the early stage strains (9.923 vs. 14.137 repeat units, Wilcoxon rank sum test, P=0.009). This difference in means is primarily explained by an abundance of tertiary strains with a very low number of repeated motifs in addition to the standard 14-repeat alleles (Fig. 3c). A single tertiary strain containing 17 repeats was also found. There was a significant difference in the variance of the two samples; tertiary strains had approximately five times the variance of primary strains (Fligner–Killeen test for homogeneity of variances, P=0.005).


Sequence data suggests positive selection acting upon venereal strains

In this study, the repeat region of the arp gene was analyzed in all available laboratory strains of T. pallidum ssp. pertenue and endemicum, laboratory and clinical strains of T. pallidum ssp. pallidum, the simian strain of T. pallidum, and two laboratory strains of T. paraluiscuniculi. Nucleotide substitutions in the T. pallidum genome as a whole appear to be quite rare and for the most part concentrated in regions such as the large, gene-conversion prone tpr gene family (Centurion-Lara et al., 2004; Gray et al., 2006; Harper et al., 2008). For example, comparative studies of the entire tpp15 (429 bp), gpd (1068 bp) and tpF-1 (531 bp) genes have yielded no substitutions, one synonymous substitution, and one nonsynonymous substitution, respectively, between the T. pallidum strains examined (Noordhoek et al., 1990; Centurion-Lara et al., 1997; Cameron et al., 1999). However, in this study we show that substitutions are frequent in the repeat region of the arp gene. The presence of a high number of substitutions, in close proximity, which are all nonsynonymous, as well as the high frequency of radical amino acid changes in the arp repeat motifs, suggest that positive selection is operating on this gene in T. pallidum ssp. pallidum and T. paraluiscuniculi. The fact that the most common Arp repeat motif has been shown to be a target of the host immune response (Liu et al., 2007) adds to the case for positive selection.

Sequence evidence based on arp polymorphism suggests that a long repeat region filled with a diverse repertoire of repeat motifs is an important indication of venereal infection by Treponema. The sexually transmitted T. pallidum ssp. pallidum and T. paraluiscuniculi strains contained multiple types of repeat motifs, while the three nonvenereal subspecies (pertenue, endemicum and the simian strain) contained only one type of motif. Further, all ssp. pallidum strains studied here contained at least three types of repeat motifs, even the strain with only four repeat units (Fig. 2). Thus, the presence of multiple Arp repeat motifs is correlated with transmission mode and could play some direct role in sexual transmission and pathogen persistence. A phylogeny of the treponemes indicates that the nonvenereal subspecies arose first, followed by ssp. pallidum, with T. paraluiscuniculi acting as an outgroup (Harper et al., 2008). Thus, it appears that a diverse repertoire of motifs evolved twice, being present in the venereal outgroup, T. paraluicuniculi, absent in the nonvenereal T. pallidum subspecies that arose first in the T. pallidum family, and present again in the venereal ssp. pallidum, which diverged most recently. The evidence for positive selection in ssp. pallidum only, and not in the nonvenereal subspecies, raises the possibility that a recent expansion of repeat motifs may have been involved in a novel, sexual transmission strategy. This scenario is consistent with several of the prominent theories concerning ssp. pallidum's origin, including the hypotheses that the pathogen that causes syphilis arose with the return of Columbus from the New World (Baker & Armelagos, 1988) or the rise of urban living and increased clothing (Cockburn, 1961).

Characterization of size variation in arp

What function might arp repeat length serve? Repeat size frequencies in nonvenereal lab strains were consistent with neutral evolution, or the normal distribution, suggesting that size may not play an important role in the survival of these subspecies. However, 67% of the ssp. pallidum strains we examined in this study contained a repeat region with 14 U. The high frequency of this type of repeat could be explained by strong stabilizing selection for repeats containing this number of units. Alternatively, it is possible that too little evolutionary time has elapsed since the emergence of syphilis-causing strains to permit much variation from an original arp gene containing a 14-U repeat. However, the rapid rate of evolution in tandem repeats such as the arp gene, and the odds against long tandem repeats persisting in the absence of strong selection (Achaz et al., 2002), argue for strong selection for the 14-repeat variant. In addition, this variant is larger than any found in ssp. pertenue or endemicum strains. As in the α C protein of group B streptococci, there may initially be selection for a large repeat region, perhaps for binding or some activity involved in transmission, or because such a region is less immunogenic (Gravekamp et al., 1996, 1997). However, greater than expected frequencies of very small and very large repeat sizes were observed in clinical strains of ssp. pallidum (Fig. 3c). This suggests that such repeats may also be favored by selection in certain situations, perhaps through escape from an immune response primed to the most common 14-repeat copy variant, as observed in the α C protein (Madoff et al., 1996). The high frequency of small repeat regions in tertiary vs. early stage strains provides further support for this hypothesis. Similarly, laboratory strains passaged from rabbit to rabbit at peak infection levels do not appear to yield different sized-variants of the arp gene (Pillay et al., 1998), at least not frequently enough to be observed, suggesting that the length of infection may be important in selection for repeat variants.

Implications of the data and future research directions

The arp region is similar in some respects to the clustered, regularly interspaced, short palindromic repeats (CRISPRs) used for epidemiological and taxonomic purposes in other pathogenic bacteria (Jansen et al., 2002). As in the CRISPRs of Mycobacterium tuberculosis and Streptococcus pyogenes, the repeat units are clustered within a single locus and the number and sequence of repeats differ from strain to strain. However, unlike CRISPRs, in the arp gene repeat units occur in tandem, rather than being separated by nonrepeat regions. Like CRISPRs, the high level of variation within the arp gene makes it a valuable tool for subtyping. Using the method described in this paper, amplification and analysis of only one gene is necessary to distinguish between the venereal and nonvenereal T. pallidum subspecies. Moreover, the arp gene is already part of the only available ssp. pallidum typing system, making it a convenient test for diagnostic laboratories.

This study, like others on the T. pallidum subspecies, is limited by the number of nonvenereal strains available for study. Although, we examined more strains than have past articles on subspecies typing systems and included all laboratory strains currently available, the use of the arp system to differentiate between subspecies will be strengthened if additional strains can be collected and examined in the future. In addition, it will be important to sequence more ssp. pallidum strains gathered outside of the United States, to confirm the pattern of variation described here and to determine if additional repeat motifs exist. Similarly, our analysis of the repeat regions in early vs. late stage syphilis rested on the study of a relatively small number of tertiary syphilis cases diagnosed by serology and symptoms, all of which came from South Africa. Testing additional strains from late stage syphilis infections, especially those collected in other regions, will clarify the pattern of arp polymorphism generated during long infections. Sequencing these strains will facilitate comparison of repeat region composition between strains gathered from early and late syphilis infections. If repeat region composition, like repeat region size, exhibits greater variation in strains from longstanding infections, this will lend further support to the hypothesis that the immune system is driving evolution in the arp gene. Finally, ssp. pallidum strains collected from late-stage infections that contained as few as two or three repeat units were described in the studies analyzed in this article (Fig. 3c). It will be important to determine whether even these ssp. pallidum strains, with very small repeat regions, contain multiple types of repeat motifs.

In the past, minisatellites have been associated with virulence genes in bacteria. The Arp's high immunogenicity, evidence for positive selection consistent with pressure from the immune system, hypothesized membrane localization and binding ability, and richness in radical substitutions point toward an important role for this protein as well. If this protein plays a functional role in venereal transmission, then using the gene that encodes it to differentiate between venereal and nonvenereal subspecies should be an extremely reliable method. Treponema pallidum is a genetically intractable organism that cannot be cultured. Thus, classical testing of the importance of a protein in pathogenesis is not possible in this system. Nevertheless, this study suggests several avenues of future research that may prove useful. The hypothesis that small or large Arp size variants are able to escape a host immune response primed to the standard ssp. pallidum 14-repeat variant can be tested in the laboratory. Similarly, the ability of the Arp to bind fibronectin, an event believed to be important in T. pallidum pathogenesis (Steiner & Sell, 1985; Thomas et al., 1985), can be investigated. Finally, given the demonstrated immunogenic properties of the Arp repeat motifs (Liu et al., 2007), it is possible that the difference in motif content between ssp. pallidum and ssp. pertenue and endemicum may be translated into a long-sought serological test to distinguish between venereal and nonvenereal infections. Perhaps monoclonal antibodies can be produced against the specific repeat peptides, which would allow differentiation of syphilis from yaws and bejel immunologically.


We would like to thank James Thomas and Sonia Altizer for providing laboratory space; Lance Waller and Todd Schlenke for their helpful comments and suggestions; Adrian Gonzalez for assistance with sequencing; and Sheila Lukehart and Leo Schouls for providing strains used in this study. This work was supported by dissertation improvement grants to K.N.H. from the National Science Foundation (award number 0622399) and the Wenner-Gren Foundation. K.N.H. was a Howard Hughes Medical Institute predoctoral fellow at the time of this study. The Arp protein is covered by US patent No. 7 005 270, EU patent No. 1240519, and multiple international patents. The findings and conclusions in this paper are those of the authors and do not necessarily represent the views of the United States Centers for Disease Control and Prevention.


  • Editor: Kai Man Kam


View Abstract