Abstract
Concatenation, or joining together, of strings or other values, possibly with extra punctuation such as spaces, is supported in Stata by addition of strings and by the
1 The problem stated
In this column, I address a bundle of related questions that arises most often in a threefold context. First, people or other individuals (firms, places, etc.) each have an identifier. Second, each of those individuals is observed over time. Third, and in particular, categories are observed over time. So, imagine a toy dataset like this:
The jargon of panel or longitudinal data will be familiar to many readers, and this is one such dataset.
An appealing step is to concatenate each history of category values. We can imagine joining the strings together, thinking that identifier 1 has a profile or history
So, how do we do that easily in Stata? And what related problems arise, and how are they tackled? All that is the focus of this column.
Before we get down to business, here are some broad-brush comments about what is and what is not included. Identifiers can be string or numeric, unlike what is compulsory with
Further, it should be clear that the example is deliberately short and simple. Serious examples could be much longer. Yet further, the entire string or history may not be of most interest, but for example, whether and which individuals include spells such as
What I am not imagining is that it is especially interesting or useful to concatenate, say,
There are several similarities of spirit and some of substance between the ideas in this column and the much more elaborate project on sequence analysis reported by Kohler and Brzinsky-Fay (2005) and Brzinsky-Fay, Kohler, and Luniak (2006). It seems simpler to explain the few basic (and unoriginal) ideas here directly and independently. Furthermore, there are overlaps with work on spells or runs (Cox 2007, 2015).
2 What’s in a name?
The Latin root here underlying “concatenation” is
In computing, it is quite common to talk of concatenation of strings or, more generally, arrays, which can be seen to mean chaining (together). Here “together”, and similarly the con- syllable, is at best emphatic and at worst redundant. Hence, “catenation” would have exactly the same meaning, but it appears much less popular, with notable exceptions in accounts of APL and related programming languages (Iverson 1962, 1977; Brown, Pakin, and Polivka 1988; Thomson and Polivka 1995).
Similarly, Unix users are likely to be aware of the command
3 Concatenation across variables
Concatenation across variables has been supported for a long time in Stata. If
concatenates, meaning that the operator
or something similar with commas or other punctuation.
The
Mentioning that might create an expectation that a new
4 Concatenation across observations
Let me underline that concatenation of string values across observations has long been possible in Stata. The point of this column is just to discuss how best to do it. Thus, in the toy dataset of section 1,
yields
yields
If you did not know that, then you have now heard about what is likely to be the most valuable tip in all of this column for your future use of Stata. Taking that more slowly (see also Cox [2002]), we note that
ensures sorting of the dataset first by identifier and then by the time variable and specifies that what is to follow is to be carried out separately within the groups of observations defined by the distinct values of
adds the first three values (in our toy dataset, that is all of them) of the variable
That is good, but going further in the same direction does not appeal. In the expression needed for the history, adding 3 terms is bearable, but adding 30 or 300 terms, or many more, will not be. Moreover, the recipe is not general if the number of observations in each panel is not the same. It so happens that a reference to a nonexistent value, say,
There is a more general recipe without too much associated pain. The procedure includes three steps:
Initialize a new variable:
Add new values in a cascade so that each value is the previous value plus the value of the current value:
Consistently copy the total string from the last observation in each panel to all observations in the same panel:
Key here is, once again, that under
The cascade just mentioned works like this. The qualifier
which means, as we could type if we so wished,
Then, in turn for the third observation, again spelling out all the details,
And so on: An excellent fact is that this looping is automated for us, and Stata takes care of any awkward details such as differing numbers of observations in the various panels.
The process ends with tidying up so that all observations in a panel contain the same values for
A helpful detail about this approach is that punctuation between strings need be added only once. So, if we had reason to add spaces between strings, the only tweak needed is to
and evidently commas or other punctuation can be added just as easily.
Now, let us see what needs to be varied if the variable of interest were not string but numeric. In practice, this would usually mean integers, often small integers. The variation needed is just conversion to string on the fly. Let’s give the code in one, but after mapping
When the numeric values are just single-digit integers 0 to 9, then no ambiguity can arise, and people might often prefer a compact concatenation:
Let us spell out that spaces or other punctuation should typically be used whenever ambiguities might arise in interpreting a string correctly, as when 1, 0, and 10 are all possible values. It may also be a good idea for readability, as when a person’s moves from a place are coded by
Let’s now consider some variations on the theme so far. The underlying message is that, given the main ideas, these are easy to solve with simple modifications of the code. •
•
•
5 Tagging each panel just once
Panel datasets imply two scales: the small scale of individual observations and the larger scale of the panels themselves. In the code discussed here, each panel’s history is repeated in all observations of that panel (unless, as mentioned briefly, code shows the history so far). Often, we wish, say, to count, tabulate, or graph results for panels, not their constituent observations. The standard trick here is to use the
creates an indicator variable that is 1 for one observation in each panel and 0 for the others so that doing anything
Because groups could generally be as small as single observations, there are only two systematic ways to tag just one observation: to tag the first observation or to tag the last; in groups of one, that is the same observation. For your information,
6 Using this idea
That’s the main idea, and speaking proverbially, all is easier than rocket science or brain surgery. Uses of the idea are manifold and include searches for particular patterns, say, using the string function
One method for counting substrings within strings was discussed in Cox (2011). The idea is to compare the lengths of strings with those of substrings removed. Thus, the difference
is necessarily the number of occurrences of
7 Conclusion
Standard advice within the Stata community is that panel or longitudinal data are almost always better analyzed with a long layout (structure or format, if you insist) in which each individual panel is represented by one or more observations. This column introduces a simple method that subverts that principle in a sometimes useful way by concatenating strings or other values in a series of observations into single strings. The machinery needed is just simple looping set up within the framework of
