Abstract

1 The problem: Looking for words
Searching for particular text within strings is a common data management problem. One frequent context is whenever various possible answers to a question are bundled together in values of a string variable. Suppose people are asked which sports they enjoy or something more interesting, like which statistical software they use routinely. To keep the matter simple, we will first imagine just lists of one or more numbers that are concise codes for distinct answers, say,
The precise problem discussed in this tip is finding text in strings whenever such text is a word in Stata’s sense, or something close to that. This needs a little explanation.
Here is a tiny sandbox dataset that will be enough to show the problem and some devices that can yield solutions. By way of example, we will focus mainly on a goal of generating indicator variables, sometimes known as dummy variables. For one overview of generating such variables, see Cox and Schechter (2019). We will also touch on the problem of counting instances of a word.
Searching for
Finding such single characters is easy and unproblematic if the possible answers are one character long at most. More generally, searches are easy if there is no ambiguity. Consider
The function
returns 1 if true and 0 if false, see Cox and Schechter (2019) or, more directly, Cox (2005, 2016).
If you look again at the sandbox, you should see what is coming next. Looking for
will still work, fortunately, but looking for
will yield a false positive. The problem is that we want to find
In some Stata contexts, double quotation marks bind together more strongly than spaces separate, so
2 A solution: Looking for spaces too
Let’s carry forward the idea that we need to look for spaces too. At first sight, this is a beautiful idea that just does not work very well because there are too many possibilities to catch. Thus, looking for
But that last idea can be made to work with a simple twist. Congratulations if you thought of this directly!
So we solve the problem of initial and following spaces by supplying them on the fly. Note that we do not need to
3 What about other separators?
Suppose our string variable used another separator, say, commas, which could just be a different convention or a good idea anyway if spaces occur naturally. Someone’s favorite sport might be
We could still use a similar idea of looking for
4 A solution: What would change if we deleted words?
Here is another solution. This time around, an example comes before the explanation.
We get the same answer, so how did that work?
The function
The function
But how does replacing text help? We do not want to change text; we are just searching for it. Yet, if the result of replacing text by an empty string (deleting it, to put it plainly) would be to reduce the length of the string, then evidently we did find that text.
Notice “would be”. As before, we do not have to
Whether the length of the string is greater than the length of the string with the word removed is a true or false question. Either the first length is greater because there is at least one instance of the word or the two lengths are the same because there is no such instance. If the expression is true, 1 is returned; and if it is false, 0 is returned, giving us an indicator variable.
This method is of interest for another reason: you may want to count instances of a word. We could have written
The difference is in the last argument fed to
If the problem is counting instances instead of checking for existence, then the difference in lengths
is precisely the number of times
For more on counting substrings, see Cox (2011b).
5 Nonnumeric words
Datasets may include one or more nonnumeric words bundled in a string variable. Suppose there was a survey question about which programming languages are routine for Stata users, with possible answers such as one or more of
Handling such nonnumeric words can be both easier and more difficult than handling numeric words. The possibility of ambiguity is less but still present, as witness checking for mentions of
Greater difficulty can arise because of variations in spelling and punctuation, depending sensitively on how such data were entered and collated. Suppose that
6 A list of tricks
We have covered two main ideas: Words are separated by spaces, so look for a word together with previous and following spaces, remembering how to catch words at the beginning or the end of a string (sections 2 and 3). If we ask Stata to tell us whether and how the length of a string would change if we were to delete a word, we have ways to detect the occurrence of that word, either yes or no, or the number of occurrences if that is what we seek (section 4).
That is not a complete treatise, even on this small topic. A longer account might mention other possibilities, complications that may arise, or possible solutions.
First, I will mention other problems: I have focused on plain ASCII characters, but searching for Unicode needs more care and different functions. I have mentioned but not fully solved the complication of “words” that include spaces. But the more complicated the string we are searching for, the less likely ambiguity is to bite. I have focused on simple searching of string variables, but string manipulation is needed in other contexts, such as parsing user input if you are writing Stata programs.
Now, I will signal other solutions: Many readers will already know about regular expression syntax. Sometimes, we cannot solve a problem with one command line. We may need to use the
All of these matters deserve detailed treatment, which is left to other accounts.
Footnotes
7 Acknowledgment
William Lisowski made helpful comments on a draft.
