Stata tip 153: Extracting text data from webpages

Abstract

Webpages frequently contain a wealth of useful data for researchers in text form. However, accessing these data may be difficult because webpages are designed for human end users and not for ease of automated use. Because copying and pasting is tedious and often infeasible for extracting large volumes of text data, users rely on computer programs to automate this process. Most text extractors (and broadly web scrapers) available today are designed using the programming languages Java, Python, and Ruby (Sirisuriya 2015).¹ Because Stata introduced Python integration in version 16 and Java integration in version 17, users can leverage the text-extraction capabilities of these languages from within Stata. The downside, however, is that this requires familiarity with these programming languages, which is a skill not possessed by many Stata users. The purpose of this tip is to illustrate how one can use Stata’s official commands and functions for text extraction. Because webpages are built using text-based markup languages (for example, HTML and XHTML), the procedure involves first reading contents of the webpages as text files and subsequently parsing the resulting files.

To illustrate, suppose that we want to extract the titles of threads appearing in the first page of Statalist’s general forum. The URL for accessing these data is https://www.statalist.org/forums/forum/general-stata-discussion/general. Figure 1 exhibits part of the webpage at 12:25 p.m. Central European Time on 2 January 2023. Apart from the titles, the webpage contains the names of users who started the threads, the dates and times the threads were started, and the number of posts and number of views corresponding to the threads, among other pieces of information.

Figure 1.

Screenshot of the Statalist general forum

To extract the titles, I use the fileread() function (see [FN] Programming functions) in combination with the commands export delimited and import delimited (see [D] import delimited).²

The extracted contents following fileread() are entirely stored in the first observation. The subsequent export delimited and import delimited commands are needed to split these contents over several lines to ease browsing because we need to parse the resulting file to extract the titles. We use the concat() function of egen (see [D] egen) to concatenate all string variables created by import delimited into one string variable named var and keep only this variable. The format command (see [D] format) increases the column size of the variable to 500 characters to further enhance readability. The contents are split over 10,000 observations, and to find out where the select threads shown in figure 1 are placed, we can use several string functions available in Stata. These include strpos() and regexm() or their Unicode counterparts, ustrpos() and ustrregexm(), respectively (see [FN] String functions, Cox [2011, 2022], and Koplenig [2018] for an introduction to string functions). For example, we may want to search for the position of the string “bilateral dataset”, which appears on the title of the penultimate thread in figure 1. We do this by typing

The notrim option of list (see [D] list) instructs Stata to suppress string trimming so that we can view the entire string. We observe that this string is at observation number 5,993. Similarly, we can search the entire title of the last thread shown in figure 1.

This, in turn, is at observation number 6,088. What is apparent from comparing these two observations is that the titles of the threads are between the substrings

and

Identifying such patterns and writing code that extracts strings matching that particular pattern are typically all that is needed to extract some desired piece of information for standardized entries in a webpage. Regular expressions are well suited for this purpose, but in many cases other string functions are equally effective.³ Below, for example, either of the two commands will suffice to extract the titles.

In total, we have extracted all 52 titles present in the first page of Statalist’s general forum, including the 7 shown in figure 1.

Extending this to extraction of other elements in the webpage, such as numbers of posts in the threads, involves the same procedure of identifying similarities in the strings and writing code that extracts the string pieces based on the identified patterns. Additionally, extending this to multiple webpages is simply a matter of specifying additional URLs. For Statalist’s general forum, the second page adds the suffix “/page2” to the URL, the third page “/page3”, and so on. This makes it easy to define a forvalues loop (see [P] forvalues) when extracting information from multiple consecutive webpages beyond the first.⁴

Footnotes

Notes

References

Cox

N. J.

2011. Stata tip 98: Counting substrings within strings. Stata Journal 11: 318–320. https://doi.org/10.1177/1536867X1101100212.

Cox

N. J.

2022. Stata tip 148: Searching for words within strings. Stata Journal 22: 998–1003. https://doi.org/10.1177/1536867X221141068.

Koplenig

2018. Stata tip 129: Efficiently processing textual data with Stata’s new Unicode features. Stata Journal 18: 287–289. https://doi.org/10.1177/1536867X1801800117.

Sirisuriya

D. S.

2015. A comparative study on web scraping. In Proceedings of the 8th International Research Conference, October 7–10, 135–140. Palisades, NY: KDU.