Wikipedia Text Segmentation

Contributors: Marko Martin, Torsten Zesch, Nicolai Erbs, Iryna Gurevych


Natural-text corpora are difficult to generate because reasonable gold standard segment boundaries for a big collection of texts are rare. Creating a gold standard manually would be too laborious and time-consuming for this thesis. Thus, we decided to extract a corpus from articles of the English Wikipedia and to take the division into sections as a gold standard for segments.

For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles ``See also'', ``References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense.


Text and segment boundaries are separated into two files. The text file simply contains the sentences of the text, without any line breaks as they actually provide information to the segmenter about sentence boundaries which would not be available in real systems. The gold standard file contains one line for each character offset of a segment boundary. E.g., for a boundary after the 200st character of the text file, the gold standard file would contain a line with the 51 number “200”. The numbers refer exactly to the positions of the boundary in the text file. Particularly, if the text file is encoded in Windows-style, i.e., with line feed and carriage return at each line end, every line break is accordingly counted as two characters. However, if a character in the text file takes two bytes, e.g., in UTF-16 encoding, it is, nevertheless, counted as one single character.

