srcML: A document-oriented XML representation of source code

srcML

srcML is a combination of source code (text) and selective AST information (tags) in a single XML document:

xmlns="https://sdml.info/srcML/src" xmlns:cpp="https://sdml.info/srcML/cpp" language="C++" filename="ex.cpp">
// copy the input to the output
while (std::cin >> n)
  std::cout << n << '\n';

The focus is to construct a document representation in XML instead of a more traditional data representation of the source code. The representation of source code as semi-structured text supports a programmer-centric rather than a compiler-centric view, providing full access to the source code at the lexical, documentary (e.g., comments, white space), structural (e.g., classes, functions), and syntactic (e.g., statement) levels.

srcML Toolkit

srcML Downloads

The srcML toolkit consists of the command-line programs src2srcml (documentation), a translator from source code to srcML, and srcml2src (documentation), a translator from srcML to source code. srcml2src also supports direct queries and transformations of srcML archives. Actively developed, they currently support C, C++, and Java, and are under a GPL license.

Please send any questions or bug reports to Michael Collard .

Note: We are compiling a list of projects that are using srcML. If you have used, or are currently using, srcML, please let us know by emailing a description to Jonathan Maletic () and Michael Collard () . Your use can be for practical software development, or for research. We are especially interested in students who are using it for their thesis or dissertation work.

srcML Features

Preservation of all source-code text, e.g., comments, formatting (white space), and preprocessor directives, in the original document ordering allowing full access to the source code at the lexical and documentary levels, with an equivalent forward and reverse mapping between source code and srcML. These elements are identified for further processing by development environments and program-comprehension tools.

Tags for comments, preprocessor directives, statements, and other syntax allows for source code to be accessed through XML at the documentary, structural, and syntactic levels. These levels can be addressed using XPath, e.g., /unit/while/condition. Round-trip transformation (i.e., source-code to srcML to source-code) can utilize XML transformation languages and tools.

Opportunistic use of XML technologies: addressing with XPath, querying with XPath and XQuery, transformation with DOM, SAX, JDOM, XOM, TextReader, XSLT, and STX, and validation with schema languages DTD and RelaxNG. The srcML format is not tied to any specific XML technology and should be compatible with any XML tools and standards developed in the future.

Representation and toolkit robust to source-code irregularities, e.g., uncompilable code, code fragments, single statements, and single files, with representation based on local document information only, i.e., no symbol table is used. Parser based on the concept of Island Grammars for robustness. Complete handling of encoding issues (e.g, ISO-8859-1, UTF-8).

Scalable storage and translation with reasonable file sizes typically less than 4 times the size of the corresponding text file. The source code to srcML translator (src2srcml) is a stream parser that supports event interfaces with a translation speed over 25 KLOC/sec.

File and directory aware with metadata at the file level, i.e., language, file location, and version information.

srcML Archive multiple source-code files can be stored in one srcML file, e.g., storing the entire Linux kernel in a single srcML file. The toolkit fully supports the archive format.

Extensible format by adding attributes on existing elements and extending the element set. XML translation on the srcML format permits further refinement of parsing and markup.