Modeling Questions: Experiences from the dbGaP Project

Kimberly A. Tryka
Jeff Beck
Matt Mailman


dbGap (database of Genotype and Phenotype) is a project of the National Center for Biotechnology Information (NCBI) a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). The stated purpose of the project is to "archive and distribute the results of studies that have investigated the interaction of genotype and phenotype." 1 To this end, dbGaP is collecting not only data from these studies but also the documents that were part of the studies. The documents are predominantly questionnaires and research protocols. These documents often particular areas (sections, paragraphs, questions) that can be associated with a particular phenotype in the database. One goal of the dbGaP project is to create web-based access to the database which will allow users to move easily between the stored data and references to the data in the documents. To achieve this goal the documents have been marked up in XML using a variation of the NLM-DTD. As with most things it is the "variations" that are the most interesting. In this case the variations are mostly related to the challenge of modeling the different question types found in the questionnaires in a systematic way that can be used as the basis for creating web-based versions of the questionnaires that accurately reflect the intent of the original form, even if, in many cases, it will not be able to reproduce the layout of the original forms.

Keywords: Modeling

Kimberly A. Tryka

Kim is currently working with the PubMed Central group at NCBI to integrate documents into the dbGaP project. She has also worked on digital projects at the University of Virginia with the University Library and the Virginia Center for Digital History. Previously, she was an astronomer, studying icy objects in the outer solar system. She holds degrees in physics, planetary science, and library science.

Jeff Beck

Matt Mailman

Modeling Questions: Experiences from the dbGaP Project

Kimberly A. Tryka [National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) ]
Jeff Beck [National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) ]
Matt Mailman [National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) ]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright © 2007 Kimberly A. Tryka, Jeff Beck, and Matt Mailman. Reproduced with permission.


Recently it has become both time- and cost-effective to obtain high quality genetic data on large numbers of individuals that can be used to identify associations between genotype and phenotypic characteristics. For example, this type of analysis has led to the identification of a genetic marker for macular degeneration. 2 To help researchers leverage these large data sets the National Center for Biotechnology Information (NCBI) has created the Database of Genotype and Phenotype (dbGaP), a public repository for phenotype and genotype data from research studies as well as the documentation, such as the research protocols and questionnaires, which help to contextualize the data. Integrating this data into a publicly accessible central repository will not only allow access to the data itself but will also aid research by creating a uniform way for community members to reference the data. dbGaP will be assigning all studies, study documents, phenotypic variables, and contributed genetic analyses stable identifiers and will version the data as needed.3

The Data

The data coming into dbGaP is not structurally complex; it arrives as tables where rows represent individuals and columns represent certain measured, or calculated, phenotype values. The phenotype information collected varies widely. It may be a physical attribute, such as weight, blood pressure, or age, a psychological or sociological indicator, such as whether a person hears voices or feels themselves to be isolated, or it might be a reported lifestyle factor, such as whether a person exercises or smokes. All these various types can be mixed in any given study.

Because of laws regulating the transmission of personal health information there are two levels of access to the data. Controlled-access data is the individual-level data submitted by contributors and access to that data is granted only after application to, and authorization by, a Data Access Committee. (All data that comes to dbGaP is HIPAA compliant and in some cases additional fields that might be identifying have been removed to help prevent the identification of individuals.) Open-access data is freely available on the web-site and consists of a summary page for each particular phenotype (i.e., column of a data table).

These summary pages include a prose description of the phenotype and a histogram representing the range of values found in the database and the occurrence of each value. Additionally, there are links to the documents where this phenotype was mentioned (maybe a link to a form used to collect the information, a link to the protocol manual where there were instructions as to how to measure blood pressure, and a second link to the manual which detailed what particular model of instrument was used, etc.). Creating these links into the documents will be discussed further below.


As noted before, not only are studies depositing their data with dbGaP, they are also depositing study documents and other documentation to support their data. The documentation is of three types:

  1. data dictionaries—which provide descriptions of phenotype data
  2. forms—the original questionnaires used to collect data
  3. supporting materials—documentation of procedures and administrative processes

Data Dictionaries

The data dictionaries are documents that describe the phenotype data that have been submitted in the data tables. They may include descriptions of the phenotype, lists of accepted values, and keys to code/value pairs. Some examples from various data dictionary entries are:

Table 1
  • 25.50 - 55.00
  • . UNKNOWN (225)
Somatic_Tactile [I12520]
  • Somatic or tactile hallucinations (when psychotic and without other comorbid features present), followed by the raw repository SAS database field name in brackets for cross-reference to repository database
  • 0=absent,1=present,2=suspected,-999=missing (could not rate)

The data dictionaries are very important for the people who are loading data into the database and for the dbGaP's curators; the curators are responsible for making connections between the phenotypes in the database and where they are mentioned in documents. If the data dictionaries are particularly well structured they can be used to help create a 'skeleton' version of the xml encoding for the questionnaires. The description of the phenotype in the data dictionary also appears on the phenotype summary page. But, the data dictionaries are not treated as "documents" – they are a table loaded into the database. Because of this they do not fall into the same workflow as the rest of the documentation, such as the questionnaires and other supporting materials, that is marked-up in XML.


All of the data in the database were collected, either on paper or electronically, based on a questionnaire. Although all the questionnaires share the property that they are made up of a series of questions which require inputs, they are surprisingly diverse and complex in their structure. Some of this complexity is intrinsic to the meaning of the question; a complex structure may reflect relationships between questions, such as grouping related concepts (often expressed as a primary question or statement and nested sub-questions) or contingencies such as answering or skipping a question based on an earlier answer. But sometimes the complexity arises as a result of the need to compress questions onto the printed page. This often creates what appears to be a complex question structure, when in reality the structure of the underlying questions is quite straightforward, though it needs to be teased apart.

The questionnaires are intimately connected to the data. Any place there is an input on a form is a potential phenotype column in the database. Rarely does every possible input on a form have a corresponding column in the database. It is not always possible to say why this is. In some cases we know that the contributor has decided only to share a particular subset of variables, or has chosen only to provide the derived values, calculated from raw values which they are choosing not to share. In some cases collected data has not been shared to prevent identification of individuals. In some cases there simply isn't a corresponding column of data; maybe the study at some point determined it was unnecessary and deleted it, or maybe, even if it was collected, it never made its way into the database. Regardless, if we have a phenotype in the database, we want to be able, not only to connect that phenotype with the questionnaire which was used to collect it, but to be able to point people directly to the appropriate question or questions on the form. And if people are looking at the form, we want to offer them the ability to see that a specific question is associated with a phenotype in the database and be able to view the summary of the phenotype data.

Other Supporting Materials

These are, on the whole, relatively straight-forward prose documents that offer additional context for the phenotypes stored in the database and that can elaborate on the short descriptions of the phenotypes found in the data dictionaries. Many are study protocols (or manuals of operation) which include information such as: descriptions of how to take measurements, specifications of what equipment should be used to make measurements, the rationale for the study, a summary of previous work in the field, and guidance to interviewers as to how they should clarify questions if someone they are interviewing is confused or has trouble answering. We have also seen consent forms, flow charts, and documents that record the algorithms used for manipulating the raw data values into derived values.

In the same way we want to be able to link back and forth between the questionnaires and the phenotypes in the database, we also want to the able to link back and forth from the supporting materials to the database. Additionally, if documents refer to each other, in whole or in part, we want to be able to capture that information so that users will be able to move seamlessly from document to document.


Marking up the documents for the dbGaP project has two primary goals:

  • to create linkages between the phenotypes in the database and documents
  • to create a markup scheme that can be used to encode the legacy documents we are receiving and be able to render them on the web in a way that will be true to the intellectual content of the originals, but may not reproduce the exact look and feel of the originals (the retrospective markup problem)

A third goal, while not immediately needed by the project, but which has informed the project since the beginning, is:

  • to create a markup scheme that can be used in the future to generate questionnaires that can be filled out online (the prospective markup problem)

To achieve these goals we decided that the documents would be encoded using XML (eXtensible Markup Language) because of its ability to encode documents semantically, naming the structural and intellectual components, which will allow future reuse and repurposing.

New or Pre-existing DTD?

NCBI created and maintains the NLM Archiving and Interchange XML TagSet, which is used for tagging journal articles and books in XML. The XML markup scheme that is being used for the dbGaP documents is a modification of the NLM Archiving and Interchange DTD, which is defined using elements from the TagSet and is used for the journal articles in PubMed Central, NLM’s electronic archive of life sciences journal literature.4

The NLM journal DTD needed to be modified in three ways to meet the needs of the dbGaP documents. First, the model for the "front matter" (usually bibliographic information when dealing with journal articles or books) was altered to allow dbGaP-specific metadata including information about the document's parent study. Second, we added the ability to reference phenotype variables from relevant pieces of text (for example, to link a paragraph describing how a measurement was taken or an item in a list describing the piece of equipment used to make a measurement to the variable that contains the measurement data). Finally, we created a model for "questions" for the data forms.

The first two modifications were minor and straightforward changes to the base DTD. We renamed the root element, so that there would be no confusion to the fact that we were using a version of the DTD modified for the dbGaP, and added attributes to hold information unique to the project. We created a new <front> element, which allows us to capture information about the clinical (or research) study that a document is a part of, any potential information about the original source document that might need to be captured (though we have not had occasion to use this) and information about the electronic document itself. Each of these pieces included models already defined in the DTD. We also added an attribute on certain structural elements (sections, paragraphs, list-items, table cells, etc) that would allow them to be linked to phenotypes in the database.

The final modification, creating a model for questions, proved to be harder than anticipated. The reasons for this are two-fold. First, questions can have both complex structures and complex relationships to other questions, or other items, on the form. Second, determining the underlying intellectual structure of the questions is confounded by the presentation of the questions on the forms. Because most of the forms that we are currently working with were designed to be viewed and filled out on the printed page the questions are laid out with the purpose of fitting them within a page structure and saving space. Additionally the forms often use visual cues, such as arrows, which are not easily incorporated into textual markup to provide guidance in filling out the form.

Before attempting to model the questions, we tried to find published XML question models, but met with little success. Since creating our question model colleagues have pointed us to IMS Question & Test Interoperability (QTI) specification. While interesting, the QTI contains a great deal of overhead that was not relevant to our particular problem. 5 During initial research we also found that there is a particular field of study revolving around the presentation of questions, both on paper and in html-based web forms, which is focused on making sure that questions are clear and the presention of lists of answers are given in a way that does not lead to bias. While interesting, that research wasn't applicable to this problem of creating a XML model for questions. 6

Examples of Question Types

Following are a series of examples of questions of varying complexity from different questionnaires we have had to work with and the different features they present.

Stand-alone with user input and stand alone with provided input

Figures 1 and 2 show what can be thought of as "stand-alone" questions. In each case a question is asked, for which a single answer is expected. There are no subquestions or contingencies involved. The most obvious difference between the two questions is that one allows the answer to be "free input" and the other asks that you select an answer from a list. A subtle difference is that in one case the question is labeled ("3") and in the other case there is no label

Question group with an opening line

In Figure 3 you see an example of a question group. There is an opening line, in this case a question which requires no explicit answer, followed by specific questions, which can be recognized individually as "stand-alone" questions. It is possible that a question group may not have any opening text, just the questions that belong with the group.

Question group, leading question with contingency

In Figure 4 you see another question group. In this case the answer to the first question determines whether or not the subsequent questions should be answered. Figure 4 is also a good example of a visual cue (text along with the "pointing hand" dingbat) which is less than trivial to translate into xml, without choosing to edit the original textual content.

Question with complex text

Many of these forms were designed to be filled out not by the person answering the questions, but by someone who is interviewing the study participant. For that reason the forms often contain explanatory text for the interviewer. This additional text seems to break down into two categories: instructions for the interviewer and scripted (or at least suggested) dialog for the interviewer. An example including all these categories of text is shown in Figure 5. We believe that these two categories are distinct from each other and are distinct from the text of the question being asked, though intimately related to the interpretation of the answers to the question.

Multiple nesting

Figure 6 shows a question that contains multiply nested questions. In this case there is a contingency question (not requiring an answer, but directing the person filling out the form to answer the subquestions if the answer is yes), followed by three subquestions (a, b, and c), which each of a, b, and c having two subquestions (one answer each for the left and right eyes).

Tables - can format and content be separated and still give you enough information to make sense of things?

Tables are often a source of trouble when separating form and content. In most cases when we have found tables in questionnaires, it is relatively simple to decompose them into series of nested question groups. But from time to time one appears where the format informs the question in a way that it seems necessary to reproduce the table layout.

Figure 7 is a part of a tabular question that easily breaks apart into a series of nested questions. The hierarchy being: a group of questions related to fruits and juices, with question groups related to particular fruits, with individual questions related to frequency eaten and serving sizes.

Figure 8 shows a series of questions laid out in a tabular format. While this table could be pulled apart (there is a single question per "row" related to the number of correct letters read, with explanatory text listing the Snellen Equivalent and the Chart 1 Letters), it seems that there is a lot gained in understanding by showing the question and it's various explanatory pieces of text within its original tabular structure.

Text after input boxes

Figures 9 and 10 illustrate a particular problem that comes up from time to time; text following an input area. Figure 9 is a simple example, with the text after the input space alerting the person filling out the form what units should be used for the value being placed in the box. Figure 10 is more complex, combining nested questions with the need to place some questions next to each other (the implicit question of choosing "+" or "-" appearing next to the input box) and also have an input areas followed by text.

Creating the Question Model

The underlying premise of the question modeling is that there is a certain "atomic" unit that is common across all questions: the variable (which can also be thought of as the input) which might be represented by a blank line or a checkbox on a questionnaire. The variable is the appropriate point of focus of our modeling because any value input on a form could, potentially, become a value in our database and the markup is to make appropriate links between the XML documents and the values in the database.

After looking at a number of questions, including those shown above, we have come up with a series of assumptions that form the basis for modeling the structure of questions found on forms:

  1. there is a basic unit, the "variable" which corresponds to any place on a form which allows input
  2. a variable can contain different types of text (questions, instructions, script) and may either have an area for free-form input or a list of items from which to select
  3. a variable has a particular type (input 7, which allows user-entered values; select1, which allows the user to choose one from a list of choices; select, which allows a user to choose more than one from a list of choices)
  4. if a variable is of the type select1 or select, then the items that can be chosen must be defined
  5. if a variable is of the type input then the 'input-box' size and location must be defined
  6. variables can be grouped together into "variable-groups"
  7. variable-groups may contain variables and text (questions, instructions, scripts, description)

A series of refinements can be made to this model that don't necessarily reflect the structure of questions. These are of two types, identifiers that the system needs to work with the xml in an automated fashion and information that may be present in the initial format of the documents and which we would prefer not to loose. So:

  1. every variable and variable-group should be given an unique id
  2. a variable or variable-group that is associated with a value in the database needs a way to reference to the correct database identifier
  3. if a data dictionary is well-written, and can be used as the basis for the questionnaire markup, then it is reasonable to keep any written description of the variables

Although we fully understand XML's goal not to conflate content and format we believe that some legacy documents contain questions presented in such a way that they do not allow for the easy disambiguation of content and format. Also, while again an issue of format rather than content, we believe that there should be a way to prompt the rendering software to layout questions in a way that may aid in their understanding. Thus, we add two other refinements to our list:

  1. allow variables (and text) to be placed into variable-groups in such a way that they retain their 'tableness'
  2. allow variables and variable-groups to indicate whether they should be displayed horizontally (side-by-side) or vertically (above-and-below)

The implementation of these assumptions and refinements as part of the NLM DTD can be seen in the Appendix.


Following is an examples of how the linking occurs between documents and the database. (Note to reviewers: the screen shots below will become obsolete sometime within the next month or so, but the functionality that they illustrate will not change.)

Let's assume that you are looking at the variable report summary for a phenotype related to diabetes:

On this page you will see a summary of the data in the database for this variable and below the data summary there is a section "Document Parts Related to Variable." This section will show any piece(s) of any document(s) that is associated with this variable. What is visible in this case is a question on a form asking if the subject has diabetes. This document part is linked back to the original document via the hyper-linked phrase "See document part in context." If one chooses to follow that link you would be taken to the following part of document:

This question is the html representation of the example shown in Figure 3 above. You will notice that there is a small icon (a blue square with a 'v') that indicates that there is a variable in the database associated with that question. If you were to hover over of the icon, a box would pop-up, with links to any appropriate variables (in this case there are 13 links because this question was asked during 13 different follow-up visits). If you found yourself interested whether any additional questions were asked on this questionnaire that are represented in the database you could scroll through the document and find:

If you then choose to follow the link to the variable, you would find yourself on the summary page for the variable related to cataract, and find yourself able to link to the two documents visible, which reference the cataract variable, among other documents (such as the questionnaire we just came from) which are off this screen shot:

Future Directions

Of the three goals listed for the document markup for dbGaP the third, to create a markup scheme that would allow newly designed questionnaires to be marked up and rendered as a web-based form for the collection of electronic data, has not yet been put to the test. We believe that the markup scheme outlined above (and shown in the Appendix) will be capable of creating web-based forms that can be used to collect data. In fact, as we were developing our scheme we did so with an eye toward the XForms standards. We have not chosen to use XForms for markup and display our questionnaires at this point because we do not feel that there are any implementations easily available to our to meet the needs of the people who would be using the forms.

The dbGaP project has made a tentative step toward electronic data collection in collaboration with another NIH-funded program, CETT (Collaboration, Education and Test Translation) 8, which is helping to fund consortiums of researchers who study, clinical laboratories who test for , and groups that advocate for particular rare diseases to collect data to add to the dbGaP database. Because of a number of constraints the initial data collection is not happening via a web-form based on our XML markup but is happening via a fillable PDF form. It is believed that most general practitioners will not be happy having to fill out a form online and that they will not fill out a form that does not print out on a single sheet of 8.5 x 11 inch paper. Hence, the choice of PDF format (which will look the same on a sheet of paper as it does on a screen and which the doctor can fill out on paper as the speak with a patient, and then have an office worker enter into the online version of the form to submit the data).

Appendix: The Question Model in DTD Formatting

        <!ELEMENT variable-group 	((label?, caption?, thead?, (variable-group | variable |
        question | instructions|script)*), description?)>
        <!ATTLIST variable-group
        id  	ID		#REQUIRED
        layout 	(horizontal | vertical | table | row)  #IMPLIED
        required	(yes|no)		#IMPLIED
        datatype	CDATA		#IMPLIED
        name		CDATA		#IMPLIED>

        <!ELEMENT variable	(label?, (question | instructions | script | input-box | items)*,
        description? ) >
        <!ATTLIST variable
        id	  ID		#REQUIRED
        style	%question-styles;	#REQUIRED
        layout 	(horizontal | vertical | row)  #IMPLIED
        required	(yes|no)	#IMPLIED
        datatype	CDATA		#IMPLIED
        name		CDATA		#IMPLIED
        size	(xs | s | m | l | xl | xxl)	#IMPLIED>

        <!ELEMENT question (%text;)* >
        <!ATTLIST question
        type	CDATA		#IMPLIED
        label	CDATA		#IMPLIED >

        <!ELEMENT instructions (%text;)* >
        <!ATTLIST instructions
        type	CDATA		#IMPLIED >

        <!ELEMENT script (%text;)* >
        <!ATTLIST script
        type	CDATA		#IMPLIED >

        <!ELEMENT description (%text;)* >
        <!ATTLIST description
        type	CDATA		#IMPLIED >

        <!ELEMENT items (item+) >

        <!ELEMENT item	(%text;)* >
        <!ATTLIST item
        value	CDATA		#REQUIRED >

        <!ELEMENT input-box EMPTY >
        <!ATTLIST input-box
        width	CDATA		#IMPLIED
        depth	CDATA		#IMPLIED >



The dbGaP home page is: General information about the project can be found by choosing the "About dbGaP" link in the left-hand navigation column.



Mailman, M.D., S.T. Sherry, Y. Jin, M. Kimura, K.A. Tryka, R. Bagoutdinov, L. Hao, J. Paschall, L. Phan, N. Popova, S. Pretel, Y. Shao, Z.Y. Wang, M. Ward, K. Zbicz, J. Beck, D. Preuss, K. Sirotkin, E. Yaschenko, J. Ostell 2007 "dbGaP: database of Genotype and Phenotype, " in preparation.


Beck, J. Lapeyre D. 2003 "New Public Domain Journal Article Archiving and Interchange DTDs," delivered at XML 2003, Philadelphia, PA.


The following site offers a description of QTI as well as links to various version of the specification


For example, Professor Don Dillman at Washington State University, a sociologist who has worked with the U.S. Census Bureau, is a contributor to this literature (


In XForms and HTML forms there is a distiction between 'input' and 'textarea'. For the purposes of this paper these can both be thought of as 'input' as they both are areas where text is entered freely.


Modeling Questions: Experiences from the dbGaP Project

Kimberly A. Tryka [National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) ]
Jeff Beck [National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) ]
Matt Mailman [National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) ]