I’ve worked with, and created both relational and flat databases during my public health career. It never occurred to me to consider constructing databases and mining them for any humanities type projects until this Digital History course. For my class project I’ll likely be working with a flat database, where the categories and fields are not related to one another. I don’t know if I have enough data or time in the next month to construct a relational database using structured query language (SQL). Also, it may not be relevant for the project’s scope, which looks qualitatively at language and discourse around menopause in the postwar decades.
That being said, in my dorky academic fantasy, it might be interesting to expand this project and look at women’s medical records at the time (if they exist and I could get a hold of them) and mine the data. (I’ve always been a fan of mixed method approaches for research projects.)
Filemaker Pro, SPSS, and Access are three types of SQL-based relational databases I’m familiar with. In my fantasy project, I’d probably use Filemaker and train an assistant to enter data from the medical records. (Of course I’d have written an awesome grant for a bottomless pool of money to hire all the monkey data crunchers I needed.) We would need to set up the database tables (categories) and construct the relationships between the tables. Off the top of my head, an initial set of tables might look something like this: DEMOGRAPHICS: age, race, household income range, education; LIFESTYLE: sedentary, moderately active, very active; NUMBER OF CHILDREN: 0, 1, 2, 3, 4, 5; SYMPTOMS: hot flashes, vaginal dryness, mood swings; headaches; TREATMENT: Premarin, estrogen only pills; vaginal suppositories; progesterone only pills; ONSET OF SYMPTOMS: <35, 35-39, 40-44, 45-49, 50-54, 55-59, 60+, never; GEOGRAPHY: East Coast, Midwest, South, West coast; DENSITY: urban, suburban, rural)
We could also import other contextual databases such as census data (again, if they are digitized, neatly organized in an excel spreadsheet, and available for use). With these tables and datasets, I could query the SYMPTOMS table with the TREATMENT table and see if there’s a relationship between the type of symptoms women had and their treatments. It might also be interesting to see if geography had anything to do with treatment plans. We could also look at whether age relates to the types of symptoms most frequently reported. These would be fairly basic searches.
But in this fantasy research project, I don’t know if I would start with the database itself. I’d want to qualitatively look at the language used to describe certain symptoms or treatments, for example. The term “hot flash” may not have been used universal. Other words such as “flushed,” “blood rush,” or “heat flusters” may have been used instead. I would have to create conditions on these words so that when “hot flash” (or any related word) is queried, all options come up. I would also consider putting quantitative conditions on symptoms (or even treatments) that could organize individual patients by the number of symptoms they experienced or the number of kinds of treatments they have received. For example, if one woman experiences vaginal dryness and headaches, she could be assigned “2” for symptoms.
One issue that these Atlantic and New York Time articles don’t explicitly discuss is about data preparation. Having worked with (and been a data entry monkey for) large-ish datasets (5,000 plus participants), I’m painfully aware of the importance of cleaning up and preparing datasets for research. Was the data entered correctly? How did you spot-check the entries? After the import process are the variables aligning correctly? Are there double entries of the same person? How do we group clusters of variables or separate variables out? What kind of predictive power do our numbers have? Do we need to collect more data? Are the variables fixed or continuous? How do we account for and properly code no values or nullified responses in certain categories?
“Fixing” data is a pain but so much of data mining and creating relational databases is hinged how “clean” the dataset is or how well prepped it is for certain types of analysis. With Filemaker and other relational datasets, it’s also to create restrictions on categories while entering individual data (rather than importing a dataset) to ensure that the data going in is “clean” and complies with range of responses you’re looking for.
One last aspect of working with databases that could be further examined is the effect of confounding variables. This semester, I’m also working on a qualitative project about morality and breastfeeding culture. The government and reputable health organizations have advocated for breastfeeding making claims about better health outcomes for children such as reduced risk for diabetes, obesity, autism, cognitive performance, etc. In Joan Wolf’s book, Is Breast Best?: Taking on the Breastfeeding Experts and the New High Stakes of Motherhood, she argues that many of these studies didn’t adequately consider confounding variables such as the mother’s prenatal health and lifestyle. For example, she argues that women who breastfeed were likely physically active and ate organic food prior to having children. This lifestyle difference alone can contribute to healthier breast milk and predict the child’s strong likelihood of being active, therefore preventing obesity and diabetes.
With all research methods, it’s important to consider their strengths, limitations, and hidden assumptions. Regarding quantitative methods and using relational databases, I’m still stuck on viewing them for scientific and social scientific research. I’m curious to see more examples of how it can be used in the humanities considering its strong tradition to not use quantitative methods.