CEGA Trace Research Project Reflection
While I originally joined the BIDS Course Mapping team, during one of the meetings I learned about another BIDS URAP project that was going on called CEGA Trace that aimed to identify the influence of research (publications) on public policy (government decision-making) by searching the grey literature and social media for digital fingerprints of specific economic studies. Because this combined my interests in data science and economics, I asked to sit in on one of their meetings. While there, I noticed that the team had a good foundation for data analysis and economic analysis but was missing someone experienced in Data Scraping. Having worked with this before, especially on my CalHacks hackathon project, I saw the opportunity to help the team. After getting approval from the leaders of bothy original and new project, I switched over.
During my first official meeting, I noticed there was a lack of clear directive on how to organize the project. So I stepped up and created a game plan with milestones and subteams. The first step was to collect data because without that, data analysis is somewhat difficult. So I split the group into two subteams. The one with more programming experience especially with Python was assigned to scrape the World Bank database while the other would build a SQL database that we can push out data to so we can query from later on for analysis. I led the scraping team and it ended up being a far more difficult task than we originally imagined. We started with a database of policy reports and had to manually take the bibliography of these PDFs and format them properly, as every report had a different formatting. We realized this was a major bottleneck and was definitely not sustainable in the long term but we wanted to collect some initial dataset so that we can start doing some analysis. After that we tried to use Google Scholar to extract information from each citation. However Google scholar turned out to be unreliable as our I kept getting blocked. So we came up with a new system using article DOIs and the crossref.org API. After building this initial database and passing it off to the analysis subteam, the scraping subteam went back to fixing the manual bottleneck issue. After doing some research, I realized that Google Scholar knows what has cited an article. So I decided that instead of using a database of Policy Reports and finding what research they cite, I could start with a database of research articles and find policy reports that cited them. We scraped the Web of Knowledge database to get these research articles. We then ran into our old Google Scholar issue
But now since we weren’t in a rush to create an MVP dataset, I was able to out in more time into getting around the roadblock and used techniques like using Selenium browser, and partitioning my data to run it in parallel on multiple IPs.
Overall being a part of this project taught me a lot of things. Along with data science and Web scraping techniques, I gained a lot of leadership skills. Despite being the newcomer, I stepped up and took charge to organize us when I saw that there was a lack of directive. Working with my team and mentors has been an amazing experience!