top of page

DATA

        Bottom: Attributing articles to authors  

Data was scraped from Google Scholar. The Google API provides access to many author and article features. Extensive data cleaning and transformation were required to create a uniform, Python-readable format. Google Scholar does not link authors to articles by id. Regex and Pandas were used to create a dictionary of all possible variations of every author's name (in accordance with GS citing conventions) and additional regex was used to link authors to articles. Authors' affiliations were categorically encoded, and their interests were vectorized with word2vec.

Additional cleaning was required to create appropriate edges, as Google Scholar lists authors as their own co-authors. Edge weights were calculated so as to create a single edge between authors who co-published multiple articles. Lastly, deduplication of various forms had to be done, as Google Scholar had duplicate entries for both authors and articles. 

 

Consistency analysis of the data revealed that many co-authors were missing from those provided by Google Scholar. 

 

Screenshot from 2021-12-17 20-29-37.png
bottom of page