-
Notifications
You must be signed in to change notification settings - Fork 1
/
notes.txt
112 lines (71 loc) · 6.09 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
Inspired by paper at:
http://advances.sciencemag.org/content/1/1/e1400005
where they look at the hiring network for CS, History, and Buisness.
http://www.wired.com/2015/02/academic-hiring-uphill-battle/
http://tuvalu.santafe.edu/~aaronc/facultyhiring/
http://www.wired.com/2015/02/infoporn-college-faculties-serious-diversity-problem/
http://networkx.github.io/documentation/networkx-1.9.1/reference/algorithms.html
Functions to write:
* Given an affiliation string, reduce it down to a simple unique identifier (e.g., 'Astronomy Dept, University of Washington, Box 23764, Seattle, WA' --> 'U Washington')
* Construct a graph object out of a bunch of papers.
* Trim graph down to only include papers by a single author
* Given the trimmed graph, construct the movement from intitution to institution, and label each step (grad school, postdoc, postdoc, long-term (prob faculty)).
* Need to re-write the matlab stuff in python: http://tuvalu.santafe.edu/~aaronc/facultyhiring/
*hmm, hard to tell how we can replicate the gender stuff. Could try and guess a gender for each author.
So I can do things in ADS labs like:
bibstem:"PhDT." year:2010
to get the list of phd's from a given year.
1) So I'd like to go through and grab the name and affiliation of each thesis.
1.5) chop down to only US institutions
2) then for each name, do a search and compute h-index and publication years and affiliations. record each year there's a change in affiliation. Need to solve the problem of people with common names. Make a graph object for each person--nodes are papers. Make edges between nodes that share identical co-authors, or identical year+institution. Maybe edge for sharing N-uncommon words in title? Can use networkX and has_path. Looks like ADS already does some paper network stuff. Pretty cool, but still tangles up some Williams, B.
Then we can run the MCMC to see how the institutions are related!
Can also look at how different phD classes do for h-index distribution.
Can look at # still publishing as a function of time and PhD class.
Can compute an estimate of young researcher attrition as anyone who publishes only 2-6 years after their phd.
So fun things to find out:
1) Can I just get a data-dump of the astronomy database? Why not have anual data releases for crazy people like me? 2-million records can't be that big, especially if you strip out the abstract text.--nope, publishers would never agree, but the API should be plenty!
2) Is there a way to connect besides the web and the perl script?
3) Any good tools for separating authors with the same name? Shouldn't it be possible to cloud-link these things together and find the group I want?
-------
things I'd like to see from the data:
1) what's the ranking of different depts/observatories/labs? Any major changes over time?
2) How long do people continue publishing after their phd? Can we see a change in the fraction of people leaving the field over time?
another fun thing would be to look at h-index (or similar) as a function of time. Is the long postdoc period necissary, or can we spot the productive astronomers early?
Another good thing to look at, what do the publication records of people look like when they get hired? Need to be careful that we don't use a crazy h-index, but could even grab a list of names from the rumor wiki and then say what their record looked like right before they got hired.
-----
It occers to me a potentially faster way to do things would be to search by institution over a set of years, pull the authors publishing there, and then decide which of those are probably faculty...
One of the possibly most painful parts will be converting affiliation strings to a simple affiliation. Might just have to manually make a giant dict.
Actually, not so bad b/c we can get all the universities from the phd affiliations. That should make it pretty easy to start a dict that has keys of string snippet and values of institution:
instituteDict = {'UCLA': 'UCLA',
'Univerisity of California, Los Angeles': 'UCLA'
'MIT': 'MIT'
'Massachusets Institute of Technology':'MIT'}
ObservatoryDict = {'Lowell':'Lowell',
'Lick':'Lick',
'KPNO':'KPNO',
'Kitt Peak National Observatory':'KPNO',}
------
Got key: in ~/.ads/dev_key
Some handy examples of the ads API at: https://github.com/adsabs/adsabs-dev-api
clearly need to use the client app here: https://github.com/andycasey/ads
For the hire network, just need name, phd institution, final instituiton
But also want to get path (so we can see people at different times)
So, after trimming down to the papers by an author:
find the unique inst,year combinations.
columns for US PhD Astronomers:
ID, name, phd year, phd institution, 1st paper year, 1st 1st author paper year, last paper year, last 1st author year, h-index, Final institution, years at final inst.
Maybe another table for institution full path?
author ID, inst-1, inst-1 year start, inst-1 year end, inst0, inst0 year start, inst0 year end, ... inst7, inst7
How to take a search on author name and get the data I want:
Search on name and 4(?) years after phd. How to decide if paper belongs to the same author? Make a network, each paper is a node,
*link nodes if they share an affiliation within +/- a year
*link nodes if they share a co-author
* others? if they cite anything?
Pull all the papers that connect to the thesis that started the seach
So, as usual, ADS has a very nice "paper-network" tool for looking at links and groupings of papers:
https://ui.adsabs.harvard.edu/#search/q=author%3A%22Yoachim%2C+P%22&sort=date+desc/paper-network
so now I just need to figure out if the API will let me download that network, then I'm almost done!
Then I don't have to roll my own silly graph network with networkX or anything https://networkx.github.io/index.html
---------
Fun way to spin the paper might be, ranking astro phd depts via where their grads end up (like the other paper), and then also by how fast their grads leave the field. I suspect they would be similar, but possibly different.
Probably need to do some screening on making sure the author is actually an "astronomer", and not a kinda-related physics phd.