GOTO ranking made affordable with MAG

2019年2月15日

分享这个页面

We are living in a modern society where performance is quantified to inform our actions and decisions. This is true in business, where companies are often measured by their revenues, earnings, and growth. It is also true in our daily lives where fans are obsessed with the stats of star athletes and sport teams, and we check online reviews and “likes” before trying out a new restaurant or making a purchase. Naturally, in the academic world, we are also focused on measurement. Some of them are less life-and-death, such as those used by university librarians to determine which journals and magazines to subscribe to. Some, however, are consequential, such as those used to decide whether a faculty member should earn tenure or which research areas should see a budget increase (or cut!). Scientists have treated the question of designing and applying proper metrics to characterize scientific activities with the same rigor as they would approach any other topic. Unfortunately, many commercial products and services do not always employ as rigorous a standard as employed by the scientific community when they try to measure academic effectiveness. For example, the faculty of Rutgers University raised serious concerns and disputes with the administration (opens in new tab), which spent $500,000 on a data mining firm that produced faulty reports on their research outputs; and Computing Research Association (CRA) recently issued a strongly-worded rebuke (opens in new tab) on the university rankings put forth by US News and World Report. One recurring theme in these incidents is that many commercial entities often underestimate the effort it takes to gather the high-quality data on which they base their analyses. Additionally, the methods and algorithms they use to derive the analytics from their proprietary datasets are often shrouded in secrecy, making it difficult to understand and have confidence in the results, especially when users can observe obvious errors.

It is against this backdrop that a taskforce at the last CRA meeting proposed the principle that should be the bedrock of any academic assessment. It is playfully called GOTO, which stands for Good and Open data with Transparent and Objective process and methodologies. A viewpoint article on this subject has been published in the July 2019 issue of Communications of the ACM (opens in new tab), and two proof-of-concept websites implementing the GOTO principle to assess universities based on their computer science programs have been created (gotorankings.org (opens in new tab)).

For Microsoft, the four tenets in the GOTO principle are second nature, as reflected in the Microsoft Academic project we started five years ago. To ensure we have good data, we teamed up with our partners in Bing to apply best-in-class machine reading technologies to process the entire web index and produce the Microsoft Academic Graph (MAG). MAG provides fresh, accurate, and comprehensive coverage of scholarly communications. Peer-reviewed studies published in Scientometrics (opens in new tab) and Journal of Informetrics (opens in new tab) suggest we are on the right track because MAG, although curated from the web by machines, favorably compares to other datasets that are created directly from the publishers. To promote open collaboration, we distribute MAG under the Open Data Commons Attribution license, or ODC-BY (opens in new tab), which encourages data mining, redistributing, and improving the dataset as appropriate, including making commercial products and services based on it. To promote transparent and objective analytics, we have published our source code, some in U-SQL for Azure Data Lake and some in Python for Spark, to GitHub. The URL links to these resources are included in our MAG documentation (opens in new tab). We hope that, by publishing these scripts, results shown at the Microsoft Academic website (opens in new tab) can be precisely understood, reproduced, and even adapted to other purposes.

To drive the point home, let us take a look at the issue of university ranking that prompted CRA’s blistering critique. The script shown in last week’s blog (opens in new tab) can be slightly modified to evaluate institutions rather than researchers. Since most of commercial reports rank universities based on high-level fields of study (for example, “best computer science/business schools”), we also analyze institutions based on the 18 top-level fields of study that MAG contains. The U-SQL code is as follows:

@affiliationPaperCitation =
    SELECT DISTINCT // authors may share the same affiliation
        (long) A.AffiliationId AS AffiliationId,
        A.PaperId,
        Q.FieldOfStudyId,
        R.DisplayName AS FosName,
        P.EstimatedCitation,
        P.Rank
    FROM @paperAuthorAffiliation AS A
    INNER JOIN @papers AS P
        ON A.PaperId == P.PaperId 
    INNER JOIN @paperFos AS Q
        ON A.PaperId == Q.PaperId
    INNER JOIN @fos AS R
        ON Q.FieldOfStudyId == R.FieldOfStudyId
    WHERE A.AffiliationId != null // only consider known affiliations
        AND P.Year > 2008 // consider only past 10 years of publications
        AND R.Level == 0; // consider only top level fields
//
// Compute Paper Rank using EstimatedCitation
//
@affiliationPaperRankByCitation =
    SELECT
        PaperId,
        AffiliationId,
        FieldOfStudyId,
        FosName,
        EstimatedCitation,
        Rank,
        ROW_NUMBER() OVER(PARTITION BY AffiliationId, FieldOfStudyId ORDER BY EstimatedCitation DESC) AS PaperRank
    FROM @affiliationPaperCitation;
//
// Compute HIndex, Saliency and total citation count
//
@affiliationHIndex =
    SELECT
        AffiliationId,
        FieldOfStudyId,
        ANY_VALUE(FosName) AS FosName,
        COUNT(*) AS PaperCount,
        SUM(EstimatedCitation) AS CitationCount,
        MAX((EstimatedCitation >= PaperRank) ? PaperRank : 0) AS Hindex,
        SUM(Math.Exp(-1.0*Rank/1000)) AS Saliency
    FROM @affiliationPaperRankByCitation
    GROUP BY AffiliationId, FieldOfStudyId;
//
// Look up the Affiliation Name
//
@affiliationStats =
    SELECT
        A.DisplayName AS AffiliationName,
        FosName,
        H.PaperCount,
        H.Saliency,
        H.CitationCount,
        H.Hindex
    FROM @affiliationHIndex AS H
    INNER JOIN @affiliations AS A
    ON  H.AffiliationId == A.AffiliationId;

OUTPUT @affiliationStats
TO @outStream
ORDER BY FosName ASC, Saliency DESC, AffiliationName ASC, Hindex DESC
USING Outputters.Tsv(quoting : false);

Again, as described in last week’s post (opens in new tab), we have first executed the CreateFunctions.usql to simplify access to MAG:

@paperFos = PaperFieldsOfStudy(@uriPrefix);
@fos = FieldsOfStudy(@uriPrefix);
@papers = Papers(@uriPrefix);
@paperAuthorAffiliation = PaperAuthorAffiliations(@uriPrefix);
@affiliations = Affiliations(@uriPrefix);

The script generates the results shown on the Microsoft Academic Institution Analytics (opens in new tab) page, where institutional rankings can be viewed with various metrics and over several time periods. Note that in the above script, you can zoom in on academic impacts for the past 10 years only. Again, like the examples shown previously, the script that ranks all of the more than 25,000 institutions in over 18 fields in MAG is fast (12 minutes and 39 seconds) and quite affordable (U.S. $1.35):

MAG: Azure cost for ranking insitutions in 18 fields of study

However, if you explore the analytics pages further, you might have noticed that institution rankings are much more nuanced and fluid than most commercial rankings would lead you to believe. Specifically, if you drill into the fields of study hierarchy, you can see that where an institution stands in the rankings can vary dramatically. Often, this reflects that many institutions have strategic focus areas and just because institution A ranks lower than institution B in one field does not mean institution A will rank lower than B in all of its subfields. In other words, without carefully accounting for academic specialization in fields of study, university rankings are more likely to be meaningless and misleading than helpful. MAG can help avoid this issue by allowing you to compute the rankings on all 18 fields and their 660,000 subfields. The script above can be modified to do so by relaxing the conditional clause “AND R.Level == 0” in the first statement. Obviously, the amount of computation grows tremendously, but the Azure bill you will incur, at $3.80, is still less than an average visit to Starbucks:

MAG: Cost for ranking institutions for 18 fields and all their subfields

With MAG making GOTO analytics so affordable, there is really no good reason not to do the right thing.

Happy researching!