I worked with the University of Minnesota libraries on a pilot data curation program. I have discussed this before. The article about the project: Preserving Data for Future Research is online now, and is in their Continuum Magazine. There is a rare photo of me smiling.
“This is a service that the Libraries can provide and nobody else on campus is currently providing,” said Lisa Johnston, a University of Minnesota librarian, who also is Co-Director of the University Digital Conservancy. Johnston is working on a plan to meet the federal mandate.
“This is just a new type of resource that we will be providing,” she said. “It’s a natural extension of library services.”
Johnston led a pilot data curation project last year that involved faculty members, researchers, and students representing five different data sets. The project leveraged the Libraries existing infrastructure, the University Digital Conservancy, the institutional repository for the University of Minnesota (conservancy.umn.edu).
“Feedback from the faculty in the pilot was very positive and anticipated that this service might satisfy the upcoming requirements from federal funding agencies,” Johnston said. Now she’s working toward building a repository for the campus, which may be open for business later this fall.
“University libraries are the natural repository for research conducted at a particular university,” said David Levinson, professor in the Department of Civil Engineering. Levinson – who conducts research in the area of infrastructure, particularly transportation infrastructure – currently maintains some of his research data on his office desktop computer.
“I won’t be here in 20 years; I’ll be retired. What will happen to the data sets when I retire?” he asks “What if someone forgets to migrate it?”
Levinson was involved in the pilot study. He called it a “step in the right direction, but it’s a baby step,” citing potential lack of resources and compliance as two challenges to a fully functioning data curation repository.
“You could probably have one librarian for every department at the University … who could have a full-time job collating and collecting the data for that department each year,” he said, noting that a funding model has not yet been established. He adds “[The funding] should come from the grants.”
So, why is it important for publicly funded research data to be preserved?
“First of all, the data is oftentimes unique, you could never recreate it,” Johnston said. “It’s also very expensive. And what do you get out of it? One, two, five papers? You could instead make that underlying research data available so that other researchers can take a look at the data, re-analyze it and come up with new results – perhaps competing results, perhaps validating results.”
Levinson agreed, saying that Libraries already have the infrastructure, the resources and the tools to not only preserve the data but to make it “findable” by the public.
“There’s 7 billion people in the world – most of whom don’t want to use my data – but a couple of whom might. And they might not know that the data exist” if it’s just sitting on my computer, he said. “Putting it out into a standardized, findable public forum makes it easier for them to: A) Know that the data exists; and B) Actually get at the data.”
Academia should be faster than it is. This especially applies to the transportation and planning journals with which I am familiar. It often takes more than a year to do less than 8 hours of work (reviews and editorial decision-making).
Some peer reviewed journals (which will remain nameless in this post) are best described as Black Holes. Articles are submitted to be reviewed and never escape. There may or may not be an acknowledgment of receipt. The paper may or may not have been sent to reviewers. The reviewers may or may not have acknowledged receiving the paper in a timely fashion. They may or may not conduct the review in a timely fashion. Some reviewers might do their job, but the editor may be waiting on the slow reviewer before making a decision.
There are several causes for this black hole:
- Authors – Why would you be foolish enough to place your trust in an editor you don’t know? But of course, for graduate students and tenure track faculty, what choice do you have when your career is determined by success in the publication game? In this game, the author is in general the supplicant. If the author is famous, the situation might be reversed, and the editor should be seeking your paper to make their journal stronger, but given there are well over 1 million peer-reviewed journal articles published each year (and I guess 2x that submitted), and only 20,000 Scopus indexed journals, the average journal has leverage over the average author.
- Reviewers – Why would I do free labor for a stranger (the Editor) for a community I don’t know (prospective readers) to help an anonymous person (the Author)? Why would I do it quickly?
- The noble answer is to stay on top of cutting edge research.
- A plausible answer is the opportunity to ensure your own work is properly referenced. Though this might appear sketchy to you as an author when reviewers say cite X, Y, and Z (and reveal themselves), yet you as an author will still cite these works in the revised manuscript, and it looks perfectly natural to the reader. This motivates the reviewer and raises the citation rate of the reviewer’s own works.
- Another answer is to accumulate social capital.
Where exactly do I redeem the social capital I am accumulating? Where is the social capital bank:
Editors write promotion (or immigration!) letters in support of good, quick, helpful reviewers. Editors might more favorably view the papers of helpful reviewers. Editors might more favorably review proposals of helpful reviewers. Editors might be more likely to nominate good reviewers for awards. Editors might nominate good reviewers to an Editorial Advisory Board and bestow upon them some prestige. The reviewer might be an editor elsewhere and be able to “return the favor”. But all of this is probabilistic and a bit vaporous. Journals sometimes publish list of reviewers. In any case, a list of (self-reported) number of reviews by journal is one of the beans that is counted in the promotion and tenure process.
- Editors – What leverage do I have over unpaid labor (reviewers) and why should I care personally about ungrateful authors who have submitted an unready paper to my journal which will almost inevitably not be accepted the first round. The leverage is future favors I might bestow in advancement of potential reviewers (see above). This indicates Editors should favor graduate students and assistant professors as reviewers over full professors. Unfortunately full professors are more famous and more likely to be selected to do reviews. I am personally running at a rate of about 100 review requests per year now. If I were really famous, I would need to decline far more than I do now. If I were really, really famous, I would not have time to decline requests (or perhaps I would have staff decline requests for me).
So there is a social network at play in this process, and if any link breaks between author and editor, between editor and reviewer, or back from reviewer to editor or editor to author, the circuit is not complete, the paper entered the system, and like a light from a black hole, cannot escape.
This is one reason I like journals that have check to automatically track publication status, nag reviewers, and have quick turnaround times. This is one reason I like the idea of “desk reject”. It is much better to be rejected immediately then after 6, 9, 18, 24 months of review. Fast has value.
There is a second black hole, not quite as large, dealing with accepted papers that have yet to be formatted for publication. This is usually solved by an online “articles in press” or “online first” section of the journal website. The advantage to the journal of this is the ability for papers to accumulate citations before they are actually “published”, thereby gaming the ISI impact factors, which look at the number of citations in the first 2 years from publication.
A major problem with looking at 2 years when journals are slow is apparent. I cite only papers published before I submit my paper. If it takes 2 years to accept and publish, I will not have included any papers from the past two years. Therefore slow fields have lower impact factors than fast fields. This feeds the notion (in a positive feedback way) that these fields are sleepy backwaters of scientific research rather than cutting edge fields where people care about progress.
To break the black holes I have a couple of ideas:
- A “name and shame” open database (or even a wiki) which tracks article submissions by journal, so that authors have a realistic assessment of review, and possibly re-review and publication times. Also the amount of time in the author’s hands for revision would be tracked.
- Money to pay reviewers and editors to act in a timely fashion and publication charges to finance open scholarly communication. A few journals pay reviewers. When I get one of those, I am far more likely to review quickly than when I get requests from other journals, especially for journals outside my core area, especially when the likelihood of withdrawing social capital is minimal. Other journals charge authors and use the funds to speed the process (but as far as I know these journals don’t pay reviewers). Of course, we need to be clear to avoid “pay to play”. Libraries could help here, redirecting funds from the traditional subscription model to a new open access model, helping their university’s authors publish in truly open access journals. The new federal initiative will hopefully tip the balance.
We all know the journal system as we have known it is unlike to survive as is for the next 100 years. It is surprising it is lasted as long as it has, but academia is one of the last guilds.
There are lots of cool models out there beyond the traditional library pays for subscription of expensive journal: from open access journals with sponsors (JTLU), author fees (PLOS_One), membership (PeerJ), decentralized archives (RePEc), and centralized electronic archives arXiv.
Yet we need some way of separating the wheat from the chaff, and peer-review, as imperfect as it is, has advantages over the open internet where any crank can write a blog post.
Eventually time will act as a filter, but peer-review, the review of papers by experts to filter out the poorly written, the wrong, the repetitive and the redundant, can save readers much time.
We have an entry in the Knight Foundation’s Knight News Challenge, which asks “How might we improve the way citizens and governments interact?”. Ours is OpenScheduleTracker. Please go there to read the details and “applaud”.
OpenScheduleTracker archives public transit schedules and provides an easy-to-use interface for understanding how schedules change over time, comparing different schedule versions, and identifying what areas are most affected by schedule changes.
What’s The Problem?
OpenScheduleTracker addresses three primary weaknesses in the way that transit system changes are currently reported and discussed:
1. Small changes are ignored
Public transit schedules evolve constantly, but we often focus only on big changes — new routes, new stations, line closures — and ignore small changes like schedule adjustments, frequency changes, and transfer synchronization. These small changes are not glamorous, but they can have a big impact on the way that a transit system meets or misses the needs of local communities.
2. Big changes are misunderstood
When a new bus route is added or a new rail station opens, the public discussion tends to focus on effects near the new facility: people want to know what’s happening “in my backyard.” These effects are important, but they are only part of the whole picture. Changes to transit systems have network effects which extend through the entire system: a new station in one neighborhood provides access to local opportunities for all users of the system.
3. Old schedules aren’t available for comparison
Analyzing schedule changes over time is often frustrated by the inconsistent availability of previous transit schedule versions. Transit operators’ policies for archiving historical schedule data varies widely, and even when schedules are archived the public often has access only to the current version. Public transit system schedules are significant investments of time, money, and expertise; when they are lost or inaccessible, the public loses the value of that investment.
Andrew Owen will represent the Nexus group at the FOSS4G (Free and Open Source Software
for Geospatial – North America 2013) conference happening in Minneapolis, May 22-24.
FOSS Experiences in Transportation and Land Use Research
Andrew Owen, University of Minnesota Nexus Research Group
The Nexus Research Group at the University of Minnesota focuses on understanding the intersections of transportation and land use. In this presentation, we will examine case studies of how open source geospatial software has fit into specific research projects. We will discuss why and how open source software was chosen, how it strengthened our research, what areas we see as most important for development, and offer suggestions for increasing the use of open source geospatial software in transportation and land use research. Over the past two years, we have begun incorporating open source geospatial data and analysis tools into a research workflow that had been dominated by commercial packages. Most significantly, we implemented an instance of OpenTripPlanner Analyst for calculation of transit travel time matrices, and deployed QGIS and PostGIS for data manipulation and analysis. The project achieved a completely open research workflow, though this brought both benefits and challenges. Strengths of open source software in this research context include cutting edge transit analysis tools, efficient parallel processing of large data sets, and default creation of open data formats. We hope that our experience will encourage research users to adopt open source geospatial research tools, and inspire developers to target enhancements that can specifically benefit research users.
Akamai: State of the Internet Report [Comment: It's not faster than last year, because, like roads, it is not rationed or priced properly]
Tim Lee @ Ars: Why bandwidth caps could be a threat to competition: “Since the first dot-com boom, unmetered Internet access has been the industry standard. But recently, usage-based billing has been staging a comeback. Comcast instituted a bandwidth cap in 2008, and some other wired ISPs, including AT&T, have followed suit. In 2010, three of the four national wireless carriers—Sprint is the only holdout—switched from unlimited data plans to plans featuring bandwidth caps.”
Tom Vanderbilt @ The Wilson Quarterly: The Call of the Future : “Today we worry about the social effects of the Internet. A century ago, it was the telephone that threatened to reinvent society.” ["He is currently at work on You May Also Like, a book about the mysteries of human preferences."]
David Willetts @ The Guardian: The UK government is promising: Open, free access to academic research [Woot!]
Lynne Kiesling @ Knowledge Problem Be indomitable. Refuse to be terrorized. : “And to what end — how justified is this fear? High financial, human, cultural costs, to avert events that are one-quarter as likely as being struck by lightning. Some may criticize the performance of relative risk assessments between accidents and deliberate attacks, but it’s precisely these crucial relative risk assessments that enable us to recognize the unavoidable reality that neither accidents nor deliberate attacks can be prevented, and that to maintain both mental and financial balance we cannot delude ourselves about that, or give in to the panic that is the objective of the deliberate attacks in the first place. Thus the title of this post, which comes from two separate quotes from Bruce Schneier — the first from his excellent remarks at EPIC’s January The Stripping of Freedom event about the TSA’s use of x-ray body scanners, the second from his classic 2006 Wired essay of the same title:
The point of terrorism is to cause terror, sometimes to further a political goal and sometimes out of sheer hatred. The people terrorists kill are not the targets; they are collateral damage. And blowing up planes, trains, markets or buses is not the goal; those are just tactics.
The real targets of terrorism are the rest of us: the billions of us who are not killed but are terrorized because of the killing. The real point of terrorism is not the act itself, but our reaction to the act.
And we’re doing exactly what the terrorists want.”
Reason Foundation – Out of Control Policy Blog > Airport Policy and Security Newsletter: Airport Security 10 Years After 9/11: “Although my airline friends will disagree, I’ve concluded that the cost of aviation security measures is somewhat analogous to insurance. If you engage in risky behavior (drive a sports car, live in a beach house, etc.) you expose yourself to higher risks, and you rightly pay somewhat more for the relevant kind of insurance. Likewise, while it’s not the fault of air travelers or airlines that aviation is a high-profile terrorist target, the fact is that it is. So from a resource-allocation standpoint, I think a sector-specific user-tax approach is less bad than having general taxpayers pay for this.” [and much other good stuff]
The Long Now Blog » The Archive Team – Long Views: The Long Now Blog: “One of our favorite rogue digital archivists, Jason Scott, has just posted a video of his talk at DefCon 19 about The Archive Team exploits. This is perhaps the most eloquent (and freely peppered with profanity) explanations of the problems inherent with preserving our digital cultural heritage. He also describes in a fair amount of detail what he and The Archive Team have been doing to help remedy the problem.” [On a related note, The Metropolitan Travel Survey Archive has had its funding re-upped for another year, so we have more archiving to do, hopefully under less stressful conditions than Jason Scott above]