Information extraction and the missing Mark2Cure module

In our previous post, we asked readers, 'What is your preferred moniker?'. Here is the response:

Mark2Curator: 36%
Citizen Scientist: 36%
Contributor: 18%
"Anything BUT volunpeer": 10%

Although it may seem a little strange that researchers have been struggling to find an answer to the "What's in a name?" issue for discussing citizen science, this struggle is a deeply representative of some of the important work biocurators do. "What's in a name? A citizen scientist by any other name still makes important contributions"

Researchers need a common vocabulary to be able to coherently exchange information, but settling on that vocabulary--on how that vocabulary is structured is difficult. Without a common vocabulary, it is easy for scientists to miss research that is valuable to their field of study. Although it has yet to be seen how the citizen science research community will settle this issue, in biomedical research, biocurators help with that sort of determination. Biocurators help standardize terms, define the rules governing how terms are classified and how they are organized. In doing so, they facilitate information quality control and exchange. Biocurators do all this and more.

Given that biocurators do very important, very tedious, and often very difficult work, one question we get quite a bit is:

"How is it possible to train citizen scientists to replace such important, skilled researchers?"

But this question is built on a fundamentally incorrect assumption about the goals of Mark2Cure. We KNOW biocurators do very important work, and that one of the most tedious, and time-consuming things that they do is information extraction.

Information extraction can generally be broken down into three tasks:
1. Named Entity Recognition (identifying and classifying words/phrases in text)
2. Normalization (linking that text to an ontology)
3. Relationship Extraction (identifying the relationship between different entities).

We want to train citizen scientists to help with this task, so that biocurators can apply their unique training towards solving problems in biomedical research analogous to the ones we're seeing in the citizen science field.

Since Mark2Cure is a citizen science project, the "What's in a name?" issue applies to us as well. Although our informal poll was only for fun, I was personally very happy with the results for two reasons:

1. I am a fan of wordplay, and I love that many users liked the term Mark2Curator--a term which blends Mark2Cure and biocurator. I love science puns

2. Even if I'm reading too much into it, I like to think that our users picked 'citizen scientist' or 'contributors' because they feel that the help they provide to Mark2Cure is important--because it is.

If you've gotten this far, you are probably one of our many astute readers and may have noticed that information extraction was divided into THREE tasks, when Mark2Cure only has TWO. Where is the third task? Why is it the missing task is the step in between the first and the last task?

The missing task, 'Normalization', is the task in between NER and Relationship Extraction. We started with NER because NER has been well-investigated so there was a solid foundation for us to build upon. We followed with the relationship extraction task because this would allow us to unlock some of the most difficult to access and valuable information in the text.

As for the Normalization task...it's currently in being built by volunteers. Mark2Curators have been helping us investigate NER mappings to different ontologies, and a very talented programmer and machine learning expert has been busy building the Normalization module. But we could use more help. We need feedback on potential interfaces for how parts of the module might work. If you'd like to help with that, answer the poll in our newsletter.

Of note for our U.S.-based Mark2Curators over 65 years of age.

Did you know? US National Park Services has a lifetime pass for seniors that will allow you to enter or park at US national parks for free or at a discounted rate. These passes only cost $10 now through August 27th. After August 28th, the price will go up to $80.

If you enjoy hiking, nature, or plan to visit any of our beautiful national parks, you may want to get your pass while it's still $10. In San Diego, the closest national park where you can purchase one in person is Cabrillo. To find the national park closest to you, visit the NPS's site. If you don't live near a park, but plan on visiting some in the future, you can purchase a pass by mail or online.

Happy Fathers Day!

A HUGE thanks to all the dads (and EVERYONE) who has been contributing to make a difference for the NGLY1 families.

Shipping delays Apologies to international prize and drawing winners who were waiting for their prizes. Most of the international packages that we shipped out in May/June have been returned to us due to customs issues (fortunately, this happened at some point prior to shipping so the postage on these is still good, unfortunately, it took a long time for these to get back to us so we can address the issue). We’ll be trying again to get these out ASAP.

Max’s original project slam now online As mentioned in our previous newsletter, Max delivered the project slam for Mark2Cure at the Citizen Science Conference in Minnesota. The project slam talks were supposed to have been recorded and still may be released by the Citizen Science Association someday, but we couldn’t wait. Here’s our recording of Max’s project slam. He finished within his allotted four minutes, and was engaging enough to win one of three invitations to deliver an even shorter version of the slam at an even the following day.
You can check it out here: https://www.youtube.com/watch?v=7kxlhhFLdmM

You be the scientist! One thing we’ve heard (and quite agree with) at the Citizen Science Conference is that trained volunteers are capable of doing more than simple tasks. Mark2Curators have very much fed into the tutorial process, and played an important role in testing and improving the design of the interface. The entities our users have identified from the text have already yielded interesting clues which we’ve used to expand the set of documents to investigate, and by now, there are users who have read a lot of abstracts—A LOT! If you’ve read something that sticks out in your mind as being potentially related to NGLY1-deficiency, share it with us! We’d love to hear YOUR hypothesis on what might be an interesting term to explore and why.