Open Standards – SteinBlog

Job offer: Bioinformatician in Metaspace project for untargeted spatial metabolomics

Christoph Steinbeck — Mon, 27 Apr 2015 15:54:35 +0000

We are looking to recruit a talented bioinformatician to work within the MetaboLights team at the European Bioinformatics Institute (EMBL-EBI) located on the Wellcome Trust Genome Campus near Cambridge in the UK. You will work closely with a consortium of 8 European partners on the METASPACE project.

METASPACE will enable untargeted spatial metabolomics for translational research and clinical applications by providing novel bioinformatics tools, and to demonstrate their potential using several case studies relating to personalised health, precision medicine and quality of life in chronic afflictions.

You will work with other bioinformaticians, domain experts and software engineers on the development of novel database-driven spectral and spatial algorithms, machine learning approach for multiple-mass fingerprinting, development of web services and more. You will also help coordinating outreach and training, both in terms of online and face-to-face training, in collaboration wit EMBL-EBI’s professional outreach and training team.

The EBI is part of the European Molecular Biology Laboratory (EMBL) and it is a world-leading bioinformatics centre providing biological data to the scientific community with expertise in data storage PuTTY SSH SOCKS proxy , analysis and representation. EMBL-EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academic and industry. We are part of EMBL, Europe’s flagship laboratory for the life sciences.

Please submit your application via the EMBL job site.

Open Position for a Scientific Database Curator/Annotator in ChEBI team

Christoph Steinbeck — Wed, 11 Apr 2012 08:23:42 +0000

An early chemistry lab

We are looking for a Scientific Database Curator/Annotator to work on the ChEBI (Chemical Entities of Biological Interest) project within the Cheminformatics and Metabolism Team. The position is based at the European Bioinformatics Institute (EMBL-EBI) located on the Wellcome Trust Genome Campus near Cambridge in the UK.

The successful applicant will work on the existing ChEBI database, in particular on extending it towards better coverage of natural products and metabolites. (S)He will also annotate entities as requested by our users.

The Chemoinformatics and Metabolism Teams conduct research and create community resources for Chemoinformatic and Metabolism research. We are a friendly, multi-national team with collaborations all around the world. The aim of the ChEBI project is to provide a high-quality, hand-curated chemistry database to the biomedical community, to standardise chemical and biochemical terminology across biological databases and to develop a chemical ontology.

Qualifications and Experience

Applicants should have a degree in Biochemistry/Chemistry and ideally some years professional experience. Knowledge of natural products and metabolites would be a definitive plus. Self-motivated candidates with good communication skills and ambition to work in an international field are invited to apply.

Application Instructions

Please apply online through www.embl.org/jobs

Additional Information

EMBL is an inclusive, equal opportunity employer offering attractive conditions and benefits appropriate to an international research organisation.

Please note that appointments on fixed term contracts can be renewed, depending on circumstances at the time of the review.

Note that special visa requirements apply to employees from non-EU countries working at EMBL-EBI in the UK. The period of work does not qualify them for the Highly Skilled Migrants Programme.

CDK-Taverna paper published

Christoph Steinbeck — Mon, 29 Mar 2010 07:31:36 +0000

CDK-Taverna workflow

We are glad to announce that our article about CDK-Taverna, an open workflow solution for cheminformatics, is now online on BMC Bioinformatics. CDK-Taverna, which lives at http://www.cdk-taverna.de/, features more than 160 workers for various tasks in molecular informatics.

The workflow paradigm allows scientists to flexibly create generic workflows using different kinds of data sources, filters and algorithms, which can later be adapted to changing needs. In order to achieve this, library methods are encapsulated in Lego(TM)-like building blocks which can be manipulated with a mouse or any pointing device in a graphical environment, relieving the scientist from the need to learn a programming language. Building blocks, so-called workers, are connected by data pipelines to enable data flow between them, which is why pipelining is often used interchangeably for workflow.

ChEBI chemistry ontology development funded by BBSRC

Christoph Steinbeck — Tue, 19 May 2009 11:30:45 +0000

We received our official award letter from BBSRC Tools and Resources Fund today for the ChEBI ontology development grant. Needless to say, we are thrilled. We are now going to work together with Michael Ashburner’s group at the University of Cambridge to align ChEBI with other OBO Foundry ontologies by adoption of the Basic Formal Ontology and the Relationship Types Ontology.
This will include extensive annotation of the ChEBI ontology required after adoption of BFO and RO. The adoption of the BFO will require a major reorganisation of the upper levels of the ChEBI ontology in order to allow it to align to the BFO. This
reorganisation can only be achieved by manual annotation although some semi-automatic means will be employed to aidthe curator. In addition to the reorganisation of the upper levels https://puttygen.in , new relationships will be introduced semi-automatically but as the ChEBI ethos requires that all data is manually checked to maintain ChEBI’s high standards of data quality, we expect a major annotation task. The project is funded for three years. Stay tuned. We’ll report on our progress on a regular basis.

ChEBI behind the scenes

Christoph Steinbeck — Fri, 08 May 2009 08:12:37 +0000

With ChEBI release 56 behind us, I thought I’d share some insight into how ChEBI is created and what we do to prepare a release. In the last years, the ChEBI team on average consisted of two software engineers maintaining and improving the software and two to three curators doing the data entry and curation. It is remarkable, that, by now, the question of which chemical compounds make it into ChEBI is completely community driven. Requests to enter compounds are submitted by users and other database maintainers via the ChEBI curator request tracker on SourceForge. Besides increasing the public knowledge of mankind, the biggest benefit and driving force for submitters is the assignement of a stable ChEBI identifier which then can be cited and linked to from other resources.

With ChEBI release 55 we have introduced the new submission tool which now allows our submitter to create ChEBI datasets themselves which a) gives our users more control over what they want to see in ChEBI and b) saves our curators some duplicate work.

In preparation for a release, here is what the ChEBI team does.

Create automatic cross-references to PubChem, UniProt, IntEnz, BRENDA, SABIO-RK, ArrayExpress, IntAct, Patents etc…These are all run a week before the release and are based on ChEBI identifier matching or text matching.
Annotation of entity of the month
Submissions deposited directly into the database by users are processed by our annotators.

On the release day:

Data is exported overnight into multiple formats, OBO format, SDF, Oracle data dumps and PostgreSQL/MySQL dumps.
Public web site updated with the entity of the month.
Statistics generated and stored.
Sitemaps are generated to be used by search engines like Google for indexing.
Finally data is deposited into PubChem and the EB-eye search engine is updated.

3rd International Biocuration Conference in Berlin

Christoph Steinbeck — Fri, 17 Apr 2009 07:24:52 +0000

Berlin Dahlem-Dorf tube station

I’m attending the 3rd International Biocuration Conference in Berlin, which looks like a pretty successful meeting in terms of numbers of participants. Seems like somewhere between 100 and 200 participants. It looks like the time for recognition for biocuration and curated biological resource has come. The International Society for Biocuration has been inaugurated yesterday. People from publishing companies such as Nature are attending.

Janet Thornton, director of EBI, gave the opening keynote yesterday evening, rehearsing some of the history of biocuration and looking into the future of securing funding for biocuration through the Elixir project.

I’m now listening to Philip Bourne talking about “Changes in Scholarly Communication and the Potential Impact on Biocuration”. He talks, beyond a lot of other things, about the author embedding semantic information into the orginal manuscript and introduces part of his own work with Microsoft on a plug-in for word to do this enrichment.

There is nothing overly particular about this meeting but it strenghens my feeling that we are at the point where finally the idea of preserving the information in the first place, in the scientific document, has come. Both Dietrich’s semantic enrichment conference as well as this one was well attended by publishers – Elsevier and Nature where at both. This scientific document can then become both a scientific article as well as one or many database entries.

Another notion that has come up a couple of times is the question of reward for authors to make and submit semantically rich documents. One of the ideas is fast-tracking those documents – publishing them faster.

ChEBI at the Fall 2009 ACS meeting in Washington

Christoph Steinbeck — Fri, 27 Mar 2009 15:05:23 +0000

I’ve been invited to present our ChEBI ontology at the 2009 Fall Meeting of the American Chemical Society. Here is our abstract:

ChEBI – An open ontology for Chemical Entities of Biological Interest

Paula de Matos (1), Kirill Degtyarenko (2), Marcus Ennis (1), Janna
Hastings (1), Inma Spiteri (1) and Christoph Steinbeck (1)

(1) European Bioinformatics Institute, Hinxton, Cambridge, UK
(2) European Patent Office, The Hague, The Netherlands

Chemical Entities of Biological Interest (ChEBI) is a freely available, manually annotated resource providing data such as chemical nomenclature, an ontology and chemical structures. The ChEBI ontology imposes meaning onto the data according to four subontologies: molecular structure, application, biological role and subatomic particle. As a cheminformatics resource it provides chemical substructure and similarity searching using the Chemistry Development Kit (CDK). ChEBI annotates structures with various properties such as charge and mass and names including brand names and International Nonproprietary Name (INN). This extended coverage is complemented by manually annotated names appearing in Patents and Patent identifiers. In addition names can now appear in French, German, Latin and Spanish. Acting as a chemoinformatics portal to other bioinformatics resources, ChEBI has introduced automatically generated links to resources such as UniProtKB, IntAct, ArrayExpress, SABIO-RK or PubChem. ChEBI lives at http://www.ebi.ac.uk/chebi/ putty download windows , where it is also available for download in
a variety of formats and accessible via webservices.

Industry-funded medical research will double your impact factor

Christoph Steinbeck — Mon, 16 Feb 2009 09:00:52 +0000

The Guardian has a nice piece by Ben Goldarcre reporting about a study published by the British Medial Journal entitled “Relation of study quality puttygen download , concordance, take home message, funding, and impact in studies of influenza vaccines: systematic review”. Both the newpaper article and the study are worth reading and seem to be open. Besides many other interesting findings, the BMJ article finds that the journal impact factor of industry-funded studies of influenza vaccines (both Ben and I find it quite likely that this is not limited to the study of influenza vaccines :-)) are on average more than twice as high as those for purely academic studies (Impact Factor 3.74 vs 8.78). Judge yourself.

Creating and Reviewing Patches in the Chemistry Development Kit (CDK)

Christoph Steinbeck — Mon, 01 Sep 2008 17:13:34 +0000

In order to prevent major turbulences in the main source code development line of the Chemistry Development Kit (CDK), we decided a while ago to have separate branches in our subversion source code management system for each developer and each of his subprojects. Once a project has been finalized by a developer in her branch, she would then publish a patch in the CDK patch tracker system and ask for it to be reviewed by posting to the CDK developers mailing list. A CDK senior developer would the assign the patch to himself or another senior developer.

I have just been assigned the task to review the recent Iterator/Iterable patch for CDK and will protocol my task for reference reasons.The patch was published on the CDK patch tracker.

The executive summary of the reviewing task goes like:

browse the code
mark up code you think is buggy
note missing unit tests
note missing JavaDoc
warn for subjected PMD warnings
optionally note other problems
optionally any other comment you have

So, let’s see how it went:

Browse the Code

I got the gzipped archive with Egon’s patch and looked at the code. A large part of the changes involve

removing public Iterator isotopes() {
and adding public Iterable isotopes() {
to enable things like

double overallCharge = 0.0 for (IAtom atom : molecule.atoms()) { overallCharge += atom.getCharge(); }

In order to implement Iterable, one needs to have methods returning an Iterator, so a lot of code essentially implements those.

Remove: public java.util.Iterator atoms() {
and add: public Iterable atoms() { logger.debug("Getting atoms iterator"); return super.atoms(); }

And then there is code actually using those iterators and all of these instances had to be adapted too (I’m just giving the patch syntax):

for(IReactionScheme rm : scheme.reactionSchemes()){ - for(Iterator iter = getAllMolecules(rm, molSet).atomContainers(); iter.hasNext(); ){ - IAtomContainer ac = iter.next(); - boolean contain = false; - for(Iterator it2 = molSet.molecules();it2.hasNext();){ - if(it2.next().equals(ac)){ - contain = true; - break; - } - } - if(!contain) - molSet.addMolecule((IMolecule)(ac)); - } + for (IAtomContainer ac : getAllMolecules(rm, molSet).atomContainers()) { + boolean contain = false; + for (IAtomContainer atomContainer : molSet.molecules()) { + if (atomContainer.equals(ac)) { + contain = true; + break; + } + } + if (!contain) + molSet.addMolecule((IMolecule) (ac)); + }

Overall, the patch affected 288 classes including test classes, with almost 2000 lines of code changed.

Mark up code you think is buggy

Impossible to do for me for such a large bunch of changes, so one must rely here on the unit tests to work.

Note missing unit tests

Egon had posted some notes about comparing failing and passing between unit tests earlier but we also need an automatic check for unit test coverage. And yes, of course, there are limits to what such an automated coverage tool can do.

With regard for failing unit tests, the “iterable” branch did have anymore failures and errors than the head branch.

Note missing JavaDoc

We’ve go DocCheck results on our CDK nightly pages but nothing tells you whether a patched method is missing neccessary JavaDoc. Presumably, we could “grep” the patches class names into a DocCheck input file and get customized info about it.

Warn for subjected PMD warnings

PMD is a tool for checking code with respect to adherence to certain coding standards.Again, the CDK nightly page contains all PMD reports on the CDK code, generated in nightly runs. The same can be achieved for each branch with a “ant -f pmd.xml” on your local copy of the branch.

Optionally note other problems

I love optional things and tend to let them be optional

Optionally any other comment you have

Dto.

So, overall I would like to conclude that according to the best of my knowledge, the Iterable patch should be safe and can be applied to the HEAD branch.

Linus on GIT on Google TechTalks

Christoph Steinbeck — Tue, 26 Aug 2008 10:31:58 +0000

I’m a big fan of Google TechTalks and watch a lot of them during flights. This week I enjoyed the recording of Linus Torvalds insulting all kinds of people including the whole SVN develoment team while introducing his distributed source code management system GIT. Egon had pointed me to GIT quite a while ago but seeing Linus himself discuss the issue made a difference.While CDK is still considerably smaller than the Linux kernel, I can see a lot of commonalities and I think that with our current development of having our fellow coadmins review important patches and branches PuTTY download , GIT sounds like a much easier way to do it.

In GIT the source code is distributed – there is no concept of a central source repository. Developers commit their chances to their local GIT systems, with all the advantages of versioning and source code history. Other developers pull code from you if they think that the changes you’ve advertised via your favourite communication channels are interesting. In theory, this allows for a very democratic and evolutionary code development. In addition to being distributed, GIT seems to be very fast when it comes to merging. Linus reports that he does hundreds of full merges per day and nothing take longer than 5 secs.

In practice, as Linus points out in his talk, there will always be one or very few repositories that people pull from – for the Linux kernel it will be Linus’s machine. In CDK it will very likely be Egon‘s. Sorry Egon, you’ve got to be online all day

The last sentence already brings me to the point. I wonder if we should give GIT a try for CDK development. The advantages do sound enormous. Ok, there are disadvantage too, such as loosing the central web browsing of the SVN repository on SF. There may be ways around this, as Egon decribed here, but this seems like not using the real thing.

This is a brief impression dump after watching Linus’ talk today and I’m happy to hear your opinions