Improve quality of automatic metadata extraction
Improve the quality of the automatic metadata extraction; add automatic retrieval of metadata from arXiv, PubMed etc.
I want the abstract to be extracted automatically.
I would like to see better ways of dealing with cases where the title doesn't get extracted right and the rest of the metadata is therefore worse than useless... I'm getting a lot of this with stuff from http://uninformed.org
I've noticed difficulty with author names such as McElroy or McFarley. I have yet to see it correctly handle these names with capital letters other than at the beginning.
Many papers have DOIs that are not imported, or sometimes imported without the full entry, due to "unusual" characters [:, -, (, ), etc.]
Also, automatically getting/comparing data without review via DOI/PMID would be useful.
Robert Knight commented
Indeed, the New England Journal of Medicine is a classic example of a journal for which the automated extraction does not work well (yet). I cannot promise when that will be fixed as we generally try to avoid hacks to fix specific journals and will instead try to find algorithmic improvements that improve accuracy across the board (or perhaps improvements that target a particular class of problem paper).
I've noticed it's especially bad with journals with large headers at the top. I don't think it has extracted the correct data from any NEJM article it has seen.
This feature has improved over time with new mendeley releases and works really well for most new pdf files. However meta data from older pdf files is still not retrieved correctly - I understand this will be more difficult. One good example of this is acta crystallographica section D. For some reason mendeley finds the editors name and not the authors in older papers (pre 2006ish). This means that mendeley matches the paper with ones already in the database and adds the file to that reference instead of making a new reference. This is annoying.
It would be nice if Mendeley could collect Meta Data from Various Sources at once.
For example for an Ebook Google supplies an Abstract but no Data regarding Publisher, ISBN etc. Amazon does supply data on Publisher, ISBN, etc. but no Abstracts so it would make sense to mesh them.
Please implement an option to manually correct automatically extracted metadata (from automatic, google scholar, and DOI lookup) and to overview the process of automatic extraction:
The user should be able to correct mistakes by accessing various online databases (not only google) and searching for complete metadata using the extracted data as query (not only the title of the entry). Right now the only option is to get the rest of the fields if the title field is correct. But when drag-dropping a pdf that has no metadata as XML but has Author, Year, and Title as filename, then those three informations are shown in the title field. Now for a successful google scholar search I have to delete author and year. And then it might get information for the wrong article.
The Google Scholar search is also not satisfying in regards that it apparently uses the first match in google scholar right away. But many articles have been published more than once, so it´d be great to be able to choose from the list of google scholar (and other databases´) results.
I second the comments by Fabian P. Title extraction should be much better; and even when it only extracts part of the title and cannot source the reference, I can manually copy/paste the title fragment into pubmed and almost always return the single, correct reference (while google scholar fails). Please rely more on pubmed; in instances where librarians have already done the work, use it! (using PMIDs also results in the abstract being imported nicely).
I've noticed that the metadata extraction from Google Scholar doesn't include the volume and issue numbers.
Along with symbols and Greek letters, even common punctuation like single and double quotes do not import correctly.
Stephen Hill commented
There is a slight bug with recognition of the DOIs that could probably be easily fixed. Most DOIs from ASM publications look like this: doi:10.1128/JCM.44.4.1495-1501.2006. Mendely seems to cut off everything after the dash and interprets the dash as a long dash (–) instead of a short one (-). I think this fix would clear about 90% of my imports that need reviewing!
Fabien P commented
Could it be possible to improve Title detection with a "font size" based approach ? (cause in many articles title is bigger than anything)
It could allow to search for this title in Pubmed and to automatically use the PMID to extract remaining data...
(Maybe a validation checkpoint could be added, for example if the DOI or ArcXiv is extracted from the document, automatically compare the title linked from DOI to the PMID title. And maintain the "Details are correct" button to prevent mistakes with manual validation)
Additionally, providing a "first page" preview from the document could help the manual validation stage...
Andrey Chetverikov commented
The metadata extraction from APA site still needs to be improved.
Try this for example http://psycnet.apa.org/psycinfo/2009-07773-029 Only the authors names are extracted correctly.
This feature does need to be improved. Also it would be useful if the program allowed you to search for a paper in pubmed with details other than the ID instead of using google search. This would improve meta data collection.
Also, there should be a feature to UNDO the auto selection. When a new pdf file is "Automatically Imported" there is "Search by Title" button under the "Document Details" tab. This is good, but more often than not it fails and fills up with junk information. The user should be able UNDO this and get back to whatever the state was before "Search by title" button was hit. Similar thing for the DOI lense button (although DOI results are seldom junk).
Tim Kietzmann commented
Hello everyone, for papers that still cannot be tagged by Mendeley (yes, this still happens), I have set up another Ticket that asks for the display of many "close" matches out of which one can select the one that is correct.
I have many difficulties with articles from the Journal of Inclusion Phenomena and Macrocyclic Chemistry. Mendeley can't extract document's details using neither Google Schoolar search nor DOI id. This is rather strange 'cause I can find this paper using DOI resolver (dx.doi.org). Also when one searches for this article by name in Google, it is at 1st or 2nd position within results.
Keith Refson commented
I have found the quality of automatic PDF metadata extraction to be pretty poor - even on
recent journals such as the APS journals. The only thing that saves the situation is the DOI lookup.
However DOIs are not detected, even though present in American Chemical Society Journals, meaning that, for example import of Journal of Physical Chemistry is a "type by hand" process. (Since the title search does not work for titles containing chemical formulae).