I suggest you ...

Improve quality of automatic metadata extraction

Improve the quality of the automatic metadata extraction; add automatic retrieval of metadata from arXiv, PubMed etc.

2,367 votes
Vote
Sign in
Signed in as (Sign out)
You have left! (?) (thinking…)
MendeleyAdminMendeley (Admin, Mendeley) shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →
Tim McNamaraTim McNamara shared a merged idea: Support Dublin Core metadata extraction from webpages  ·   · 

180 comments

Sign in
Signed in as (Sign out)
Submitting...
  • naestennaesten commented  ·   ·  Flag as inappropriate

    I would like to see better ways of dealing with cases where the title doesn't get extracted right and the rest of the metadata is therefore worse than useless... I'm getting a lot of this with stuff from http://uninformed.org

  • jhamonjhamon commented  ·   ·  Flag as inappropriate

    I've noticed difficulty with author names such as McElroy or McFarley. I have yet to see it correctly handle these names with capital letters other than at the beginning.

  • mike.shearnmike.shearn commented  ·   ·  Flag as inappropriate

    Many papers have DOIs that are not imported, or sometimes imported without the full entry, due to "unusual" characters [:, -, (, ), etc.]

    Also, automatically getting/comparing data without review via DOI/PMID would be useful.

  • Robert KnightRobert Knight commented  ·   ·  Flag as inappropriate

    @emschaub

    Indeed, the New England Journal of Medicine is a classic example of a journal for which the automated extraction does not work well (yet). I cannot promise when that will be fixed as we generally try to avoid hacks to fix specific journals and will instead try to find algorithmic improvements that improve accuracy across the board (or perhaps improvements that target a particular class of problem paper).

  • emschaubemschaub commented  ·   ·  Flag as inappropriate

    I've noticed it's especially bad with journals with large headers at the top. I don't think it has extracted the correct data from any NEJM article it has seen.

  • bezbez commented  ·   ·  Flag as inappropriate

    This feature has improved over time with new mendeley releases and works really well for most new pdf files. However meta data from older pdf files is still not retrieved correctly - I understand this will be more difficult. One good example of this is acta crystallographica section D. For some reason mendeley finds the editors name and not the authors in older papers (pre 2006ish). This means that mendeley matches the paper with ones already in the database and adds the file to that reference instead of making a new reference. This is annoying.

  • zanzibarzanzibar commented  ·   ·  Flag as inappropriate

    It would be nice if Mendeley could collect Meta Data from Various Sources at once.
    For example for an Ebook Google supplies an Abstract but no Data regarding Publisher, ISBN etc. Amazon does supply data on Publisher, ISBN, etc. but no Abstracts so it would make sense to mesh them.

  • JanJan commented  ·   ·  Flag as inappropriate

    Please implement an option to manually correct automatically extracted metadata (from automatic, google scholar, and DOI lookup) and to overview the process of automatic extraction:
    The user should be able to correct mistakes by accessing various online databases (not only google) and searching for complete metadata using the extracted data as query (not only the title of the entry). Right now the only option is to get the rest of the fields if the title field is correct. But when drag-dropping a pdf that has no metadata as XML but has Author, Year, and Title as filename, then those three informations are shown in the title field. Now for a successful google scholar search I have to delete author and year. And then it might get information for the wrong article.
    The Google Scholar search is also not satisfying in regards that it apparently uses the first match in google scholar right away. But many articles have been published more than once, so it´d be great to be able to choose from the list of google scholar (and other databases´) results.

  • ChrisLChrisL commented  ·   ·  Flag as inappropriate

    I second the comments by Fabian P. Title extraction should be much better; and even when it only extracts part of the title and cannot source the reference, I can manually copy/paste the title fragment into pubmed and almost always return the single, correct reference (while google scholar fails). Please rely more on pubmed; in instances where librarians have already done the work, use it! (using PMIDs also results in the abstract being imported nicely).

  • jon.reevejon.reeve commented  ·   ·  Flag as inappropriate

    I've noticed that the metadata extraction from Google Scholar doesn't include the volume and issue numbers.

  • arikleimanarikleiman commented  ·   ·  Flag as inappropriate

    Along with symbols and Greek letters, even common punctuation like single and double quotes do not import correctly.

  • Stephen HillStephen Hill commented  ·   ·  Flag as inappropriate

    There is a slight bug with recognition of the DOIs that could probably be easily fixed. Most DOIs from ASM publications look like this: doi:10.1128/JCM.44.4.1495-1501.2006. Mendely seems to cut off everything after the dash and interprets the dash as a long dash (–) instead of a short one (-). I think this fix would clear about 90% of my imports that need reviewing!

  • Fabien PFabien P commented  ·   ·  Flag as inappropriate

    Could it be possible to improve Title detection with a "font size" based approach ? (cause in many articles title is bigger than anything)
    It could allow to search for this title in Pubmed and to automatically use the PMID to extract remaining data...
    (Maybe a validation checkpoint could be added, for example if the DOI or ArcXiv is extracted from the document, automatically compare the title linked from DOI to the PMID title. And maintain the "Details are correct" button to prevent mistakes with manual validation)

    Additionally, providing a "first page" preview from the document could help the manual validation stage...

  • bezbez commented  ·   ·  Flag as inappropriate

    This feature does need to be improved. Also it would be useful if the program allowed you to search for a paper in pubmed with details other than the ID instead of using google search. This would improve meta data collection.

  • kapatpkapatp commented  ·   ·  Flag as inappropriate

    Also, there should be a feature to UNDO the auto selection. When a new pdf file is "Automatically Imported" there is "Search by Title" button under the "Document Details" tab. This is good, but more often than not it fails and fills up with junk information. The user should be able UNDO this and get back to whatever the state was before "Search by title" button was hit. Similar thing for the DOI lense button (although DOI results are seldom junk).

  • kvnlinuxkvnlinux commented  ·   ·  Flag as inappropriate

    I have many difficulties with articles from the Journal of Inclusion Phenomena and Macrocyclic Chemistry. Mendeley can't extract document's details using neither Google Schoolar search nor DOI id. This is rather strange 'cause I can find this paper using DOI resolver (dx.doi.org). Also when one searches for this article by name in Google, it is at 1st or 2nd position within results.

  • Keith RefsonKeith Refson commented  ·   ·  Flag as inappropriate

    I have found the quality of automatic PDF metadata extraction to be pretty poor - even on
    recent journals such as the APS journals. The only thing that saves the situation is the DOI lookup.

    However DOIs are not detected, even though present in American Chemical Society Journals, meaning that, for example import of Journal of Physical Chemistry is a "type by hand" process. (Since the title search does not work for titles containing chemical formulae).

Feedback and Knowledge Base