<div class="csl-bib-body">
<div class="csl-entry">Fries, J., Seelam, N., Altay, G., Weber, L., Kang, M., Datta, D., Su, R., Garda, S., Wang, B., Ott, S., Samwald, M., & Kusa, W. (2022). Dataset Debt in Biomedical Language Modeling. In <i>Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models</i> (pp. 137–145). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.bigscience-1.10</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/175704
-
dc.description.abstract
Large-scale language modeling and natural language prompting have demonstrated exciting capabilities for few and zero shot learning in NLP. However, translating these successes to specialized domains such as biomedicine remains challenging, due in part to biomedical NLP’s significant dataset debt – the technical costs associated with data that are not consistently documented or easily incorporated into popular machine learning frameworks at scale. To assess this debt, we crowdsourced curation of datasheets for 167 biomedical datasets. We find that only 13% of datasets are available via programmatic access and 30% lack any documentation on licensing and permitted reuse. Our dataset catalog is available at: https://tinyurl.com/bigbio22.
en
dc.description.sponsorship
European Commission
-
dc.language.iso
en
-
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/
-
dc.subject
biomedical NLP
en
dc.subject
language modelling
en
dc.subject
dataset debt
en
dc.subject
biomedical datasets
en
dc.subject
biomedical language modelling
en
dc.title
Dataset Debt in Biomedical Language Modeling
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.rights.license
Creative Commons Namensnennung 4.0 International
de
dc.rights.license
Creative Commons Attribution 4.0 International
en
dc.contributor.affiliation
Stanford University, United States of America (the)
-
dc.contributor.affiliation
Sherlock Biosciences, USA
-
dc.contributor.affiliation
Tempus Labs (United States), United States of America (the)
-
dc.contributor.affiliation
Humboldt-Universität zu Berlin, Germany
-
dc.contributor.affiliation
Immuneering (United States), United States of America (the)
-
dc.contributor.affiliation
University of Virginia, United States of America (the)
-
dc.contributor.affiliation
Sway AI, USA
-
dc.contributor.affiliation
Humboldt-Universität zu Berlin, Germany
-
dc.contributor.affiliation
Massachusetts General Hospital, United States of America (the)