DarkBERT: A Language Model for the Dark Side of the Internet

Date

03.06.23

Original article is here

Large language models are all the rage these days and new ones are popping up every other day. Most of these linguistic behemoths, including OpenAI’s ChatGPT and Google’s Bard, are trained on text data from all over the internet – websites, articles, books, you name it. This means that their output is a mixed bag of genius. But what if instead of the web, LLMs were trained on the dark web? Researchers have done just that with DarkBERT to some surprising results. Let’s take a look.

What is DarkBERT?

A team of South Korean researchers have released a paper detailing how they built an LLM on a large-scale dark web corpus collected by crawling the Tor network. The data included a host of shady sites from various categories including cryptocurrency, pornography, hacking, weaponry, and others. However, due to ethical concerns, the team did not use the data as is. To ensure that the model wasn’t trained on sensitive data so that bad actors aren’t able to extract that information, the researchers polished the pre-training corpus through filtering, before feeding it to DarkBERT.

If you are wondering about the rationale behind the name DarkBERT, the LLM is based on the RoBERTa architecture, which is a transformer-based model developed back in 2019 by researchers at Facebook.

Meta had described RoBERTa as a “robustly optimized method for pre-training natural language processing (NLP) systems” that improves upon BERT, which was released by Google back in 2018. After Google made the LLM open-source, Meta was able to improve its performance.

Cut to the present, the Korean researchers have improved upon the original model even further by feeding it data from the dark web over the course of 15 days, eventually arriving upon DarkBERT. The research paper highlights that a machine with an Intel Xeon Gold 6348 CPU and 4 NVIDIA A100 80GB GPUs was used for the purpose.

Read the full article at: indianexpress.com

Revolutionizing Supply Chain Management: The Power of AI in Strategic Sourcing and Inventory Optimization

August 12, 2023

10 AI Tools That You Should Be Using In Your Business This Year

August 12, 2023

Lights out? 61% of Americans think AI could spell the end of Humanity

August 4, 2023

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Date

More articles

Revolutionizing Supply Chain Management: The Power of AI in Strategic Sourcing and Inventory Optimization

10 AI Tools That You Should Be Using In Your Business This Year

Lights out? 61% of Americans think AI could spell the end of Humanity

More
articles