DeepSeek


Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., doing business as DeepSeek, is a Chinese artificial intelligence company that develops large language models. Based in Hangzhou, Zhejiang, DeepSeek is owned and funded by the Chinese hedge fund High-Flyer. DeepSeek was founded in July 2023 by Liang Wenfeng, the co-founder of High-Flyer, who also serves as the CEO for both of the companies. The company launched an eponymous chatbot alongside its DeepSeek-R1 model in January 2025.
Released under the MIT License, DeepSeek-R1 provides responses comparable to other contemporary large language models, such as OpenAI's GPT-4 and o1. Its training cost was reported to be significantly lower than other LLMs. The company claims that it trained its V3 model for US$6 million—far less than the US$100 million cost for OpenAI's GPT-4 in 2023—and using approximately one-tenth the computing power consumed by Meta's comparable model, Llama 3.1. DeepSeek's success against larger and more established rivals has been described as "upending AI".
DeepSeek's models are described as "open weight," meaning the exact parameters are openly shared, although certain usage conditions differ from typical open-source software. The company reportedly recruits AI researchers from top Chinese universities and also hires from outside traditional computer science fields to broaden its models' knowledge and capabilities.
DeepSeek significantly reduced training expenses for their R1 model by incorporating techniques such as mixture of experts layers. The company also trained its models during ongoing trade restrictions on AI chip exports to China, using weaker AI chips intended for export and employing fewer units overall. Observers say this breakthrough sent "shock waves" through the industry which were described as triggering a "Sputnik moment" for the US in the field of artificial intelligence, particularly due to its open-source, cost-effective, and high-performing AI models. This threatened established AI hardware leaders such as Nvidia; Nvidia's share price dropped sharply, losing US$600 billion in market value, the largest single-company decline in U.S. stock market history.

History

Founding and early years (2016–2023)

In February 2016, High-Flyer was co-founded by AI enthusiast Liang Wenfeng, who had been trading since the 2008 financial crisis while attending Zhejiang University. The company began stock trading using a GPU-dependent deep learning model on 21 October 2016; before then, it had used CPU-based linear models. By the end of 2017, most of its trading was driven by AI.
Liang established High-Flyer as a hedge fund focused on developing and using AI trading algorithms, and by 2021 the firm was using AI exclusively, often using Nvidia chips.
In 2019, the company began constructing its first computing cluster, Fire-Flyer, at a cost of 200 million yuan; it contained 1,100 GPUs interconnected at 200 Gbit/s and was retired after 1.5 years in operation.
By 2021, Liang had started buying large quantities of Nvidia GPUs for an AI project, reportedly obtaining 10,000 Nvidia A100 GPUs before the United States restricted chip sales to China. Computing cluster Fire-Flyer 2 began construction in 2021 with a budget of 1 billion yuan.
It was reported that in 2022, Fire-Flyer 2's capacity had been used at over 96%, totaling 56.74 million GPU hours. 27% was used to support scientific computing outside the company.
During 2022, Fire-Flyer 2 had 5,000 PCIe A100 GPUs in 625 nodes, each containing 8 GPUs. At the time, it exclusively used PCIe instead of the DGX version of A100, since at the time the models it trained could fit within a single 40 GB GPU VRAM and so there was no need for the higher bandwidth of DGX. Later, it incorporated NVLinks and NCCL to train larger models that required model parallelism.
On 14 April 2023, High-Flyer announced the launch of an artificial general intelligence research lab, stating that the new lab would focus on developing AI tools unrelated to the firm's financial business. Two months later, on 17 July 2023, that lab was spun off into an independent company, DeepSeek, with High-Flyer as its principal investor and backer. Venture capital investors were reluctant to provide funding, as they considered it unlikely that the venture would be able to quickly generate an "exit".

Model releases (2023–present)

DeepSeek released its first model, DeepSeek Coder, on 2 November 2023, followed by the DeepSeek-LLM series on 29 November 2023. In January 2024, it released two DeepSeek-MoE models, and in April 3rd DeepSeek-Math models.
DeepSeek-V2 was released in May 2024, followed a month later by the DeepSeek-Coder V2 series. In September 2024, DeepSeek V2.5 was introduced and revised in December. On 20 November 2024, the preview of DeepSeek-R1-Lite became available via chat. In December, DeepSeek-V3-Base and DeepSeek-V3 were released.
On 20 January 2025, DeepSeek launched the DeepSeek chatbot—based on the DeepSeek-R1 model—free for iOS and Android. By 27 January, DeepSeek surpassed ChatGPT as the most downloaded freeware app on the iOS App Store in the United States, triggering an 18% drop in Nvidia's share price.
On 24 March 2025, DeepSeek released DeepSeek-V3-0324 under the MIT License.
On 28 May 2025, DeepSeek released DeepSeek-R1-0528 under the MIT License. The model has been noted for more tightly following official Chinese Communist Party ideology and censorship in its answers to questions than prior models.
On August 21, 2025, DeepSeek released DeepSeek V3.1 under the MIT License. This model features a hybrid architecture with thinking and non-thinking modes. It also surpasses prior models like V3 and R1, by over 40% on certain benchmarks like SWE-bench and Terminal-bench. It was updated to V3.1-Terminus on 22 September 2025. V3.2-Exp was released on 29 September 2025. It uses DeepSeek Sparse Attention, a more efficient attention mechanism based on previous research published in February.

Company operation

DeepSeek is headquartered in Hangzhou, Zhejiang, and is owned and funded by High-Flyer. Its co-founder, Liang Wenfeng, serves as CEO. As of May 2024, Liang personally held an 84% stake in DeepSeek through two shell corporations.

Strategy

DeepSeek has stated that it focuses on research and does not have immediate plans for commercialization. This posture also means it can skirt certain provisions of China's AI regulations aimed at consumer-facing technologies.
DeepSeek's hiring approach emphasizes skills over lengthy work experience, resulting in many hires fresh out of university. The company likewise recruits individuals without computer science backgrounds to expand the range of expertise incorporated into the models, for instance in poetry or advanced mathematics. According to The New York Times, dozens of DeepSeek researchers have or have previously had affiliations with People's Liberation Army laboratories and the Seven Sons of National Defence.
Due to the impact of United States restrictions on chips, DeepSeek refined its algorithms to maximise computational efficiency and thereby leveraged older hardware and reduced energy consumption.
DeepSeek also expanded on the African continent as it offers more affordable and less power-hungry AI solutions. The company has bolstered African language models and generated a number of startups, for example in Nairobi. Along with Huawei's storage and cloud computing services, the impact on the tech scene in sub-saharan Africa is considerable. DeepSeek offers local data sovereignty and more flexibility compared to Western AI platforms.

Training framework

High-Flyer/DeepSeek had operated at least two primary computing clusters: Fire-Flyer and Fire-Flyer 2. Fire-Flyer 1 was constructed in 2019 and was retired after 1.5 years of operation. Fire-Flyer 2 is still in operation as of 2025. Fire-Flyer 2 consists of co-designed software and hardware architecture. On the hardware side, Nvidia GPUs use 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for high bisection bandwidth. On the software side are:
  • 3FS : A distributed parallel file system, specifically designed for asynchronous random reads. It uses Direct I/O and RDMA Read. In contrast to standard Buffered I/O, Direct I/O does not cache data. Caching is useless in this case, since each data read is random and is not reused.
  • hfreduce: Library for asynchronous communication, originally designed to replace Nvidia Collective Communication Library. It is mainly used for allreduce, especially of gradients during backpropagation. It is asynchronously run on the CPU to avoid blocking kernels on the GPU. It uses two-tree broadcast like NCCL.
  • hfai.nn: Software library of commonly used operators for neural network training, similar to torch.nn in PyTorch.
  • HaiScale Distributed Data Parallel : Parallel training library that implements various forms of parallelism such as Data Parallelism, Pipeline Parallelism, Tensor Parallelism, Experts Parallelism, Fully Sharded Data Parallel and Zero Redundancy Optimizer. It is similar to PyTorch DDP, which uses NCCL on the backend.
  • HAI Platform: Various applications such as task scheduling, fault handling, and disaster recovery.
As of 2022, Fire-Flyer 2 had 5,000 PCIe A100 GPUs in 625 nodes, each containing 8 GPUs. It later incorporated NVLinks and NCCL to train larger models that required model parallelism.

Development and release history

The first DeepSeek models were essentially the same as Llama, which were dense decoder-only transformers. Later models incorporated the multi-head latent attention, Mixture of Experts, and KV caching.
A decoder-only transformer consists of multiple identical decoder layers. Each of these layers features two main components: an attention layer and a feedforward network layer. V2 replaced the standard multi-head attention mechanism with multi-head latent attention. This introduces compressed latent vectors to reduce KV cache size, and thus memory usage.
A standard MoE Transformer generally use the sparsely-gated MoE layers in the FFN layers. In such an MoE layer, there are several FFN modules in parallel and a small classifier to compute a score for all these modules upon each token. Only the highest-scoring modules are activated. Starting with DeepSeekMoE, DeepSeek adopted a variant that adds "shared experts", which are always activated.