DeepSeek Janusシリーズ：統合型多モーダル理解および生成モデル

（詳細は⬇️参照）、他の現在のクローズドソースまたはオープンソースのソリューションと比較しても、まだ一定の差があります。

3つのモデルの概要

Janus：視覚的なデカップリングによるマルチモーダル理解と生成

これは革新的な自己回帰フレームワークであり、マルチモーダル理解と生成を統一することを目指しています。その独自の利点には以下が含まれます：

：視覚エンコーディングパスを分離することで、生成と理解間の衝突を効果的に緩和し、なおかつ統一された Transformer アーキテクチャを使用します。
：デカップリング設計によりフレームワークの柔軟性が強化され、マルチモーダルタスクにおいて従来の統合モデルを超えるだけでなく、タスク専用モデルと同等の性能を発揮します。

技術的特長：

シンプルな設計：アーキテクチャの複雑さを減らします。
効率性：複数のタスクで優れたパフォーマンスを発揮し、次世代マルチモーダルモデルの有力候補となります。

モデルのダウンロードリンクと論文のアドレス：

Janus-1.3B：https://huggingface.co/deepseek-ai/Janus-1.3B
Paper：https://arxiv.org/abs/2410.13848

JanusFlow：自己回帰と修正フローの調和

を組み合わせています。その技術的ハイライトには以下が含まれます：

：大規模言語モデルフレームワーク内で修正フローを直接トレーニングするために、複雑なアーキテクチャ変更は不要です。
：視覚および言語タスクにおいて、タスク専用モデルと同等かそれ以上のパフォーマンスを実現します。
：既存の統合方法のベンチマークパフォーマンスを顕著に上回ります。

モデルのダウンロードリンクと論文のアドレス：

JanusFlow-1.3B：https://huggingface.co/deepseek-ai/JanusFlow-1.3B
Paper：https://arxiv.org/abs/2411.07975

Janus-Pro：データとモデル拡張によるマルチモーダル統合フレームワーク

は Janus の高度版であり、以下の主要な改善を統合しています：
：汎化能力と安定性を大幅に向上させます。
：マルチモーダル入力のシーンと多様性を豊富にします。
：モデルの理解能力和生成安定性を強化します。
これらの改善により、Janus-Pro は以下の分野で卓越したパフォーマンスを発揮します：
：複雑なタスクにおいて業界トップレベルを達成します。
：複雑な指示をより効率的に処理し、高品質な画像を安定して生成します。
モデルのダウンロードリンクと論文のアドレス：
Janus-Pro-1B：https://huggingface.co/deepseek-ai/Janus-Pro-1B
Janus-Pro-7B：https://huggingface.co/deepseek-ai/Janus-Pro-7B
Paper：https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf

画像理解の試用

：このミームを理解する

Janus

このミームは、二種類のAIアシスタントをユーモラスに比較しています。左側はOpenAIアシスタントを使用しており、頭にカメラを取り付けている様子が示されており、監視に偏っていることを示唆しています。右側はDeepSeeアシスタントを使用しており、手にカメラを持って画面に向かっており、視覚分析やデータ収集に重点を置いていることを示唆しています。このミームは、OpenAIが監視に重きを置く一方で、DeepSeeが視覚分析に集中していることを皮肉っています。

JanusFlow

OpenAI と DeepSeek は、それぞれ人間がさまざまなタスクを完了するのを助けるための二種類のAIモデルです。OpenAI はチャットボット、言語翻訳、テキスト生成など幅広い用途に使用される汎用的大規模言語モデルです。一方、DeepSeek は大規模データセットから特定の情報を検索することに焦点を当てています。これらの二つのモデルにはそれぞれの利点があり、しばしば組み合わせて使用され、より複雑なタスクを完了します。

Janus-Pro

この画像は、視覚的な比喩を通じて「OpenAI」と「DeepSeek」の二つのAIシステムを比較したユーモラスなミームです。
左側 (OpenAI):
人物は複数のカメラやレンズ付きのヘルメットを装着しており、目標に向けています。これは、OpenAIが特定のタスクや目標にさらに「集中」または「正確」に焦点を当てる象徴です。
右側全体 (DeepSeek) の意味:
人物は銃を構えていますが、目標は見えません。これは、DeepSeekが「ある方向」を狙っているものの、具体的な目標や結果がまだ明確ではないことを示唆しています。
Overall Meaning:
The meme humorously implies that OpenAI is more focused and directed towards a specific goal, while DeepSeek is more ambiguous or uncertain in its approach.
このミームは、OpenAIが特定の目標に向かってより明確である一方で、DeepSeekの方法はより曖昧または不確実であることを皮肉っています。

：違いを見つける

正しい答え：

Janus

違いなし（誤り❌）

JanusFlow

画像はアイスクリームとアイスクリームコーンを持つ女の子がテーブルに座っている様子を示しています。二枚の画像の違いは次の通りです：
二枚目の画像では女の子の髪型が少し異なります。一枚目の画像では彼女の髪はまっすぐですが、二枚目の画像では若干ウェーブがかかっています。（誤り❌）
二枚目の画像では、女の子は最初の画像にはないピンクのリボンをつけています。（正解✅）

Janus-Pro

二枚の画像の違いはアイスクリームの色です。左の画像ではアイスクリームの色は緑色ですが、右の画像では白です。（誤り❌）

：数学問題を解く

：chatGPT o1 と o1 Pro は解答できます。

Janus

To solve this math problem, we need to determine the value of ( h_1(-2) ).
First, we need to find the value of ( h_1(-2) ).
Given the equation:
[ A^2 = 24^2 - 48t + 2, h_1(-2) = -9 ]
We can substitute ( A^2 = 24^2 - 48t ) into the given equation:
[ 24^2 - 48t + 2 = -9 ]
Next, we need to solve the equation for ( t ):
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t +

（誤り❌）

JanusFlow

To solve this math problem, we need to find the value of ( h_1(1) ) given that ( h_1'(1) = 24t^2 - 48t + 2 ) and ( h_1(-2) = -9 ).
First, we need to find ( h_1'(1) ):
[ h_1'(1) = 24t^2 - 48t + 2 ]
Next, we need to find ( h_1(-2) ):
[ h_1(-2) = -9 ]
Now, we can substitute ( h_1'(1) ) and ( h_1(-2) ) into the equation:
[ 24t^2 - 48t + 2 - 9 = -9 ]
Simplifying the equation:
[ 24t^2 - 47t + 1 = -9 ]
Subtract 1 from both sides:
[ 24t^2 - 47t + 1 - 1 = -9 ]
[ 24t^2 - 46t + 2 = -9 ]
Now, we need to find ( h_1(1) ):
[ h_1(1) = 24t^2 - 46t + 2 ]
Substitute ( h_1(1) ) back into the equation we derived:
[ 24t^2 - 46t + 2 = 24t^2 - 46t + 2 ]
Simplify the equation:
[ 24t^2 - 46t + 2 = 24t^2 - 46t + 2 ]
This equation is already correct, so there is no need to solve it further. The solution is:
[ h_1(1) = 24t^2 - 46t + 2 ]

（誤り❌）

Janus-Pro

We have the equation (24^2 - 48t + 2h = -9) and (h(-2) = -4).
First, let's simplify the equation (24^2 - 48t + 2h = -9).
[
\begin{align*}
24^2 - 48t + 2h &= -9 \
(24)^2 - 48t + 2h &= -9 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48

（誤り❌）

：駐車場に何台の車があるか数える

：18台（これについては以前 ChatGPT o1 Pro も誤った答えを出しました）

Janus

There are 20 cars in this parking lot.

（誤り❌）

JanusFlow

There are 21 cars in the parking lot.

（誤り❌）

Janus-Pro

There are 14 cars in the parking lot.

（誤り❌）

画像生成の試用

プロンプト/提示詞：

The image features an intricately designed eye set against a circular backdrop adorned with ornate swirl patterns that evoke both realism and surrealism. At the center of attention is a strikingly vivid blue iris surrounded by delicate veins radiating outward from the pupil to create depth and intensity. The eyelashes are long and dark, casting subtle shadows on the skin around them which appears smooth yet slightly textured as if aged or weathered over time.

Above the eye, there's a stone-like structure resembling part of classical architecture, adding layers of mystery and timeless elegance to the composition. This architectural element contrasts sharply but harmoniously with the organic curves surrounding it. Below the eye lies another decorative motif reminiscent of baroque artistry, further enhancing the overall sense of eternity encapsulated within each meticulously crafted detail.

Overall, the atmosphere exudes a mysterious aura intertwined seamlessly with elements suggesting timelessness, achieved through the juxtaposition of realistic textures and surreal artistic flourishes. Each component—from the intricate designs framing the eye to the ancient-looking stone piece above—contributes uniquely towards creating a visually captivating tableau imbued with enigmatic allure.

Janus

JanusFlow

デモにバグがあり、成功しませんでした

Janus-Pro

プロンプト/提示詞：

A charming, cute girl with big sparkling eyes, soft pastel-colored hair (e.g., pink, lavender, or mint green), wearing a stylish outfit with subtle frills and bows, standing in a dreamy background filled with soft lighting, cherry blossoms, and gentle gradients. The atmosphere is cheerful and heartwarming, with warm, glowing highlights and delicate details in the surroundings, anime-inspired style

Janus

JanusFlow

デモにバグがあり、成功しませんでした

Janus-Pro

さらに多くの画像例

Janus

JanusFlow

Janus-Pro

評価

Janus

ベンチマークテストのパフォーマンス

視覚生成結果

JanusFlow

ベンチマークテストのパフォーマンス

視覚生成結果

Janus-Pro

4つのマルチモーダル理解ベンチマークテストにおける平均パフォーマンス

テキストから画像生成命令追従ベンチマークテストにおけるパフォーマンス表現

技術フレームワーク

Janus

これまでの方法とは異なり、Janusは視覚理解と生成が同じ視覚エンコーダーを共有するという一般的な仮定ではなく、視覚エンコーダーを視覚理解と視覚生成のための独立したモジュールにデカップリングします。「Und. Encoder」と「Gen. Encoder」はそれぞれ「理解エンコーダー」と「生成エンコーダー」の略称です。

Janusの三段階トレーニングプロセス：

目的は、埋め込み空間で視覚と言語の概念接続を確立し、モデルに初期の視覚生成能力を与えることです。視覚エンコーダーとLLMはこの段階で凍結され、理解アダプター、生成アダプター、およびイメージヘッドのみが更新されます。
マルチモーダルコーパスを使用してトレーニングを行い、モデルがマルチモーダル理解と生成能力を学習できるようにします。LLMを解凍し、純粋なテキストデータ、マルチモーダル理解データ、視覚生成データを使用してトレーニングを行い、ImageNet-1kから視覚生成トレーニングを開始し、その後オープンドメインのテキストから画像データに拡張します。
命令チューニングデータを使用してモデルを微調整し、その命令追従能力と対話能力を強化します。生成エンコーダー以外のすべてのパラメータが解凍され、さまざまなシーンでのモデルの汎用性を確保するために、データ混合には純粋なテキスト対話、マルチモーダル理解、視覚生成が含まれます。

JanusFlow

JanusFlowアーキテクチャ：視覚理解ではLLMが自己回帰予測によって応答を生成し、画像生成ではガウスノイズ（𝑡=0）から始まり、LLMが速度ベクトルを予測して𝑧𝑡を反復更新し、𝑡=1まで到達します。簡略化のためにVAEエンコーダー、生成中のジャンプ接続、および𝑓𝑒𝑛𝑐後の線形層は省略されています。

JanusFlowの三段階トレーニングプロセス：

ランダム初期化された線形層、生成エンコーダー、生成デコーダーをトレーニングし、これらの新しいモジュールを事前学習済みのLLMとSigLIPエンコーダーと協調させるために初期化を行います。
視覚エンコーダーを凍結したまま、モデル全体をトレーニングします。使用するデータタイプは3種類：マルチモーダル理解データ、画像生成データ、純粋なテキストデータです。初期段階ではマルチモーダル理解に重点を置き、後期には画像生成データの比率を増やして拡散モデルの収束ニーズに適応します。
命令チューニングデータを使用してモデルを微調整します（対話、タスク固有の対話、高品質なテキスト-画像生成サンプル）。SigLIPエンコーダーも解凍され、マルチモーダル理解と画像生成タスクの命令応答能力が向上します。

Janus-Pro

Janus-Proアーキテクチャ：視覚エンコーディングをマルチモーダル理解（Understanding Encoder、略して「Und. Encoder」）と視覚生成（Generation Encoder、略して「Gen. Encoder」）にデカップリングします。

Janusと比較して、Proの三段階最適化は次の通りです：

ステージIのトレーニングステップを増やし、ImageNetデータセットに対して十分にトレーニングを行います。LLMパラメータが固定されている場合でも、モデルは依然としてピクセル依存関係を効果的にモデリングし、カテゴリ名に基づいて合理的な画像を生成することができます。
ImageNetデータを削除し、通常のテキストから画像データを使用して直接画像を生成します。これにより、密集した説明に基づいて画像を生成し、トレーニング効率と全体的なパフォーマンスを向上させることができます。
教師あり微調整中、マルチモーダルデータ、純テキストデータ、テキストから画像データの比率を7:3:10から5:1:4に調整します。テキストから画像データの比率をわずかに低下させることで、モデルは強力な視覚生成能力を維持しながら、マルチモーダル理解のパフォーマンスが向上します。