SaRang

Topology- Set Theory, Functions

2022-07-31T00:00:00+00:00

참고
[1] Topology 2e, James Munkres

Introduction
Rule of assignment
Define function
restriction
injective, surjective, bijective
Lemma by inverse function
image, preimage of set under \(f\)

Introduction

Function에 대한 정의를 Set의 개념에 통해서 접근해본다.
Function의 여러 구성요소와 operation에 대해서 알아본다.

Rule of assignment

두 집합(\(C, D\))의 Cartesian product의 subset이다.
ordered pair에 있는 집합 \(C\) 에 있는 elemenet는 단 한개만 존재해야만 하는 Property가 있다.
many to one이라고 생각하면 쉽다.

\[\begin{aligned} \text{a subset r of C } \times \text{ D is a rule of assignment if} \\[1em] [(c, d) \in r \text{ and } (c, e) \in r ] \Rightarrow [d = e] \end{aligned}\]

rule of assignment r 이 주어져있을때, domain of r 은 rule의 ordered pair에 존재하는 집합 \(C\) 의 subset이다. 반대로 \(D\) 의 subset은 image set of r 이라고 부른다.

\[\begin{aligned} \text{domain r } = \{ c | \text{ there exists } d \in D \text{ such that } (c, d) \in r \} \\[1em] \text{image r } = \{ d | \text{ there exists } c \in C \text{ such that } (c, d) \in r \} \\[1em] \end{aligned}\]

Define function

function \(f\) 는 rule of assignment r 을 의미한다.
domain of the function \(f\) 는 domain A of the rule 을 의미한다.
image of the function \(f\) 는 image set of the rule 을 의미한다.
range of the function \(f\) 는 전체 set B를 의미한다.
function \(f\) 가 domain A와 range B를 가지고 있을때, 다음과 같디 표현한다.

\[\begin{aligned} \text{"} f \text{ is function from A to B"} \\[1em] f : A \rightarrow B \end{aligned}\]

restriction

function \(f : A \rightarrow B\) 가 있고 \(A_0\)가 \(A\)의 subset일때, restriction of \(f\) to \(A_0\) 는 다음을 의미한다.

\[\{(a, f(a)) | a \in A_0\}\]

injective, surjective, bijective

function \(f : A \rightarrow B\) 가 있고, A의 element가 assign하는 B의 element가 하나만 존재할 때, 함수 \(f\)를 injective function이라고 부른다.

\[[f(a) = f(b)] \Rightarrow [a = b]\]

image of \(f\)와 range of \(f\)가 **동일할 때, 함수 \(f\)를 surjective function이라고 부른다.

\[[b \in B] \Rightarrow [b = f(a) \text{ for at least one } a \in A]\]

injective 조건과 surjective 조건을 모두 만족시키는 함수를 bijective function이라고 부른다.
만일 function \(f\)가 bijective이라면 \(f : B \rightarrow A\) 인 함수가 존재하며, 이를 inverse of \(f\) 라고 부르고 \(f^{-1}\) 로 표시한다.
\(f^{-1}(b) \Rightarrow f(a) = b\) 를 의미한다. \(f\) 의 surjective 조건은 \(f^{-1}(b)\) 가 존재하게 보장해주며, injective는 \(f^{-1}(b)\) 가 하나의 값을 나타내게 보장해준다.
inverse of \(f\) 또한 bijective function 이다.

Lemma by inverse function

\[\begin{aligned} &f : A \rightarrow B, g : B \rightarrow A, h : B \rightarrow A \\[1em] &\text{if } g(f(a)) = a \text{ for every } a \text{ in } A \text{ and } f(h(b)) = b \text{ for every } b \text{ in } B, \\[1em] &\text{then } f \text{ is bijective function and } g = h = f^{-1} \end{aligned}\]

\(g(f(a)) = a\) 는 함수 \(f\)가 injective function인 것을 보장해주고, \(f(h(b)) = b\) 는 surjective function인 것을 보장해준다.

image, preimage of set under \(f\)

함수 \(f : A \rightarrow B\), \(A_0\) 는 \(A\)의 subset이라고 할때, \(f(A_0)\) 를 다음과 같이 정의하고 image of \(A_0\) under \(f\) 라고 부른다.

\[f(A_0) = \{ b | b = f(a) \text{ for at least one } a \in A_0 \}\]

\(B_0\) 를 \(B\) 의 subset이라고 할때, \(f^{-1}(B_0)\) 를 다음과 같이 정의하고 preimage of \(B_0\) under \(f\) 라고 부른다.

\[f^{-1}(B_0) = \{ a | f(a) \in B_0 \}\]

반드시 \(f^{-1}\) 가 inverse of \(f\)를 의미하는 것은 아니다. \(f\) 가 bijective가 아닐때도 preimage를 정의할 수 있으며, preimage는 공집합일 수도 있다.
\(f^{-1}\) 은 inclusions, unions, intersections, differences of sets 에 대해서 preserves 하지만, \(f\) 는 오로지 inclusions와 unions에 대해서만 preserve 하다.

\[\begin{aligned} B_0 \subset B_1 &\Rightarrow f^{-1}(B_0) \subset f^{-1}(B_1) \\[1em] f^{-1}(B_0 \cup B_1) &= f^{-1}(B_0) \cup f^{-1}(B_1) \\[1em] f^{-1}(B_0 \cap B_1) &= f^{-1}(B_0) \cap f^{-1}(B_1) \\[1em] f^{-1}(B_0 - B_1) &= f^{-1}(B_0) - f^{-1}(B_1) \\[1em] B_0 \subset B_1 &\Rightarrow f(B_0) \subset f(B_1) \\[1em] f(B_0 \cup B_1) &= f(B_0) \cup f(B_1) \\[1em] f(B_0 \cap B_1) &\subset f(B_0) \cap f(B_1) \quad \text{ equality holds when f is injective.} \\[1em] f(B_0 - B_1) &\supset f(B_0) - f(B_1) \quad \text{ equality holds when f is injective.}\\[1em] \end{aligned}\]

if \(f : A \rightarrow B\) and \(A_0 \subset A \text{ and } B_0 \subset B\) 이라고 할때, 다음을 만족한다.

\[\begin{aligned} &A_0 \subset f^{-1}(f(A_0)) \quad \text{ equality holds when f is injective } \\[1em] &f(f^{-1}(B_0)) \subset B_0 \quad \text{ equality holds when f is surjective } \\[1em] \end{aligned}\]

Topology- Set Theory, Fundamental Concepts

2022-07-27T00:00:00+00:00

참고
[1] Topology 2e, James Munkres

Introduction
property
Union, Intersection and Empty Set
Vacuous Truth
Negation
Distributive law
Collections of Sets
Arbitrary Unions and Intersections
Cartesian Product

Introduction

Set Theory에서 Set에 대한 기본 개념과 간단한 operation에 대해서 알아본다.

property

property \(P(x)\)는 참과 거짓을 다루는 함수이다. 파라미터를 가질 수 있다.
Set을 이루는 element들을 제한하는데 사용한다.

\[B = \{x|x\text{ is integer greater than 5}\}\]

위의 Equation은 말로 풀어 쓰면 “B is the set of all \(x\) such that \(x\) is integer greater than 5”을 의미한다.
\(\{\}\)은 Set의 정의, \(x\)|은 모든 x를 의미, \(x \text{ is integer greater than 5}\)는 property를 의미한다.
위의 Equation에서 Property \(P(x, 5) = x > 5 \text{ and x is integer}\) 을 의미한다.

Union, Intersection and Empty Set

union of \(A\) and \(B\) :

\[A \cup B = \{x|x\in A \text{ or } x\in B \}\]

intersection of \(A\) and \(B\) :

\[A \cap B = \{x|x\in A \text{ and } x\in B \}\]

만약에 \(x\in A \text{ and } x\in B\) 을 만족하는 x가 하나도 존재하지 않을 시에는 \(A \cap B = \varnothing\) 이라고 한다. 그리고 이런 경우 \(A\) 와 \(B\)는 disjoint하다고 한다.
empty set은 element를 하나도 가지지 않는 set을 의미한다.
empty set의 개념은 어려울 수 있다. element를 가지지 않는게 Set이라고 부를 수 있는가. 이것은 수체계에서 0을 수로 인정하는 것과 같다. 0을 처음 수로 인정하는게 어려웠나보다. Convention 하게 empty set을 도입할 경우, 여러 이론과 증명이 정확히 떨어지는 경우가 많기에 직관상 이상하지만 수학적으로 사용하는 개념이라 볼 수 있다.

\[\begin{aligned} A \cup \varnothing = A \quad \text{ and } \quad A \cap \varnothing = \varnothing \quad \quad \text{for every set A} \end{aligned}\]

Vacuous Truth

\(\varnothing \subset A\) 이것은 참일까, 거짓일까
그전에 “if P, then Q”의 Statement를 보자. 만약에 P를 만족시키는 event가 하나도 존재하지 않다면 어떻게 될까.
Set Theory에서는 가정이 잘못되면 그 Statement는 참으로 본다. 이것을 Vacuous Truth라고 부른다.
아래의 Statement는 참이다. if 문을 만족시키는 x가 없기 때문이다.

\[\text{if } x^2 < 0, \text{then } x = 23\]

다시 \(\varnothing \subset A\)를 보자. 이것을 문자로 풀어보면 \(\text{if } x \in \varnothing \text{, then } x \in A\) 이다. if 문을 만족시키는 x가 없기 때문에 vacuous truth에 의해 이 명제는 참이다. 심지어 \(\varnothing \subset \varnothing\) 또한 참이다. 하지만 \(\varnothing \in \varnothing\)은 참이 아니다.
vacuous truth는 직관적인 이해보다는 수학적 논리에의해 정의된 convention에 가깝다. 예를들면 vacuous truth는 contrapositive가 성립한다.

\[\begin{aligned} \text{if } x^2 < 0&, \text{then } x = 23 \\[1em] \text{if } x \neq 23 &, x^2 >= 0 \end{aligned}\]

다음은 vacuous truth의 조건들이다.

\[\begin{cases} \forall x: P(x) \Rightarrow Q(x), \text{where it is the case that} \forall x: \neg P(x) \\[1em] \forall x \in A : Q(x), \text{where the set A is empty} \end{cases}\]

좀 더 직관적인 이해를 해보자. \(\text{if P,then Q}\) 를 Set의 개념으로 생각해보면, P를 만족시키는 event의 set을 \(P\)라고 하고, Q를 만족시키는 event의 set을 \(Q\)라고 하자. 앞의 statement를 set operation으로 바꾸면 \(P \subseteq Q\) 으로 표현될 수 있다. 만약에 \(P\)가 empty set이라면 Q에 포함된다는 의미니까 앞의 statement는 참이 된다. contrapositive를 보면 \(\neg P\) 는 모든 element를 담은 set이기 때문에 모든 set Q에 대해서도 명제는 참이된다.

Negation

negation of statement \(P\)는 not \(P\)를 의미한다.
대부분의 경우에서 not \(P\)를 구하는 것은 쉬울 것이다. 하지만 “for every”, “for at least one” 같은 logical quantifiers에서는 혼동이 올 수 있다.

\[\text{For every } x \in A, \text{statement P holds}\]

위의 statement의 negation은 다음과 같다. “for every“의 negation은 “for at least one“인 것에 주의하자.

\[\text{For at least one } x \in A, \text{statement P does not hold}\]

반대로 “for at least one“의 negation은 “for every” 이다.

Distributive law

Set의 \(\cup, \cap\) 은 distribute law를 만족한다.
\(\cup, \cap\) 모두 만족시킬 수 있다는 것에 주의하자.

\[\begin{aligned} A \cup (B \cap C) = (A \cup B) \cap (A \cup C) \\[1em] A \cap (B \cup C) = (A \cap B) \cup (A \cap C) \end{aligned}\]

다음의 equation을 DeMorgan’s laws 라고 부른다.

\[\begin{aligned} A - (B \cup C) = (A - B)\cap(A - C) \\[1em] A - (B \cap C) = (A - B)\cup(A - C) \end{aligned}\]

Collections of Sets

Set은 Set을 element로 가질 수 있다.
Set의 모든 element가 Set으로 이루어졌으면 그 Set을 Collection of Set 이라 부른다.
대표적으로 the power set of \(A\)이 있다.
power set of \(A\)는 \(A\)의 모든 Subset을 element로 가지고 있는 Set을 의미한다. \(\mathcal{P}(A)\)로 표시한다.

\[\begin{aligned} A &= \{a, b, c\} \\[1em] a &\in A, \\[1em] \{a\} &\subset A, \\[1em] \{a\} &\in \mathcal{P}(A) \end{aligned}\]

Arbitrary Unions and Intersections

union과 intersection을 꼭 두개의 Set에 대해서만 할 필요는 없다. 여러개를 같이 할 수도 있다.
Collection \(\mathcal{A}\) 이 있을때, the union of the elements of \(\mathcal{A}\) 은 다음과 같이 정의한다.

\[\bigcup_{A \in \mathcal{A}} A = \{x|x \in A \text{ for at least one } A \in \mathcal{A}\}\]

그리고 the intersection of the elements of \(\mathcal{A}\) 은 다음과 같이 정의한다.

\[\bigcap_{A \in \mathcal{A}} A = \{x|x \in A \text{ for every } A \in \mathcal{A}\}\]

만약에 \(\mathcal{A}\) 가 empty set이라면 어떻게 될까.
Union의 경우 생각해보자. Property 부분만 따로 떼어내보면, x가 존재하기 위해서는 적어도 하나 이상의 \(A\)가 있어야 한다. 하지만 \(A\)는 empty이기 때문에 적어도 하나의 \(A\)는 존재하지 않기에 Property를 만족시키는 \(x\)는 없다.
Intersection의 경우를 생각해보자. 위의 Property를 문장으로 표현하면 \(\forall A \in \mathcal{A}, x \in A\) 이다. vacuous truth 두번째 조건을 보면 Intersection의 Property와 동일한 구조이다. 그렇기에 어떠한 \(x\) 인지 상관없이 위의 statement는 참이 된다.
하지만 많은 수학자가 위의 논리를 인정하지 않기에 Intersection of collection은 collection이 empty set일 때 정의하지 않는다.

Cartesian Product

Cartesian product Set \(A \times B\) 는 다음과 같이 정의된다.

\[A \times B = \{(a, b)\text{ }|\text{ } a \in A \text{ and } b \in B\}\]

\((a, b)\) 는 ordered pair라고 부른다.
대부분의 ordered pair는 단순히 \(a\) 와 \(b\) 를 element로 가지는 Set으로 여겨지지만, ordered pair의 정의에 따라 다를 수 있다.
Cartesian product로 생성된 Set들 간에 Cartesian product를 수행할 수도 있다.

\[(A \times B) \times (C \times D) = \{(a, b, c, d)\text{ }|\text{ } a \in A, b \in B, c \in C, d \in D \}\]

Normalized Cut

2022-07-08T00:00:00+00:00

참고
[1] https://people.eecs.berkeley.edu/~malik/papers/SM-ncut.pdf

코드
https://github.com/tinnunculus/Ncut/blob/master/Ncut.ipynb

Introduction
Conventional Cutting algorithm
Normalized Cut
Computing the Optimal Partition
Grouping Algorithm
Example: Brightness Images
Review

Introduction

이 논문은 2000년도에 출간된 논문으로 Spectral Graph Theory를 기반으로 새로운 Graph Partitioning 기법을 제시한다.
기존에 있던 Graph Cutting 알고리즘에 문제를 해결하는 새로운 Graph Cutting 기법인 Normalized Cut 알고리즘을 만들었다.
새로운 Normalized Cut 알고리즘의 NP 문제를 generalized eigenvalue problem으로 접근해 효율적으로 해결하였다.

Conventional Cutting algorithm

그래프 \(G = (\mathbf{V}, \mathbf{E})\) 를 두개의 disjoint sets \(\mathbf{A}, \mathbf{B}, \mathbf{A} \cup \mathbf{B} = \mathbf{V}, \mathbf{A} \cap \mathbf{B} = \emptyset\) 으로 나누는 문제를 Graph Cut이라고 한다.
그래프는 노드의 집합 \(\mathbf{V}\) 와 두 노드간의 similarity를 나타내는 집합 \(\mathbf{E}\) 로 구성되어있다.
두개의 sets \(\mathbf{A}, \mathbf{B}\) 의 association의 척도를 나타내는 함수를 \(asso(\mathbf{A}, \mathbf{B})\) 라고 한다.

\[assoc(\mathbf{A}, \mathbf{B}) = \displaystyle\sum_{u \in \mathbf{A}, v \in \mathbf{B}} w(u, v)\]

Optimal Graph Partitioning은 \(assoc(\mathbf{A}, \mathbf{B})\) 의 값을 최소화 시키는 \(\mathbf{A}, \mathbf{B}\) 을 찾는 것이다. 즉 가장 dissociation한 두개의 disjoint set \(\mathbf{A}, \mathbf{B}\) 을 구하는 문제이다.
전체 set \(\mathbf{V}\) 을 두개의 sets \(\mathbf{A}, \mathbf{B}\)으로 나누어 지는 경우의 수는 exponential의 빅오를 가지기 때문에 매우 많은 계산량이 필요하지만, \(minimum asso\) 문제는 당시에도 well-studied problem 이었기 때문에 이 문제를 풀기 위한 효율적인 알고리즘이 존재했다.
하지만 위의 식을 최소화 시키는 방향으로 Group을 Cutting하다 보면 Graph에서 혼자 고립된(similarity가 작은) 노드를 cutting하는 것을 선호한다. 즉, association 함수가 Normalized가 되지 않은 Summation으로 이뤄지기 때문에 Summation의 항의 수가 작은 방향으로 Cutting될 확률이 높기에 small set of node로 cutting되는 경향이 있다. 아래의 그림에서 노드 간의 거리가 가까우면 weight가 높은 그래프가 있다고 했을 때, 중앙의 선으로 partition하는게 이상적으로 보이지만, 실제로는 n1과 n2노드가 분리되는 방향으로 cutting이 진행된다.

위의 문제는 단순히 Summation으로 assocation을 측정했기 때문이다. 그러면 Summation이 아닌 edge의 수로 나눠주는 normalized를 처리하면 문제가 해결될까..? 그렇지 않다. edge의 수로 normalized를 해도 똑같이 고립된(simmilarity가 작은) 노드를 컷팅하는 경향이 생길 것이다. 평균의 weight가 가장 작은 edge를 고르는 것이기 때문이다. 아래의 그림에서도 마찬가지로 중앙의 선이 아닌 가장 멀리 떨어진 하나의 노드를 나누는 식으로 cutting이 될 것이다.

Normalized Cut

본 논문에서는 새로운 Normalized Association 함수를 제시한다.
단순의 두개의 sets \(\mathbf{A}, \mathbf{B}\) 에 존재하는 similarity의 Summation이 아닌 연결된 전체 노드간의 비율로 계산을 한다.

\[\begin{aligned} Nassoc(\mathbf{A}, \mathbf{B}) = \frac{assoc(\mathbf{A}, \mathbf{B})}{assoc(\mathbf{A}, \mathbf{V})} + \frac{assoc(\mathbf{A}, \mathbf{B})}{assoc(\mathbf{B}, \mathbf{V})} \end{aligned}\]

새로운 Assocation 함수를 이용하면 기존 알고리즘에서 문제가 되었던 고립된 노드를 걷어내는 식으로 Cutting이 되는 경향을 해결할 수 있다.
한쪽 set의 노드 수가 다른 set보다 극히 작응면 \(Nassoc\) 함수의 한쪽 항은 0, 다른 항은 1에 가까워지기 때문이다.
또한

\[\begin{aligned} Nassoc(\mathbf{A}, \mathbf{B}) &= \frac{assoc(\mathbf{A}, \mathbf{B})}{assoc(\mathbf{A}, \mathbf{V})} + \frac{assoc(\mathbf{A}, \mathbf{B})}{assoc(\mathbf{B}, \mathbf{V})} \\[1em] &= \frac{assoc(\mathbf{A}, \mathbf{V}) - assoc(\mathbf{A}, \mathbf{A})}{assoc(\mathbf{A}, \mathbf{V})} + \frac{assoc(\mathbf{B}, \mathbf{V}) - assoc(\mathbf{B}, \mathbf{B})}{assoc(\mathbf{B}, \mathbf{V})} \\[1em] &= 2 - (\frac{assoc(\mathbf{A}, \mathbf{A})}{assoc(\mathbf{A}, \mathbf{V})} + \frac{assoc(\mathbf{B}, \mathbf{B})}{assoc(\mathbf{B}, \mathbf{V})}) \end{aligned}\]

Nassoc의 식을 최소화하는 것은 위의 식의 마지막 항에 \((\frac{assoc(\mathbf{A}, \mathbf{A})}{assoc(\mathbf{A}, \mathbf{V})} + \frac{assoc(\mathbf{B}, \mathbf{B})}{assoc(\mathbf{B}, \mathbf{V})})\) 항을 최대화하는 것과 같다. 즉, 기존의 cutting 되는 edge만 고려했던 association 함수와 달리 Normalized association 함수는 자기 자신 그룹의 association이 증가되는 방향으로 cutting되기 때문에 기존 알고리즘의 문제였던 small set으로 분리되는 경향은 사라진다.

\[\begin{aligned} \text{number of total nodes} &= N \\[1em] |\mathbf{A}| &= x \\[1em] \end{aligned}\] \[\begin{aligned} \frac{assoc(\mathbf{A}, \mathbf{A})}{assoc(\mathbf{A}, \mathbf{V})} &= \frac{\displaystyle\sum_{i=1}^{x}i}{x(N-1)} \\[1em] &= \frac{x(x + 1)}{2x(N - 1)} \\[1em] &= \frac{x + 1}{2(N - 1)} \\[1em] \end{aligned}\] \[\begin{aligned} \therefore \frac{assoc(\mathbf{A}, \mathbf{A})}{assoc(\mathbf{A}, \mathbf{V})} + \frac{assoc(\mathbf{B}, \mathbf{B})}{assoc(\mathbf{B}, \mathbf{V})} &= \frac{x + 1}{2(N - 1)} + \frac{N -x + 1}{2(N - 1)} \\[1em] &= \frac{N + 1}{2(N - 1)} \end{aligned}\]

위의 식은 \(Nassoc\) 를 계산하기 위해 사용하는 그래프의 edge 수를 계산한 것이다.
마지막 식을 보면 식의 항에 \(x\) 가 없는 것을 볼 수 있는데, 이것은 기존 \(assoc\) 함수에서 문제가 되었던 edge의 수가 작아질수록 assocation 값이 작아지는 경향의 문제를 해결했음을 알 수 있다.

Computing the Optimal Partition

Graph partitioning 문제는 \(Nassoc\) 함수를 최소화시키는 set \(\mathbf{A}, \mathbf{B}\)를 찾는 것으로 해결한다.
그러나 \(Nassoc\) 의 최소값을 구하는 문제는 정확하게 NP 문제이다.
하지만 그래프가 real-valued domain이라고 한정한다면 이 문제는 approximate solution으로 해결될 수 있다.

\[\begin{aligned} &\mathbf{x} : N = ||\mathbf{V}|| \text{의 dimension을 가진 vector, 1 if node i is in A and -1, otherwise} \\[1em] &\mathbf{d}(i) = \sum_{j}w(i, j) \\[1em] &\mathbf{D} : \text{N x N diagonal matrix with d on its diagonal} \\[1em] &\mathbf{W} : \text{N x N symmetrical matrix with } W(i, j) = w_{ij} \\[1em] &k = \frac{\sum_{x_i>0} \mathbf{d}_i}{\sum_i \mathbf{d}_i} = \frac{assoc(\mathbf{A},\mathbf{V})}{assoc(\mathbf{V},\mathbf{V})} \end{aligned}\]

위의 notation을 활용하여 기존의 set \(\mathbf{A}, \mathbf{B}\)를 찾는 partitioning 문제를 벡터 x를 찾는 문제로 대체할 수 있다.
\(Nassoc\) 함수를 위의 수식들로 대체할 수 있다.

\[\begin{aligned} 4Nassoc(\mathbf{A}, \mathbf{B}) &= 4(\frac{assoc(\mathbf{A}, \mathbf{B})}{assoc(\mathbf{A}, \mathbf{V})} + \frac{assoc(\mathbf{B}, \mathbf{A})}{assoc(\mathbf{B}, \mathbf{V})}) \\[1em] &= 4(\frac{\sum_{(x_i>0, x_j<0)} -w_{ij}\mathbf{x}_i\mathbf{x}_j}{\sum_{\mathbf{x}_i>0}\mathbf{d}_i} + \frac{\sum_{(x_i<0, x_j>0)} -w_{ij}\mathbf{x}_i\mathbf{x}_j}{\sum_{\mathbf{x}_i<0}\mathbf{d}_i}) \\[1em] &= \frac{(\mathbf{1} + \mathbf{x})^T(\mathbf{D} - \mathbf{W})(\mathbf{1} + \mathbf{x})}{k\mathbf{1}^T\mathbf{d}\mathbf{1}} + \frac{(\mathbf{1} - \mathbf{x})^T(\mathbf{D} - \mathbf{W})(\mathbf{1} - \mathbf{x})}{(1 - k)\mathbf{1}^T\mathbf{d}\mathbf{1}} \\[1em] &= \frac{\mathbf{x}^T(\mathbf{D} - \mathbf{W})\mathbf{x} + \mathbf{1}^T(\mathbf{D} - \mathbf{W})\mathbf{1}}{k(1-k)\mathbf{1}^T\mathbf{D}\mathbf{1}} + \frac{2(1-2k)\mathbf{1}^T(\mathbf{D} - \mathbf{W})\mathbf{x}}{k(1-k)\mathbf{1}^T\mathbf{D}\mathbf{1}} \end{aligned}\]

\(Nassoc\)는 위의 식처럼 \(\mathbf{x}\) 의 이차식으로 표현될 수 있다. 무엇인가가 이차식으로 표현되었으면 최대한 완전 제곱식으로 표현하고 싶은게 인지상정.

\[\begin{aligned} \alpha(x) &= \mathbf{x}^T(\mathbf{D} - \mathbf{W})\mathbf{x}, \\[1em] \beta(x) &= \mathbf{1}^T(\mathbf{D} - \mathbf{W})\mathbf{x}, \\[1em] \gamma(x) &= \mathbf{x}^T(\mathbf{D} - \mathbf{W})\mathbf{1}, \\[1em] M &= \mathbf{1}^T\mathbf{D}\mathbf{1} \end{aligned}\] \[\begin{aligned} &= \frac{(\alpha(\mathbf{x}) + \gamma) + 2(1 - 2k)\beta(\mathbf{x})}{k(1-k)M} \\[1em] &= \frac{(\alpha(\mathbf{x}) + \gamma) + 2(1 - 2k)\beta(\mathbf{x})}{k(1-k)M} - \frac{2(\alpha(\mathbf{x}) + \gamma)}{M} + \frac{2\alpha(\mathbf{x})}{M} + \frac{2\gamma}{M} \\[1em] &= \frac{(\alpha(\mathbf{x}) + \gamma) + 2(1 - 2k)\beta(\mathbf{x})}{k(1-k)M} - \frac{2(\alpha(\mathbf{x}) + \gamma)}{M} + \frac{2\alpha(\mathbf{x})}{M} \\[1em] &= \frac{(1 - 2k + 2k^2)(\alpha(\mathbf{x}) + \gamma) + 2(1 - 2k) \beta(\mathbf{x}) }{k(1 - k)M} + \frac{2\alpha(\mathbf{x})}{M} \\[1em] &= \frac{\frac{(1-2k+2k^2)}{(1-2k)^2}(\alpha(\mathbf{x}) + \gamma) + \frac{2(1-2k)}{(1-k)^2}\beta(\mathbf{x})}{\frac{k}{1-k}M} + \frac{2\beta(\mathbf{x})}{M} \end{aligned}\]

\(b = \frac{k}{1-k}\) 라고 할 때,

\[\begin{aligned} &= \frac{(1 + b^2)(\alpha(\mathbf{x}) + \gamma) + 2(1 - b^2)\beta(\mathbf{x})}{} + \frac{2b\alpha(\mathbf{x})}{bM} \\[1em] &= \frac{(1 + b^2)(\alpha(\mathbf{x}) + \gamma)}{bM} + \frac{2(1 - b^2)\beta(\mathbf{x})}{bM} + \frac{2b\alpha(\mathbf{x})}{bM} - \frac{2b\gamma}{bM} \\[1em] &= \frac{(1 + b^2)(\mathbf{x}^T(\mathbf{D} - \mathbf{W})\mathbf{x} + \mathbf{1}^T(\mathbf{D} - \mathbf{W})\mathbf{1})}{b\mathbf{1}^T\mathbf{D}\mathbf{1}} + \frac{2(1 - b^2)\mathbf{1}^T(\mathbf{D} - \mathbf{W})\mathbf{x}}{b\mathbf{1}^T\mathbf{D}\mathbf{1}} + \frac{2b\mathbf{x}^T(\mathbf{D} - \mathbf{W})\mathbf{x}}{b\mathbf{1}^T\mathbf{D}\mathbf{1}} - \frac{2b\mathbf{1}^T(\mathbf{D} - \mathbf{W})\mathbf{1}}{b\mathbf{1}^T\mathbf{D}\mathbf{1}} \\[1em] &= \frac{(\mathbf{1} + \mathbf{x})^T(\mathbf{D} - \mathbf{W})(\mathbf{1} + \mathbf{x})}{b\mathbf{1}^T\mathbf{D}\mathbf{1}} + \frac{b^2(\mathbf{1} - \mathbf{x})^T(\mathbf{D} - \mathbf{W})(\mathbf{1} - \mathbf{x})}{b\mathbf{1}^T\mathbf{D}\mathbf{1}} - \frac{2b(\mathbf{1} - \mathbf{x})^T(\mathbf{D} - \mathbf{W})(\mathbf{1} + \mathbf{x})}{b\mathbf{1}^T\mathbf{D}\mathbf{1}} \\[1em] &= \frac{[(\mathbf{1} + \mathbf{x}) - b(\mathbf{1} - \mathbf{x})]^2(\mathbf{D} - \mathbf{W})[(\mathbf{1} + \mathbf{x}) - b(\mathbf{1} - \mathbf{x})]}{b\mathbf{1}^T\mathbf{D}\mathbf{1}} \\[3em] \end{aligned}\]

새로운 \(\mathbf{x}\)에 대한 변수 \(\mathbf{y} = (\mathbf{1} + \mathbf{x}) - b(\mathbf{1} - \mathbf{x})\) 라고 할 때,

\[\begin{aligned} \mathbf{y}^T\mathbf{D}\mathbf{y} &= 4\displaystyle\sum_{x_i>0}\mathbf{d}_i + 4b^2\displaystyle\sum_{x_i<0}\mathbf{d}_i \\[1em] &= 4b\displaystyle\sum_{x_i<0}\mathbf{d}_i + 4b^2\displaystyle\sum_{x_i<0}\mathbf{d}_i \\[1em] &= 4b(\displaystyle\sum_{x_i<0}\mathbf{d}_i + b\displaystyle\sum_{x_i<0}\mathbf{d}_i) \\[1em] &= 4b\mathbf{1}^T\mathbf{D}\mathbf{1} \end{aligned}\] \[\begin{aligned} \therefore min_\mathbf{x}Nassoc(\mathbf{x}) &= min_\mathbf{y}\frac{\mathbf{y}^T(\mathbf{D}-\mathbf{W})\mathbf{y}}{\mathbf{y}^T\mathbf{D}\mathbf{y}} \end{aligned}\]

최종적인 간단해지고 정교해진 식은 \(\mathbf{y}\) 는 real-values 를 가지고, \(\mathbf{D} - \mathbf{W}\) 는 real-value와 symmetric하기 때문에 positive semidefinite이다.
위의 식을 Rayleigh quotient 식이라고 불른다.
위의 식은 \(\mathbf{y}^T\mathbf{D}\mathbf{y} = 1\) 의 제한조건 상에서 \(\mathbf{y}^T(\mathbf{D} - \mathbf{W})\mathbf{y}\) 을 최소화하는 문제와 같다.
라그랑지안 상수법을 이용해서 위의 문제를 쉽게 풀 수 있다.

\[\begin{aligned} &\text{minimized} \quad \mathbf{y}^T(\mathbf{D} - \mathbf{W})\mathbf{y} \\[1em] &\text{subject to} \quad \mathbf{y}^T\mathbf{D}\mathbf{y} = 1 \end{aligned}\] \[\begin{aligned} 0 &= \frac{\partial}{\partial\mathbf{y}}\mathbf{y}^T(\mathbf{D} - \mathbf{W})\mathbf{y} - \lambda\frac{\partial}{\partial\mathbf{y}}\mathbf{y}^T\mathbf{D}\mathbf{y} \\[1em] &= 2(\mathbf{D} - \mathbf{W})\mathbf{y} + 2\lambda\mathbf{D}\mathbf{y} \\[3em] \end{aligned}\] \[\begin{aligned} (\mathbf{D} - \mathbf{W})\mathbf{y} = \lambda\mathbf{D}\mathbf{y} \end{aligned}\]

최종적으로 위의 solution 식을 generalized eigensystem이라 불리며, 이 문제를 풀어 \(\lambda\) 를 구하면 된다.
문제를 풀기 위해 새로운 변수 \(\mathbf{z} = \mathbf{D}^\frac{1}{2}\mathbf{y}\) 를 만들고 위의 식을 치환한다.

\[\begin{aligned} \mathbf{D}^{-\frac{1}{2}}(\mathbf{D} - \mathbf{W})\mathbf{D}^{-\frac{1}{2}}\mathbf{z} = \lambda\mathbf{z} \end{aligned}\]

\(\mathbf{D}^{-\frac{1}{2}}(\mathbf{D} - \mathbf{W})\mathbf{D}^{-\frac{1}{2}}\)는 symmetric positive semidefinite 이다. 그렇기 때문에 \(\lambda\) 는 0 이상의 실수 값을 가진다.
실제로 \(\lambda = 0\) 일 때가 가장 smallest eigenvalue이며 그에 대응하는 eigenvector는 smallest eigenvector이다. 그러나 \(\lambda = 0\)이 된다면 \(k = 1\) 의 값을 가지므로 전제조건에 맞지 않다.
그래서 이 논문에서는 second smallest eigen value를 object function의 최소값으로 여기며, 그에 해당하는 second smallest eigen vector를 방정식의 해로 보고 있다.

Grouping Algorithm

Given an image or image sequence, set up a weighted graph \(\mathbf{G} = (\mathbf{V}, \mathbf{E})\) and set the weight on the edge connecting two nodes to be a measure of the similarity between the two nodes.
Solve \(\mathbf{D}^{-\frac{1}{2}}(\mathbf{D} - \mathbf{W})\mathbf{D}^{-\frac{1}{2}}\mathbf{x} = \lambda\mathbf{x}\) for eigenvectors with the smallest eigenvalues.
Use the eigenvector with the second smallest eigenvalue to bipartition the graph.
Decide if the current partition should be subdivided and recursively repartition the segmented parts if necessary.

Example: Brightness Images

Construct a weighted graph \(\mathbf{G} = (\mathbf{V}, \mathbf{E})\) by taking each pixel as a node and connecting each pair of pixels by an edge. \(F(i), X(i)\) are the pixel value and spatial location of node i, respectively.

\[\begin{aligned} w_{ij} = e^{\frac{-||\mathbf{F}_{(i)} - \mathbf{F}_{(j)}||_2^2}{\sigma_I^2}} * \begin{cases} e^{\frac{-||\mathbf{X}_{(i)} - \mathbf{X}_{(j)}||_2^2}{\sigma_X^2}} & \quad \text{if } ||\mathbf{X}_{(i)} - \mathbf{X}_{(j)}||_2 < r \\[1em] 0 & \quad \text{otherwise} \end{cases} \end{aligned}\]

Solve the generalized eigensystem for the eigenvectors with the smallest eigenvalues of the system. \(\mathbf{D}^{-\frac{1}{2}}(\mathbf{D} - \mathbf{W})\mathbf{D}^{-\frac{1}{2}}\mathbf{x} = \lambda\mathbf{x}\)
Once the eigenvectors are computed, we can partition the graph into two pieces using second smallest eigenvector. but, nonideally our eigenvectors have continuout values and just we need to choose a splitting point to partition it into two parts. normally we can take 0 as splitting point.
After the graph is broken into two pieces. we can recursively run our algorithm on the two partitioned parts. Or, we coule take adventage of the other small eigenvectors to partition more than two pieces. but the greater eigenvalue, the lower stability on the partition boundary. as we see in the eigenvector with the seventh to ninth smallest eigenvalues on the bottom picture, the eigenvectors take on the shape of a continuous function rather than discrete that we seek. so we simply choose to ignore all those eigenvetors which have smoothly varying eigenvector values by measuring the degree of smoothness in the eigenvector values and thresholding.

Review

이 논문은 분명 기존의 Graph Cut 함수의 문제점과 좋은 Normalized Cut 알고리즘을 제시하고, 계산방법을 제시하였다.
특히 제시한 Normalized Cut으로부터 Rayleigh quotient 형식을 이끌어 냈다는 점에서 훌륭한 논문이라고 할 수 있다.
하지만 minimized 시키는 방법에 대해서는 의문이 있다. Rayleigh quotient의 solution을 구하는 식 \(\mathbf{D}^{-\frac{1}{2}}(\mathbf{D} - \mathbf{W})\mathbf{D}^{-\frac{1}{2}}\mathbf{x} = \lambda\mathbf{x}\) 은 minimized를 구하는 식이 아닌 critical point를 구하는 식이다. 즉, object function의 값의 법위는 eigenvalue의 값의 범위와 같다. 그렇기에 smallest eigenvalue가 아닌 second smallest eigenvalue는 이 함수의 최소값이라고 단정할 수 없다고 생각한다. 즉 제한 조건이 있는 상태에서 풀어낸 solution의 신뢰성에 의심을 한다. 다만 응용에서 벡터는 이미지 픽셀 전체를 나타내기에 상당히 큰 차원의 벡터를 추정한다. 그렇기에 second smallest eigenvalue은 최솟값일지는 모르겠지만 충분히 작은 값을 나타낸다고 할 수 있고, second가 아닌 그 뒤에 있는 third~ninth까지도 충분히 작은 값을 나타낸다고 볼 수 있다. 또한 모든 eigenvector가 orthogonal 하기 때문에 각각의 eigenvector는 겹치지 않는 고유의 특징을 표현한다고 기대할 수 있겠다.

DINO

2022-07-05T00:00:00+00:00

참고
[1] https://arxiv.org/pdf/2104.14294.pdf

Introduction
SSL with Knowledge Distillation
Algorithm

Introduction

최근 Vision Transformer는 Vision Task에서 Convolution을 대체할 만큼 좋은 성능을 보여주고 있다.
이 논문에서는 Vision Task에서 Transformer의 성공이 Transformer의 pretrain 과정에서의 supervison을 통해 설명될 수 있을까 의문을 던졌고, 이것은 NLP 분야에서의 Transformer의 성공 중 하나인 self-supervised pretrain의 사용으로부터 동기를 얻었다.
NLP 분야에서의 self-supervised pretrain은 language modeling 혹은 문장 중에 비어있는 word를 생성하는 방식으로 학습이 진행되고 이것은 문장의 label을 예측하는 학습보다 풍부한 문장의 인지 능력을 가지게 된다.
이미지 분야에서도 마찬가지로 Vit처럼 supervised 학습은 이미지에 포함된 풍분한 visual information을 가지지 못하기에 이 논문에서는 self-supervised 학습을 통해 풍부한 visual information을 가지는 것을 목표로 한다.
이 논문에서는 ViT 모델을 이용해서 self-supervised pretraining의 효과를 연구한다.
연구의 결과 self-supervised ViT는 supervised ViT와는 달리 명백히 scene layout(object boundaries)을 예측할 수 있다. 이것은 학습 완료된 ViT 모델의 last block에서 볼 수 있다.
이런 segmentation 적인 특징이 나타나는 것은 여러 self-supervised method들의 사용을 통한 결과라고 예측할 수 있다.
그러나 k-NN을 통한 좋은 classification 결과를 보여주는 것은 momentum encoder과 multi-crop augmentation 기술 덕분이라고 할 수 있다.
이 논문에서는 student model과 teacher model을 통해 self-supervised 학습을 진행하는 knowledge distillation with no labels 기법을 사용하고, momentum encoder를 통해 학습을 진행한다.
기존의 다른 논문과는 다르게 오직 teacher 모델의 centering과 sharpening을 통해 학습이 collapse되는 것을 막는다.
self-supervised의 knowledge distillation 학습은 label이 없이 student와 teacher의 output으로만 학습이 진행되기 때문에 학습 도중에 모든 데이터에 대해서 어느 특정한 값으로 쏠려버리는 현상 같은게 일어날 수 있다. 그러한 현상을 collapse라고 한다. 예를 들면 모든 이미지에 대해서 결과 값이 0.5, 0.5, …, 0.5 혹은 1, 0, …, 0 을 출력하는 경우이다.

SSL with Knowledge Distillation

이 논문에서 제안하는 모델 DINO는 distillation knowledge 기법을 기반으로 student, teacher model을 학습한다.
student model과 teacher model은 동일한 아키텍쳐를 사용한다.
student model의 학습은 모델의 마지막단의 [CLS] 토큰을 통해 이루어지며, student와 teacher model의 결과물에는 softmax를 취하기 전에 특정 값으로 나눠준다. 이것은 output distribution의 sharpness를 컨트롤할 수 있다. 값이 작아질 수록 분포는 더욱 뾰족해질 것이다.
뭔가 student의 output에는 sharpness를 하지 않고 teacher의 output에만 sharpness를 하는게 더 맞지 않나… 라고 생각이 든다. 둘다 shaprness를 하면 어차피 다시 값이 평행하도록 학습이 될 것 같은 느낌인데, 물론 학습 극 초반에 평행하도록 되는 현상은 막을 수 있을 것 같지만…

Training student model

student model과 teacher model이 동일한 이미지를 입력으로 받지 않는다. 만일 동일한 이미지를 입력으로 받는다면 값의 차이가 크지 않아 collapse가 일어날 확률이 쉬워지고, classification에서의 성능이 떨어질 것이다.
이미지는 crop되어서 student model과 teacher model에 입력으로 들어가는데, 상대적으로 이미지가 큰 global views \(x_1^g, x_2^g\)와 이미지가 작은 local views group으로 나눠지며, teacher model에는 큰 이미지만이 들어간다. 즉, ‘local-to-global’ 관련성을 찾도록 유도하고, 이것은 classification 문제에 적합하다.
global view로는 224x224 크기의 image를 사용하고, 기존 이미지의 절반 이상을 덮을 수 있어야 한다. local view로는 96x96 크기의 image를 사용하고 기존 이미지의 절반 이하의 영역을 가리켜야 한다.
학습은 stochastic gradient descent를 통해 이뤄진다. momentum을 사용하지 않는 것에 주의하자.
momentum을 사용하지 않는 이유는 sharpening을 강화하기 위함이라고 생각한다. momentum을 가질경우 기존의 결과의 영향으로 centering되는 collapse가 일어날 수 있겠다.

Teacher network

다른 knowledge distillation와는 다르게 이 논문에서는 label이 없고 student와 teacher가 동일한 network 구조를 사용하기 때문에 실직적인 teacher는 존재하지 않는다.
하나의 에폭을 학습하는데 teacher network를 freezing시키는 것이 더 좋은 결과를 보여주었고, 단순히 student weight를 teacher weight로 복사하는 것은 모델이 수렴하는데에 실패하였다.
student weight를 단순히 복사하는 것이 아닌 exponential moving average을 사용하여 momentum을 가져가는 것이 더 학습의 안정화를 이끌어낼 수 있었다.
\(\theta_t = \lambda\theta_t + (1 - \lambda)\theta_s\) 의 object function으로 teacher weight가 갱신되며, \(\lambda\)는 cosine schedule로 0.996부터 1까지 증가한다. 굉장히 작은 크기의 student weight가 영향을 미친다.

Avoiding collapse

이 논문에서는 self-supervised 학습의 collapse를 피하기 위해 Sharpening, Centering 두가지 기법을 사용한다.
Sharpening은 앞서 보았던 모델의 output에 softmax를 취하기 전 작은 값으로 나눠 분포를 더욱 다이나믹하게 하는 것이다.
centering은 teacher 모델의 output에 softmax를 취하기 전 기존 output의 momentum을 더해 특정 차원에 dominate해지는 것을 막는다.
위 두 기법은 trade-off로 작용되어 학습의 collapse를 막도록 한다.
위 그림에서 KL-divergence는 \(H(P_t, P_s) = h(P_t) + D_{KL}(P_t\|P_s)\) 로 계산한다.

\[\begin{aligned} g_t(x) = g_t(x) + c \\[1em] c = mc + (1-m)\frac{1}{B}\sum_{i=1}^{B} g_{\theta_t}(x_i) \end{aligned}\]

Algorithm

Node- Socket

2022-07-01T00:00:00+00:00

참고
[1] https://socket.io/docs/v4

Introduction
socket.io
io 객체
socket 객체

Introduction

네트워크 상에서 정보를 주고 받는 방법(protocol)중 HTTP는 client에서 server로만 요청을 보낼 수 있는 단방향 통신이다. 하지만 실시간 채팅처럼 server에서 client에게 데이터를 전송해야하는 문제도 있다. 기존에는 HTTP protocol은 유지한 채로 Poling 기법을 취했다.
Poling 기법은 HTTP protocol 상에서 client가 주기적으로 server에게 어떠한 변화가 있는지 체크하는 요청을 보내는 것이다. 주기적으로 확인 요청을 보내야하기 때문에 네트워크 상의 자원을 많이 차지하고 server에서 통신하는 client가 많아지면 server는 수 많은 확인 요청으로 인해 병목현상이 심화될 것이다.
Socket protocol은 HTTP protocol과 달리 양방향으로 통신할 수 있는 protocol이다.
Socket protocol은 server와 client간에 connection을 통해 양방향 통신을 가능하게 한다.
Socket protocol상에서 server와 client간에 connection에 따라 TCP, UDP 등으로 나뉜다.
connection은 항상 client에서 server로 연결 요청을 보낸다.

socket.io

socket.io는 node에서 socket을 편리하게 사용할 수 있게 하는 패키지이다.
nodejs에서 socket은 이벤트를 통해 데이터를 주고 받는다.
socket.io 패키지를 이용하면 HTTP server와 연동하여 사용할 수 있다. 즉, HTTP server를 통해 socket server를 구축할 수 있다. 그렇기에 socket 연결 요청도 HTTP server로 요청해야 한다.
아래의 코드에서 http_server 객체는 app.listen()의 output으로 http server에 대한 정보들이 저장되어 있다.

const SocketIO = require('socket.io');
const io = SocketIO(http_server, { path: '/socket.io' }); // socket server를 http server에 connection. 동일한 port 번호 사용
io.on('connection', (socket) => {}); // connection이 완료되면 callback 함수가 실행된다. 

io 객체

server의 socket에는 계층 구조가 있는데, 첫번째로 나눠지는 것을 namespace라고 부른다.
client는 해당 socket server의 namespace로 connection 요청을 보내야 한다.

const socket1 = io.of('/socket1');
const socket2 = io.of('/socket2');

socket1.on('connection', (socket) => {}); // 클라이언트는 /socket1 으로 connection 요청을 보내야만 한다.
socket2.on('connection', (socket) => {}); // 클라이언트는 /socket2 으로 connection 요청을 보내야만 한다.

io.emit 메소드를 통해 현재 server에 있는 모든 socket에서 데이터를 전송할 수 있다.

client와 server간의 connection은 요청과 응답의 http protocol 상에서 이뤄지기 때문에 socket server에서도 middle-ware를 넣을 수 있고, connection 당시에만 실행된다.

io.on('connection', (socket) => {
    io.use((socket, next) => {
        if (isValid(socket.request)) {
            next();
        } else {
            next(new Error("invalid"));
        }
    });
});

io.socketsJoin 메소드를 통해 현재 socket server 내부에 있는 socket의 room 위치를 변경할 수 있다.

// 현재 socket server 내부에 있는 모든 socket들을 "room1"으로 옮긴다.
io.socketsJoin("room1");

// "room1"에 위치한 모든 socket을 'room2'와 'room3'로 옮긴다.
io.in("room1").socketsJoin(["room2", "room3"]);

// "admin" namespace를 사용하고 "room1"에 위치한 모든 socket을 "room2"로 옮긴다.
io.of("/admin").in("room1").socketsJoin("room2");

// 특정 socket id의 socket을 'room1'으로 옮긴다.
io.in(theSocketId).socketsJoin("room1");

io.socketsLeave 메소드를 통해 socket이 room을 떠나게 할 수 있다.

// make all Socket instances leave the "room1" room
io.socketsLeave("room1");

// make all Socket instances in the "room1" room leave the "room2" and "room3" rooms
io.in("room1").socketsLeave(["room2", "room3"]);

// make all Socket instances in the "room1" room of the "admin" namespace leave the "room2" room
io.of("/admin").in("room1").socketsLeave("room2");

// this also works with a single socket ID
io.in(theSocketId).socketsLeave("room1");

io.disconnectSockets 메소드를 통해 socket의 연결을 끊을 수 있다. 특정 room만 연결을 끊는 socketsLeave와는 차이가 있다.

// make all Socket instances disconnect
io.disconnectSockets();

// make all Socket instances in the "room1" room disconnect (and discard the low-level connection)
io.in("room1").disconnectSockets(true);

// make all Socket instances in the "room1" room of the "admin" namespace disconnect
io.of("/admin").in("room1").disconnectSockets();

// this also works with a single socket ID
io.of("/admin").in(theSocketId).disconnectSockets();

io.fetchSockets 메소드를 통해 특정 socket들을 검색할 수 있다. 리스트형태로 출력이 되니 for문을 이용하여 사용하자.

// return all Socket instances of the main namespace
const sockets = await io.fetchSockets();

// return all Socket instances in the "room1" room of the main namespace
const sockets = await io.in("room1").fetchSockets();

// return all Socket instances in the "room1" room of the "admin" namespace
const sockets = await io.of("/admin").in("room1").fetchSockets();

// this also works with a single socket ID
const sockets = await io.in(theSocketId).fetchSockets();

for (const socket of sockets) {
    console.log(socket.id);
    console.log(socket.handshake);
    console.log(socket.rooms);
    console.log(socket.data);
    socket.emit(/* ... */);
    socket.join(/* ... */);
    socket.leave(/* ... */);
    socket.disconnect(/* ... */);
}

socket 객체

socket 객체는 connection이 성공적으로 이루어지면 그 콜백 함수의 parameter로 들어간다.
socket protocol은 이벤트기반으로 데이터를 주고 받기에 socket 객체의 emit 메소드를 이용하여 이벤트를 발생시키고, on 메소드를 이용하여 이벤트를 받는다.

모든 socket은 room을 가지고 있다. room이란 namespace의 하위 구조라고 볼 수 있으며, 하나의 socket이 여러 room에 속할 수 있고, 하나의 room에 여러 socket이 들어있을 수 있다.
이벤트의 emit과 on은 room을 기준으로 하기에 해당 room에 속한 모든 socket에서 이벤트를 동일시 한다.
현재 자기 자신이 속하고 있는 room 외에도 다른 room에 속한 socket을 통해 데이터를 전송할 수 있다.

io.on("connection", (socket) => {
    console.log(socket.rooms); // Set { <socket.id> }
    socket.join(roomId1);
    socket.join(roomId2);
    console.log(socket.rooms); // Set { <socket.id>, roomId1, roomId2 }

    socket.emit({ data: "hello" }); // roomId1과 roomId2에 있는 모든 socket에게 데이터를 전송한다.
    socket.to(roomId1).emit({ data: "hello" }); // roomId1에 있는 모든 socket에게 데이터를 전송한다.
    socket.to(roomId3).emit({ data: "hello" }); // roomId3에 있는 모든 socket에게 데이터를 전송한다.
    /* ... */
    socket.on('disconnect', () => {
        socket.leave(roomId1);
        socket.leave(roomId2);
    })
});

모든 socket은 고유의 id를 가지고 있으며, server와 client에 동일한 id를 가진다. socket이 생성되면 default로 자기 id의 room에 들어간다.

io.on("connection", (socket) => {
  console.log(socket.id); // ojIckSD2jqNzOqIrAGzL
});

// client-side
socket.on("connect", () => {
  console.log(socket.id); // ojIckSD2jqNzOqIrAGzL
});

io.on("connection", socket => {
    socket.to(anotherSocketId).emit("private message", socket.id, msg); // anotherSocketId에 해당하는 socket에게만 메시지를 보낸다.
});

server와 client가 connection 과정에서의 정보들도 담겨 있다. (hand-shake)

{
  headers: /* the headers of the initial request */
  query: /* the query params of the initial request */
  auth: /* the authentication payload */
  time: /* the date of creation (as string) */
  issued: /* the date of creation (unix timestamp) */
  url: /* the request URL string */
  address: /* the ip of the client */
  xdomain: /* whether the connection is cross-domain */
  secure: /* whether the connection is secure */
}

socket.data 객체에 정보를 저장하면 io에서 fetchSocket을 통해 socket을 가져왔을 때, 데이터를 전달할 수 있다.

// server A
io.on("connection", (socket) => {
  socket.data.username = "alice";
});

// server B
const sockets = await io.fetchSockets();
console.log(sockets[0].data.username); // "alice"

Node- OAuth 2.0

2022-06-06T00:00:00+00:00

참고
[1] https://developers.google.com/identity/protocols/oauth2
[2] https://blog.naver.com/mds_datasecurity/222182943542

Introduction
OAuth 2.0 protocol
Authorization Code Grant
Authorization Code Grant 절차
OAuth 2.0을 이용한 GMail Api 관련 Token 얻기
Access token과 Node mailer 패키지를 이용하여 이메일 전송하기

Introduction

Oauth는 Client(웹 어플리케이션)가 Resource Server(구글)로 부터 Resource owner(웹 어플리케이션의 유저들)의 Resource들을 사용하기위해 Authorization Server(구글)에게 Authentication(인증)과 Authorization(권한)을 받는 프로토콜이다. 즉 내가 만든 서버가 유저의 구글 로그인, 네이버 로그인, 네이버 메일 등을 사용하기 위해 구글, 네이버에게 내 서버와 유저를 인증하는 절차이다.
이메일 인증을 위해 서버에서 Gmail을 이용하여 이메일을 유저들에게 보냈는데, 기존에는 Gmail의 아이디와 비밀번호만 서버에 저장하고 있고 서버에서 구글 이메일 서버에 접속해 아이디와 비밀번호를 통해 로그인하여 메일을 발송하였는데 최근에 구글에서 그 방법을 막기 시작하였다.
서버에서 유저의 아이디와 비밀번호를 저장하여 구글에 유저를 대행하여 로그인하는 것이 서버에서 유저행세를 하며 구글에서 유저정보를 빼앗는 것은 보안상 좋지 않다고 볼 수 있다. 구글에서는 클라이언트 서버가 신뢰할만한 인증을 받는 것이 아니기 때문.
Clent 서버는 인증을 받으면 유저의 리소스를 사용할 수 있는 권리를 받는데 Access token, Refresh token을 통해 Resource Server로부터 유저의 Resource를 받아올 수 있다.
핵심은 Authorization Server는 유저와 Client 모두를 인증하고 Client 서버에게 권한을 부여하는 것이다.

OAuth 2.0 protocol

OAuth 2.0을 다루는데 있어서 등장하는 요소들로는 Resource 소유자인 Resouce Owner(편의상 Client의 유저라고 부르겠다), 유저의 Resource를 사용할려는 Client, 유저의 Resource를 소유하고 있는 Resource server, 유저와 Client부터 인증과 권한을 부여하는 Authorization Server가 존재한다.
OAuth를 통해서 Client에게 Resource Server에서 유저의 Resource를 사용하도록 권한을 부여하는 여러 방법(4가지)가 존재하는데 그 중에서 가장 보편적으로 사용하는 Authorization Code Grant 방식을 알아보도록 하겠다.

Authorization Code Grant

Authorization Server에서 유저를 검증(Authentication)하고 어떤 Resource를 사용할 지 Scope을 정의한 후에 그 정보를 Authorization Code로 만들어 Client에게 전달하고 Client는 자신을 Authorization Server에게 받은 Code와 함께 Client을 인증하면 유저의 Resource를 사용할 수 있도록 하는 OAuth 프로토콜 중에 가장 보편적으로 사용하는 Authorization 방식이다.
Authorization Code는 인증된 유저 정보, Scope 정보, 아직 인증되지 않은 Client 정보가 있다. 그렇기 때문에 Client는 이 Code와 함께 Authorization Server에게 자신을 인증해야 한다. 유저 검증을 Client가 하는 것이 아닌 Authorization Server가 하고 Client에서는 유저 정보를 가지고 있을 필요가 없기 때문에 보안상 뛰어나다. 또한 유저 인증만 하였다고 Resource를 바로 쓸 수 있는 것이 아닌 Client도 검증 단계를 거쳐야만 Resource를 사용할 수 있다. 유저 정보만 검증하고 Client는 검증하지 않을 경우 Client Id는 오픈된 정보이기 떄문에 다른 누군가가 Client 행세를 하여 Code를 얻을 수 있기 때문이다.
Access token과 Refresh token을 사용하는 방식이다.

Authorization Code Grant 절차

우선 Client를 Authorization Server에 등록하여 Client Id와 Cleint Secret을 받아야한다. 이것은 나중에 Authorization Server에서 Client를 인증할 때 사용한다.
Cleint에서 어떤 특정한 일에서 유저의 Resource를 사용한다고 하면 그 특정한 일을 유저가 요청할 때, Client에서는 Client Id와 Resource Scope 정보, 인증이 완료되면 인증 코드를 보낼 redirected Url 주소를 담아 Authorization Server에게 요청한다.
Authorization Server는 자체적으로 로그인 같은 방법으로 유저를 검증하고 해당 Scope에 대해서 사용 동의를 얻은 후에 유저 정보와 Scope 정보가 담긴 Authorization Code를 만들어 Client에게 전달한다.
응답을 받은 Client는 Authorization Code와 Client Id, Client Secret 정보를 다시 Authorization Server에게 전달한다.
Authorization Server는 받은 Client Id와 Client Secret을 통해 Client를 인증하고 해당 유저에 리소스에 접근할 수 있는 Access token과 Refresh token을 Client에게 발급한다.
Access token에는 유저 정보와 Scope 정보, Client 정보가 담겨있다.
Access token을 재발급 받기 위해서는 Client Id, Client Secret, Refresh token 3개가 필요하다.
Access token과 Refresh token을 이용하기 때문에 다음에 중복해서 인증할 필요가 없다.

OAuth 2.0을 이용한 GMail Api 관련 Token 얻기

Gmail을 구글에서 제공하는 서비스이기 때문에 구글에 Client를 등록해야 한다.
구글 클라우드 플랫폼에 Api 및 서비스에서 Client를 등록하여 Client Id와 Secret을 받는다.
Gmail api 서비스에서 유저는 이메일을 보내는 발신자이다. 일반적인 OAuth 2.0을 사용하는 로그인 서비스에서는 Authorization code를 Client로 응답하여 매번 유저마다 다른 Authorization Code를 받아 인증해야하는 것과는 달리, 로그인 서비스에서는 유저의 이메일을 등록하여 등록된 이메일을 사용해서 전송하면 되기 때문에 웹 브라우저를 통해 사용할 이메일을 한번만 인증하면 된다. 구글에서 제공하는 oauthplayground를 사용한다.
oauthplayground는 브라우저에서 oauth관련 일을 처리할 수 있는 Authorization Server라 생각하면 쉽다.
oauthplayground에서 Client Id와 gmail scope을 등록하고, 구글 로그인을 통해 유저 인증을 하면 Authorization Code를 얻는다.
받은 Authorization Code와 Clien Id, Client Secret을 입력하면 Access token과 Refresh token을 얻을 수 있다.

Access token과 Node mailer 패키지를 이용하여 이메일 전송하기

Node mailer을 이용하여 이메일 보내기에서 설명한 것처럼 mailer 패키지를 이용하여 메일을 전송하기 위해서는 Transporter 객체를 이용한다.

access token은 등록하지 않아도 된다. nodemailer가 자체적으로 refresh token, Client Id, Client Secret을 이용해서 access token을 발급하기 때문이다.

let transporter = nodemailer.createTransport({
host: "smtp.gmail.com",
port: 465,
secure: true,
auth: {
  type: "OAuth2",
  user: "user@example.com",
  clientId: "000000000000-xxx0.apps.googleusercontent.com",
  clientSecret: "XxxxxXXxX0xxxxxxxx0XXxX0",
  refreshToken: "1/XXxXxsss-xxxXXXXXxXxx0XXXxxXXx0x00xxx",
},
});

about requires_grad

2022-06-03T00:00:00+00:00

참고
[1] https://pytorch.org/docs/stable/notes/autograd.html
[2] https://medium.com/@mrityu.jha/understanding-the-grad-of-autograd-fc8d266fd6cf

Introduction
Autograd의 처리 과정
Saved Tensor
Requires_grad flag
Gradient context
in-place operation
tip

Introduction

코딩을 하면서 Tensor 객체가 가지고 있는 requires_grad flag에 대해서 항상 헷갈리는 점이 있었는데, 이번 기회에 정리를 해보자 한다.
pytorch의 autograd는 forward pass 시에 directed gradient graph를 생성하여, backward() 메소드를 통해 gradient를 계산한다.
autograd는 forward pass 중에 backward에서 gradient 계산을 위해 텐서들을 gradient function에 저장한다.
Requires_grad flag가 True인 leaf tensor를 수정하면 오류가 난다. leaf tensor와 intermediate tensor의 차이점에 대해서 알아본다.
Requires_grad flag가 True인 텐서에 대해서면 gradient를 계산한다. 전령 gradient function이 존재하더라도 requires_grad가 True인 텐서만 골라서 gradient를 계산한다. 마찬가지로 forward할 시에 gradient에 필요한 saved tensor만 저장한다.
gradient를 계산하는 context로는 grad mode, no grad mode, inference mode가 존재한다. inference mode는 그레디언트를 전혀 계산하지 않는다.
saved tensor는 값이 바뀌면 안된다. 그렇기에 in place operation의 사용에 있어서는 주의해야한다. 파이토치에서는 leaf node에 대해서 in place operation을 사용하는 것에 오류를 보낸다.
intermediate tensor에서는 in place operation을 사용한다고 오류를 내보내지 않는다. 자체적으로 clone하여 saved tensor를 저장한다. 이 말은 saved tensor를 저장하지 못한다는 이유로 leaf node에 in-place operation을 못하게 하는 것이 아닌, leaf tensor는 gradient function을 가져서는 안되는 특징을 가지고 있고, clone하여 leaf tensor는 유지하고 새로운 intermediate tensor를 만들어 해결한다면 그건 사실상 out place operation을 의미한다.

Autograd의 처리 과정

autograd는 reverse automatic diffrentiation system을 의미한다. 말 그대로 forward 연산을 역행하여 자동으로 미분을 계산하는 시스템이다.
forward pass를 계산할 때, autograd는 gradient를 계산하는 Function을 노드로해서 graph를 만든다. graph에는 노드 Function과 엣지 Tensor와 방향 input, output이 기록되어 있다.
이렇게 만들어진 graph는 backward()메소드를 실행 시 forward pass의 역행으로 해서 계산되어 진다. 즉 forward pass에서 입력으로 들어간 leaf node는 backward pass에서 ouput이 되고, forward pass에서 출력으로 나온 output tensor(root)는 backward pass에서 input tensor가 된다.

Saved Tensor

Gradient를 계산하기 위해서는 forward시에 입력으로 들어갔던 텐서들이 필요할 때가 있다. autograd는 그런 텐서들을 해당 operation에 gradient function과 함께 ctx라는 변수명으로 저장한다. 물론 텐서를 복사 저장이 아닌 참조 저장이다. 그렇기에 한번 저장된 텐서는 별도로 수정해서는 안된다. 오류 발생
ctx.save_for_backward() 함수를 통해서 텐서를 저장할 수 있으며, ctx.saved_tensors를 통해 불러올 수 있다.

intermediate tensor에 저장된 gradient function으로 어떤 텐서들이 저장되어 있는지 확인할 수 있다.

x = torch.randn(5, requires_grad=True)
y = x.pow(2) # dx/dy를 구하기 위해서는 x의 값이 필요하다.
print(x is y.grad_fn._saved_self)  # True... 참조형으로 저장되는 모습

Requires_grad flag

requires_grad는 모든 텐서가 가지고 있는 flag이며, 기본값은 False이다. nn.parameters()로 감싸진 tensor는 기본값이 True이다.
forward pass 중에 어떤 operation의 input중에 requires_grad가 하나라도 True이면 output Tensor는 모두 gradient function을 가지며 intermediate tensor가 된다. gradient function이 저장되지만 gradient function과 함께 저장되는 텐서들은 모두 저장되는 것은 아니다. requires_grad가 True인 input tensor의 gradient를 계산하기 위한 텐서들만 저장된다. forward pass f(a, b) = a * b이고 requires_grad가 a는 True, b는 False일 때, 함수 f의 grad_fn은 requires_grad가 True인 tensor a의 gradient 계산에 필요한 b만을 저장한다. 반면에 모든 input tensor가 requires_grad가 False라면 output tensor는 모두 gradient function을 가지지 않으며 leaf tensor가 된다.

backward pass 중에는 오로지 requires_grad가 True인 leaf tensor(gradient function을 가지지 않는 tensor)만이 gradient값이 축적된다. intermediate tensor들도 gradient 값을 축적시키고 싶다면 .retain_graph 를 True로 설정해서 grad 값을 축적시키도록 할 수 있다. 하지만 거의 쓰지 않을듯..?
autograd는 intermediate tensor는 항상 requires_grad가 True라고 가정하고 한다. 만약 intermediate tensor가 requires_grad를 False로 선정한다면 파이토치상에서 오류를 일으킨다.
intermediate tensor를 gradient 계산이 필요 없어진다면, 즉 backward 그래프를 해당 tensor에서 끊고 싶다면 해당 tensor.detach()를 통해 graph를 끊을 수 있다. 물론 .detach() 메소드는 in-place method가 아닌 graph가 끊어진 새로운 tensor를 복사하는 함수이다. 그렇기에 실제로는 graph가 남아있다.

어떤 특정 모델을 학습시키지 않고 싶다면 그 모델의 parameter들을 모두 requires_grad = False로 만들면 된다. nn.module.requires_grad_(False) 함수를 사용하자.

Gradient context

텐서들의 requires_grad와는 별개로 gradient graph를 다루는 context blcok들이 존재한다.
첫번째로 grad mode는 가장 기본적인 mode로써 requires_grad = True인 텐서들에 대한 gradient graph를 계산하는 것으로 default context이다.
두번쨰는 no_grad mode로써 context box 내부에 있는 텐서 연산들은 모두 gradient graph를 만들지 않으며 그 뜻은 gradient function, 새로운 intermediate tensor를 만들지 않고 leaf tensor만을 만드는 context이다. 텐서들을 모두 requires_grad=False로 가정하고 계산한 뒤에 동일한 텐서에 대해서 requires_grad를 True로 바꾼다. 즉 값만 바꾼다고 생각하면 된다. 주의할 점은 동일한 텐서라는 점이다. 이름만 같은 다른 텐서는 requires_grad를 True로 바꾸지 않는다.

마지막으로는 inference mode가 있다. 이것은 no_grad mode는 해당 context에서만 gradient flow를 생성하지 않고 context에서 출력한 output은 새로 grad mode에서 gradient graph를 만들 수 있다. 하지만 inference mode에서는 이것조차 불가하여 해당 context에서 계산을 한 Tensor는 grad mode로 옮기더라도 더이상 gradient graph를 만들 수 없다.

별개로 module의 evalution mode는 requires_grad와는 별개의 기술이다. model.eval()으로 한다고 해서 requires_grad를 False로 계산하지 않는다. nn.Batchnorm과 nn.dropout 같은 Train과 Evalution에 다르게 적용되어야 하는 operation을 위함이다.

in-place operation

in-place operation이란 연산할 시에 새로운 객체를 만드는 것이 아닌 연산과 동시에 값을 대입하는 operation을 말한다. 예륻 들면 x += 5 같은 더하기 연산이 있다.
autograd 시스템에서도 in-place operation은 허락되어진다. 하지만 많은 경우에서 안정성이 좋지 않으며 out place operation을 사용하는 것을 추천한다.
requires_grad = True인 leaf tensor의 경우 in-place operation을 사용하는 것에 주의해야한다. Pytorch에서는 requires_grad가 True인 텐서가 어떤 연산을 수행할 경우 자동으로 gradient graph를 만들며 연산의 결과로 grad_fn과 함께 intermediate tensor를 출력한다. 하지만 leaf tensor의 경우 grad_fn이 없는 텐서를 의미하고 파이토치는 in-place operation을 통해 leaf tensor를 intermediate tensor로 변경하는 것을 허용하지 않는다. 그러나 no grad mode 상에서는 requires_grad를 False로 만들고 계산을 하고, grad_fn을 만들지 않기 때문에 in-place operation이 허용된다.

no grad mode와 함께 in-place operation을 사용하는 경우로 모델의 Parameter를 초기화할 경우가 있다. 이 경우에서는 in-place operation만을 사용해야 실수를 안하기 쉽다. 예를 들면 모델을 생성하고 optimizer에 모델의 파라미터를 등록한 후, 모델의 파라미터를 out place operation을 통해 초기화한다면 optimizer가 가지고 있는 모델의 파라미터와 실제 모델이 가지고 있는 파라미터가 다른 텐서라서 학습이 진행되지 않는다. 그렇기에 파라미터 초기화에는 in-place operation이 사용된다.

intermediate tensor의 in-place operation은 파이토치에서 오류를 내보내지 않는다. 그렇다고 파이토치에서 in-place operation을 사용하는 것을 권하지는 않는다. 그럼 어떻게 intermediate tensor에서는 in-place operation이 가능한걸까??
기본적으로 requires_grad가 True인 tensor는 operation실행 시 gradient graph을 그리게 되고 gradient function와 함께 saved tensor를 저장한다. 하지만 in-place operation은 자기 자신의 tensor의 값을 바꿔버린다. 그러면 saved tensor와 operation 결과의 tensor는 동일한 객체지만 값이 달라져야하는 현상이 생긴다. pytorch에서는 이 문제를 자체적으로 해결해 주는데 input tensor를 clone하여 saved tensor로 저장하고 input tensor는 그대로 operation을 계산하여 출력된다.

tip

모델의 파라미터는 tensor 객체가 아닌 nn.Parameters() 객체를 가리켜야한다. nn.Parameters()는 tensor를 wrap한 것으로 tensor와 동일한데 후에 model.parameters()를 통해 해당 모델의 nn.Parameters()객체를 제너레이터형태로 끌어모을 수 있다.
optimizer는 초기화할 때, 모델의 파라미터를 넣기 때문에 모델의 파라미터의 값과 grad를 알 수 있다. 물론 업데이트를 해야하기 때문에 값 복사가 아닌 참조이다.
optim.zero_grad는 optimizer가 가지고 있는 모델의 파라미터의 grad를 모두 0으로 초기화한다. 그렇기 때문에 반드시 그레디언트를 계산하기 전(backward 앞)이나 모델의 파라미터를 업데이트한 후(optim.step) 불러와야 한다. backward와 optim.step 사이에 넣는다면 grad 값은 0으로 초기화 되기 때문에 모델의 파라미터가 업데이트 되지 않는다.

Cross Entropy

2022-05-31T00:00:00+00:00

참고
[1] https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
[2] https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html

Introduction
Cross entropy
Binary Cross Entropy
cross entropy
Unnormalized input tensor를 입력으로 받는 이유

Introduction

논문 코드를 구현하던 중 official 코드의 결과 값과 직접 구현한 코드의 결과 값의 차이가 많아 원인을 찾는 중에 발견한 그동안에 모르고 있었던 pytorch의 cross entropy 메소드에 대한 치명적인 실수에 대한 짧은 글이다.
cross entropy는 KL divergence의 식에서 유래한 objective fucntion이다. 그렇기 때문에 함수의 입력으로 들어가는 텐서들은 확률 값을 나타내며 값이 0부터 1까지 normalize 되어 있어야 한다.
하지만 pytorch의 cross entropy method는 unnormalized 데이터를 입력으로 받아 내부에서 softmax를 통해 normalize를 시킨다.
pytorch의 binary cross entropy method는 normalized 데이터를 입력으로 받는다.
pytorch의 binary cross entropy with logits method는 unnormlized 데이터를 입력으로 받아 내부에서 sigmoid를 통해 normlized를 시킨다.

Cross entropy

Cross entropy는 KL divergence에서 유래한 식이다.
\(p(x)\)를 정답 labels라고 한다면 KL divergence 식에서 고정된 상수의 식을 제외한 항을 Cross Entropy라고 부르며, KL divergence를 최소화하는 것은 Cross Entropy를 최소화하는 것과 같다.
일반적으로 Classification 문제를 다룰 때, objective function으로 cross entropy를 많이 사용한다.

\[\begin{aligned} D_{KL}(p||q) = E_{x \sim p}[log(p(x))] - E_{x \sim p}[log(q(x))] \\[1em] CrossEntropy : H(p, q) = - E_{x \sim p}[log(q(x))] \\[1em] = - \sum p(x)log(q(x)) \end{aligned}\]

Binary Cross Entropy

경우의 수가 True, Falce 두개인 Binary Classification 문제를 풀 땐, objective function으로 Binary cross entropy 함수를 이용한다.
머신러닝 문제에서 입력되는 데이터를 \(x\)라 칭하고 모델의 결과 값을 y라고 칭한다.
ground truth는 \(p(y \| x)\)이며, 확률 변수 y가 가질 수 있는 데이터가 0과 1이기 때문에 \(p(y \| x)\) 는 Bernoulli distribution 이다.
우리의 모델 또한 likelihood function이 Bernoulli distribution \(q(y\|x)\)을 따른다고 가정하며, binary cross entropy 함수를 이용하여 bernoulli distribution의 parameter p = \(q(y=1\|x)\)를 구하는 학습을 진행한다.
대개의 경우 classification 문제이기 때문에 ground truth \(p(y\|x)\)는 \(p(y=1\|x) = 1 \text{ or } 0\) 의 값을 나타낸다.
\(P(y=1\|x) = 1\)일 경우

\[\begin{aligned} CrossEntropy : H(p, q) = - E_{y|x \sim p}[log(q(y|x))] \\[1em] = - \sum p(y|x)log(q(y|x)) \\[1em] = -log(q(y=1|x)) \\[1em] \end{aligned}\]

\(P(y=1\|x) = 0\)일 경우

\[\begin{aligned} CrossEntropy : H(p, q) = - E_{y|x \sim p}[log(q(y|x))] \\[1em] = - \sum p(y|x)log(q(y|x)) \\[1em] = -(1-log(q(y=1|x))) \\[1em] \end{aligned}\]

Pytorch에서 binary cross entropy와 관련된 메소드는 두개가 있다.
torch.nn.functional.binary_cross_entropy() 함수는 input tensor와 target tensor가 동일한 shape을 가지고 있어야 하며, input tensor는 nomalized tensor여야만 한다.
torch.nn.functional.binary_cross_entropy_with_logits() 함수도 마찬가지로 input tensor와 target tensor가 동일한 shape을 가지고 있어야 하며, input tensor는 unnolized tensor이고 내부에서 자체적으로 sigmoid 함수를 걸친다. 그렇기 때문에 따로 모델의 결과 값에 sigmoid를 넣지 않도록 해야만 한다.
input tensor 와 target tensor가 동일한 shape을 가지고 있어야 하는 이유는 tensor의 값 하나하나가 독립적인 하나의 확률 변수, 확률 분포라고 생각하기 때문이다. 이것은 binary cross entropy 뿐만 아니라 cross entropy 함수도 동일하게 적용된다.

cross entropy

주로 경우의 수가 두개가 아닌 Classification 문제를 풀 때 사용하는 objective function이다.
경우의 수가 두개가 아니기 때문에 ground truth는 Bernoulli distribution의 multinomial variable 버전?인 Dirichlet distribution 이다.
ground truth는 \(p(y\|x)\)이며, 확률 변수 y = 0 ~ N 의 정수 값을 가질 수 있다. ground truth는 one hot encoding하지 않도록 한다.
우리의 모델 또한 likelyhood function으로 Dirichlet distribution \(q(y\|x)\)을 따르고, parameter가 1개인 Bernoulli distribution과는 달리 여러개(N개)의 parameter를 가지기 때문에 결과 값도 N개가 되어야 한다.
binary variables 때와 마찬가지로 대개의 경우 classification 문제이기 때문에 ground truth는 하나의 값에만 확률 값이 1이고 나머지는 0인 분포를 띈다.
\(P(y=n\|x) = 1\) 일 경우

\[\begin{aligned} CrossEntropy : H(p, q) = - E_{y|x \sim p}[log(q(y|x))] \\[1em] = - \sum p(y|x)log(q(y|x)) \\[1em] = -log(q(y=n|x))) \\[1em] \end{aligned}\]

torch.nn.functional.cross_entropy() 함수는 input tensor와 target tensor가 다른 shape을 띄고 있다. input tensor는 Dirichlet distribution의 파라미터 수(C)에 대해서의 값을 가지고 있어야 하므로 (C) 혹은 (N, C) 혹은 (N, C, d_1, d_2, … , d_n)의 shape을 가지고 target tensor는 classification 문제이기 때문에 어떤 값이 1인지 만 나타내면 되어서 (,) 혹은 (N) 혹은 (N, d_1, d_2, …, d_n)의 shape을 가지고 있어야 한다. 혹은 classification 문제가 아닌 일반적인 문제일 경우 target tensor는 input tensor와 동일한 shape을 가지고 있으면 된다. 두 경우 모두 C를 제외하면 동일한 shape을 가지고 있어야 하며 C 채널이 항상 가장 마지막에 있는 것이 아닌 두번째에 존재하는 것을 인지해야 한다. 또한 unnormalized input tensor를 입력으로 받고 내부에서 자체적으로 softmax 함수를 취한다.
bernoulli distribution의 경우에서는 입력 데이터 모두가 독립적인 확률 분포로 쓰였지만 dirichlet distribution의 경우에서는 입력 텐서의 C채널이 모두 하나의 확률 분포를 나타내기 위해 쓰여진다.

Unnormalized input tensor를 입력으로 받는 이유

논리적으로는 사용자가 직접 sigmoid나 softmax 함수를 통해 normalize를 한 후에 cross entropy에 입력으로 넣는게 맞는 것 같지만, 최근 파이토치 함수에서는 Unnormalized input tensor을 입력으로 받고 자체적으로 normalize를 한다.
정확하게 그렇게 취한 이유는 잘 모르겠지만 사용자가 실수하는 것을 줄이기 위해 하는 건지… normalize 과정과 cross entropy 계산 과정을 합침으로써 더 학습에 안정적으로 수식을 수정하는 방법이 있어서 그런건지는 잘 모르겠다…
그냥 인지하고 있자.

Mask2Former

2022-05-26T00:00:00+00:00

참고
[1] https://arxiv.org/abs/2112.01527
[2] https://github.com/facebookresearch/Mask2Former

코드
https://github.com/tinnunculus/Mask2Former/blob/master/mask2former.ipynb

Introduction
Contribution
Mask classification preliminaries
masked attention
Multi-scale features
Optimization improvements
Improving training efficiency

Introduction

MaskFormer의 후속 논문이다. MaskFormer와 거의 유사한 모델 구조를 가진다.
이 논문 또한 MaskFormer와 마찬가지로 여러 Segmentation task를 하나의 통합된(universal) 모델 구조로 처리할 수 있는 것을 다룬다.
MaskFormer는 기존의 universal 모델들보다 좋은 성능을 보여주었지만, 여전히 task speicific한 모델들에 비해 단점이 존재했다.
MaskFormer는 Performance 측면에서 task specific한 모델에 비해 약간 좋거나 나쁜 수준이었지만, 시간 복잡도와 메모리 복잡도 측면에서 매우 비효율적인 모습을 보여주었다.
특히 MaskFormer는 이미지 한장(800, 600)을 학습하기 위해서는 32기가의 GPU 메모리를 필요로 한다.
또한 MaskFormer는 task specific한 모델에 비해 학습의 시간과 수렴에 어려움이 있다.
이러한 퍼포먼스와 학습 및 수렴, 메모리 효율의 문제를 해결하기 위해 Mask2Former 모델을 제안한다.

Contribution

Mask2Former는 모델의 성능을 개선하기 위해 Transformer decoder에 사용되는 masked attention을 제안하였다. masked attention은 query와 key의 attention score를 계산할 때, 모든 영역에 대하여 계산하는 것이 아닌 이전 layer에서 추출한 mask의 영역에서만 attention을 계산하도록 한다. 구현 상 효율성(시간, 메모리)를 개선하지는 않지만, 더 빠른 수렴과 퍼포먼스의 증가를 보여주었다.
pixel-decoder의 하나의 feature map을 사용하는 것이 아닌 여러(3개) layer를 사용하여 Transformer decoder의 key, value로 사용하였다. MaskFormer에서는 마지막 layer의 low resolution feature map만을 사용하였다.
Transformer layer의 self attention layer와 cross attention layer의 순서를 바꿨다. 이것은 모델의 학습을 개선하는 효과를 내었다.
Transformer 모델에 dropout을 없앴다. dropout은 모델의 성능에 영향을 주지 않고, 오히려 학습의 수렴에 방해하였다.
학습 시, mask loss를 계산하는 과정에서 모든 Pixel에 대해서 계산하는 것이 아닌 임의로 추출된 Pixel group에 대해서만 계산하는 방법으로 성능은 유지하면서 메모리 효율성을 높였다.
MaskFormer와는 달리 Pixel decoder로 FPN이 아닌 Deformable Transformer를 사용하였다.

Mask classification preliminaries

Mask2Former는 전체적으로 MaskFormer와 동일한 구조를 가지고 있다.
N개의 bianary mask와 class labels를 추출하는 것을 목표로 한다.
Backbone 모델을 이용해서 이미지로부터 multi-scale feature map들을 추출한다.
Pixel decoder를 통해 multi scale feature map으로부터 multi-scale feature map을 추출한다. per-pixel embedding은 나중에 mask embedded vector와 함께 mask segmentation을 만드는데 사용된다.
Transformer decoder는 고정된 N개의 object queries와 Pixel decoder로부터 multi-scale feature map을 입력으로 받아 N개의 embedded vector를 추출한다. 이것은 선현 변환과 함께 N개의 class labels 결과를 내고, per-pixel embeded matrics와 함께 mask segmentation 결과를 낸다.

masked attention

기존의 일반적인 cross attention을 대체하는 새로운 attention 알고리즘이다.
cross attention과는 달리 모든 영역에 대해서 attention score 값을 계산하지 않는다.
이 전의 Transformer decoder layer로 부터 뽑은 결과로 부터 mask를 추출해서 mask가 있는 영역(foreground)만을 attention 계산을 한다.
연산량은 동일하다. 하지만 이것은 퍼포먼스를 증가시킨다.
다음 Layer이기 때문에 이미 이 전의 Layer의 결과물이 반영된 것이다. 그럼에도 불구하고 mask를 씌우는 것은 일종의 dropout 같은 효과인데 랜덤하게 out 시키는 것이 아닌 흥미 없는 부분을 out 시켜서 더 강하게 attention 하는 것이라고 볼 수 있다.
너무 영역을 제한해서 중요한 영역도 버릴 수 있지 않나라고 생각이 들 수도 있지만 뒷 단에 나오는 self attention을 통해서 그 점을 보완해준다.
중간 Layer마다 mask를 뽑고 사용하기 때문에 axial training을 같이 해주면 도움이 될 듯하다.
masked attention layer의 구체적인 연산은 아래와 같다.

\(X_l \in R^{N \times C}\) 은 \(l\)번째 layer의 결과물을 나타내고 \(X_0\)는 Transformer decoder에 입력으로 들어가는 queries vector들이다.
\(K_l, V_l \in R^{H_lW_l \times C}\) 는 Pixel decoder로부터 Transformer decoder에 \(l\) 번째 layer로 들어가는 image feature를 나타낸다.
\(M_{l-1}\) 은 Trasnformer decoder의 이전 layer로부터 뽑은 결과로부터 mask를 추출(per-pixel embeddings와 함께)한 것이다. per-pixel embeddings의 이미지 사이즈이기 때문에 resize를 거쳐준다.
\(M_0\) 는 Transformer decoder의 입력으로 들어가는 query vector로 부터 추출한다. 의미가 없어 보일 수 있지만 query vector는 학습 가능한 벡터이기에 학습이 진행되면 진행될 수록 의미있는 mask가 추출될 것이다.

Multi-scale features

이전의 MaskFormer와는 다르게 Mask2Former에서는 Pixel decoder의 최종 출력 feature map만이 아닌 중간중간의 layer로 부터 나온 feature map을 활용한다. MaskFormer는 low resolution feature map만을 사용하였다.
pixel-decoder로 부터 1/32, 1/16, 1/8 크기의 feature map을 추출한다. 1/4는 추출하지 않는데 1/8 feature map을 단순히 upsampling 하여서 per-pixel embeddings를 만든다. 즉 1/4 feature map은 transformer decoder에 들어가지 않고 mask를 뽑는데에만 사용된다.
거기에 고정된 sinusoidal positional embedding을 더한다. \(e_{pos} \in R^{H_lW_l \times C}\)
거기에 scale-level embedding도 곱한다. \(e_{lvl} \in R^{1 \times C}\)
이 세개의 레이어를 Transformer decoder에 L번 반복해서 넣으며, 그 결과로 Transformer decoder의 layer 수는 총 3L개가 존재하게 된다.

Optimization improvements

기존의 일반적이 Transformer와는 달리 self attention 과 masked attention의 위치를 바꾸었다. 이것은 논리적으로 더 옳다고 볼 수 있다. 처음 queries vector들은 아무런 의미가 없는 정보이기 때문.
모든 layer에서 Dropout을 없앴다.

Improving training efficiency

전체적인 학습 방법은 기존의 MaskFormer와 동일하다.
hungarian maching과 objective function에 쓰인 mask loss를 개선했다.
기존의 mask loss는 모든 픽셀에 대해서 distance를 계산하였지만 여기서는 임의의 픽셀을 추출해서 그 픽셀에 대해서만 distance를 계산하였다.
maching을 위한 mask loss를 계산할 시에는 uniform 하게 모두 동일한 위치의 pixel을 sampling 했으며, 학습을 위한 mask loss 계산을 위해서는 uniform 하지 않고 foreground에 더 중점적으로 sampling을 했으며 mask마다 다른 pixel group을 sampling 하였다.
sampling시에 사용한 함수로는 grid_sample 함수를 사용하였다. 중복의 픽셀을 선택할 수도 있지만 이 논문에서는 이미지 크기의 비율을 줄이는 것이 아닌 고정된 픽셀 수를 샘플리 하였기에 샘플링 수 보다 이미지 크기가 작을 수도 있다. 그렇기에 매 샘플링 마다 독립시행으로 샘플링을 진행하였다.
1/3 수준으로 메모리가 절약했지만 퍼포먼스에는 영향을 미치지 않았다.

MaskFormer

2022-05-12T00:00:00+00:00

참고
[1] https://arxiv.org/abs/2107.06278
[2] https://github.com/facebookresearch/MaskFormer

코드
https://github.com/tinnunculus/MaskFormer/blob/master/maskformer.ipynb

Introduction
Mask classification formulation
Mask classification inference
- General inference
- Semantic inference

Introduction

MaskFormer는 Object dectection 모델인 DETR을 segmentation task에 맞게 수정한 모델이라고 볼 수 있다.

그렇기 때문에 핵심 개념인 모델의 구조와 학습 방법으로는 DETR 모델과 거진 유사하다.

기존에는 semantic, instance, panoptic 등의 여러가지 segmentation 문제마다 다르게 접근하여 문제를 풀었지만, 이 논문에서는 하나의 학습된 MaskFormer 모델로 inference 방법만 task 마다 다르게 하여 앞에 언급한 segmentation 문제를 모두 풀 수 있다.

Mask classification formulation

DETR 모델에서는 object detection 모델이기에 트랜스포머의 Query 토큰들이 가리키는 것은 object box 정보였다면 MaskFormer에서는 segmentation 정보의 embedding vector를 가리킨다.

이 embedding vector는 후에 per-pixel embedding tensor와 곱셈 연산을 통해 mask segmentation 정보를 가리키고 linear mapping을 통해 mask의 class 정보를 기리키게 된다.

나머지 개념은 DETR과 동일하다. class는 no object를 포함하고 있고, bipartite maching을 통해 prediction 정보와 ground truth 정보를 1대1 mapping 하고 학습을 진행한다.

hungarian matching의 score 함수로 \(-p_i(c^{gt}_j) + L_{mask}(m_i, m^{gt}_j)\) 를 사용하여 bipartite matching 을 진행한다.

또한 학습 objective function으로는 아래의 식을 사용하였다.

참고로 위의 그림에서 H, W는 원본 이미지의 H, W가 아니다. 0.25배 축소된 크기의 H, W이다.

mask loss는 DETR과 동일하게 focal loss와 dice loss의 linear combination으로 계산한다.

정답 mask는 binary이기에 예측한 mask는 sigmoid를 한번 걸친다. 그렇기에 l1,l2 distance 보다는 cross entropy 계열의 cost가 적합하고, 여기서는 더 구체적으로 focal loss를 사용하였다.

또한 iou 계열의 cost인 dice loss도 함께 사용하였다.

다른 Transformer 모델들과 같이 필요에 따라 auxiliary loss(transformer decoder의 layer마다 결과를 출력하여 loss를 매김)를 같이 학습할 수도 있다.

Mask classification inference

MaskFormer 모델 그 자체로는 단순히 N개의 binary segmentation(sigmoid by pixel) 정보와 class(softmax) 정보만을 가지고 있다.

여러 Task에 적합하게 MaskFormer를 inference해야만 한다.

General inference

픽셀마다 가장 확률 값이 높은 class를 고르는 것으로 가장 기본적으로 접근할 수 있는 방법이다.

semantic segmentation을 위해서는 픽셀마다 class 하나 만을 뽑으면 된다.

instance-level segmentation을 위해서는 같은 클래스의 다른 mask index를 통해 instance들을 구분한다.

panoptic segmentation을 위해서는 false positive 비율을 줄이기 위해 뭔 짓을 했는데 아직 잘 모르겟다!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

각각의 \(mask_i\)에서 가장 확률 값이 높은 class \(c_i\) 를 뽑는다. ( \(c_i = argmax_{c\in\{1,...,K,\varnothing\}}{p_i(c)}\) )

이미지의 모든 픽셀 [h, w] 각각에 대해서 가장 predicted probability 값이 높은 class를 고른다. ( \(argmax_{i:c_i\neq\varnothing}p_i(c_i) \cdot m_i[h, w]\) )

Semantic inference

semantic segmentation을 위한 inference 기법이다.

general inference에서 처럼 mask마다 하나의 class를 고정하는 것이 아닌 marginalization을 통해 통합적인 값을 구하고 class를 선별한다.

\(argmax_{c\in\{1,...,K\}}\sum_{i=1}^{N}p_i(c) \cdot m_i[h, w]\) 으로 no object는 취급하지 않는다.

semantic inference에 대해서는 좋은 결과를 내었지만 낮은 performance 보여주었다.