While service challenges are often connected to with netcat or PuTTY, solving them will sometimes require using a scripting language like Python. CTF players often use Python alongside pwntools.
+
You can run pwntools right in your browser by using repl.it.
+
+
Using netcat
+
+
netcat is a networking utility found on macOS and linux operating systems and allows for easy connections to CTF challenges. Service challenges will commonly give you an address and a port to connect to. The syntax for connecting to a service challenge with netcat is nc <ip> <port>.
+
Using ConEmu
+
Windows users can connect to service challenges using ConEmu, which can be downloaded here. Connecting to service challenges with ConEmu is done by running nc <ip> <port>.
Occasionally, certain kinds of exploits will require a server to connect back to. Some examples are connect back shellcode, cross site request forgery (CSRF), or blind cross site scripting (XSS).
+
I just a web server
+
If you just need a web server to host simple static websites or check access logs, we recommend using PythonAnywhere to host a simple web application. You can program a simple web application in popular Python web frameworks (e.g. Flask) and host it there for free.
+
I need a real server
+
If you need a real server (perhaps to run complex calculations or for shellcode to connect back to), we recommend DigitalOcean. DigitalOcean has a cheap $4-6/month plan for a small server that can be freely configured to do whatever you need.
Generally in cyber security competitions, it is up to you and your team
+to determine what software to use. In some cases you may even end up
+creating new tools to give you an edge! That being said, here are some
+applications that we recommend for most competitors for most
+competitions.
Ghidra is a disassembler and decompiler that is open source and free
+to use. Released by the NSA, Ghidra is a capable tool and is the
+recommended disassembler for most use cases. An alternative is IDA
+Pro (a cyber security industry standard), however IDA Pro is not
+free and licenses are very expensive.
Binary Ninja is a commercial disassembler (with a free demo
+application) that provides an aesthetic and easy to use interface
+for binary reverse engineering. It also has a Web-UI which can be
+used freely. Binary Ninja's API and intermediate language make it
+superior than other disassemblers for certain use cases.
Pwndbg is a plugin for the GNU Debugger (gdb) which makes it easier
+to dynamically reverse an application by stepping through its
+execution. In order to use pwndbg you will first need to have gdb
+installed via a Linux virtual machine or similar.
Burp Suite is an HTTP proxy and set of tools which allow you to
+view, edit and replay your HTTP requests. While Burp Suite is a
+commercial tool, it offers a free version which is very capable and
+usually all that's needed.
sqlmap is a penetration testing tool that automates hte process of
+detecting and exploiting SQL injection flaws. It's open source and
+freely available.
Google Chrome is a web browser with a suite of developer tools and
+extensions. These tools and extensions can be useful when
+investigating a web application.
VMware is a company that creates virtualization software that allows
+you to run other operating systems within your existing operating
+system. While their products are not generally free, their software
+is best in class for virtualization.
+
VMWare Fusion, VMWare Workstation, and VMWare Player are three of
+their virtualization products that can be used on your computer to
+run other OS'es. VMWare
+Player is free to use for
+Windows and Linux.
VirtualBox is open source virtualization software which allows you
+to virtualize other operating systems. It's very similar to VMWare
+products but free for all OS'es. It is generally slower than VMWare
+but works well enough for most people.
Python is an easy-to-learn, widely used programming language which
+supports complex applications as well as small scripts. It has a
+large community which provides thousands of useful packages. Python
+is widely used in the cyber security industry and is generally the
+recommended language to use in CTF competition.
Pwntools is a Python package which makes interacting with processes
+and networks easy. It is a recommended library for interacting with
+binary exploitation and networking based CTF challenges.
+
+
Note
+
You can run
+pwntools right in
+your browser by using repl.it. Create a new
+Python repl and install the pwntools package. After that you'll
+be able to use pwntools directly from your browser without having to
+install anything.
Capture the Flags, or CTFs, are computer security competitions.
Teams of competitors (or just individuals) are pitted against each other in various challenges across multiple security disciplines, competing to earn the most points.
CTFs are often the beginning of one's cyber security career due to their team building nature and competitive aspect. In addition, there isn't a lot of commitment required beyond a weekend.
Info
For information about ongoing CTFs, check out CTFTime.
In this handbook you'll learn the basics\u2122 behind the methodologies and techniques needed to succeed in Capture the Flag competitions.
Address Space Layout Randomization (or ASLR) is the randomization of the place in memory where the program, shared libraries, the stack, and the heap are. This makes can make it harder for an attacker to exploit a service, as knowledge about where the stack, heap, or libc can't be re-used between program launches. This is a partially effective way of preventing an attacker from jumping to, for example, libc without a leak.
Typically, only the stack, heap, and shared libraries are ASLR enabled. It is still somewhat rare for the main program to have ASLR enabled, though it is being seen more frequently and is slowly becoming the default.
The simplest and most common buffer overflow is one where the buffer is on the stack. Let's look at an example.
#include <stdio.h>\n\nint main() {\n int secret = 0xdeadbeef;\n char name[100] = {0};\n read(0, name, 0x100);\n if (secret == 0x1337) {\n puts(\"Wow! Here's a secret.\");\n } else {\n puts(\"I guess you're not cool enough to see my secret\");\n }\n}\n
There's a tiny mistake in this program which will allow us to see the secret. name is decimal 100 bytes, however we're reading in hex 100 bytes (=256 decimal bytes)! Let's see how we can use this to our advantage.
If the compiler chose to layout the stack like this:
The least significant byte of secret has been overwritten! If we follow the next 3 bytes to be read in, we'll see the entirety of secret is \"clobbered\" with our 'A's
The remaining 152 bytes would continue clobbering values up the stack.
"}, {"location": "binary-exploitation/buffer-overflow/#passing-an-impossible-check", "title": "Passing an impossible check", "text": "
How can we use this to pass the seemingly impossible check in the original program? Well, if we carefully line up our input so that the bytes that overwrite secret happen to be the bytes that represent 0x1337 in little-endian, we'll see the secret message.
A small Python one-liner will work nicely: python -c \"print 'A'*100 + '\\x31\\x13\\x00\\x00'\"
This will fill the name buffer with 100 'A's, then overwrite secret with the 32-bit little-endian encoding of 0x1337.
"}, {"location": "binary-exploitation/buffer-overflow/#going-one-step-further", "title": "Going one step further", "text": "
As discussed on the stack page, the instruction that the current function should jump to when it is done is also saved on the stack (denoted as \"Saved EIP\" in the above stack diagrams). If we can overwrite this, we can control where the program jumps after main finishes running, giving us the ability to control what the program does entirely.
Usually, the end objective in binary exploitation is to get a shell (often called \"popping a shell\") on the remote computer. The shell provides us with an easy way to run anything we want on the target computer.
Say there happens to be a nice function that does this defined somewhere else in the program that we normally can't get to:
void give_shell() {\n system(\"/bin/sh\");\n}\n
Well with our buffer overflow knowledge, now we can! All we have to do is overwrite the saved EIP on the stack to the address where give_shell is. Then, when main returns, it will pop that address off of the stack and jump to it, running give_shell, and giving us our shell.
Assuming give_shell is at 0x08048fd0, we could use something like this: python -c \"print 'A'*108 + '\\xd0\\x8f\\x04\\x08'\"
We send 108 'A's to overwrite the 100 bytes that is allocated for name, the 4 bytes for secret, and the 4 bytes for the saved EBP. Then we simply send the little-endian form of give_shell's address, and we would get a shell!
This idea is extended on in Return Oriented Programming
Much like a stack buffer overflow, a heap overflow is a vulnerability where more data than can fit in the allocated buffer is read in. This could lead to heap metadata corruption, or corruption of other heap objects, which could in turn provide new attack surface.
"}, {"location": "binary-exploitation/heap-exploitation/#use-after-free-uaf", "title": "Use After Free (UAF)", "text": "
Once free is called on an allocation, the allocator is free to re-allocate that chunk of memory in future calls to malloc if it so chooses. However if the program author isn't careful and uses the freed object later on, the contents may be corrupt (or even attacker controlled). This is called a use after free or UAF.
In this example, we have a string structure with a length and a pointer to the actual string data. We properly allocate, fill, and then free an instance of this structure. Then we make another allocation, fill it, and then improperly reference the freed string. Due to how glibc's allocator works, s2 will actually get the same memory as the original s allocation, which in turn gives us the ability to control the s->data pointer. This could be used to leak program data.
Not only can the heap be exploited by the data in allocations, but exploits can also use the underlying mechanisms in malloc, free, etc. to exploit a program. This is beyond the scope of CTF 101, but here are a few recommended resources:
sploitFUN's glibc overview
Shellphish's how2heap
"}, {"location": "binary-exploitation/no-execute/", "title": "No eXecute (NX Bit)", "text": "
The No eXecute or the NX bit (also known as Data Execution Prevention or DEP) marks certain areas of the program as not executable, meaning that stored input or data cannot be executed as code. This is significant because it prevents attackers from being able to jump to custom shellcode that they've stored on the stack or in a global variable.
Binaries, or executables, are machine code for a computer to execute. For the most part, the binaries that you will face in CTFs are Linux ELF files or the occasional windows executable. Binary Exploitation is a broad topic within Cyber Security which really comes down to finding a vulnerability in the program and exploiting it to gain control of a shell or modifying the program's functions.
Common topics addressed by Binary Exploitation or 'pwn' challenges include:
Partial RELRO is the default setting in GCC, and nearly all binaries you will see have at least partial RELRO.
From an attackers point-of-view, partial RELRO makes almost no difference, other than it forces the GOT to come before the BSS in memory, eliminating the risk of a buffer overflows on a global variable overwriting GOT entries.
Full RELRO makes the entire GOT read-only which removes the ability to perform a \"GOT overwrite\" attack, where the GOT address of a function is overwritten with the location of another function or a ROP gadget an attacker wants to run.
Full RELRO is not a default compiler setting as it can greatly increase program startup time since all symbols must be resolved before the program is started. In large programs with thousands of symbols that need to be linked, this could cause a noticable delay in startup time.
Return Oriented Programming (or ROP) is the idea of chaining together small snippets of assembly with stack control to cause the program to do more complex things.
As we saw in buffer overflows, having stack control can be very powerful since it allows us to overwrite saved instruction pointers, giving us control over what the program does next. Most programs don't have a convenient give_shell function however, so we need to find a way to manually invoke system or another exec function to get us our shell.
Imagine we have a program similar to the following:
#include <stdio.h>\n#include <stdlib.h>\n\nchar name[32];\n\nint main() {\n printf(\"What's your name? \");\n read(0, name, 32);\n\n printf(\"Hi %s\\n\", name);\n\n printf(\"The time is currently \");\n system(\"/bin/date\");\n\n char echo[100];\n printf(\"What do you want me to echo back? \");\n read(0, echo, 1000);\n puts(echo);\n\n return 0;\n}\n
We obviously have a stack buffer overflow on the echo variable which can give us EIP control when main returns. But we don't have a give_shell function! So what can we do?
We can call system with an argument we control! Since arguments are passed in on the stack in 32-bit Linux programs (see calling conventions), if we have stack control, we have argument control.
When main returns, we want our stack to look like something had normally called system. Recall what is on the stack after a function has been called:
This is a good start, but we need to pass an argument to system for anything to happen. As mentioned in the page on ASLR, the stack and dynamic libraries \"move around\" each time a program is run, which means we can't easily use data on the stack or a string in libc for our argument. In this case however, we have a very convenient name global which will be at a known location in the binary (in the BSS segment).
"}, {"location": "binary-exploitation/return-oriented-programming/#putting-it-together", "title": "Putting it together", "text": "
Our exploit will need to do the following:
Enter \"sh\" or another command to run as name
Fill the stack with
Garbage up to the saved EIP
The address of system's PLT entry
A fake return address for system to jump to when it's done
The address of the name global to act as the first argument to system
In 64-bit binaries we have to work a bit harder to pass arguments to functions. The basic idea of overwriting the saved RIP is the same, but as discussed in calling conventions, arguments are passed in registers in 64-bit programs. In the case of running system, this means we will need to find a way to control the RDI register.
To do this, we'll use small snippets of assembly in the binary, called \"gadgets.\" These gadgets usually pop one or more registers off of the stack, and then call ret, which allows us to chain them together by making a large fake call stack.
For example, if we needed control of both RDI and RSI, we might find two gadgets in our program that look like this (using a tool like rp++ or ROPgadget):
0x400c01: pop rdi; ret\n0x400c03: pop rsi; pop r15; ret\n
We can setup a fake call stack with these gadets to sequentially execute them, poping values we control into registers, and then end with a jump to system.
0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n 0xffff0020: 0x1337beef // value we want in r15 (probably garbage)\n 0xffff0018: 0x1337beef // value we want in rsi\n 0xffff0010: 0x400c03 // address that the rdi gadget's ret will return to - the pop rsi gadget\n 0xffff0008: 0xdeadbeef // value to be popped into rdi\nRSP -> 0xffff0000: 0x400c01 // address of rdi gadget\n
Stepping through this one instruction at a time, main returns, jumping to our pop rdi gadget:
RIP = 0x400c01 (pop rdi)\nRDI = UNKNOWN\nRSI = UNKNOWN\n\n 0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n 0xffff0020: 0x1337beef // value we want in r15 (probably garbage)\n 0xffff0018: 0x1337beef // value we want in rsi\n 0xffff0010: 0x400c03 // address that the rdi gadget's ret will return to - the pop rsi gadget\nRSP -> 0xffff0008: 0xdeadbeef // value to be popped into rdi\n
pop rdi is then executed, popping the top of the stack into RDI:
RIP = 0x400c02 (ret)\nRDI = 0xdeadbeef\nRSI = UNKNOWN\n\n 0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n 0xffff0020: 0x1337beef // value we want in r15 (probably garbage)\n 0xffff0018: 0x1337beef // value we want in rsi\nRSP -> 0xffff0010: 0x400c03 // address that the rdi gadget's ret will return to - the pop rsi gadget\n
The RDI gadget then rets into our RSI gadget:
RIP = 0x400c03 (pop rsi)\nRDI = 0xdeadbeef\nRSI = UNKNOWN\n\n 0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n 0xffff0020: 0x1337beef // value we want in r15 (probably garbage)\nRSP -> 0xffff0018: 0x1337beef // value we want in rsi\n
RSI and R15 are popped:
RIP = 0x400c05 (ret)\nRDI = 0xdeadbeef\nRSI = 0x1337beef\n\nRSP -> 0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n
And finally, the RSI gadget rets, jumping to whatever function we want, but now with RDI and RSI set to values we control.
Stack Canaries are a secret value placed on the stack which changes every time the program is started. Prior to a function return, the stack canary is checked and if it appears to be modified, the program exits immeadiately.
Stack Canaries seem like a clear cut way to mitigate any stack smashing as it is fairly impossible to just guess a random 64-bit value. However, leaking the address and bruteforcing the canary are two methods which would allow us to get through the canary check.
If we can read the data in the stack canary, we can send it back to the program later because the canary stays the same throughout execution. However Linux makes this slightly tricky by making the first byte of the stack canary a NULL, meaning that string functions will stop when they hit it. A method around this would be to partially overwrite and then put the NULL back or find a way to leak bytes at an arbitrary stack offset.
A few situations where you might be able to leak a canary:
User-controlled format string
User-controlled length of an output
\u201cHey, can you send me 1000000 bytes? thx!\u201d
"}, {"location": "binary-exploitation/stack-canaries/#bruteforcing-a-stack-canary", "title": "Bruteforcing a Stack Canary", "text": "
The canary is determined when the program starts up for the first time which means that if the program forks, it keeps the same stack cookie in the child process. This means that if the input that can overwrite the canary is sent to the child, we can use whether it crashes as an oracle and brute-force 1 byte at a time!
This method can be used on fork-and-accept servers where connections are spun off to child processes, but only under certain conditions such as when the input accepted by the program does not append a NULL byte (read or recv).
Buffer (N Bytes) ?? ?? ?? ?? ?? ?? ?? ?? RBP RIP
Fill the buffer N Bytes + 0x00 results in no crash
Buffer (N Bytes) 00 ?? ?? ?? ?? ?? ?? ?? RBP RIP
Fill the buffer N Bytes + 0x00 + 0x00 results in a crash
N Bytes + 0x00 + 0x01 results in a crash
N Bytes + 0x00 + 0x02 results in a crash
...
N Bytes + 0x00 + 0x51 results in no crash
Buffer (N Bytes) 00 51 ?? ?? ?? ?? ?? ?? RBP RIP
Repeat this bruteforcing process for 6 more bytes...
Buffer (N Bytes) 00 51 FE 0A 31 D2 7B 3C RBP RIP
Now that we have the stack cookie, we can overwrite the RIP register and take control of the program!
A buffer is any allocated space in memory where data (often user input) can be stored. For example, in the following C program name would be considered a stack buffer:
Given that buffers commonly hold user input, mistakes when writing to them could result in attacker controlled data being written outside of the buffer's space. See the page on buffer overflows for more.
To be able to call functions, there needs to be an agreed-upon way to pass arguments. If a program is entirely self-contained in a binary, the compiler would be free to decide the calling convention. However in reality, shared libraries are used so that common code (e.g. libc) can be stored once and dynamically linked in to programs that need it, reducing program size.
In Linux binaries, there are really only two commonly used calling conventions: cdecl for 32-bit binaries, and SysV for 64-bit
Any method of passing arguments could be used as long as the compiler is aware of what the convention is. As a result, there have been many calling conventions in the past that aren't used frequently anymore. See Wikipedia for a comprehensive list.
A register is a location within the processor that is able to store data, much like RAM. Unlike RAM however, accesses to registers are effectively instantaneous, whereas reads from main memory can take hundreds of CPU cycles to return.
Registers can hold any value: addresses (pointers), results from mathematical operations, characters, etc. Some registers are reserved however, meaning they have a special purpose and are not \"general purpose registers\" (GPRs). On x86, the only 2 reserved registers are rip and rsp which hold the address of the next instruction to execute and the address of the stack respectively.
On x86, the same register can have different sized accesses for backwards compatability. For example, the rax register is the full 64-bit register, eax is the low 32 bits of rax, ax is the low 16 bits, al is the low 8 bits, and ah is the high 8 bits of ax (bits 8-16 of rax).
A format string vulnerability is a bug where user input is passed as the format argument to printf, scanf, or another function in that family.
The format argument has many different specifies which could allow an attacker to leak data if they control the format argument to printf. Since printf and similar are variadic functions, they will continue popping data off of the stack according to the format.
For example, if we can make the format argument \"%x.%x.%x.%x\", printf will pop off four stack values and print them in hexadecimal, potentially leaking sensitive information.
printf can also index to an arbitrary \"argument\" with the following syntax: \"%n$x\" (where n is the decimal index of the argument you want).
While these bugs are powerful, they're very rare nowadays, as all modern compilers warn when printf is called with a non-constant string.
#include <stdio.h>\n#include <unistd.h>\n\nint main() {\n int secret_num = 0x8badf00d;\n\n char name[64] = {0};\n read(0, name, 64);\n printf(\"Hello \");\n printf(name);\n printf(\"! You'll never get my secret!\\n\");\n return 0;\n}\n
Due to how GCC decided to lay out the stack, secret_num is actually at a lower address on the stack than name, so we only have to go to the 7th \"argument\" in printf to leak the secret:
$ ./fmt_string\n%7$llx\nHello 8badf00d3ea43eef\n! You'll never get my secret!\n
Binary Security is using tools and methods in order to secure programs from being manipulated and exploited. This tools are not infallible, but when used together and implemented properly, they can raise the difficulty of exploitation greatly.
The Global Offset Table (or GOT) is a section inside of programs that holds addresses of functions that are dynamically linked. As mentioned in the page on calling conventions, most programs don't include every function they use to reduce binary size. Instead, common functions (like those in libc) are \"linked\" into the program so they can be saved once on disk and reused by every program.
Unless a program is marked full RELRO, the resolution of function to address in dynamic library is done lazily. All dynamic libraries are loaded into memory along with the main program at launch, however functions are not mapped to their actual code until they're first called. For example, in the following C snippet puts won't be resolved to an address in libc until after it has been called once:
int main() {\n puts(\"Hi there!\");\n puts(\"Ok bye now.\");\n return 0;\n}\n
To avoid searching through shared libraries each time a function is called, the result of the lookup is saved into the GOT so future function calls \"short circuit\" straight to their implementation bypassing the dynamic resolver.
This has two important implications:
The GOT contains pointers to libraries which move around due to ASLR
The GOT is writable
These two facts will become very useful to use in Return Oriented Programming
Before a functions address has been resolved, the GOT points to an entry in the Procedure Linkage Table (PLT). This is a small \"stub\" function which is responsible for calling the dynamic linker with (effectively) the name of the function that should be resolved.
"}, {"location": "binary-exploitation/what-is-the-heap/", "title": "The Heap", "text": "
The heap is a place in memory which a program can use to dynamically create objects. Creating objects on the heap has some advantages compared to using the stack:
Heap allocations can be dynamically sized
Heap allocations \"persist\" when a function returns
There are also some disadvantages however:
Heap allocations can be slower
Heap allocations must be manually cleaned up
"}, {"location": "binary-exploitation/what-is-the-heap/#using-the-heap", "title": "Using the heap", "text": "
In C, there are a number of functions used to interact with the heap, but we're going to focus on the two core ones:
This program reads in a size from the user, creates an allocation of that size on the heap, reads in that many bytes, then prints it back out to the user.
"}, {"location": "binary-exploitation/what-is-the-stack/", "title": "The Stack", "text": "
In computer architecture, the stack is a hardware manifestation of the stack data structure (a Last In, First Out queue).
In x86, the stack is simply an area in RAM that was chosen to be the stack - there is no special hardware to store stack contents. The esp/rsp register holds the address in memory where the bottom of the stack resides. When something is pushed to the stack, esp decrements by 4 (or 8 on 64-bit x86), and the value that was pushed is stored at that location in memory. Likewise, when a pop instruction is executed, the value at esp is retrieved (i.e. esp is dereferenced), and esp is then incremented by 4 (or 8).
N.B. The stack \"grows\" down to lower memory addresses!
Conventionally, ebp/rbp contains the address of the top of the current stack frame, and so sometimes local variables are referenced as an offset relative to ebp rather than an offset to esp. A stack frame is essentially just the space used on the stack by a given function.
Skipping over the bulk of main, you'll see that at 0x8048452main's name local is pushed to the stack because it's the first argument to say_hi. Then, a call instruction is executed. call instructions first push the current instruction pointer to the stack, then jump to their destination. So when the processor begins executing say_hi at 0x0804840b, the stack looks like this:
And finally, ret pops the saved instruction pointer into eip which causes the program to return to main with the same esp, ebp, and stack contents as when say_hi was initially called.
Cryptography is the reason we can use banking apps, transmit sensitive information over the web, and in general protect our privacy. However, a large part of CTFs is breaking widely used encryption schemes which are improperly implemented. The math may seem daunting, but more often than not, a simple understanding of the underlying principles will allow you to find flaws and crack the code.
The word \u201ccryptography\u201d technically means the art of writing codes. When it comes to digital forensics, it\u2019s a method you can use to understand how data is constructed for your analysis.
"}, {"location": "cryptography/overview/#what-is-cryptography-used-for", "title": "What is cryptography used for?", "text": "
Uses in every day software
Securing web traffic (passwords, communication, etc.)
A Block Cipher is an algorithm which is used in conjunction with a cryptosystem in order to package a message into evenly distributed 'blocks' which are encrypted one at a time.
In this case ~i~ represents an index over the # of blocks in the plaintext. F() and g() represent the function used to convert plaintext into ciphertext.
ECB is the most basic block cipher, it simply chunks up plaintext into blocks and independently encrypts those blocks and chains them all into a ciphertext.
Because ECB independently encrypts the blocks, patterns in data can still be seen clearly, as shown in the CBC Penguin image below.
Original Image ECB Image Other Block Cipher Modes"}, {"location": "cryptography/what-are-block-ciphers/#cipher-block-chaining-cbc", "title": "Cipher Block Chaining (CBC)", "text": "
CBC is an improvement upon ECB where an Initialization Vector is used in order to add randomness. The encrypted previous block is used as the IV for each sequential block meaning that the encryption process cannot be parallelized. CBC has been declining in popularity due to a variety of
Note
Even though the encryption process cannot be parallelized, the decryption process can be parallelized. If the wrong IV is used for decryption it will only affect the first block as the decryption of all other blocks depends on the ciphertext not the plaintext.
PCBC is a less used cipher which modifies CBC so that decryption is also not parallelizable. It also cannot be decrypted from any point as changes made during the decryption and encryption process \"propogate\" throughout the blocks, meaning that both the plaintext and ciphertext are used when encrypting or decrypting as seen in the images below.
Counter is also known as CM, integer counter mode (ICM), and segmented integer counter (SIC)
CTR mode makes the block cipher similar to a stream cipher and it functions by adding a counter with each block in combination with a nonce and key to XOR the plaintext to produce the ciphertext. Similarly, the decryption process is the exact same except instead of XORing the plaintext, the ciphertext is XORed. This means that the process is parallelizable for both encryption and decryption and you can begin from anywhere as the counter for any block can be deduced easily.
If the nonce chosen is non-random, it is important to concatonate the nonce with the counter (high 64 bits to the nonce, low 64 bits to the counter) as adding or XORing the nonce with the counter would break security as an attacker can cause a collisions with the nonce and counter. An attacker with access to providing a plaintext, nonce and counter can then decrypt a block by using the ciphertext as seen in the decryption image.
A Padding Oracle Attack sounds complex, but essentially means abusing a block cipher by changing the length of input and being able to determine the plaintext.
Hashing functions are one way functions which theoretically provide a unique output for every input. MD5, SHA-1, and other hashes which were considered secure are now found to have collisions or two different pieces of data which produce the same supposed unique output.
A string hash is a number or string generated using an algorithm that runs on text or data.
The idea is that each hash should be unique to the text or data (although sometimes it isn\u2019t). For example, the hash for \u201cdog\u201d should be different from other hashes.
You can use command line tools tools or online resources such as this one. Example: $ echo -n password | md5 5f4dcc3b5aa765d61d8327deb882cf99 Here, \u201cpassword\u201d is hashed with different hashing algorithms:
A file hash is a number or string generated using an algorithm that is run on text or data. The premise is that it should be unique to the text or data. If the file or text changes in any way, the hash will change.
What is it used for? - File and data identification - Password/certificate storage comparison
How can we determine the hash of a file? You can use the md5sum command (or similar).
A collision is when two pieces of data or text have the same cryptographic hash. This is very rare.
What\u2019s significant about collisions is that they can be used to crack password hashes. Passwords are usually stored as hashes on a computer, since it\u2019s hard to get the passwords from hashes.
If you bruteforce by trying every possible piece of text or data, eventually you\u2019ll find something with the same hash. Enter it, and the computer accepts it as if you entered the actual password.
Two different files on the same hard drive with the same cryptographic hash can be very interesting.
\u201cIt\u2019s now well-known that the cryptographic hash function MD5 has been broken,\u201d said Peter Selinger of Dalhousie University. \u201cIn March 2005, Xiaoyun Wang and Hongbo Yu of Shandong University in China published an article in which they described an algorithm that can find two different sequences of 128 bytes with the same MD5 hash.\u201d
For example, he cited this famous pair:
and
Each of these blocks has MD5 hash 79054025255fb1a26e4bc422aef54eb4.
Selinger said that \u201cthe algorithm of Wang and Yu can be used to create files of arbitrary length that have identical MD5 hashes, and that differ only in 128 bytes somewhere in the middle of the file. Several people have used this technique to create pairs of interesting files with identical MD5 hashes.\u201d
Ben Laurie has a nice website that visualizes this MD5 collision. For a non-technical, though slightly outdated, introduction to hash functions, see Steve Friedl\u2019s Illustrated Guide. And here\u2019s a good article from DFI News that explores the same topic.
A Stream Cipher is used for symmetric key cryptography, or when the same key is used to encrypt and decrypt data. Stream Ciphers encrypt pseudorandom sequences with bits of plaintext in order to generate ciphertext, usually with XOR. A good way to think about Stream Ciphers is to think of them as generating one-time pads from a given state.
A keystream is a sequence of pseudorandom digits which extend to the length of the plaintext in order to uniquely encrypt each character based on the corresponding digit in the keystream
"}, {"location": "cryptography/what-are-stream-ciphers/#one-time-pads", "title": "One Time Pads", "text": "
A one time pad is an encryption mechanism whereby the entire plaintext is XOR'd with a random sequence of numbers in order to generate a random ciphertext. The advantage of the one time pad is that it offers an immense amount of security BUT in order for it to be useful, the randomly generated key must be distributed on a separate secure channel, meaning that one time pads have little use in modern day cryptographic applications on the internet. Stream ciphers extend upon this idea by using a key, usually 128 bit in length, in order to seed a pseudorandom keystream which is used to encrypt the text.
A Synchronous Stream Cipher generates a keystream based on internal states not related to the plaintext or ciphertext. This means that the stream is generated pseudorandomly outside of the context of what is being encrypted. A binary additive stream cipher is the term used for a stream cipher which XOR's the bits with the bits of the plaintext. Encryption and decryption require that the synchronus state cipher be in the same state, otherwise the message cannot be decrypted.
A Self-synchronizing Stream Cipher, also known as an asynchronous stream cipher or ciphertext autokey (CTAK), is a stream cipher which uses the previous N digits in order to compute the keystream used for the next N characters.
Note
Seems a lot like block ciphers doesn't it? That's because block cipher feedback mode (CFB) is an example of a self-synchronizing stream ciphers.
The key tenet of using stream ciphers securely is to NEVER repeat key use because of the communative property of XOR. If C~1~ and C~2~ have been XOR'd with a key K, retrieving that key K is trivial because C~1~ XOR C~2~ = P~1~ XOR P~2~ and having an english language based XOR means that cryptoanalysis tools such as a character frequency analysis will work well due to the low entropy of the english language.
Another key tenet of using stream ciphers securely is considering that just because a message has been decrypted, it does not mean the message has not been tampered with. Because decryption is based on state, if an attacker knows the layout of the plaintext, a Man in the Middle (MITM) attack can flip a bit during transit altering the underlying ciphertext. If a ciphertext decrypts to 'Transfer $1000', then a middleman can flip a single bit in order for the ciphertext to decrypt to 'Transfer $9000' because changing a single character in the ciphertext does not affect the state in a synchronus stream cipher.
RSA, which is an abbreviation of the author's names (Rivest\u2013Shamir\u2013Adleman), is a cryptosystem which allows for asymmetric encryption. Asymmetric cryptosystems are alos commonly referred to as Public Key Cryptography where a public key is used to encrypt data and only a secret, private key can be used to decrypt the data.
The message is represented as m and is converted into a number
The encrypted message or ciphertext is represented by c
p and q are prime numbers which make up n
e is the public exponent
n is the modulus and its length in bits is the bit length (i.e. 1024 bit RSA)
d is the private exponent
The totient \u03bb(n) is used to compute d and is equal to the lcm(p-1, q-1), another definition for \u03bb(n) is that \u03bb(pq) = lcm(\u03bb(p), \u03bb(q))
"}, {"location": "cryptography/what-is-rsa/#what-makes-rsa-viable", "title": "What makes RSA viable?", "text": "
If public n, public e, private d are all very large numbers and a message m holds true for 0 < m < n, then we can say:
(m^e^)^d^ \u2261 m (mod n)
Note
The triple equals sign in this case refers to modular congruence which in this case means that there exists an integer k such that (m^e^)^d^ = kn + m
RSA is viable because it is incredibly hard to find d even with m, n, and e because factoring large numbers is an arduous process.
We are going to follow along Wikipedia's small numbers example in order to make this idea a bit easier to understand.
Note
In This example we are using Carmichael's totient function where \u03bb(n) = lcm(\u03bb(p), \u03bb(q)), but Euler's totient function is perfectly valid to use with RSA. Euler's totient is \u03c6(n) = (p \u2212 1)(q \u2212 1)
Choose two prime numbers such as:
p = 61 and q = 53
Find n:
n = pq = 3233
Calculate \u03bb(n) = lcm(p-1, q-1)
\u03bb(3233) = lcm(60, 52) = 780
Choose a public exponent such that 1 < e < \u03bb(n) and is coprime (not a factor of) \u03bb(n). The standard is most cases is 65537, but we will be using:
e = 17
Calculate d as the modular multiplicative inverse or in english find d such that: d x e mod \u03bb(n) = 1
d x 17 mod 780 = 1
d = 413
Now we have a public key of (3233, 17) and a private key of (3233, 413)
An XOR or eXclusive OR is a bitwise operation indicated by ^ and shown by the following truth table:
A B A ^ B 0 0 0 0 1 1 1 0 1 1 1 0
So what XOR'ing bytes in the action 0xA0 ^ 0x2C translates to is:
1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0
0b10001100 is equivelent to 0x8C, a cool property of XOR is that it is reversable meaning 0x8C ^ 0x2C = 0xA0 and 0x8C ^ 0xA0 = 0x2C
"}, {"location": "cryptography/what-is-xor/#what-does-this-have-to-do-with-ctf", "title": "What does this have to do with CTF?", "text": "
XOR is a cheap way to encrypt data with a password. Any data can be encrypted using XOR as shown in this Python example:
>>> data = 'CAPTURETHEFLAG'\n>>> key = 'A'\n>>> encrypted = ''.join([chr(ord(x) ^ ord(key)) for x in data])\n>>> encrypted\n'\\x02\\x00\\x11\\x15\\x14\\x13\\x04\\x15\\t\\x04\\x07\\r\\x00\\x06'\n>>> decrypted = ''.join([chr(ord(x) ^ ord(key)) for x in encrypted])\n>>> decrypted\n'CAPTURETHEFLAG'\n
This can be extended using a multibyte key by iterating in parallel with the data.
Multibyte XOR gets exponentially harder the longer the key, but if the encrypted text is long enough, character frequency analysis is a viable method to find the key. Character Frequency Analysis means that we split the cipher text into groups based on the number of characters in the key. These groups then are bruteforced using the idea that some letters appear more frequently in the english alphabet than others.
Forensics is the art of recovering the digital trail left on a computer. There are plenty of methods to find data which is seemingly deleted, not stored, or worse, covertly recorded.
An important part of forensics is having the right tools, as well as being familiar with the following topics:
File Extensions are not the sole way to identify the type of a file, files have certain leading bytes called file signatures which allow programs to parse the data in a consistent manner. Files can also contain additional \"hidden\" data called metadata which can be useful in finding out information about the context of a file's data.
File signatures (also known as File Magic Numbers) are bytes within a file used to identify the format of the file. Generally they\u2019re 2-4 bytes long, found at the beginning of a file.
"}, {"location": "forensics/what-are-file-formats/#what-is-it-used-for", "title": "What is it used for?", "text": "
Files can sometimes come without an extension, or with incorrect ones. We use file signature analysis to identify the format (file type) of the file. Programs need to know the file type in order to open it properly.
"}, {"location": "forensics/what-are-file-formats/#how-do-you-find-the-file-signature", "title": "How do you find the file signature?", "text": "
You need to be able to look at the binary data that constitutes the file you\u2019re examining. To do this, you\u2019ll use a hexadecimal editor. Once you find the file signature, you can check it against file signature repositories such as Gary Kessler\u2019s.
The file above, when opened in a Hex Editor, begins with the bytes FFD8FFE0 00104A46 494600 or in ASCII \u02c7\u00ff\u02c7\u2021 JFIF where \\x00 and \\x10 lack symbols.
Searching in Gary Kessler\u2019s database shows that this file signature belongs to a JPEG/JFIF graphics file, exactly what we suspect.
A hexadecimal (hex) editor (also called a binary file editor or byte editor) is a computer program you can use to manipulate the fundamental binary data that constitutes a computer file. The name \u201chex\u201d comes from \u201chexadecimal,\u201d a standard numerical format for representing binary data. A typical computer file occupies multiple areas on the platter(s) of a disk drive, whose contents are combined to form the file. Hex editors that are designed to parse and edit sector data from the physical segments of floppy or hard disks are sometimes called sector editors or disk editors. A hex editor is used to see or edit the raw, exact contents of a file. Hex editors may used to correct data corrupted by a system or application. A list of editors can be found on the forensics Wiki. You can download one and install it on your system.
A forensic image is an electronic copy of a drive (e.g. a hard drive, USB, etc.). It\u2019s a bit-by-\u00adbit or bitstream file that\u2019s an exact, unaltered copy of the media being duplicated.
Wikipedia said that the most straight\u00adforward disk imaging method is to read a disk from start to finish and write the data to a forensics image format. \u201cThis can be a time-consuming process, especially for disks with a large capacity,\u201d Wikipedia said.
To prevent write access to the disk, you can use a write blocker. It\u2019s also common to calculate a cryptographic hash of the entire disk when imaging it. \u201cCommonly-used cryptographic hashes are MD5, SHA1 and/or SHA256,\u201d said Wikipedia. \u201cBy recalculating the integrity hash at a later time, one can determine if the data in the disk image has been changed. This by itself provides no protection against intentional tampering, but it can indicate that the data was altered, e.g. due to corruption.\u201d
Why image a disk? Forensic imaging: - Prevents tampering with the original data\u00ad evidence - Allows you to play around with the copy, without worrying about messing up the original
There are plenty of traces of someone's activity on a computer, but perhaps some of the most valuble information can be found within memory dumps, that is images taken of RAM. These dumps of data are often very large, but can be analyzed using a tool called Volatility
In order to properly use Volatility you must supply a profile with --profile=PROFILE, therefore before any sleuthing, you need to determine the profile using imageinfo:
$ python vol.py -f ~/image.raw imageinfo\nVolatility Foundation Volatility Framework 2.4\nDetermining profile based on KDBG search...\n\n Suggested Profile(s) : Win7SP0x64, Win7SP1x64, Win2008R2SP0x64, Win2008R2SP1x64\n AS Layer1 : AMD64PagedMemory (Kernel AS)\n AS Layer2 : FileAddressSpace (/Users/Michael/Desktop/win7_trial_64bit.raw)\n PAE type : PAE\n DTB : 0x187000L\n KDBG : 0xf80002803070\n Number of Processors : 1\n Image Type (Service Pack) : 0\n KPCR for CPU 0 : 0xfffff80002804d00L\n KUSER_SHARED_DATA : 0xfffff78000000000L\n Image date and time : 2012-02-22 11:29:02 UTC+0000\n Image local date and time : 2012-02-22 03:29:02 -0800\n
Metadata is data about data. Different types of files have different metadata. The metadata on a photo could include dates, camera information, GPS location, comments, etc. For music, it could include the title, author, track number and album.
"}, {"location": "forensics/what-is-metadata/#what-kind-of-file-metadata-is-useful", "title": "What kind of file metadata is useful?", "text": "
Potentially, any file metadata you can find could be useful.
"}, {"location": "forensics/what-is-metadata/#how-do-i-find-it", "title": "How do I find it?", "text": "
Note
EXIF Data is metadata attached to photos which can include location, time, and device information.
One of our favorite tools is exiftool, which displays metadata for an input file, including: - File size - Dimensions (width and height) - File type - Programs used to create (e.g. Photoshop) - OS used to create (e.g. Apple)
Run command line: exiftool(-k).exe [filename] and you should see something like this:
Timestamps are data that indicate the time of certain events (MAC): - Modification \u2013 when a file was modified - Access \u2013 when a file or entries were read or accessed - Creation \u2013 when files or entries were created
"}, {"location": "forensics/what-is-metadata/#types-of-timestamps", "title": "Types of timestamps", "text": "
Modified
Accessed
Created
Date Changed (MFT)
Filename Date Created (MFT)
Filename Date Modified (MFT)
Filename Date Accessed (MFT)
INDX Entry Date Created
INDX Entry Date Modified
INDX Entry Date Accessed
INDX Entry Date Changed
"}, {"location": "forensics/what-is-metadata/#why-do-we-care", "title": "Why do we care?", "text": "
Certain events such as creating, moving, copying, opening, editing, etc. might affect the MAC times. If the MAC timestamps can be attained, a timeline of events could be created.
There are plenty more patterns than the ones introduced below, but these are the basics you should start with to get a good understanding of how it works, and to complete this challenge.
We know that the BMP files fileA and fileD are the same, but that the JPEG files fileB and fileC are different somehow. So how can we find out what went on with these files?
By using time stamp information from the file system, we can learn that the BMP fileD was the original file, with fileA being a copy of the original. Afterward, fileB was created by modifying fileB, and fileC was created by modifying fileA in a different way.
Follow along as we demonstrate.
We\u2019ll start by analyzing images in AccessData FTK Imager, where there\u2019s a Properties window that shows you some information about the file or folder you\u2019ve selected.
Here are the extracted MAC times for fileA, fileB, fileC and fileD: Note, AccessData FTK Imager assumes that the file times on the drive are in UTC (Universal Coordinated Time). I subtracted four hours, since the USB was set up in Eastern Standard Time. This isn\u2019t necessary, but it helps me understand the times a bit better.
Highlight timestamps that are the same, if timestamps are off by a few seconds, they should be counted as the same. This lets you see a clear difference between different timestamps. Then, highlight oldest to newest to help put them in order.
Steganography is the practice of hiding data in plain sight. Steganography is often embedded in images or audio.
You could send a picture of a cat to a friend and hide text inside. Looking at the image, there\u2019s nothing to make anyone think there\u2019s a message hidden inside it.
You could also hide a second image inside the first.
So we can hide text and an image, how do we find out if there is hidden data?
FileA and FileD appear the same, but they\u2019re different. Also, FileD was modified after it was copied, so it\u2019s possible there might be steganography in it.
FileB and FileC don\u2019t appear to have been modified after being created. That doesn\u2019t rule out the possibility that there\u2019s steganography in them, but you\u2019re more likely to find it in fileD. This brings up two questions:
Can we determine that there is steganography in fileD?
Let\u2019s say we have an image, and part of it contains the following binary:
And let\u2019s say we want to hide the character y inside.
First, we need to convert the hidden message to binary.
Now we take each bit from the hidden message and replace the LSB of the corresponding byte with it.
And again:
And again:
And again:
And again:
And again:
And again:
And once more:
Decoding LSB steganography is exactly the same as encoding, but in reverse. For each byte, grab the LSB and add it to your decoded message. Once you\u2019ve gone through each byte, convert all the LSBs you grabbed into text or a file. (You can use your file signature knowledge here!)
"}, {"location": "forensics/what-is-stegonagraphy/#what-other-types-of-steganography-are-there", "title": "What other types of steganography are there?", "text": "
Steganography is hard for the defense side, because there\u2019s practically an infinite number of ways it could be carried out. Here are a few examples: - LSB steganography: different bits, different bit combinations - Encode in every certain number of bytes - Use a password - Hide in different places - Use encryption on top of steganography
\"Wireshark saved me hours on my last tax return! - David\"
\"[Wireshark] is great for ruining your weekend and fixing pesky networking problems!\" - Max\"
\"Wireshark is the powerhouse of the cell. - Joe\"
\"Does this cable do anything? - Ayyaz\"
Wireshark is a network protocol analyzer which is often used in CTF challenges to look at recorded network traffic. Wireshark uses a filetype called PCAP to record traffic. PCAPs are often distributed in CTF challenges to provide recorded traffic history.
Upon opening Wireshark, you are greeted with the option to open a PCAP or begin capturing network traffic on your device.
The network traffic displayed initially shows the packets in order of which they were captured. You can filter packets by protocol, source IP address, destination IP address, length, etc.
In order to apply filters, simply enter the constraining factor, for example 'http', in the display filter bar.
Filters can be chained together using '&&' notation. In order to filter by IP, ensure a double equals '==' is used.
The most pertinent part of a packet is its data payload and protocol information.
In order for a network session to be encrypted properly, the client and server must share a common secret for which they can use to encrypt and decrypt data without someone in the middle being able to guess. The SSL Handshake loosely follows this format:
The client sends a list of available cipher suites it can use along with a random set of bytes referred to as client_random
The server sends back the cipher suite that will be used, such as TLS_DHE_RSA_WITH_AES_128_CBC_SHA, along with a random set of bytes referred to as server_random
The client generates a pre-master secret, encrypts it, then sends it to the server.
The server and client then generate a common master secret using the selected cipher suite
The client and server begin communicating using this common secret
There are several ways to be able to decrypt traffic.
If you have the client and server random values and the pre-master secret, the master secret can be generated and used to decrypt the traffic
If you have the master secret, traffic can be decrypted easily
If the cipher-suite uses RSA, you can factor n in the key in order to break the encryption on the encrypted pre-master secret and generate the master secret with the client and server randoms
Reverse Engineering in a CTF is typically the process of taking a compiled (machine code, bytecode) program and converting it back into a more human readable format.
Very often the goal of a reverse engineering challenge is to understand the functionality of a given program such that you can identify deeper issues.
If we are given a binary compiled from that source and we want to figure out how the source looks, we can use a decompiler to get c pseudocode which we can then use to reconstruct the function. The sample decompilation can look like:
printSpacer:\nint __fastcall printSpacer(int a1)\n{\n int i; // [rsp+8h] [rbp-8h]\n\n for ( i = 0; i < a1; ++i )\n printf(\"-\");\n return printf(\"\\n\");\n}\n\nmain:\nint __cdecl main(int argc, const char **argv, const char **envp)\n{\n int v4; // [rsp+18h] [rbp-18h]\n signed int i; // [rsp+1Ch] [rbp-14h]\n\n for ( i = 0; i < 13; ++i )\n {\n v4 = i + 1;\n printf(\"%c\", (unsigned int)aHelloWorld[i], envp);\n while ( v4 < 13 )\n printf(\"%c\", (unsigned int)aHelloWorld[v4++]);\n printf(\"\\n\");\n printSpacer(13 - i);\n }\n return 0;\n}\n
A good method of getting a good representation of the source is to convert the decompilation into Python since Python is basically psuedocode that runs. Starting with main often allows you to gain a good overview of what the program is doing and will help you translate the other functions.
We know we will start with a main function and some variables, if you trace the execution of the variables, you can oftentimes determine the variable type. Because i is being used as an index, we know its an int, and because v4 used as one later on, it too is an index. We can also see that we have a variable aHelloWorld being printed with \"%c\", we can determine it represents the 'Hello, World!' string. Lets define all these variables in our Python main function:
def main():\n string = \"Hello, World!\"\n i = 0\n v4 = 0\n for i in range(0, 13):\n v4 = i + 1\n print(string[i], end='')\n while v4 < 13:\n print(string[v4], end='')\n v4 += 1\n print()\n printSpacer(13-i)\n
The Interactive Disassembler (IDA) is the industry standard for binary disassembly. IDA is capable of disassembling \"virtually any popular file format\". This makes it very useful to security researchers and CTF players who often need to analyze obscure files without knowing what they are or where they came from. IDA also features the industry leading Hex Rays decompiler which can convert assembly code back into a pseudo code like format.
IDA also has a plugin interface which has been used to create some successful plugins that can make reverse engineering easier:
Binary Ninja is an up and coming disassembler that attempts to bring a new, more programmatic approach to reverse engineering. Binary Ninja brings an improved plugin API and modern features to reverse engineering. While it's less popular or as old as IDA, Binary Ninja (often called binja) is quickly gaining ground and has a small community of dedicated users and followers.
Binja also has some community contributed plugins which are collected here: https://github.com/Vector35/community-plugins
The GNU Debugger is a free and open source debugger which also disassembles programs. It's capable as a disassembler, but most notably it is used by CTF players for its debugging and dynamic analysis capabailities.
gdb is often used in tandom with enhancement scripts like peda, pwndbg, and GEF
Machine Code or Assembly is code which has been formatted for direct execution by a CPU. Machine Code is the reason why readable programming languages like C, when compiled, cannot be reversed into source code (well Decompilers can sort of, but more on that later).
"}, {"location": "reverse-engineering/what-is-assembly-machine-code/#from-source-to-compilation", "title": "From Source to Compilation", "text": "
Godbolt shows the differences in machine code generated by various compilers.
This is a one way process for compiled languages as there is no way to generate source from machine code. While the machine code may seem unintelligible, the extremely basic functions can be interpreted with some practice.
x86-64 or amd64 or i64 is a 64-bit Complex Instruction Set Computing (CISC) architecture. This basically means that the registers used for this architecture extend an extra 32-bits on Intel's x86 architecture. CISC means that a single instruction can do a bunch of different things at once, such as memory accesses, register reads, etc. It is also a variable-length instruction set, which means different instructions can be different sizes ranging from 1 to 16 bytes long. And finally x86-64 allows for multi-sized register access, which means that you can access certain parts of a register which are different sizes.
x86-64 registers behave similarly to other architectures. A key component of x86-64 registers is multi-sized access which means the register RAX can have its lower 32 bits accessed with EAX. The next lower 16 bits can be accessed with AX and the lowest 8 bits can be accessed with AL which allows for the compiler to make optimizations which boost program execution.
x86-64 has plenty of registers to use, including rax, rbx, rcx, rdx, rdi, rsi, rsp, rip, r8-r15, and more! But some registers serve special purposes.
The special registers include: - RIP: the instruction pointer - RSP: the stack pointer - RBP: the base pointer
An instruction represents a single operation for the CPU to perform.
There are different types of instructions including:
Data movement: mov rax, [rsp - 0x40]
Arithmetic: add rbx, rcx
Control-flow: jne 0x8000400
Because x86-64 is a CISC architecture, instructions can be quite complex for machine code, such as repne scasb which repeats up to ECX times over memory at EDI looking for a NULL byte (0x00), decrementing ECX each byte (essentially strlen() in a single instruction!).
It is important to remember that an instruction really is just memory; this idea will become useful with Return Oriented Programming or ROP.
Note
Instructions, numbers, strings, everything are always represented in hex!
add rax, rbx\nmov rax, 0xdeadbeef\nmov rax, [0xdeadbeef] == 67 48 8b 05 ef be ad de\n\"Hello\" == 48 65 6c 6c 6f\n== 48 01 d8\n== 48 c7 c0 ef be ad de\n
What should the CPU execute? This is determined by the RIP register where IP means instruction pointer. Execution follows the pattern: fetch the instruction at the address in RIP, decode it, run it.
Here the operation mov is moving the \"immediate\" 0xdeadbeef into the register RAX
mov rax, [0xdeadbeef + rbx * 4]
Here the operation mov is moving the data at the address of [0xdeadbeef + RBX*4] into the register RAX. When brackets are used, you can think of the program as getting the content from that effective address.
How can we express conditionals in x86-64? We use conditional jumps such as:
jnz <address>
je <address>
jge <address>
jle <address>
etc.
They jump if their condition is true, and just go to the next instruction otherwise. These conditionals are checking EFLAGS, which are special registers which store flags on certain instructions such as add rax, rbx which sets the o (overflow) flag if the sum is greater than a 64-bit register can hold, and wraps around. You can jump based on that with a jo instruction. The most important thing to remember is the cmp instruction:
cmp rax, rbx\njle error\n
This assembly jumps if RAX <= RBX"}, {"location": "reverse-engineering/what-is-assembly-machine-code/#addresses", "title": "Addresses", "text": "
Memory acts similarly to a big array where the indices of this \"array\" are memory addresses. Remember from earlier:
mov rax, [0xdeadbeef]
The square brackets mean \"get the data at this address\". This is analogous to the C/C++ syntax: rax = *0xdeadbeef;
"}, {"location": "reverse-engineering/what-is-bytecode/", "title": "What is bytecode", "text": ""}, {"location": "reverse-engineering/what-is-c/", "title": "The C Programming Language", "text": ""}, {"location": "reverse-engineering/what-is-c/#history", "title": "History", "text": "
The C programming language was written by Dennis Ritchie in the 1970s while he was working at Bell Labs. It was first used to reimplement the Unix operating system which was purely written in assembly language. At first, the Unix developers were considering using a language called \"B\" but because B wasn't optimized for the target computer, the C language was created.
Note
C is the letter and the programming language after B!
C was designed to be close to assembly and is still widely used in lower level programming where speed and control are needed (operating systems, embedded systems). C was also very influential to other programming languages used today. Notable languages include C++, Objective-C, Golang, Java, JavaScript, PHP, Python, and Rust.
Today C is widely used either as a low level programming language or is the base language that other programming languages are implemented in.
While it can be difficult to see, the C language compiles down directly into machine code. The compiler is programmed to process the provided C code and emit assembly that's targetted to whatever operating system and architecture the compiler is set to use.
Some common compilers include:
gcc
clang
A good way to explore this relationship is to use this online GCC Explorer from Matt Godbolt.
In regards to CTF, many reverse engineering and exploitation CTF challenges are written in C because the language compiles down directly to assembly and there are little to no safeguards in the language. This means developers must manually handle both. Of course, this can lead to mistakes which can sometimes lead to security issues.
Note
Other higher level langauges like Python manage memory and garbage collection for you. Google Golang was inspired by C, but adds in functionality like garbage collection and memory safety.
There are some examples of famously vulnerable functions in C which are still available and can still result in vulnerabilities:
C uses an idea known as pointers. A pointer is a variable which contains the address of another variable.
To understand this idea we should first understand that memory is laid out in terms of addresses and data gets stored at these addresses.
Take the following example of defining an integer in C:
int x = 4;\n
To the programmer this is the variable x receiving the value of 4. The computer stores this value in some location in memory. For example we can say that address 0x1000 now holds the value 4. The computer knows to directly access the memory and retrieve the value 4 whenever the programmer tries to use the x variable. If we were to say x + 4, the computer would give you 8 instead of 0x1004.
But in C we can retrieve the memory address being used to hold the 4 value (i.e. 0x1000) by using the & character and using * to create an \"integer pointer\" type.
int* y = &x;\n
The y variable will store the address pointed to by the xvariable (0x1000).
Note
The * character allows us to declare pointer variables but also allows us to access the value stored at a pointer. For example, entering *y allows us to access the 4 value instead of 0x1000.
Whenever we use the y variable we are using the memory address, but if we use the x variable we use the value stored at the memory address.
Arrays allow programmers to group data into logical containers.
To access the individual elements of an array we access the contents by their \"index\". Most programming langauges today start counting from 0. So to take our previous example:
"}, {"location": "reverse-engineering/what-is-c/#how-do-arrays-work", "title": "How do arrays work?", "text": "
Arrays are a clever combination of multiplication, pointers, and programming.
Because the computer knows the data type used for every element in the array, the computer needs to simply multiply the size of the data type by the index you are looking for and then add this value to the address of the beginning of the array.
For example if we know that the base address of an array is 1000 and we know that each integer takes 8 bytes, we know that if we have 8 integers right next to each other, we can get the integer at the 4th index with the following math:
"}, {"location": "reverse-engineering/what-is-c/#memory-management", "title": "Memory Management", "text": ""}, {"location": "reverse-engineering/what-is-gdb/", "title": "The GNU Debugger (GDB)", "text": "
The GNU Debugger or GDB is a powerful debugger which allows for step-by-step execution of a program. It can be used to trace program execution and is an important part of any reverse engineering toolkit.
GDB without any modifications is unintuitive and obscures a lot of useful information. The plug-in pwndb solves a lot of these problems and makes for a much more pleasant experience. But if you are constrained and have to use vanilla gdb, here are several things to make your life easier.
In order to view the state of registers with vanilla gdb, you need to run the command info registers which will display the state of all the registers:
As before, in order to delete a view, you can list the available breakpoints using (gdb) info breakpoints (don't forget about GDB's autocomplete, you don't always need to type out every command!) which will display all breakpoints:
Num Type Disp Enb Address What\n1 breakpoint keep y 0x0804852f <main>\n3 breakpoint keep y 0x0804864d <__libc_csu_init+61>\n
Then simply execute (gdb) delete 1
Note
GDB creates breakpoints chronologically and does NOT reuse numbers.
What good is a debugger if you can't control where you are going? In order to begin execution of a program, use the command r [arguments] similar to how if you ran it with dot-slash notation you would execute it ./program [arguments]. In this case the program will run normally and if no breakpoints are set, you will execute normally. If you have breakpoints set, you will stop at that instruction.
(gdb) continue [# of breakpoints]: Resumes the execution of the program until it finishes or until another breakpoint is hit (shorthand c)
(gdb) step[# of instructions]: Steps into an instruction the specified number of times, default is 1 (shorthand s)
(gdb) next instruction [# of instructions]: Steps over an instruction meaning it will not delve into called functions (shorthand ni)
(gdb) finish: Finishes a function and breaks after it gets returned (shorthand fin)
Examining data in GDB is also very useful for seeing how the program is affecting data. The notation may seem complex at first, but it is flexible and provides powerful functionality.
If the program happens to be an accept-and-fork server, gdb will have issues following the child or parent processes. In order to specify how you want gdb to function you can use the command set follow-fork-mode [on/off]
Another useful feature of GDB is to attach to processes which are already running. Simply launch gdb using gdb, then find the process id of the program you would like to attach to an execute attach [pid].
Websites all around the world are programmed using various programming languages. While there are specific vulnerabilities in each programming langage that the developer should be aware of, there are issues fundamental to the internet that can show up regardless of the chosen language or framework.
These vulnerabilities often show up in CTFs as web security challenges where the user needs to exploit a bug to gain some kind of higher level privelege.
Command Injection is a vulnerability that allows an attacker to submit system commands to a computer running a website. This happens when the application fails to encode user input that goes into a system shell. It is very common to see this vulnerability when a developer uses the system() command or its equivalent in the programming language of the application.
Because of the additional semicolon, the os.system() function is instructed to run two commands.
It looks to the program as:
ping ; ls\n
Note
The semicolon terminates a command in bash and allows you to put another command after it.
Because the ping command is being terminated and the ls command is being added on, the ls command will be run in addition to the empty ping command!
This is the core concept behind command injection. The ls command could of course be switched with another command (e.g. wget, curl, bash, etc.)
Command injection is a very common means of privelege escalation within web applications and applications that interface with system commands. Many kinds of home routers take user input and directly append it to a system command. For this reason, many of those home router models are vulnerable to command injection.
A Cross Site Request Forgery or CSRF Attack, pronounced see surf, is an attack on an authenticated user which uses a state session in order to perform state changing attacks like a purchase, a transfer of funds, or a change of email address.
The entire premise of CSRF is based on session hijacking, usually by injecting malicious elements within a webpage through an <img> tag or an <iframe> where references to external resources are unverified.
GET requests are often used by websites to get user input. Say a user signs in to an banking site which assigns their browser a cookie which keeps them logged in. If they transfer some money, the URL that is sent to the server might have the pattern:
Knowing this format, an attacker can send an email with a hyperlink to be clicked on or they can include an image tag of 0 by 0 pixels which will automatically be requested by the browser such as:
"}, {"location": "web-exploitation/cross-site-scripting/what-is-cross-site-scripting/", "title": "Cross Site Scripting (XSS)", "text": "
Cross Site Scripting or XSS is a vulnerability where on user of an application can send JavaScript that is executed by the browser of another user of the same application.
This is a vulnerability because JavaScript has a high degree of control over a user's web browser.
For example JavaScript has the ability to:
Modify the page (called the DOM)
Send more HTTP requests
Access cookies
By combining all of these abilities, XSS can maliciously use JavaScript to extract user's cookies and send them to an attacker controlled server. XSS can also modify the DOM to phish users for their passwords. This only scratches the surface of what XSS can be used to do.
XSS is typically broken down into three categories:
You can see the XSS exploit provided in the data GET parameter. If the application is vulnerable to reflected XSS, the application will take this data parameter value and inject it into the DOM.
Depending on where the exploit gets injected, it may need to be constructed differently.
Also, the exploit payload can change to fit whatever the attacker needs it to do. Whether that is to extract cookies and submit it to an external server, or to simply modify the page to deface it.
One of the deficiencies of reflected XSS however is that it requires the victim to access the vulnerable page from an attacker controlled resource. Notice that if the data paramter, wasn't provided the exploit wouldn't work.
In many situations, reflected XSS is detected by the browser because it is very simple for a browser to detect malicous XSS payloads in URLs.
Stored XSS is different from reflected XSS in one key way. In reflected XSS, the exploit is provided through a GET parameter. But in stored XSS, the exploit is provided from the website itself.
Imagine a website that allows users to post comments. If a user can submit an XSS payload as a comment, and then have others view that malicious comment, it would be an example of stored XSS.
The reason being that the web site itself is serving up the XSS payload to other users. This makes it very difficult to detect from the browser's perspective and no browser is capable of generically preventing stored XSS from exploiting a user.
DOM XSS is XSS that is due to the browser itself injecting an XSS payload into the DOM. While the server itself may properly prevent XSS, it's possible that the client side scripts may accidentally take a payload and insert it into the DOM and cause the payload to trigger.
The server itself is not to blame, but the client side JavaScript files are causing the issue.
Here the user is submitting ../../../../../../../../etc/passwd.
This will result in the PHP interpreter leaving the directory that it is coded to look in ('/var/www/html') and instead be forced up to the root folder.
Ultimately this will become /etc/passwd because the computer will not go a directory above its top directory.
Thus the application will load the /etc/passwd file and emit it to the user like so:
root:x:0:0:root:/root:/bin/bash\ndaemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin\nbin:x:2:2:bin:/bin:/usr/sbin/nologin\nsys:x:3:3:sys:/dev:/usr/sbin/nologin\nsync:x:4:65534:sync:/bin:/bin/sync\ngames:x:5:60:games:/usr/games:/usr/sbin/nologin\nman:x:6:12:man:/var/cache/man:/usr/sbin/nologin\nlp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin\nmail:x:8:8:mail:/var/mail:/usr/sbin/nologin\nnews:x:9:9:news:/var/spool/news:/usr/sbin/nologin\nuucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin\nproxy:x:13:13:proxy:/bin:/usr/sbin/nologin\nwww-data:x:33:33:www-data:/var/www:/usr/sbin/nologin\nbackup:x:34:34:backup:/var/backups:/usr/sbin/nologin\nlist:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin\nirc:x:39:39:ircd:/var/run/ircd:/usr/sbin/nologin\ngnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/usr/sbin/nologin\nnobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin\nsystemd-timesync:x:100:102:systemd Time Synchronization,,,:/run/systemd:/bin/false\nsystemd-network:x:101:103:systemd Network Management,,,:/run/systemd/netif:/bin/false\nsystemd-resolve:x:102:104:systemd Resolver,,,:/run/systemd/resolve:/bin/false\nsystemd-bus-proxy:x:103:105:systemd Bus Proxy,,,:/run/systemd:/bin/false\n_apt:x:104:65534::/nonexistent:/bin/false\n
This same concept can be applied to applications where some input is taken from a user and then used to access a file or path or similar. This vulnerability very often can be used to leak sensitive data or extract application source code to find other vulnerabilities.
PHP is one of the most used languages for back-end web development and therefore it has become a target by hackers. PHP is a language which makes it painful to be secure for most instances, making it every hacker's dream target.
PHP is a C-like language which uses tags enclosed by <?php ... ?> (sometimes just <? ... ?>). It is inlined into HTML. A word of advice is to keep the php docs open because function names are strange due to the fact that the length of function name is used to be the key in PHP's internal dictionary, so function names were shortened/lengthened to make the lookup faster. Other things include:
<?php\n if ($_SERVER['REQUEST_METHOD'] === 'POST' && isset($_POST['email']) && isset($_POST['password'])) {\n $db = new mysqli('127.0.0.1', 'cs3284', 'cs3284', 'logmein');\n $email = $_POST['email'];\n $password = sha1($_POST['password']);\n $res = $db->query(\"SELECT * FROM users WHERE email = '$email' AND password = '$password'\");\n if ($row = $res->fetch_assoc()) {\n $_SESSION['id'] = $row['id'];\n header('Location: index.php');\n die();\n }\n }\n?>\n<html>...\n
This example PHP simply checks the POST data for an email and password. If the password is equal to the hashed password in the database, the use is logged in and redirected to the index page.
The line email = '$email' uses automatic string interpolation in order to convert $email into a string to compare with the database.
PHP will do just about anything to match with a loose comparison (\\=\\=) which means things can be 'equal' (\\=\\=) or really equal (\\=\\=\\=). The implicit integer parsing to strings is the root cause of a lot of issues in PHP.
PHP has multiple ways to include other source files such as require, require_once and include. These can take a dynamic string such as require $_GET['page'] . \".php\"; which is usually seen in templating.
PHP has its own URL scheme: php://... and its main purpose is to filter output automatically. It can automatically remove certain HTML tags and can base64 encode as well.
Server Side Request Forgery or SSRF is where an attacker is able to cause a web application to send a request that the attacker defines.
For example, say there is a website that lets you take a screenshot of any site on the internet.
Under normal usage a user might ask it to take a screenshot of a page like Google, or The New York Times. But what if a user does something more nefarious? What if they asked the site to take a picture of http://localhost ? Or perhaps tries to access something more useful like http://localhost/server-status ?
Note
127.0.0.1 (also known as localhost or loopback) represents the computer itself. Accessing localhost means you are accessing the computer's own internal network. Developers often use localhost as a way to access the services they have running on their own computers.
Depending on what the response from the site is the attacker may be able to gain additional information about what's running on the computer itself.
In addition, the requests originating from the server would come from the server's IP not the attackers IP. Because of that, it is possible that the attacker might be able to access internal resources that he wouldn't normally be able to access.
Another usage for SSRF is to create a simple port scanner to scan the internal network looking for internal services.
SQL Injection is a vulnerability where an application takes input from a user and doesn't vaildate that the user's input doesn't contain additional SQL.
<?php\n $username = $_GET['username']; // kchung\n $result = mysql_query(\"SELECT * FROM users WHERE username='$username'\");\n?>\n
If we look at the $username variable, under normal operation we might expect the username parameter to be a real username (e.g. kchung).
But a malicious user might submit different kind of data. For example, consider if the input was '?
The application would crash because the resulting SQL query is incorrect.
SELECT * FROM users WHERE username='''\n
Note
Notice the extra single quote at the end.
With the knowledge that a single quote will cause an error in the application we can expand a little more on SQL Injection.
What if our input was ' OR 1=1?
SELECT * FROM users WHERE username='' OR 1=1\n
1 is indeed equal to 1. This equates to true in SQL. If we reinterpret this the SQL statement is really saying
SELECT * FROM users WHERE username='' OR true\n
This will return every row in the table because each row that exists must be true.
We can also inject comments and termination characters like -- or /* or ;. This allows you to terminate SQL queries after your injected statements. For example '-- is a common SQL injection payload.
SELECT * FROM users WHERE username=''-- '\n
This payload sets the username parameter to an empty string to break out of the query and then adds a comment (--) that effectively hides the second single quote.
Using this technique of adding SQL statements to an existing query we can force databases to return data that it was not meant to return.
"}]}
\ No newline at end of file
+{"config": {"lang": ["en"], "separator": "[\\s\\-]+", "pipeline": ["stopWordFilter"]}, "docs": [{"location": "", "title": "Capture The Flag 101", "text": ""}, {"location": "#overview", "title": "Overview", "text": "
Capture the Flags, or CTFs, are computer security competitions.
Teams of competitors (or just individuals) are pitted against each other in various challenges across multiple security disciplines, competing to earn the most points.
CTFs are often the beginning of one's cyber security career due to their team building nature and competitive aspect. In addition, there isn't a lot of commitment required beyond a weekend.
Info
For information about ongoing CTFs, check out CTFTime.
In this handbook you'll learn the basics\u2122 behind the methodologies and techniques needed to succeed in Capture the Flag competitions.
Address Space Layout Randomization (or ASLR) is the randomization of the place in memory where the program, shared libraries, the stack, and the heap are. This makes can make it harder for an attacker to exploit a service, as knowledge about where the stack, heap, or libc can't be re-used between program launches. This is a partially effective way of preventing an attacker from jumping to, for example, libc without a leak.
Typically, only the stack, heap, and shared libraries are ASLR enabled. It is still somewhat rare for the main program to have ASLR enabled, though it is being seen more frequently and is slowly becoming the default.
The simplest and most common buffer overflow is one where the buffer is on the stack. Let's look at an example.
#include <stdio.h>\n\nint main() {\n int secret = 0xdeadbeef;\n char name[100] = {0};\n read(0, name, 0x100);\n if (secret == 0x1337) {\n puts(\"Wow! Here's a secret.\");\n } else {\n puts(\"I guess you're not cool enough to see my secret\");\n }\n}\n
There's a tiny mistake in this program which will allow us to see the secret. name is decimal 100 bytes, however we're reading in hex 100 bytes (=256 decimal bytes)! Let's see how we can use this to our advantage.
If the compiler chose to layout the stack like this:
The least significant byte of secret has been overwritten! If we follow the next 3 bytes to be read in, we'll see the entirety of secret is \"clobbered\" with our 'A's
The remaining 152 bytes would continue clobbering values up the stack.
"}, {"location": "binary-exploitation/buffer-overflow/#passing-an-impossible-check", "title": "Passing an impossible check", "text": "
How can we use this to pass the seemingly impossible check in the original program? Well, if we carefully line up our input so that the bytes that overwrite secret happen to be the bytes that represent 0x1337 in little-endian, we'll see the secret message.
A small Python one-liner will work nicely: python -c \"print 'A'*100 + '\\x31\\x13\\x00\\x00'\"
This will fill the name buffer with 100 'A's, then overwrite secret with the 32-bit little-endian encoding of 0x1337.
"}, {"location": "binary-exploitation/buffer-overflow/#going-one-step-further", "title": "Going one step further", "text": "
As discussed on the stack page, the instruction that the current function should jump to when it is done is also saved on the stack (denoted as \"Saved EIP\" in the above stack diagrams). If we can overwrite this, we can control where the program jumps after main finishes running, giving us the ability to control what the program does entirely.
Usually, the end objective in binary exploitation is to get a shell (often called \"popping a shell\") on the remote computer. The shell provides us with an easy way to run anything we want on the target computer.
Say there happens to be a nice function that does this defined somewhere else in the program that we normally can't get to:
void give_shell() {\n system(\"/bin/sh\");\n}\n
Well with our buffer overflow knowledge, now we can! All we have to do is overwrite the saved EIP on the stack to the address where give_shell is. Then, when main returns, it will pop that address off of the stack and jump to it, running give_shell, and giving us our shell.
Assuming give_shell is at 0x08048fd0, we could use something like this: python -c \"print 'A'*108 + '\\xd0\\x8f\\x04\\x08'\"
We send 108 'A's to overwrite the 100 bytes that is allocated for name, the 4 bytes for secret, and the 4 bytes for the saved EBP. Then we simply send the little-endian form of give_shell's address, and we would get a shell!
This idea is extended on in Return Oriented Programming
Much like a stack buffer overflow, a heap overflow is a vulnerability where more data than can fit in the allocated buffer is read in. This could lead to heap metadata corruption, or corruption of other heap objects, which could in turn provide new attack surface.
"}, {"location": "binary-exploitation/heap-exploitation/#use-after-free-uaf", "title": "Use After Free (UAF)", "text": "
Once free is called on an allocation, the allocator is free to re-allocate that chunk of memory in future calls to malloc if it so chooses. However if the program author isn't careful and uses the freed object later on, the contents may be corrupt (or even attacker controlled). This is called a use after free or UAF.
In this example, we have a string structure with a length and a pointer to the actual string data. We properly allocate, fill, and then free an instance of this structure. Then we make another allocation, fill it, and then improperly reference the freed string. Due to how glibc's allocator works, s2 will actually get the same memory as the original s allocation, which in turn gives us the ability to control the s->data pointer. This could be used to leak program data.
Not only can the heap be exploited by the data in allocations, but exploits can also use the underlying mechanisms in malloc, free, etc. to exploit a program. This is beyond the scope of CTF 101, but here are a few recommended resources:
sploitFUN's glibc overview
Shellphish's how2heap
"}, {"location": "binary-exploitation/no-execute/", "title": "No eXecute (NX Bit)", "text": "
The No eXecute or the NX bit (also known as Data Execution Prevention or DEP) marks certain areas of the program as not executable, meaning that stored input or data cannot be executed as code. This is significant because it prevents attackers from being able to jump to custom shellcode that they've stored on the stack or in a global variable.
Binaries, or executables, are machine code for a computer to execute. For the most part, the binaries that you will face in CTFs are Linux ELF files or the occasional windows executable. Binary Exploitation is a broad topic within Cyber Security which really comes down to finding a vulnerability in the program and exploiting it to gain control of a shell or modifying the program's functions.
Common topics addressed by Binary Exploitation or 'pwn' challenges include:
Partial RELRO is the default setting in GCC, and nearly all binaries you will see have at least partial RELRO.
From an attackers point-of-view, partial RELRO makes almost no difference, other than it forces the GOT to come before the BSS in memory, eliminating the risk of a buffer overflows on a global variable overwriting GOT entries.
Full RELRO makes the entire GOT read-only which removes the ability to perform a \"GOT overwrite\" attack, where the GOT address of a function is overwritten with the location of another function or a ROP gadget an attacker wants to run.
Full RELRO is not a default compiler setting as it can greatly increase program startup time since all symbols must be resolved before the program is started. In large programs with thousands of symbols that need to be linked, this could cause a noticable delay in startup time.
Return Oriented Programming (or ROP) is the idea of chaining together small snippets of assembly with stack control to cause the program to do more complex things.
As we saw in buffer overflows, having stack control can be very powerful since it allows us to overwrite saved instruction pointers, giving us control over what the program does next. Most programs don't have a convenient give_shell function however, so we need to find a way to manually invoke system or another exec function to get us our shell.
Imagine we have a program similar to the following:
#include <stdio.h>\n#include <stdlib.h>\n\nchar name[32];\n\nint main() {\n printf(\"What's your name? \");\n read(0, name, 32);\n\n printf(\"Hi %s\\n\", name);\n\n printf(\"The time is currently \");\n system(\"/bin/date\");\n\n char echo[100];\n printf(\"What do you want me to echo back? \");\n read(0, echo, 1000);\n puts(echo);\n\n return 0;\n}\n
We obviously have a stack buffer overflow on the echo variable which can give us EIP control when main returns. But we don't have a give_shell function! So what can we do?
We can call system with an argument we control! Since arguments are passed in on the stack in 32-bit Linux programs (see calling conventions), if we have stack control, we have argument control.
When main returns, we want our stack to look like something had normally called system. Recall what is on the stack after a function has been called:
This is a good start, but we need to pass an argument to system for anything to happen. As mentioned in the page on ASLR, the stack and dynamic libraries \"move around\" each time a program is run, which means we can't easily use data on the stack or a string in libc for our argument. In this case however, we have a very convenient name global which will be at a known location in the binary (in the BSS segment).
"}, {"location": "binary-exploitation/return-oriented-programming/#putting-it-together", "title": "Putting it together", "text": "
Our exploit will need to do the following:
Enter \"sh\" or another command to run as name
Fill the stack with
Garbage up to the saved EIP
The address of system's PLT entry
A fake return address for system to jump to when it's done
The address of the name global to act as the first argument to system
In 64-bit binaries we have to work a bit harder to pass arguments to functions. The basic idea of overwriting the saved RIP is the same, but as discussed in calling conventions, arguments are passed in registers in 64-bit programs. In the case of running system, this means we will need to find a way to control the RDI register.
To do this, we'll use small snippets of assembly in the binary, called \"gadgets.\" These gadgets usually pop one or more registers off of the stack, and then call ret, which allows us to chain them together by making a large fake call stack.
For example, if we needed control of both RDI and RSI, we might find two gadgets in our program that look like this (using a tool like rp++ or ROPgadget):
0x400c01: pop rdi; ret\n0x400c03: pop rsi; pop r15; ret\n
We can setup a fake call stack with these gadets to sequentially execute them, poping values we control into registers, and then end with a jump to system.
0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n 0xffff0020: 0x1337beef // value we want in r15 (probably garbage)\n 0xffff0018: 0x1337beef // value we want in rsi\n 0xffff0010: 0x400c03 // address that the rdi gadget's ret will return to - the pop rsi gadget\n 0xffff0008: 0xdeadbeef // value to be popped into rdi\nRSP -> 0xffff0000: 0x400c01 // address of rdi gadget\n
Stepping through this one instruction at a time, main returns, jumping to our pop rdi gadget:
RIP = 0x400c01 (pop rdi)\nRDI = UNKNOWN\nRSI = UNKNOWN\n\n 0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n 0xffff0020: 0x1337beef // value we want in r15 (probably garbage)\n 0xffff0018: 0x1337beef // value we want in rsi\n 0xffff0010: 0x400c03 // address that the rdi gadget's ret will return to - the pop rsi gadget\nRSP -> 0xffff0008: 0xdeadbeef // value to be popped into rdi\n
pop rdi is then executed, popping the top of the stack into RDI:
RIP = 0x400c02 (ret)\nRDI = 0xdeadbeef\nRSI = UNKNOWN\n\n 0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n 0xffff0020: 0x1337beef // value we want in r15 (probably garbage)\n 0xffff0018: 0x1337beef // value we want in rsi\nRSP -> 0xffff0010: 0x400c03 // address that the rdi gadget's ret will return to - the pop rsi gadget\n
The RDI gadget then rets into our RSI gadget:
RIP = 0x400c03 (pop rsi)\nRDI = 0xdeadbeef\nRSI = UNKNOWN\n\n 0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n 0xffff0020: 0x1337beef // value we want in r15 (probably garbage)\nRSP -> 0xffff0018: 0x1337beef // value we want in rsi\n
RSI and R15 are popped:
RIP = 0x400c05 (ret)\nRDI = 0xdeadbeef\nRSI = 0x1337beef\n\nRSP -> 0xffff0028: 0x400d00 // where we want the rsi gadget's ret to jump to now that rdi and rsi are controlled\n
And finally, the RSI gadget rets, jumping to whatever function we want, but now with RDI and RSI set to values we control.
Stack Canaries are a secret value placed on the stack which changes every time the program is started. Prior to a function return, the stack canary is checked and if it appears to be modified, the program exits immeadiately.
Stack Canaries seem like a clear cut way to mitigate any stack smashing as it is fairly impossible to just guess a random 64-bit value. However, leaking the address and bruteforcing the canary are two methods which would allow us to get through the canary check.
If we can read the data in the stack canary, we can send it back to the program later because the canary stays the same throughout execution. However Linux makes this slightly tricky by making the first byte of the stack canary a NULL, meaning that string functions will stop when they hit it. A method around this would be to partially overwrite and then put the NULL back or find a way to leak bytes at an arbitrary stack offset.
A few situations where you might be able to leak a canary:
User-controlled format string
User-controlled length of an output
\u201cHey, can you send me 1000000 bytes? thx!\u201d
"}, {"location": "binary-exploitation/stack-canaries/#bruteforcing-a-stack-canary", "title": "Bruteforcing a Stack Canary", "text": "
The canary is determined when the program starts up for the first time which means that if the program forks, it keeps the same stack cookie in the child process. This means that if the input that can overwrite the canary is sent to the child, we can use whether it crashes as an oracle and brute-force 1 byte at a time!
This method can be used on fork-and-accept servers where connections are spun off to child processes, but only under certain conditions such as when the input accepted by the program does not append a NULL byte (read or recv).
Buffer (N Bytes) ?? ?? ?? ?? ?? ?? ?? ?? RBP RIP
Fill the buffer N Bytes + 0x00 results in no crash
Buffer (N Bytes) 00 ?? ?? ?? ?? ?? ?? ?? RBP RIP
Fill the buffer N Bytes + 0x00 + 0x00 results in a crash
N Bytes + 0x00 + 0x01 results in a crash
N Bytes + 0x00 + 0x02 results in a crash
...
N Bytes + 0x00 + 0x51 results in no crash
Buffer (N Bytes) 00 51 ?? ?? ?? ?? ?? ?? RBP RIP
Repeat this bruteforcing process for 6 more bytes...
Buffer (N Bytes) 00 51 FE 0A 31 D2 7B 3C RBP RIP
Now that we have the stack cookie, we can overwrite the RIP register and take control of the program!
A buffer is any allocated space in memory where data (often user input) can be stored. For example, in the following C program name would be considered a stack buffer:
Given that buffers commonly hold user input, mistakes when writing to them could result in attacker controlled data being written outside of the buffer's space. See the page on buffer overflows for more.
To be able to call functions, there needs to be an agreed-upon way to pass arguments. If a program is entirely self-contained in a binary, the compiler would be free to decide the calling convention. However in reality, shared libraries are used so that common code (e.g. libc) can be stored once and dynamically linked in to programs that need it, reducing program size.
In Linux binaries, there are really only two commonly used calling conventions: cdecl for 32-bit binaries, and SysV for 64-bit
Any method of passing arguments could be used as long as the compiler is aware of what the convention is. As a result, there have been many calling conventions in the past that aren't used frequently anymore. See Wikipedia for a comprehensive list.
A register is a location within the processor that is able to store data, much like RAM. Unlike RAM however, accesses to registers are effectively instantaneous, whereas reads from main memory can take hundreds of CPU cycles to return.
Registers can hold any value: addresses (pointers), results from mathematical operations, characters, etc. Some registers are reserved however, meaning they have a special purpose and are not \"general purpose registers\" (GPRs). On x86, the only 2 reserved registers are rip and rsp which hold the address of the next instruction to execute and the address of the stack respectively.
On x86, the same register can have different sized accesses for backwards compatability. For example, the rax register is the full 64-bit register, eax is the low 32 bits of rax, ax is the low 16 bits, al is the low 8 bits, and ah is the high 8 bits of ax (bits 8-16 of rax).
A format string vulnerability is a bug where user input is passed as the format argument to printf, scanf, or another function in that family.
The format argument has many different specifies which could allow an attacker to leak data if they control the format argument to printf. Since printf and similar are variadic functions, they will continue popping data off of the stack according to the format.
For example, if we can make the format argument \"%x.%x.%x.%x\", printf will pop off four stack values and print them in hexadecimal, potentially leaking sensitive information.
printf can also index to an arbitrary \"argument\" with the following syntax: \"%n$x\" (where n is the decimal index of the argument you want).
While these bugs are powerful, they're very rare nowadays, as all modern compilers warn when printf is called with a non-constant string.
#include <stdio.h>\n#include <unistd.h>\n\nint main() {\n int secret_num = 0x8badf00d;\n\n char name[64] = {0};\n read(0, name, 64);\n printf(\"Hello \");\n printf(name);\n printf(\"! You'll never get my secret!\\n\");\n return 0;\n}\n
Due to how GCC decided to lay out the stack, secret_num is actually at a lower address on the stack than name, so we only have to go to the 7th \"argument\" in printf to leak the secret:
$ ./fmt_string\n%7$llx\nHello 8badf00d3ea43eef\n! You'll never get my secret!\n
Binary Security is using tools and methods in order to secure programs from being manipulated and exploited. This tools are not infallible, but when used together and implemented properly, they can raise the difficulty of exploitation greatly.
The Global Offset Table (or GOT) is a section inside of programs that holds addresses of functions that are dynamically linked. As mentioned in the page on calling conventions, most programs don't include every function they use to reduce binary size. Instead, common functions (like those in libc) are \"linked\" into the program so they can be saved once on disk and reused by every program.
Unless a program is marked full RELRO, the resolution of function to address in dynamic library is done lazily. All dynamic libraries are loaded into memory along with the main program at launch, however functions are not mapped to their actual code until they're first called. For example, in the following C snippet puts won't be resolved to an address in libc until after it has been called once:
int main() {\n puts(\"Hi there!\");\n puts(\"Ok bye now.\");\n return 0;\n}\n
To avoid searching through shared libraries each time a function is called, the result of the lookup is saved into the GOT so future function calls \"short circuit\" straight to their implementation bypassing the dynamic resolver.
This has two important implications:
The GOT contains pointers to libraries which move around due to ASLR
The GOT is writable
These two facts will become very useful to use in Return Oriented Programming
Before a functions address has been resolved, the GOT points to an entry in the Procedure Linkage Table (PLT). This is a small \"stub\" function which is responsible for calling the dynamic linker with (effectively) the name of the function that should be resolved.
"}, {"location": "binary-exploitation/what-is-the-heap/", "title": "The Heap", "text": "
The heap is a place in memory which a program can use to dynamically create objects. Creating objects on the heap has some advantages compared to using the stack:
Heap allocations can be dynamically sized
Heap allocations \"persist\" when a function returns
There are also some disadvantages however:
Heap allocations can be slower
Heap allocations must be manually cleaned up
"}, {"location": "binary-exploitation/what-is-the-heap/#using-the-heap", "title": "Using the heap", "text": "
In C, there are a number of functions used to interact with the heap, but we're going to focus on the two core ones:
This program reads in a size from the user, creates an allocation of that size on the heap, reads in that many bytes, then prints it back out to the user.
"}, {"location": "binary-exploitation/what-is-the-stack/", "title": "The Stack", "text": "
In computer architecture, the stack is a hardware manifestation of the stack data structure (a Last In, First Out queue).
In x86, the stack is simply an area in RAM that was chosen to be the stack - there is no special hardware to store stack contents. The esp/rsp register holds the address in memory where the bottom of the stack resides. When something is pushed to the stack, esp decrements by 4 (or 8 on 64-bit x86), and the value that was pushed is stored at that location in memory. Likewise, when a pop instruction is executed, the value at esp is retrieved (i.e. esp is dereferenced), and esp is then incremented by 4 (or 8).
N.B. The stack \"grows\" down to lower memory addresses!
Conventionally, ebp/rbp contains the address of the top of the current stack frame, and so sometimes local variables are referenced as an offset relative to ebp rather than an offset to esp. A stack frame is essentially just the space used on the stack by a given function.
Skipping over the bulk of main, you'll see that at 0x8048452main's name local is pushed to the stack because it's the first argument to say_hi. Then, a call instruction is executed. call instructions first push the current instruction pointer to the stack, then jump to their destination. So when the processor begins executing say_hi at 0x0804840b, the stack looks like this:
And finally, ret pops the saved instruction pointer into eip which causes the program to return to main with the same esp, ebp, and stack contents as when say_hi was initially called.
Cryptography is the reason we can use banking apps, transmit sensitive information over the web, and in general protect our privacy. However, a large part of CTFs is breaking widely used encryption schemes which are improperly implemented. The math may seem daunting, but more often than not, a simple understanding of the underlying principles will allow you to find flaws and crack the code.
The word \u201ccryptography\u201d technically means the art of writing codes. When it comes to digital forensics, it\u2019s a method you can use to understand how data is constructed for your analysis.
"}, {"location": "cryptography/overview/#what-is-cryptography-used-for", "title": "What is cryptography used for?", "text": "
Uses in every day software
Securing web traffic (passwords, communication, etc.)
A Block Cipher is an algorithm which is used in conjunction with a cryptosystem in order to package a message into evenly distributed 'blocks' which are encrypted one at a time.
In this case ~i~ represents an index over the # of blocks in the plaintext. F() and g() represent the function used to convert plaintext into ciphertext.
ECB is the most basic block cipher, it simply chunks up plaintext into blocks and independently encrypts those blocks and chains them all into a ciphertext.
Because ECB independently encrypts the blocks, patterns in data can still be seen clearly, as shown in the CBC Penguin image below.
Original Image ECB Image Other Block Cipher Modes"}, {"location": "cryptography/what-are-block-ciphers/#cipher-block-chaining-cbc", "title": "Cipher Block Chaining (CBC)", "text": "
CBC is an improvement upon ECB where an Initialization Vector is used in order to add randomness. The encrypted previous block is used as the IV for each sequential block meaning that the encryption process cannot be parallelized. CBC has been declining in popularity due to a variety of
Note
Even though the encryption process cannot be parallelized, the decryption process can be parallelized. If the wrong IV is used for decryption it will only affect the first block as the decryption of all other blocks depends on the ciphertext not the plaintext.
PCBC is a less used cipher which modifies CBC so that decryption is also not parallelizable. It also cannot be decrypted from any point as changes made during the decryption and encryption process \"propogate\" throughout the blocks, meaning that both the plaintext and ciphertext are used when encrypting or decrypting as seen in the images below.
Counter is also known as CM, integer counter mode (ICM), and segmented integer counter (SIC)
CTR mode makes the block cipher similar to a stream cipher and it functions by adding a counter with each block in combination with a nonce and key to XOR the plaintext to produce the ciphertext. Similarly, the decryption process is the exact same except instead of XORing the plaintext, the ciphertext is XORed. This means that the process is parallelizable for both encryption and decryption and you can begin from anywhere as the counter for any block can be deduced easily.
If the nonce chosen is non-random, it is important to concatonate the nonce with the counter (high 64 bits to the nonce, low 64 bits to the counter) as adding or XORing the nonce with the counter would break security as an attacker can cause a collisions with the nonce and counter. An attacker with access to providing a plaintext, nonce and counter can then decrypt a block by using the ciphertext as seen in the decryption image.
A Padding Oracle Attack sounds complex, but essentially means abusing a block cipher by changing the length of input and being able to determine the plaintext.
Hashing functions are one way functions which theoretically provide a unique output for every input. MD5, SHA-1, and other hashes which were considered secure are now found to have collisions or two different pieces of data which produce the same supposed unique output.
A string hash is a number or string generated using an algorithm that runs on text or data.
The idea is that each hash should be unique to the text or data (although sometimes it isn\u2019t). For example, the hash for \u201cdog\u201d should be different from other hashes.
You can use command line tools tools or online resources such as this one. Example: $ echo -n password | md5 5f4dcc3b5aa765d61d8327deb882cf99 Here, \u201cpassword\u201d is hashed with different hashing algorithms:
A file hash is a number or string generated using an algorithm that is run on text or data. The premise is that it should be unique to the text or data. If the file or text changes in any way, the hash will change.
What is it used for? - File and data identification - Password/certificate storage comparison
How can we determine the hash of a file? You can use the md5sum command (or similar).
A collision is when two pieces of data or text have the same cryptographic hash. This is very rare.
What\u2019s significant about collisions is that they can be used to crack password hashes. Passwords are usually stored as hashes on a computer, since it\u2019s hard to get the passwords from hashes.
If you bruteforce by trying every possible piece of text or data, eventually you\u2019ll find something with the same hash. Enter it, and the computer accepts it as if you entered the actual password.
Two different files on the same hard drive with the same cryptographic hash can be very interesting.
\u201cIt\u2019s now well-known that the cryptographic hash function MD5 has been broken,\u201d said Peter Selinger of Dalhousie University. \u201cIn March 2005, Xiaoyun Wang and Hongbo Yu of Shandong University in China published an article in which they described an algorithm that can find two different sequences of 128 bytes with the same MD5 hash.\u201d
For example, he cited this famous pair:
and
Each of these blocks has MD5 hash 79054025255fb1a26e4bc422aef54eb4.
Selinger said that \u201cthe algorithm of Wang and Yu can be used to create files of arbitrary length that have identical MD5 hashes, and that differ only in 128 bytes somewhere in the middle of the file. Several people have used this technique to create pairs of interesting files with identical MD5 hashes.\u201d
Ben Laurie has a nice website that visualizes this MD5 collision. For a non-technical, though slightly outdated, introduction to hash functions, see Steve Friedl\u2019s Illustrated Guide. And here\u2019s a good article from DFI News that explores the same topic.
A Stream Cipher is used for symmetric key cryptography, or when the same key is used to encrypt and decrypt data. Stream Ciphers encrypt pseudorandom sequences with bits of plaintext in order to generate ciphertext, usually with XOR. A good way to think about Stream Ciphers is to think of them as generating one-time pads from a given state.
A keystream is a sequence of pseudorandom digits which extend to the length of the plaintext in order to uniquely encrypt each character based on the corresponding digit in the keystream
"}, {"location": "cryptography/what-are-stream-ciphers/#one-time-pads", "title": "One Time Pads", "text": "
A one time pad is an encryption mechanism whereby the entire plaintext is XOR'd with a random sequence of numbers in order to generate a random ciphertext. The advantage of the one time pad is that it offers an immense amount of security BUT in order for it to be useful, the randomly generated key must be distributed on a separate secure channel, meaning that one time pads have little use in modern day cryptographic applications on the internet. Stream ciphers extend upon this idea by using a key, usually 128 bit in length, in order to seed a pseudorandom keystream which is used to encrypt the text.
A Synchronous Stream Cipher generates a keystream based on internal states not related to the plaintext or ciphertext. This means that the stream is generated pseudorandomly outside of the context of what is being encrypted. A binary additive stream cipher is the term used for a stream cipher which XOR's the bits with the bits of the plaintext. Encryption and decryption require that the synchronus state cipher be in the same state, otherwise the message cannot be decrypted.
A Self-synchronizing Stream Cipher, also known as an asynchronous stream cipher or ciphertext autokey (CTAK), is a stream cipher which uses the previous N digits in order to compute the keystream used for the next N characters.
Note
Seems a lot like block ciphers doesn't it? That's because block cipher feedback mode (CFB) is an example of a self-synchronizing stream ciphers.
The key tenet of using stream ciphers securely is to NEVER repeat key use because of the communative property of XOR. If C~1~ and C~2~ have been XOR'd with a key K, retrieving that key K is trivial because C~1~ XOR C~2~ = P~1~ XOR P~2~ and having an english language based XOR means that cryptoanalysis tools such as a character frequency analysis will work well due to the low entropy of the english language.
Another key tenet of using stream ciphers securely is considering that just because a message has been decrypted, it does not mean the message has not been tampered with. Because decryption is based on state, if an attacker knows the layout of the plaintext, a Man in the Middle (MITM) attack can flip a bit during transit altering the underlying ciphertext. If a ciphertext decrypts to 'Transfer $1000', then a middleman can flip a single bit in order for the ciphertext to decrypt to 'Transfer $9000' because changing a single character in the ciphertext does not affect the state in a synchronus stream cipher.
RSA, which is an abbreviation of the author's names (Rivest\u2013Shamir\u2013Adleman), is a cryptosystem which allows for asymmetric encryption. Asymmetric cryptosystems are alos commonly referred to as Public Key Cryptography where a public key is used to encrypt data and only a secret, private key can be used to decrypt the data.
The message is represented as m and is converted into a number
The encrypted message or ciphertext is represented by c
p and q are prime numbers which make up n
e is the public exponent
n is the modulus and its length in bits is the bit length (i.e. 1024 bit RSA)
d is the private exponent
The totient \u03bb(n) is used to compute d and is equal to the lcm(p-1, q-1), another definition for \u03bb(n) is that \u03bb(pq) = lcm(\u03bb(p), \u03bb(q))
"}, {"location": "cryptography/what-is-rsa/#what-makes-rsa-viable", "title": "What makes RSA viable?", "text": "
If public n, public e, private d are all very large numbers and a message m holds true for 0 < m < n, then we can say:
(m^e^)^d^ \u2261 m (mod n)
Note
The triple equals sign in this case refers to modular congruence which in this case means that there exists an integer k such that (m^e^)^d^ = kn + m
RSA is viable because it is incredibly hard to find d even with m, n, and e because factoring large numbers is an arduous process.
We are going to follow along Wikipedia's small numbers example in order to make this idea a bit easier to understand.
Note
In This example we are using Carmichael's totient function where \u03bb(n) = lcm(\u03bb(p), \u03bb(q)), but Euler's totient function is perfectly valid to use with RSA. Euler's totient is \u03c6(n) = (p \u2212 1)(q \u2212 1)
Choose two prime numbers such as:
p = 61 and q = 53
Find n:
n = pq = 3233
Calculate \u03bb(n) = lcm(p-1, q-1)
\u03bb(3233) = lcm(60, 52) = 780
Choose a public exponent such that 1 < e < \u03bb(n) and is coprime (not a factor of) \u03bb(n). The standard is most cases is 65537, but we will be using:
e = 17
Calculate d as the modular multiplicative inverse or in english find d such that: d x e mod \u03bb(n) = 1
d x 17 mod 780 = 1
d = 413
Now we have a public key of (3233, 17) and a private key of (3233, 413)
An XOR or eXclusive OR is a bitwise operation indicated by ^ and shown by the following truth table:
A B A ^ B 0 0 0 0 1 1 1 0 1 1 1 0
So what XOR'ing bytes in the action 0xA0 ^ 0x2C translates to is:
1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0
0b10001100 is equivelent to 0x8C, a cool property of XOR is that it is reversable meaning 0x8C ^ 0x2C = 0xA0 and 0x8C ^ 0xA0 = 0x2C
"}, {"location": "cryptography/what-is-xor/#what-does-this-have-to-do-with-ctf", "title": "What does this have to do with CTF?", "text": "
XOR is a cheap way to encrypt data with a password. Any data can be encrypted using XOR as shown in this Python example:
>>> data = 'CAPTURETHEFLAG'\n>>> key = 'A'\n>>> encrypted = ''.join([chr(ord(x) ^ ord(key)) for x in data])\n>>> encrypted\n'\\x02\\x00\\x11\\x15\\x14\\x13\\x04\\x15\\t\\x04\\x07\\r\\x00\\x06'\n>>> decrypted = ''.join([chr(ord(x) ^ ord(key)) for x in encrypted])\n>>> decrypted\n'CAPTURETHEFLAG'\n
This can be extended using a multibyte key by iterating in parallel with the data.
Multibyte XOR gets exponentially harder the longer the key, but if the encrypted text is long enough, character frequency analysis is a viable method to find the key. Character Frequency Analysis means that we split the cipher text into groups based on the number of characters in the key. These groups then are bruteforced using the idea that some letters appear more frequently in the english alphabet than others.
"}, {"location": "faq/connecting-to-services/", "title": "How to connect to services", "text": "
Note
While service challenges are often connected to with netcat or PuTTY, solving them will sometimes require using a scripting language like Python. CTF players often use Python alongside pwntools.
You can run pwntools right in your browser by using repl.it.
netcat is a networking utility found on macOS and linux operating systems and allows for easy connections to CTF challenges. Service challenges will commonly give you an address and a port to connect to. The syntax for connecting to a service challenge with netcat is nc <ip> <port>.
Windows users can connect to service challenges using ConEmu, which can be downloaded here. Connecting to service challenges with ConEmu is done by running nc <ip> <port>.
"}, {"location": "faq/i-need-a-server/", "title": "I need a server", "text": "
Occasionally, certain kinds of exploits will require a server to connect back to. Some examples are connect back shellcode, cross site request forgery (CSRF), or blind cross site scripting (XSS).
"}, {"location": "faq/i-need-a-server/#i-just-a-web-server", "title": "I just a web server", "text": "
If you just need a web server to host simple static websites or check access logs, we recommend using PythonAnywhere to host a simple web application. You can program a simple web application in popular Python web frameworks (e.g. Flask) and host it there for free.
"}, {"location": "faq/i-need-a-server/#i-need-a-real-server", "title": "I need a real server", "text": "
If you need a real server (perhaps to run complex calculations or for shellcode to connect back to), we recommend DigitalOcean. DigitalOcean has a cheap $4-6/month plan for a small server that can be freely configured to do whatever you need.
Generally in cyber security competitions, it is up to you and your team to determine what software to use. In some cases you may even end up creating new tools to give you an edge! That being said, here are some applications that we recommend for most competitors for most competitions.
Ghidra is a disassembler and decompiler that is open source and free to use. Released by the NSA, Ghidra is a capable tool and is the recommended disassembler for most use cases. An alternative is IDA Pro (a cyber security industry standard), however IDA Pro is not free and licenses are very expensive.
Binary Ninja
Binary Ninja is a commercial disassembler (with a free demo application) that provides an aesthetic and easy to use interface for binary reverse engineering. It also has a Web-UI which can be used freely. Binary Ninja's API and intermediate language make it superior than other disassemblers for certain use cases.
Pwndbg is a plugin for the GNU Debugger (gdb) which makes it easier to dynamically reverse an application by stepping through its execution. In order to use pwndbg you will first need to have gdb installed via a Linux virtual machine or similar.
Burp Suite is an HTTP proxy and set of tools which allow you to view, edit and replay your HTTP requests. While Burp Suite is a commercial tool, it offers a free version which is very capable and usually all that's needed.
sqlmap
sqlmap is a penetration testing tool that automates hte process of detecting and exploiting SQL injection flaws. It's open source and freely available.
Google Chrome
Google Chrome is a web browser with a suite of developer tools and extensions. These tools and extensions can be useful when investigating a web application.
Wireshark
Wireshark is a PCAP analysis tool which allows you to analyze and record network traffic.
VMware is a company that creates virtualization software that allows you to run other operating systems within your existing operating system. While their products are not generally free, their software is best in class for virtualization.
VMWare Fusion, VMWare Workstation, and VMWare Player are three of their virtualization products that can be used on your computer to run other OS'es. VMWare Player is free to use for Windows and Linux.
VirtualBox
VirtualBox is open source virtualization software which allows you to virtualize other operating systems. It's very similar to VMWare products but free for all OS'es. It is generally slower than VMWare but works well enough for most people.
Python is an easy-to-learn, widely used programming language which supports complex applications as well as small scripts. It has a large community which provides thousands of useful packages. Python is widely used in the cyber security industry and is generally the recommended language to use in CTF competition.
pwntools
Pwntools is a Python package which makes interacting with processes and networks easy. It is a recommended library for interacting with binary exploitation and networking based CTF challenges.
Note
You can run pwntools right in your browser by using repl.it. Create a new Python repl and install the pwntools package. After that you'll be able to use pwntools directly from your browser without having to install anything.
Forensics is the art of recovering the digital trail left on a computer. There are plenty of methods to find data which is seemingly deleted, not stored, or worse, covertly recorded.
An important part of forensics is having the right tools, as well as being familiar with the following topics:
File Extensions are not the sole way to identify the type of a file, files have certain leading bytes called file signatures which allow programs to parse the data in a consistent manner. Files can also contain additional \"hidden\" data called metadata which can be useful in finding out information about the context of a file's data.
File signatures (also known as File Magic Numbers) are bytes within a file used to identify the format of the file. Generally they\u2019re 2-4 bytes long, found at the beginning of a file.
"}, {"location": "forensics/what-are-file-formats/#what-is-it-used-for", "title": "What is it used for?", "text": "
Files can sometimes come without an extension, or with incorrect ones. We use file signature analysis to identify the format (file type) of the file. Programs need to know the file type in order to open it properly.
"}, {"location": "forensics/what-are-file-formats/#how-do-you-find-the-file-signature", "title": "How do you find the file signature?", "text": "
You need to be able to look at the binary data that constitutes the file you\u2019re examining. To do this, you\u2019ll use a hexadecimal editor. Once you find the file signature, you can check it against file signature repositories such as Gary Kessler\u2019s.
The file above, when opened in a Hex Editor, begins with the bytes FFD8FFE0 00104A46 494600 or in ASCII \u02c7\u00ff\u02c7\u2021 JFIF where \\x00 and \\x10 lack symbols.
Searching in Gary Kessler\u2019s database shows that this file signature belongs to a JPEG/JFIF graphics file, exactly what we suspect.
A hexadecimal (hex) editor (also called a binary file editor or byte editor) is a computer program you can use to manipulate the fundamental binary data that constitutes a computer file. The name \u201chex\u201d comes from \u201chexadecimal,\u201d a standard numerical format for representing binary data. A typical computer file occupies multiple areas on the platter(s) of a disk drive, whose contents are combined to form the file. Hex editors that are designed to parse and edit sector data from the physical segments of floppy or hard disks are sometimes called sector editors or disk editors. A hex editor is used to see or edit the raw, exact contents of a file. Hex editors may used to correct data corrupted by a system or application. A list of editors can be found on the forensics Wiki. You can download one and install it on your system.
A forensic image is an electronic copy of a drive (e.g. a hard drive, USB, etc.). It\u2019s a bit-by-\u00adbit or bitstream file that\u2019s an exact, unaltered copy of the media being duplicated.
Wikipedia said that the most straight\u00adforward disk imaging method is to read a disk from start to finish and write the data to a forensics image format. \u201cThis can be a time-consuming process, especially for disks with a large capacity,\u201d Wikipedia said.
To prevent write access to the disk, you can use a write blocker. It\u2019s also common to calculate a cryptographic hash of the entire disk when imaging it. \u201cCommonly-used cryptographic hashes are MD5, SHA1 and/or SHA256,\u201d said Wikipedia. \u201cBy recalculating the integrity hash at a later time, one can determine if the data in the disk image has been changed. This by itself provides no protection against intentional tampering, but it can indicate that the data was altered, e.g. due to corruption.\u201d
Why image a disk? Forensic imaging: - Prevents tampering with the original data\u00ad evidence - Allows you to play around with the copy, without worrying about messing up the original
There are plenty of traces of someone's activity on a computer, but perhaps some of the most valuble information can be found within memory dumps, that is images taken of RAM. These dumps of data are often very large, but can be analyzed using a tool called Volatility
In order to properly use Volatility you must supply a profile with --profile=PROFILE, therefore before any sleuthing, you need to determine the profile using imageinfo:
$ python vol.py -f ~/image.raw imageinfo\nVolatility Foundation Volatility Framework 2.4\nDetermining profile based on KDBG search...\n\n Suggested Profile(s) : Win7SP0x64, Win7SP1x64, Win2008R2SP0x64, Win2008R2SP1x64\n AS Layer1 : AMD64PagedMemory (Kernel AS)\n AS Layer2 : FileAddressSpace (/Users/Michael/Desktop/win7_trial_64bit.raw)\n PAE type : PAE\n DTB : 0x187000L\n KDBG : 0xf80002803070\n Number of Processors : 1\n Image Type (Service Pack) : 0\n KPCR for CPU 0 : 0xfffff80002804d00L\n KUSER_SHARED_DATA : 0xfffff78000000000L\n Image date and time : 2012-02-22 11:29:02 UTC+0000\n Image local date and time : 2012-02-22 03:29:02 -0800\n
Metadata is data about data. Different types of files have different metadata. The metadata on a photo could include dates, camera information, GPS location, comments, etc. For music, it could include the title, author, track number and album.
"}, {"location": "forensics/what-is-metadata/#what-kind-of-file-metadata-is-useful", "title": "What kind of file metadata is useful?", "text": "
Potentially, any file metadata you can find could be useful.
"}, {"location": "forensics/what-is-metadata/#how-do-i-find-it", "title": "How do I find it?", "text": "
Note
EXIF Data is metadata attached to photos which can include location, time, and device information.
One of our favorite tools is exiftool, which displays metadata for an input file, including: - File size - Dimensions (width and height) - File type - Programs used to create (e.g. Photoshop) - OS used to create (e.g. Apple)
Run command line: exiftool(-k).exe [filename] and you should see something like this:
Timestamps are data that indicate the time of certain events (MAC): - Modification \u2013 when a file was modified - Access \u2013 when a file or entries were read or accessed - Creation \u2013 when files or entries were created
"}, {"location": "forensics/what-is-metadata/#types-of-timestamps", "title": "Types of timestamps", "text": "
Modified
Accessed
Created
Date Changed (MFT)
Filename Date Created (MFT)
Filename Date Modified (MFT)
Filename Date Accessed (MFT)
INDX Entry Date Created
INDX Entry Date Modified
INDX Entry Date Accessed
INDX Entry Date Changed
"}, {"location": "forensics/what-is-metadata/#why-do-we-care", "title": "Why do we care?", "text": "
Certain events such as creating, moving, copying, opening, editing, etc. might affect the MAC times. If the MAC timestamps can be attained, a timeline of events could be created.
There are plenty more patterns than the ones introduced below, but these are the basics you should start with to get a good understanding of how it works, and to complete this challenge.
We know that the BMP files fileA and fileD are the same, but that the JPEG files fileB and fileC are different somehow. So how can we find out what went on with these files?
By using time stamp information from the file system, we can learn that the BMP fileD was the original file, with fileA being a copy of the original. Afterward, fileB was created by modifying fileB, and fileC was created by modifying fileA in a different way.
Follow along as we demonstrate.
We\u2019ll start by analyzing images in AccessData FTK Imager, where there\u2019s a Properties window that shows you some information about the file or folder you\u2019ve selected.
Here are the extracted MAC times for fileA, fileB, fileC and fileD: Note, AccessData FTK Imager assumes that the file times on the drive are in UTC (Universal Coordinated Time). I subtracted four hours, since the USB was set up in Eastern Standard Time. This isn\u2019t necessary, but it helps me understand the times a bit better.
Highlight timestamps that are the same, if timestamps are off by a few seconds, they should be counted as the same. This lets you see a clear difference between different timestamps. Then, highlight oldest to newest to help put them in order.
Steganography is the practice of hiding data in plain sight. Steganography is often embedded in images or audio.
You could send a picture of a cat to a friend and hide text inside. Looking at the image, there\u2019s nothing to make anyone think there\u2019s a message hidden inside it.
You could also hide a second image inside the first.
So we can hide text and an image, how do we find out if there is hidden data?
FileA and FileD appear the same, but they\u2019re different. Also, FileD was modified after it was copied, so it\u2019s possible there might be steganography in it.
FileB and FileC don\u2019t appear to have been modified after being created. That doesn\u2019t rule out the possibility that there\u2019s steganography in them, but you\u2019re more likely to find it in fileD. This brings up two questions:
Can we determine that there is steganography in fileD?
Let\u2019s say we have an image, and part of it contains the following binary:
And let\u2019s say we want to hide the character y inside.
First, we need to convert the hidden message to binary.
Now we take each bit from the hidden message and replace the LSB of the corresponding byte with it.
And again:
And again:
And again:
And again:
And again:
And again:
And once more:
Decoding LSB steganography is exactly the same as encoding, but in reverse. For each byte, grab the LSB and add it to your decoded message. Once you\u2019ve gone through each byte, convert all the LSBs you grabbed into text or a file. (You can use your file signature knowledge here!)
"}, {"location": "forensics/what-is-stegonagraphy/#what-other-types-of-steganography-are-there", "title": "What other types of steganography are there?", "text": "
Steganography is hard for the defense side, because there\u2019s practically an infinite number of ways it could be carried out. Here are a few examples: - LSB steganography: different bits, different bit combinations - Encode in every certain number of bytes - Use a password - Hide in different places - Use encryption on top of steganography
\"Wireshark saved me hours on my last tax return! - David\"
\"[Wireshark] is great for ruining your weekend and fixing pesky networking problems!\" - Max\"
\"Wireshark is the powerhouse of the cell. - Joe\"
\"Does this cable do anything? - Ayyaz\"
Wireshark is a network protocol analyzer which is often used in CTF challenges to look at recorded network traffic. Wireshark uses a filetype called PCAP to record traffic. PCAPs are often distributed in CTF challenges to provide recorded traffic history.
Upon opening Wireshark, you are greeted with the option to open a PCAP or begin capturing network traffic on your device.
The network traffic displayed initially shows the packets in order of which they were captured. You can filter packets by protocol, source IP address, destination IP address, length, etc.
In order to apply filters, simply enter the constraining factor, for example 'http', in the display filter bar.
Filters can be chained together using '&&' notation. In order to filter by IP, ensure a double equals '==' is used.
The most pertinent part of a packet is its data payload and protocol information.
In order for a network session to be encrypted properly, the client and server must share a common secret for which they can use to encrypt and decrypt data without someone in the middle being able to guess. The SSL Handshake loosely follows this format:
The client sends a list of available cipher suites it can use along with a random set of bytes referred to as client_random
The server sends back the cipher suite that will be used, such as TLS_DHE_RSA_WITH_AES_128_CBC_SHA, along with a random set of bytes referred to as server_random
The client generates a pre-master secret, encrypts it, then sends it to the server.
The server and client then generate a common master secret using the selected cipher suite
The client and server begin communicating using this common secret
There are several ways to be able to decrypt traffic.
If you have the client and server random values and the pre-master secret, the master secret can be generated and used to decrypt the traffic
If you have the master secret, traffic can be decrypted easily
If the cipher-suite uses RSA, you can factor n in the key in order to break the encryption on the encrypted pre-master secret and generate the master secret with the client and server randoms
Reverse Engineering in a CTF is typically the process of taking a compiled (machine code, bytecode) program and converting it back into a more human readable format.
Very often the goal of a reverse engineering challenge is to understand the functionality of a given program such that you can identify deeper issues.
If we are given a binary compiled from that source and we want to figure out how the source looks, we can use a decompiler to get c pseudocode which we can then use to reconstruct the function. The sample decompilation can look like:
printSpacer:\nint __fastcall printSpacer(int a1)\n{\n int i; // [rsp+8h] [rbp-8h]\n\n for ( i = 0; i < a1; ++i )\n printf(\"-\");\n return printf(\"\\n\");\n}\n\nmain:\nint __cdecl main(int argc, const char **argv, const char **envp)\n{\n int v4; // [rsp+18h] [rbp-18h]\n signed int i; // [rsp+1Ch] [rbp-14h]\n\n for ( i = 0; i < 13; ++i )\n {\n v4 = i + 1;\n printf(\"%c\", (unsigned int)aHelloWorld[i], envp);\n while ( v4 < 13 )\n printf(\"%c\", (unsigned int)aHelloWorld[v4++]);\n printf(\"\\n\");\n printSpacer(13 - i);\n }\n return 0;\n}\n
A good method of getting a good representation of the source is to convert the decompilation into Python since Python is basically psuedocode that runs. Starting with main often allows you to gain a good overview of what the program is doing and will help you translate the other functions.
We know we will start with a main function and some variables, if you trace the execution of the variables, you can oftentimes determine the variable type. Because i is being used as an index, we know its an int, and because v4 used as one later on, it too is an index. We can also see that we have a variable aHelloWorld being printed with \"%c\", we can determine it represents the 'Hello, World!' string. Lets define all these variables in our Python main function:
def main():\n string = \"Hello, World!\"\n i = 0\n v4 = 0\n for i in range(0, 13):\n v4 = i + 1\n print(string[i], end='')\n while v4 < 13:\n print(string[v4], end='')\n v4 += 1\n print()\n printSpacer(13-i)\n
The Interactive Disassembler (IDA) is the industry standard for binary disassembly. IDA is capable of disassembling \"virtually any popular file format\". This makes it very useful to security researchers and CTF players who often need to analyze obscure files without knowing what they are or where they came from. IDA also features the industry leading Hex Rays decompiler which can convert assembly code back into a pseudo code like format.
IDA also has a plugin interface which has been used to create some successful plugins that can make reverse engineering easier:
Binary Ninja is an up and coming disassembler that attempts to bring a new, more programmatic approach to reverse engineering. Binary Ninja brings an improved plugin API and modern features to reverse engineering. While it's less popular or as old as IDA, Binary Ninja (often called binja) is quickly gaining ground and has a small community of dedicated users and followers.
Binja also has some community contributed plugins which are collected here: https://github.com/Vector35/community-plugins
The GNU Debugger is a free and open source debugger which also disassembles programs. It's capable as a disassembler, but most notably it is used by CTF players for its debugging and dynamic analysis capabailities.
gdb is often used in tandom with enhancement scripts like peda, pwndbg, and GEF
Machine Code or Assembly is code which has been formatted for direct execution by a CPU. Machine Code is the reason why readable programming languages like C, when compiled, cannot be reversed into source code (well Decompilers can sort of, but more on that later).
"}, {"location": "reverse-engineering/what-is-assembly-machine-code/#from-source-to-compilation", "title": "From Source to Compilation", "text": "
Godbolt shows the differences in machine code generated by various compilers.
This is a one way process for compiled languages as there is no way to generate source from machine code. While the machine code may seem unintelligible, the extremely basic functions can be interpreted with some practice.
x86-64 or amd64 or i64 is a 64-bit Complex Instruction Set Computing (CISC) architecture. This basically means that the registers used for this architecture extend an extra 32-bits on Intel's x86 architecture. CISC means that a single instruction can do a bunch of different things at once, such as memory accesses, register reads, etc. It is also a variable-length instruction set, which means different instructions can be different sizes ranging from 1 to 16 bytes long. And finally x86-64 allows for multi-sized register access, which means that you can access certain parts of a register which are different sizes.
x86-64 registers behave similarly to other architectures. A key component of x86-64 registers is multi-sized access which means the register RAX can have its lower 32 bits accessed with EAX. The next lower 16 bits can be accessed with AX and the lowest 8 bits can be accessed with AL which allows for the compiler to make optimizations which boost program execution.
x86-64 has plenty of registers to use, including rax, rbx, rcx, rdx, rdi, rsi, rsp, rip, r8-r15, and more! But some registers serve special purposes.
The special registers include: - RIP: the instruction pointer - RSP: the stack pointer - RBP: the base pointer
An instruction represents a single operation for the CPU to perform.
There are different types of instructions including:
Data movement: mov rax, [rsp - 0x40]
Arithmetic: add rbx, rcx
Control-flow: jne 0x8000400
Because x86-64 is a CISC architecture, instructions can be quite complex for machine code, such as repne scasb which repeats up to ECX times over memory at EDI looking for a NULL byte (0x00), decrementing ECX each byte (essentially strlen() in a single instruction!).
It is important to remember that an instruction really is just memory; this idea will become useful with Return Oriented Programming or ROP.
Note
Instructions, numbers, strings, everything are always represented in hex!
add rax, rbx\nmov rax, 0xdeadbeef\nmov rax, [0xdeadbeef] == 67 48 8b 05 ef be ad de\n\"Hello\" == 48 65 6c 6c 6f\n== 48 01 d8\n== 48 c7 c0 ef be ad de\n
What should the CPU execute? This is determined by the RIP register where IP means instruction pointer. Execution follows the pattern: fetch the instruction at the address in RIP, decode it, run it.
Here the operation mov is moving the \"immediate\" 0xdeadbeef into the register RAX
mov rax, [0xdeadbeef + rbx * 4]
Here the operation mov is moving the data at the address of [0xdeadbeef + RBX*4] into the register RAX. When brackets are used, you can think of the program as getting the content from that effective address.
How can we express conditionals in x86-64? We use conditional jumps such as:
jnz <address>
je <address>
jge <address>
jle <address>
etc.
They jump if their condition is true, and just go to the next instruction otherwise. These conditionals are checking EFLAGS, which are special registers which store flags on certain instructions such as add rax, rbx which sets the o (overflow) flag if the sum is greater than a 64-bit register can hold, and wraps around. You can jump based on that with a jo instruction. The most important thing to remember is the cmp instruction:
cmp rax, rbx\njle error\n
This assembly jumps if RAX <= RBX"}, {"location": "reverse-engineering/what-is-assembly-machine-code/#addresses", "title": "Addresses", "text": "
Memory acts similarly to a big array where the indices of this \"array\" are memory addresses. Remember from earlier:
mov rax, [0xdeadbeef]
The square brackets mean \"get the data at this address\". This is analogous to the C/C++ syntax: rax = *0xdeadbeef;
"}, {"location": "reverse-engineering/what-is-bytecode/", "title": "What is bytecode", "text": ""}, {"location": "reverse-engineering/what-is-c/", "title": "The C Programming Language", "text": ""}, {"location": "reverse-engineering/what-is-c/#history", "title": "History", "text": "
The C programming language was written by Dennis Ritchie in the 1970s while he was working at Bell Labs. It was first used to reimplement the Unix operating system which was purely written in assembly language. At first, the Unix developers were considering using a language called \"B\" but because B wasn't optimized for the target computer, the C language was created.
Note
C is the letter and the programming language after B!
C was designed to be close to assembly and is still widely used in lower level programming where speed and control are needed (operating systems, embedded systems). C was also very influential to other programming languages used today. Notable languages include C++, Objective-C, Golang, Java, JavaScript, PHP, Python, and Rust.
Today C is widely used either as a low level programming language or is the base language that other programming languages are implemented in.
While it can be difficult to see, the C language compiles down directly into machine code. The compiler is programmed to process the provided C code and emit assembly that's targetted to whatever operating system and architecture the compiler is set to use.
Some common compilers include:
gcc
clang
A good way to explore this relationship is to use this online GCC Explorer from Matt Godbolt.
In regards to CTF, many reverse engineering and exploitation CTF challenges are written in C because the language compiles down directly to assembly and there are little to no safeguards in the language. This means developers must manually handle both. Of course, this can lead to mistakes which can sometimes lead to security issues.
Note
Other higher level langauges like Python manage memory and garbage collection for you. Google Golang was inspired by C, but adds in functionality like garbage collection and memory safety.
There are some examples of famously vulnerable functions in C which are still available and can still result in vulnerabilities:
C uses an idea known as pointers. A pointer is a variable which contains the address of another variable.
To understand this idea we should first understand that memory is laid out in terms of addresses and data gets stored at these addresses.
Take the following example of defining an integer in C:
int x = 4;\n
To the programmer this is the variable x receiving the value of 4. The computer stores this value in some location in memory. For example we can say that address 0x1000 now holds the value 4. The computer knows to directly access the memory and retrieve the value 4 whenever the programmer tries to use the x variable. If we were to say x + 4, the computer would give you 8 instead of 0x1004.
But in C we can retrieve the memory address being used to hold the 4 value (i.e. 0x1000) by using the & character and using * to create an \"integer pointer\" type.
int* y = &x;\n
The y variable will store the address pointed to by the xvariable (0x1000).
Note
The * character allows us to declare pointer variables but also allows us to access the value stored at a pointer. For example, entering *y allows us to access the 4 value instead of 0x1000.
Whenever we use the y variable we are using the memory address, but if we use the x variable we use the value stored at the memory address.
Arrays allow programmers to group data into logical containers.
To access the individual elements of an array we access the contents by their \"index\". Most programming langauges today start counting from 0. So to take our previous example:
"}, {"location": "reverse-engineering/what-is-c/#how-do-arrays-work", "title": "How do arrays work?", "text": "
Arrays are a clever combination of multiplication, pointers, and programming.
Because the computer knows the data type used for every element in the array, the computer needs to simply multiply the size of the data type by the index you are looking for and then add this value to the address of the beginning of the array.
For example if we know that the base address of an array is 1000 and we know that each integer takes 8 bytes, we know that if we have 8 integers right next to each other, we can get the integer at the 4th index with the following math:
"}, {"location": "reverse-engineering/what-is-c/#memory-management", "title": "Memory Management", "text": ""}, {"location": "reverse-engineering/what-is-gdb/", "title": "The GNU Debugger (GDB)", "text": "
The GNU Debugger or GDB is a powerful debugger which allows for step-by-step execution of a program. It can be used to trace program execution and is an important part of any reverse engineering toolkit.
GDB without any modifications is unintuitive and obscures a lot of useful information. The plug-in pwndb solves a lot of these problems and makes for a much more pleasant experience. But if you are constrained and have to use vanilla gdb, here are several things to make your life easier.
In order to view the state of registers with vanilla gdb, you need to run the command info registers which will display the state of all the registers:
As before, in order to delete a view, you can list the available breakpoints using (gdb) info breakpoints (don't forget about GDB's autocomplete, you don't always need to type out every command!) which will display all breakpoints:
Num Type Disp Enb Address What\n1 breakpoint keep y 0x0804852f <main>\n3 breakpoint keep y 0x0804864d <__libc_csu_init+61>\n
Then simply execute (gdb) delete 1
Note
GDB creates breakpoints chronologically and does NOT reuse numbers.
What good is a debugger if you can't control where you are going? In order to begin execution of a program, use the command r [arguments] similar to how if you ran it with dot-slash notation you would execute it ./program [arguments]. In this case the program will run normally and if no breakpoints are set, you will execute normally. If you have breakpoints set, you will stop at that instruction.
(gdb) continue [# of breakpoints]: Resumes the execution of the program until it finishes or until another breakpoint is hit (shorthand c)
(gdb) step[# of instructions]: Steps into an instruction the specified number of times, default is 1 (shorthand s)
(gdb) next instruction [# of instructions]: Steps over an instruction meaning it will not delve into called functions (shorthand ni)
(gdb) finish: Finishes a function and breaks after it gets returned (shorthand fin)
Examining data in GDB is also very useful for seeing how the program is affecting data. The notation may seem complex at first, but it is flexible and provides powerful functionality.
If the program happens to be an accept-and-fork server, gdb will have issues following the child or parent processes. In order to specify how you want gdb to function you can use the command set follow-fork-mode [on/off]
Another useful feature of GDB is to attach to processes which are already running. Simply launch gdb using gdb, then find the process id of the program you would like to attach to an execute attach [pid].
Websites all around the world are programmed using various programming languages. While there are specific vulnerabilities in each programming langage that the developer should be aware of, there are issues fundamental to the internet that can show up regardless of the chosen language or framework.
These vulnerabilities often show up in CTFs as web security challenges where the user needs to exploit a bug to gain some kind of higher level privelege.
Command Injection is a vulnerability that allows an attacker to submit system commands to a computer running a website. This happens when the application fails to encode user input that goes into a system shell. It is very common to see this vulnerability when a developer uses the system() command or its equivalent in the programming language of the application.
Because of the additional semicolon, the os.system() function is instructed to run two commands.
It looks to the program as:
ping ; ls\n
Note
The semicolon terminates a command in bash and allows you to put another command after it.
Because the ping command is being terminated and the ls command is being added on, the ls command will be run in addition to the empty ping command!
This is the core concept behind command injection. The ls command could of course be switched with another command (e.g. wget, curl, bash, etc.)
Command injection is a very common means of privelege escalation within web applications and applications that interface with system commands. Many kinds of home routers take user input and directly append it to a system command. For this reason, many of those home router models are vulnerable to command injection.
A Cross Site Request Forgery or CSRF Attack, pronounced see surf, is an attack on an authenticated user which uses a state session in order to perform state changing attacks like a purchase, a transfer of funds, or a change of email address.
The entire premise of CSRF is based on session hijacking, usually by injecting malicious elements within a webpage through an <img> tag or an <iframe> where references to external resources are unverified.
GET requests are often used by websites to get user input. Say a user signs in to an banking site which assigns their browser a cookie which keeps them logged in. If they transfer some money, the URL that is sent to the server might have the pattern:
Knowing this format, an attacker can send an email with a hyperlink to be clicked on or they can include an image tag of 0 by 0 pixels which will automatically be requested by the browser such as:
"}, {"location": "web-exploitation/cross-site-scripting/what-is-cross-site-scripting/", "title": "Cross Site Scripting (XSS)", "text": "
Cross Site Scripting or XSS is a vulnerability where on user of an application can send JavaScript that is executed by the browser of another user of the same application.
This is a vulnerability because JavaScript has a high degree of control over a user's web browser.
For example JavaScript has the ability to:
Modify the page (called the DOM)
Send more HTTP requests
Access cookies
By combining all of these abilities, XSS can maliciously use JavaScript to extract user's cookies and send them to an attacker controlled server. XSS can also modify the DOM to phish users for their passwords. This only scratches the surface of what XSS can be used to do.
XSS is typically broken down into three categories:
You can see the XSS exploit provided in the data GET parameter. If the application is vulnerable to reflected XSS, the application will take this data parameter value and inject it into the DOM.
Depending on where the exploit gets injected, it may need to be constructed differently.
Also, the exploit payload can change to fit whatever the attacker needs it to do. Whether that is to extract cookies and submit it to an external server, or to simply modify the page to deface it.
One of the deficiencies of reflected XSS however is that it requires the victim to access the vulnerable page from an attacker controlled resource. Notice that if the data paramter, wasn't provided the exploit wouldn't work.
In many situations, reflected XSS is detected by the browser because it is very simple for a browser to detect malicous XSS payloads in URLs.
Stored XSS is different from reflected XSS in one key way. In reflected XSS, the exploit is provided through a GET parameter. But in stored XSS, the exploit is provided from the website itself.
Imagine a website that allows users to post comments. If a user can submit an XSS payload as a comment, and then have others view that malicious comment, it would be an example of stored XSS.
The reason being that the web site itself is serving up the XSS payload to other users. This makes it very difficult to detect from the browser's perspective and no browser is capable of generically preventing stored XSS from exploiting a user.
DOM XSS is XSS that is due to the browser itself injecting an XSS payload into the DOM. While the server itself may properly prevent XSS, it's possible that the client side scripts may accidentally take a payload and insert it into the DOM and cause the payload to trigger.
The server itself is not to blame, but the client side JavaScript files are causing the issue.
Here the user is submitting ../../../../../../../../etc/passwd.
This will result in the PHP interpreter leaving the directory that it is coded to look in ('/var/www/html') and instead be forced up to the root folder.
Ultimately this will become /etc/passwd because the computer will not go a directory above its top directory.
Thus the application will load the /etc/passwd file and emit it to the user like so:
root:x:0:0:root:/root:/bin/bash\ndaemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin\nbin:x:2:2:bin:/bin:/usr/sbin/nologin\nsys:x:3:3:sys:/dev:/usr/sbin/nologin\nsync:x:4:65534:sync:/bin:/bin/sync\ngames:x:5:60:games:/usr/games:/usr/sbin/nologin\nman:x:6:12:man:/var/cache/man:/usr/sbin/nologin\nlp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin\nmail:x:8:8:mail:/var/mail:/usr/sbin/nologin\nnews:x:9:9:news:/var/spool/news:/usr/sbin/nologin\nuucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin\nproxy:x:13:13:proxy:/bin:/usr/sbin/nologin\nwww-data:x:33:33:www-data:/var/www:/usr/sbin/nologin\nbackup:x:34:34:backup:/var/backups:/usr/sbin/nologin\nlist:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin\nirc:x:39:39:ircd:/var/run/ircd:/usr/sbin/nologin\ngnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/usr/sbin/nologin\nnobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin\nsystemd-timesync:x:100:102:systemd Time Synchronization,,,:/run/systemd:/bin/false\nsystemd-network:x:101:103:systemd Network Management,,,:/run/systemd/netif:/bin/false\nsystemd-resolve:x:102:104:systemd Resolver,,,:/run/systemd/resolve:/bin/false\nsystemd-bus-proxy:x:103:105:systemd Bus Proxy,,,:/run/systemd:/bin/false\n_apt:x:104:65534::/nonexistent:/bin/false\n
This same concept can be applied to applications where some input is taken from a user and then used to access a file or path or similar. This vulnerability very often can be used to leak sensitive data or extract application source code to find other vulnerabilities.
PHP is one of the most used languages for back-end web development and therefore it has become a target by hackers. PHP is a language which makes it painful to be secure for most instances, making it every hacker's dream target.
PHP is a C-like language which uses tags enclosed by <?php ... ?> (sometimes just <? ... ?>). It is inlined into HTML. A word of advice is to keep the php docs open because function names are strange due to the fact that the length of function name is used to be the key in PHP's internal dictionary, so function names were shortened/lengthened to make the lookup faster. Other things include:
<?php\n if ($_SERVER['REQUEST_METHOD'] === 'POST' && isset($_POST['email']) && isset($_POST['password'])) {\n $db = new mysqli('127.0.0.1', 'cs3284', 'cs3284', 'logmein');\n $email = $_POST['email'];\n $password = sha1($_POST['password']);\n $res = $db->query(\"SELECT * FROM users WHERE email = '$email' AND password = '$password'\");\n if ($row = $res->fetch_assoc()) {\n $_SESSION['id'] = $row['id'];\n header('Location: index.php');\n die();\n }\n }\n?>\n<html>...\n
This example PHP simply checks the POST data for an email and password. If the password is equal to the hashed password in the database, the use is logged in and redirected to the index page.
The line email = '$email' uses automatic string interpolation in order to convert $email into a string to compare with the database.
PHP will do just about anything to match with a loose comparison (\\=\\=) which means things can be 'equal' (\\=\\=) or really equal (\\=\\=\\=). The implicit integer parsing to strings is the root cause of a lot of issues in PHP.
PHP has multiple ways to include other source files such as require, require_once and include. These can take a dynamic string such as require $_GET['page'] . \".php\"; which is usually seen in templating.
PHP has its own URL scheme: php://... and its main purpose is to filter output automatically. It can automatically remove certain HTML tags and can base64 encode as well.
Server Side Request Forgery or SSRF is where an attacker is able to cause a web application to send a request that the attacker defines.
For example, say there is a website that lets you take a screenshot of any site on the internet.
Under normal usage a user might ask it to take a screenshot of a page like Google, or The New York Times. But what if a user does something more nefarious? What if they asked the site to take a picture of http://localhost ? Or perhaps tries to access something more useful like http://localhost/server-status ?
Note
127.0.0.1 (also known as localhost or loopback) represents the computer itself. Accessing localhost means you are accessing the computer's own internal network. Developers often use localhost as a way to access the services they have running on their own computers.
Depending on what the response from the site is the attacker may be able to gain additional information about what's running on the computer itself.
In addition, the requests originating from the server would come from the server's IP not the attackers IP. Because of that, it is possible that the attacker might be able to access internal resources that he wouldn't normally be able to access.
Another usage for SSRF is to create a simple port scanner to scan the internal network looking for internal services.
SQL Injection is a vulnerability where an application takes input from a user and doesn't vaildate that the user's input doesn't contain additional SQL.
<?php\n $username = $_GET['username']; // kchung\n $result = mysql_query(\"SELECT * FROM users WHERE username='$username'\");\n?>\n
If we look at the $username variable, under normal operation we might expect the username parameter to be a real username (e.g. kchung).
But a malicious user might submit different kind of data. For example, consider if the input was '?
The application would crash because the resulting SQL query is incorrect.
SELECT * FROM users WHERE username='''\n
Note
Notice the extra single quote at the end.
With the knowledge that a single quote will cause an error in the application we can expand a little more on SQL Injection.
What if our input was ' OR 1=1?
SELECT * FROM users WHERE username='' OR 1=1\n
1 is indeed equal to 1. This equates to true in SQL. If we reinterpret this the SQL statement is really saying
SELECT * FROM users WHERE username='' OR true\n
This will return every row in the table because each row that exists must be true.
We can also inject comments and termination characters like -- or /* or ;. This allows you to terminate SQL queries after your injected statements. For example '-- is a common SQL injection payload.
SELECT * FROM users WHERE username=''-- '\n
This payload sets the username parameter to an empty string to break out of the query and then adds a comment (--) that effectively hides the second single quote.
Using this technique of adding SQL statements to an existing query we can force databases to return data that it was not meant to return.
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index d302cc79..dbfdbfcb 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ
diff --git a/web-exploitation/command-injection/what-is-command-injection/index.html b/web-exploitation/command-injection/what-is-command-injection/index.html
index 1394d0f8..66de7e27 100644
--- a/web-exploitation/command-injection/what-is-command-injection/index.html
+++ b/web-exploitation/command-injection/what-is-command-injection/index.html
@@ -3067,6 +3067,113 @@
+
+
+
+
+
+
+
+
+
+
+
+