Skip to content

Commit

Permalink
Improve URL checks to reduce false-negatives
Browse files Browse the repository at this point in the history
This commit improves the URL health checking mechanism to reduce false
negatives.

- Treat all 2XX status codes as successful, addressing issues with codes
  like `204`.
- Exclude URLs within Markdown inline code blocks.
- Send the Host header for improved handling of webpages behind proxies.
- Improve formatting and context for output messages.
- Fix the defaulting options for redirects and cookie handling.
- Add URL exclusion support for non-responsive URLs.
- Update the user agent pool to modern browsers and platforms.
- Improve CI/CD workflow to respond to modifications in the
  `test/checks/external-urls` directory, offering immediate feedback on
  potential impacts to the external URL test after changes.
- Add support for randomizing TLS fingerprint to mimic various clients
  better, improving the effectiveness of checks. It's however not
  completely supported by Node.js's HTTP client, see
  nodejs/undici#1983.
  • Loading branch information
undergroundwires committed Mar 13, 2024
1 parent 4ac1425 commit 0acc440
Show file tree
Hide file tree
Showing 19 changed files with 333 additions and 216 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/checks.external-urls.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ name: checks.external-urls
on:
schedule:
- cron: '0 0 * * 0' # at 00:00 on every Sunday
push:
paths:
- tests/checks/external-urls/**

jobs:
run-check:
Expand Down
5 changes: 4 additions & 1 deletion src/infrastructure/Threading/AsyncSleep.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
export type SchedulerCallbackType = (...args: unknown[]) => void;
export type SchedulerType = (callback: SchedulerCallbackType, ms: number) => void;

export function sleep(time: number, scheduler: SchedulerType = setTimeout) {
export function sleep(
time: number,
scheduler: SchedulerType = setTimeout,
): Promise<void> {
return new Promise((resolve) => {
scheduler(() => resolve(undefined), time);
});
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { splitTextIntoLines, indentText } from '../utils/text';
import { indentText, splitTextIntoLines } from '@tests/shared/Text';
import { log, die } from '../utils/log';
import { readAppLogFile } from './app-logs';
import { STDERR_IGNORE_PATTERNS } from './error-ignore-patterns';
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import { filterEmpty } from '@tests/shared/Text';
import { runCommand } from '../../utils/run-command';
import { log, LogLevel } from '../../utils/log';
import { SupportedPlatform, CURRENT_PLATFORM } from '../../utils/platform';
import { filterEmpty } from '../../utils/text';

export async function captureWindowTitles(processId: number) {
if (!processId) { throw new Error('Missing process ID.'); }
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { indentText } from '@tests/shared/Text';
import { logCurrentArgs, CommandLineFlag, hasCommandLineFlag } from './cli-args';
import { log, die } from './utils/log';
import { ensureNpmProjectDir, npmInstall, npmBuild } from './utils/npm';
Expand All @@ -15,7 +16,6 @@ import {
APP_EXECUTION_DURATION_IN_SECONDS,
SCREENSHOT_PATH,
} from './config';
import { indentText } from './utils/text';
import type { ExtractionResult } from './app/extractors/common/extraction-result';

export async function main(): Promise<void> {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { exec, type ExecOptions, type ExecException } from 'node:child_process';
import { indentText } from './text';
import { exec } from 'child_process';
import { indentText } from '@tests/shared/Text';
import type { ExecOptions, ExecException } from 'child_process';

const TIMEOUT_IN_SECONDS = 180;
const MAX_OUTPUT_BUFFER_SIZE = 1024 * 1024; // 1 MB
Expand Down
66 changes: 32 additions & 34 deletions tests/checks/external-urls/StatusChecker/BatchStatusChecker.ts
Original file line number Diff line number Diff line change
@@ -1,64 +1,62 @@
import { sleep } from '@/infrastructure/Threading/AsyncSleep';
import { getUrlStatus, type IRequestOptions } from './Requestor';
import { groupUrlsByDomain } from './UrlPerDomainGrouper';
import type { IUrlStatus } from './IUrlStatus';
import { getUrlStatus, type RequestOptions } from './Requestor';
import { groupUrlsByDomain } from './UrlDomainProcessing';
import type { FollowOptions } from './FetchFollow';
import type { UrlStatus } from './UrlStatus';

export async function getUrlStatusesInParallel(
urls: string[],
options?: IBatchRequestOptions,
): Promise<IUrlStatus[]> {
// urls = [ 'https://privacy.sexy' ]; // Here to comment out when testing
options?: BatchRequestOptions,
): Promise<UrlStatus[]> {
urls = ['https://archive.ph/2023.10.07-112359/https://apps.microsoft.com/detail/9NCBCSZSJRSB?hl=en-us&gl=US'];
const uniqueUrls = Array.from(new Set(urls));
const defaultedOptions = { ...DefaultOptions, ...options };
console.log('Options: ', defaultedOptions);
const results = await request(uniqueUrls, defaultedOptions);
const defaultedDomainOptions = { ...DefaultDomainOptions, ...options?.domainOptions };
console.log('Batch request options applied:', defaultedDomainOptions);
const results = await request(uniqueUrls, defaultedDomainOptions, options);
return results;
}

export interface IBatchRequestOptions {
domainOptions?: IDomainOptions;
requestOptions?: IRequestOptions;
export interface BatchRequestOptions {
readonly domainOptions?: Partial<DomainOptions>;
readonly requestOptions?: Partial<RequestOptions>;
readonly followOptions?: Partial<FollowOptions>;
}

interface IDomainOptions {
sameDomainParallelize?: boolean;
sameDomainDelayInMs?: number;
interface DomainOptions {
readonly sameDomainParallelize?: boolean;
readonly sameDomainDelayInMs?: number;
}

const DefaultOptions: Required<IBatchRequestOptions> = {
domainOptions: {
sameDomainParallelize: false,
sameDomainDelayInMs: 3 /* sec */ * 1000,
},
requestOptions: {
retryExponentialBaseInMs: 5 /* sec */ * 1000,
requestTimeoutInMs: 60 /* sec */ * 1000,
additionalHeaders: {},
},
const DefaultDomainOptions: Required<DomainOptions> = {
sameDomainParallelize: false,
sameDomainDelayInMs: 3 /* sec */ * 1000,
};

function request(
urls: string[],
options: Required<IBatchRequestOptions>,
): Promise<IUrlStatus[]> {
if (!options.domainOptions.sameDomainParallelize) {
domainOptions: Required<DomainOptions>,
options?: BatchRequestOptions,
): Promise<UrlStatus[]> {
if (!domainOptions.sameDomainParallelize) {
return runOnEachDomainWithDelay(
urls,
(url) => getUrlStatus(url, options.requestOptions),
options.domainOptions.sameDomainDelayInMs,
(url) => getUrlStatus(url, options?.requestOptions, options?.followOptions),
domainOptions.sameDomainDelayInMs,
);
}
return Promise.all(urls.map((url) => getUrlStatus(url, options.requestOptions)));
return Promise.all(
urls.map((url) => getUrlStatus(url, options?.requestOptions, options?.followOptions)),
);
}

async function runOnEachDomainWithDelay(
urls: string[],
action: (url: string) => Promise<IUrlStatus>,
action: (url: string) => Promise<UrlStatus>,
delayInMs: number | undefined,
): Promise<IUrlStatus[]> {
): Promise<UrlStatus[]> {
const grouped = groupUrlsByDomain(urls);
const tasks = grouped.map(async (group) => {
const results = new Array<IUrlStatus>();
const results = new Array<UrlStatus>();
/* eslint-disable no-await-in-loop */
for (const url of group) {
const status = await action(url);
Expand Down
Original file line number Diff line number Diff line change
@@ -1,27 +1,33 @@
import { sleep } from '@/infrastructure/Threading/AsyncSleep';
import type { IUrlStatus } from './IUrlStatus';
import { indentText } from '@tests/shared/Text';
import { type UrlStatus, formatUrlStatus } from './UrlStatus';

const DefaultBaseRetryIntervalInMs = 5 /* sec */ * 1000;

export async function retryWithExponentialBackOff(
action: () => Promise<IUrlStatus>,
action: () => Promise<UrlStatus>,
baseRetryIntervalInMs: number = DefaultBaseRetryIntervalInMs,
currentRetry = 1,
): Promise<IUrlStatus> {
): Promise<UrlStatus> {
const maxTries = 3;
const status = await action();
if (shouldRetry(status)) {
if (currentRetry <= maxTries) {
const exponentialBackOffInMs = getRetryTimeoutInMs(currentRetry, baseRetryIntervalInMs);
console.log(`Retrying (${currentRetry}) in ${exponentialBackOffInMs / 1000} seconds`, status);
console.log([
`Attempt ${currentRetry}: Retrying in ${exponentialBackOffInMs / 1000} seconds.`,
'Details:',
indentText(formatUrlStatus(status)),
].join('\n'));
await sleep(exponentialBackOffInMs);
return retryWithExponentialBackOff(action, baseRetryIntervalInMs, currentRetry + 1);
}
console.warn('💀 All retry attempts failed. Final failure to retrieve URL:', indentText(formatUrlStatus(status)));
}
return status;
}

function shouldRetry(status: IUrlStatus) {
function shouldRetry(status: UrlStatus): boolean {
if (status.error) {
return true;
}
Expand All @@ -32,14 +38,14 @@ function shouldRetry(status: IUrlStatus) {
|| status.code === 429; // Too Many Requests
}

function isTransientError(statusCode: number) {
function isTransientError(statusCode: number): boolean {
return statusCode >= 500 && statusCode <= 599;
}

function getRetryTimeoutInMs(
currentRetry: number,
baseRetryIntervalInMs: number = DefaultBaseRetryIntervalInMs,
) {
): number {
const retryRandomFactor = 0.5; // Retry intervals are between 50% and 150%
// of the exponentially increasing base amount
const minRandom = 1 - retryRandomFactor;
Expand Down
34 changes: 19 additions & 15 deletions tests/checks/external-urls/StatusChecker/FetchFollow.ts
Original file line number Diff line number Diff line change
@@ -1,19 +1,17 @@
import { fetchWithTimeout } from './FetchWithTimeout';
import { getDomainFromUrl } from './UrlDomainProcessing';

export function fetchFollow(
url: string,
timeoutInMs: number,
fetchOptions: RequestInit,
followOptions: IFollowOptions | undefined,
fetchOptions?: Partial<RequestInit>,
followOptions?: Partial<FollowOptions>,
): Promise<Response> {
const defaultedFollowOptions = {
...DefaultFollowOptions,
...followOptions,
};
const defaultedFollowOptions = { ...DefaultFollowOptions, ...followOptions };
if (followRedirects(defaultedFollowOptions)) {
return fetchWithTimeout(url, timeoutInMs, fetchOptions);
}
fetchOptions = { ...fetchOptions, redirect: 'manual' /* handled manually */ };
fetchOptions = { ...fetchOptions, redirect: 'manual' /* handled manually */, mode: 'cors' };
const cookies = new CookieStorage(defaultedFollowOptions.enableCookies);
return followRecursivelyWithCookies(
url,
Expand All @@ -24,13 +22,15 @@ export function fetchFollow(
);
}

export interface IFollowOptions {
followRedirects?: boolean;
maximumRedirectFollowDepth?: number;
enableCookies?: boolean;
// "cors" | "navigate" | "no-cors" | "same-origin";

export interface FollowOptions {
readonly followRedirects?: boolean;
readonly maximumRedirectFollowDepth?: number;
readonly enableCookies?: boolean;
}

export const DefaultFollowOptions: Required<IFollowOptions> = {
const DefaultFollowOptions: Required<FollowOptions> = {
followRedirects: true,
maximumRedirectFollowDepth: 20,
enableCookies: true,
Expand Down Expand Up @@ -64,6 +64,10 @@ async function followRecursivelyWithCookies(
if (cookieHeader) {
cookies.addHeader(cookieHeader);
}
options.headers = {
...options.headers,
Host: getDomainFromUrl(nextUrl),
};
return followRecursivelyWithCookies(nextUrl, timeoutInMs, options, newFollowDepth, cookies);
}

Expand All @@ -77,7 +81,7 @@ class CookieStorage {
constructor(private readonly enabled: boolean) {
}

public hasAny() {
public hasAny(): boolean {
return this.enabled && this.cookies.length > 0;
}

Expand All @@ -88,12 +92,12 @@ class CookieStorage {
this.cookies.push(header);
}

public getHeader() {
public getHeader(): string {
return this.cookies.join(' ; ');
}
}

function followRedirects(options: IFollowOptions) {
function followRedirects(options: FollowOptions): boolean {
if (!options.followRedirects) {
return false;
}
Expand Down
5 changes: 4 additions & 1 deletion tests/checks/external-urls/StatusChecker/FetchWithTimeout.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ export async function fetchWithTimeout(
...(init ?? {}),
signal: controller.signal,
};
const promise = fetch(url, options);
const promise = fetch(
url,
options,
);
const timeout = setTimeout(() => controller.abort(), timeoutInMs);
return promise.finally(() => clearTimeout(timeout));
}
5 changes: 0 additions & 5 deletions tests/checks/external-urls/StatusChecker/IUrlStatus.ts

This file was deleted.

21 changes: 6 additions & 15 deletions tests/checks/external-urls/StatusChecker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ A CLI and SDK for checking the availability of external URLs.
- 😇 **Rate Limiting**: Queues requests by domain to be polite.
- 🔁 **Retries**: Implements retry pattern with exponential back-off.
-**Timeouts**: Configurable timeout for each request.
- 🎭️ **User-Agent Rotation**: Change user agents for each request.
- 🎭️ **Impersonation**: Impersonate different browsers for each request.
- **🌐 User-Agent Rotation**: Change user agents.
- **🔑 TLS Handshakes**: Perform TLS and HTTP handshakes that are identical to that of a real browser.
- 🫙 **Cookie jar**: Preserve cookies during redirects to mimic real browser.

## CLI

Expand Down Expand Up @@ -54,6 +57,7 @@ const statuses = await getUrlStatusesInParallel([ 'https://privacy.sexy', /* ...
- **`sameDomainDelayInMs`** (*number*), default: `3000` (3 seconds)
- Sets the delay between requests to the same domain.
- `requestOptions` (*object*): See [request options](#request-options).
- `followOptions` (*object*): See [follow options](#follow-options).

### `getUrlStatus`

Expand All @@ -72,7 +76,6 @@ console.log(`Status code: ${status.code}`);
- The longer the base time, the greater the intervals between retries.
- **`additionalHeaders`** (*object*), default: `false`
- Additional HTTP headers to send along with the default headers. Overrides default headers if specified.
- **`followOptions`** (*object*): See [follow options](#follow-options).
- **`requestTimeoutInMs`** (*number*), default: `60000` (60 seconds)
- Time limit to abort the request if no response is received within the specified time frame.

Expand All @@ -83,19 +86,7 @@ Follows `3XX` redirects while preserving cookies.
Same fetch API except third parameter that specifies [follow options](#follow-options), `redirect: 'follow' | 'manual' | 'error'` is discarded in favor of the third parameter.

```js
const status = await fetchFollow('https://privacy.sexy', {
// First argument is same options as fetch API, except `redirect` options
// that's discarded in favor of next argument follow options
headers: {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
},
}, {
// Second argument sets the redirect behavior
followRedirects: true,
maximumRedirectFollowDepth: 20,
enableCookies: true,
}
);
const status = await fetchFollow('https://privacy.sexy', 1000 /* timeout in milliseconds */);
console.log(`Status code: ${status.code}`);
```

Expand Down
Loading

0 comments on commit 0acc440

Please sign in to comment.