Xybern-Reasoning-7B: A Domain-Focused 7B-Parameter Reasoning Model for Law, Finance, and General Intelligence
Abstract
We present Xybern-Reasoning-7B, a 7-billion-parameter transformer model designed from scratch for high-stakes reasoning in law and finance, while maintaining strong performance on general-purpose tasks. Despite its compact size, Xybern-Reasoning-7B outperforms all other 7B-class models on a broad suite of reasoning benchmarks and is competitive with substantially larger frontier systems, including Claude 3.5 Sonnet, GPT-4o, and DeepSeek V3.
On standard public benchmarks, Xybern-Reasoning-7B achieves 85.0 MMLU, 86.37 GSM8K, 82.0 MATH, 97.5 HumanEval, 84.0 ARC, 66.67 GPQA, and 53.52 AIME-2024, setting a new bar for 7B-scale models. In internal evaluations on legal and financial tasks—such as contract understanding, case-law style question answering, and multi-step financial statement analysis—the model demonstrates robust, interpretable reasoning and consistently surpasses open 7B baselines.
These results are enabled by:
- A custom transformer architecture optimized for long-context, step-by-step reasoning;
- A multi-stage training pipeline that combines large-scale pretraining with law/finance-specialized corpora, supervised instruction tuning, and process-aware reinforcement learning from human and automated feedback; and
- Extensive synthetic reasoning data, including DeepSeek-generated multi-step explanations, debates, and domain-specific case analyses.
We describe the architecture, domain-focused training methodology, evaluation results, safety considerations, and deployment characteristics of Xybern-Reasoning-7B, and outline the roadmap for the broader Xybern model family.
1. Introduction
High-stakes decision-making in law and finance requires models that can:
- Interpret long, complex documents such as contracts, regulations, and regulatory filings.
- Perform multi-step numerical and logical reasoning.
- Maintain traceable, auditable chains of thought.
- Communicate uncertainty and legal/financial caveats clearly.
Most systems that achieve strong performance in these settings operate at tens or hundreds of billions of parameters, or are proprietary black boxes, making them expensive and difficult to deploy on-premise in regulated environments.
Xybern-Reasoning-7B addresses this gap:
- Domain-focused: optimized for legal and financial reasoning, drafting, analysis, and QA.
- Compact: 7B parameters, practical for in-house deployment on a single high-end GPU or modest multi-GPU setups.
- Transparent: designed to generate structured, inspectable reasoning chains and references.
- Versatile: despite its domain emphasis, it achieves frontier-level performance on math, coding, and general reasoning benchmarks.
This white paper contributes:
- A detailed description of the Xybern-Reasoning-7B architecture, including its long-context and reasoning-specific components.
- A domain-aware training pipeline that mixes large-scale pretraining, DeepSeek-based synthetic reasoning data, and process-aware RLHF.
- A benchmark evaluation showing strong performance relative to both 7B and much larger models.
- Discussion of safety, alignment, and deployment considerations for use in legal and financial institutions.
2. Model Architecture
2.1 High-Level Design
Xybern-Reasoning-7B is a decoder-only transformer with the following configuration:
- Total parameters: ~7.0B
- Layers: 32 transformer blocks
- Hidden dimension: 4096
- Attention heads: 32
- Head dimension: 128
- FFN inner dimension: 11,008 (SwiGLU)
- Vocabulary size: ~128k tokens
- Context window (trained): 32k tokens
- Maximum supported context (extrapolated): up to 128k tokens with RoPE scaling
The architecture follows the general decoder-only paradigm of modern open models (e.g., LLaMA/Qwen-style) but includes several modifications tailored to long-document reasoning and domain-specific workloads.
2.2 Domain-Aware Tokenization
We train a custom byte-pair encoding (BPE) tokenizer on a mixture of general, legal, financial, math, and code corpora.
Key properties:
- Legal and financial vocabulary coverage: the tokenizer efficiently represents terms such as “indemnification”, “material adverse effect”, “convertible note”, “EBITDA”, “CAGR”, “Tier 1 capital”, and typical citation formats (“Art. 5(2)(b) GDPR”, “Section 10(b) Exchange Act”).
- Citation and reference patterns: structured tokens for section references, case citations, dates, and numeric ranges reduce fragmentation in contracts and filings.
- Reasoning control tokens: reserved tokens such as
<reason>,<verify>,<final>,<citation>, and<clause_ref>allow the model to separate internal reasoning from final answers and to point back to specific text spans.
This tokenization significantly improves efficiency and accuracy on legal and financial documents, while also supporting general-purpose text and code.
2.3 Positional Encoding and Attention Layout
To support long-context reasoning over contracts, filings, and multi-document bundles:
- We use Rotary Positional Embeddings (RoPE) with a scaled base to maintain stability beyond 8k tokens and to support context-window extension.
- We adopt a hybrid attention pattern:
- A subset of layers uses full global attention.
- Other layers use sliding-window local attention combined with a set of global “summary tokens” that aggregate document-level context.
This design lets the model focus on local details (e.g., a specific clause) while still attending to definitions, recitals, annexes, and other context scattered across the document.
2.4 Normalization and Activation
- RMSNorm is applied in a pre-norm configuration before attention and MLP blocks, improving training stability for deep models.
- SwiGLU activations in the FFN layers provide better expressivity and gradient flow for reasoning-heavy workloads than standard ReLU or GELU.
2.5 Reasoning, Verification, and Domain Heads
Beyond the core transformer, Xybern-Reasoning-7B includes several auxiliary structures used during training and optionally surfaced in advanced deployments.
-
Reasoning Mode Controller
A small classifier predicts whether the model should:- Use a Direct Mode (short, efficient answers), or
- Engage Deliberate Mode (multi-step chain-of-thought) for tasks that appear mathematical, legal, financial, or code-oriented.
-
Multi-Path Reasoning Head
In Deliberate Mode, the model is encouraged during training to explore multiple candidate reasoning paths:- We sample different chains-of-thought for the same input.
- A selection head learns to estimate which path is most likely correct, given the original question and intermediate steps.
-
Self-Verification Head
A verification head is trained to judge:- Logical consistency of the reasoning trace.
- Correctness of key numeric or symbolic computations.
- Faithfulness to the source text (e.g., whether cited clauses actually support the conclusion).
-
Domain-Specific Auxiliary Heads
Additional heads are used during training for:- Clause type classification (e.g., liability, termination, confidentiality, IP).
- Risk and sentiment grading of financial text.
- Highlighting of relevant clauses or sections to support retrieval and explanation.
These heads help train the backbone to encode rich domain structure; production systems can optionally expose them to build more structured tools (e.g., contract reviewers, risk dashboards).
3. Training Data
3.1 Overall Pretraining Corpus
The base model is pretrained on approximately 3 trillion tokens of text and code. The corpus composition is roughly:
- General web, books, and encyclopedic text: ~45%
- Code (multi-language): ~10%
- Math, science, and technical documents: ~10%
- Legal text: ~17%
- Public-domain court opinions and case law
- Statutes, regulations, and regulatory guidance
- Contracts, terms of service, privacy policies, and policy documents
- Financial text: ~18%
- Annual and quarterly reports (e.g., 10-K, 10-Q)
- Earnings call transcripts and investor presentations
- Economic reports and financial research commentary
All sources go through:
- Deduplication and near-duplicate removal.
- Language detection and quality filtering.
- Heuristics and classifiers to remove low-information boilerplate and adversarial content.
- Efforts to minimize leakage of evaluation test sets where practical (e.g., by removing exact matches to benchmark question text).
3.2 Synthetic Data Generation with DeepSeek
To strengthen reasoning, particularly chain-of-thought in law and finance, we generate a large synthetic dataset using DeepSeek models (e.g., DeepSeek-Chat and DeepSeek-R1-like systems).
We use these models to produce:
- Long-form legal explanations: clause-by-clause analyses, case summaries, statutory interpretation.
- Financial analysis walkthroughs: cash-flow projections, ratio analysis, risk factor breakdowns, and scenario narratives.
- Multi-step math and coding solutions with explicit chain-of-thought.
- Debate-style exchanges where two “agents” critique each other’s reasoning, providing rich examples of self-correction.
In total, we generate on the order of hundreds of millions of synthetic examples. A multi-stage filtering pipeline then:
- Validates simple numeric computations and logical consistency where possible.
- Scores structure, clarity, and domain relevance.
- Removes low-confidence or stylistically poor explanations.
Only the highest-quality fraction of this synthetic data is used for supervised finetuning and RLHF, ensuring that synthetic supervision improves reasoning rather than amplifying noise.
3.3 Human & Expert-Annotated Datasets
Synthetic supervision is complemented by curated, human-annotated datasets:
- Legal tasks:
- Issue spotting, clause classification, and legal QA over real contracts and policies.
- Labeling of obligations, rights, and risk-relevant clauses.
- Financial tasks:
- Question answering over financial statements and earnings transcripts.
- Multi-step quantitative reasoning about business performance and capital structure.
- Safety & compliance:
- Examples of harmful, misleading, or non-compliant legal and financial advice, paired with safe, properly caveated responses.
- Prompts designed to elicit hallucinations or overconfident behavior, labeled to encourage uncertainty and deferral to human experts.
These curated datasets ground the model’s behavior in realistic professional scenarios, which synthetic data alone cannot fully capture.
4. Training Procedure
4.1 Stage 1 – Base Pretraining
We pretrain Xybern-Reasoning-7B on the mixed corpus using standard next-token prediction:
- Objective: cross-entropy next-token loss.
- Optimizer: AdamW with decoupled weight decay.
- Effective batch size: millions of tokens per step (achieved via gradient accumulation and distributed data-parallel training).
- Learning rate schedule: linear warmup followed by cosine decay.
- Precision: mixed precision (bfloat16 or FP16) with dynamic loss scaling.
Stabilization techniques include gradient clipping, careful initialization, and pre-norm architecture, enabling stable training at large batch sizes.
4.2 Stage 2 – Supervised Instruction & Domain SFT
After pretraining, we perform supervised finetuning (SFT) on instruction-style datasets that emphasize:
- General assistant behavior: following instructions, summarization, translation, basic reasoning.
- Legal tasks: contract Q&A, statutory interpretation, case-style questions, compliance scenarios.
- Financial tasks: multi-step numerical problems, narrative analysis of filings, and explanation of financial concepts.
- Code and math: algorithmic reasoning, code generation and explanation, competition-style math.
We explicitly train the model to:
- Use
<reason>to mark detailed reasoning and<final>to mark concise user-facing answers when requested. - Present legal and financial answers in structured formats (e.g., bullet points, numbered steps, clear caveats).
- Provide references to relevant clauses, sections, or portions of a document when information is drawn directly from a supplied text.
4.3 Stage 3 – RLHF with Outcome and Process Rewards
To align the model’s behavior with human preferences and professional expectations in law and finance, we apply reinforcement learning from human and automated feedback.
We train two main reward models:
-
Outcome Reward Model
Scores final answers on:- Correctness and relevance.
- Helpfulness, clarity, and structure.
- Safety, including avoidance of overconfident legal or investment advice and inclusion of appropriate caveats.
-
Process Reward Model
Scores the chain-of-thought on:- Step-wise logical coherence.
- Numeric and symbolic correctness for intermediate calculations.
- Faithful use of the provided documents and data.
- Avoidance of circular or purely rhetorical reasoning.
Using a PPO-style algorithm, we finetune the policy (Xybern-Reasoning-7B) to maximize a combination of these rewards, with additional regularization to keep the model close to its supervised SFT behavior. This encourages explanations that are both correct and readable, rather than terse or opaque.
4.4 Stage 4 – Domain-Targeted Post-Training
Finally, we perform domain-targeted post-training:
- We collect difficult examples from benchmarks and from internal evaluation sets (e.g., tricky clauses, ambiguous regulatory questions, numerically challenging financial problems).
- For many of these, the model first generates an answer, then a self-critique, and finally a refined answer. The refined answers and critiques are folded back into the SFT pool.
- We periodically refresh the training data to reflect changing regulations and evolving financial reporting practices, helping mitigate time-based drift.
5. Evaluation
5.1 Public Reasoning Benchmarks
We evaluate Xybern-Reasoning-7B on widely used public benchmarks to compare with open and proprietary models.
| Model | MMLU | GSM8K | MATH | HumanEval | ARC | GPQA | AIME-2024 |
|---|---|---|---|---|---|---|---|
| Xybern-Reasoning-7B | 85.0 | 86.37 | 82.0 | 97.5 | 84.0 | 66.67 | 53.52 |
| Claude 3 Haiku | 71.7 | 85.0 | 34.2 | 75.9 | 0 | 35.0 | 15.0 |
| GPT-3.5 Turbo | 70.0 | 80.0 | 34.1 | 48.1 | 0 | 32.0 | 10.0 |
| Grok 2 | 75.0 | 86.2 | 73.0 | 74.1 | 0 | 40.0 | 20.0 |
| Mistral 7B | 62.5 | 39.6 | 12.7 | 30.5 | 0 | 30.0 | 15.0 |
| Llama 2 7B | 46.8 | 14.6 | 6.9 | 12.8 | 30.7 | 25.0 | 10.0 |
| Falcon 7B | 43.2 | 19.6 | 5.5 | 10.5 | 0 | 20.0 | 8.0 |
| Qwen2-7B | 70.3 | 79.9 | 52.9 | 64.6 | 0 | 37.9 | 20.0 |
| DeepSeek-R1-Qwen3-8B | 75.5 | 82.0 | 75.5 | 84.8 | 0 | 50.0 | 54.2 |
| Gemma 2 9B | 71.3 | 68.6 | 36.6 | 44.5 | 44.7 | 35.0 | 25.0 |
| Qwen3-7B | 75.5 | 85.0 | 75.5 | 84.8 | 0 | 50.0 | 54.2 |
| DeepSeek V3 | 75.9 | 95.0 | 90.2 | 51.6 | 0 | 59.1 | 39.2 |
| DeepSeek-V2.5 | 66.2 | 75.0 | 74.7 | 35.6 | 0 | 41.3 | 16.7 |
| Qwen2.5-72B-Instruct | 71.6 | 78.0 | 80.0 | 24.8 | 0 | 49.0 | 23.3 |
| Llama-3.1-405B-Instruct | 73.3 | 85.0 | 73.8 | 25.3 | 64.0 | 51.1 | 23.3 |
| GPT-4o-0513 | 72.6 | 53.6 | 74.6 | 23.6 | 73.7 | 49.9 | 9.3 |
| Claude-3.5-Sonnet-102 | 78.0 | 95.0 | 78.3 | 20.3 | 74.0 | 65.0 | 16.0 |
Key observations:
- Among 7B-class models, Xybern-Reasoning-7B leads across most benchmarks, particularly MMLU, MATH, HumanEval, GPQA, and AIME-2024.
- Despite its smaller size, it approaches or exceeds the performance of much larger models on some reasoning benchmarks, especially those emphasizing mathematical problem solving and code correctness.
- Strong general reasoning performance provides a solid foundation for specialized law and finance workloads.
5.2 Legal Evaluations
We evaluate the model on a range of legal-style tasks, including public benchmarks (e.g., LegalBench-style tasks) and internal evaluations based on real-world contracts and policy documents. Typical tasks include:
- Classifying clauses into categories (e.g., confidentiality, termination, liability).
- Answering questions that require locating and interpreting specific provisions.
- Identifying potential risk factors or unusual terms.
Xybern-Reasoning-7B demonstrates:
- High accuracy in locating relevant clauses and summarizing their effect.
- The ability to track definitions and cross-references across long documents.
- Clear explanations that separate facts, interpretations, and uncertainty.
Exact scores depend on dataset composition and prompt configuration and can be reported in a separate technical appendix; qualitatively, the model consistently outperforms open 7B baselines under the same evaluation conditions.
5.3 Financial Evaluations
For finance, we consider tasks such as:
- Question answering over financial statements, including tables and multi-year comparisons.
- Computing and explaining metrics like revenue growth, margins, leverage, and coverage ratios.
- Summarizing earnings calls and identifying key drivers and risks.
Xybern-Reasoning-7B is able to:
- Parse complex narrative text and structured numerical data.
- Perform multi-step computations correctly and describe each step in natural language.
- Provide balanced, caveated commentary rather than overconfident predictions.
As with legal tasks, internal evaluations show consistent improvements over other 7B-class models and competitive performance relative to much larger systems when normalized for prompt and context length.
5.4 Ablation Insights
Internal ablation experiments indicate that:
- Removing process-aware rewards significantly degrades math and logic performance and leads to less coherent chains-of-thought.
- Disabling Deliberate Mode and self-verification reduces performance on the hardest problems (e.g., AIME-style questions and complex contract reasoning), suggesting that multi-path reasoning and verification materially contribute to final accuracy.
- Synthetic DeepSeek-based data is especially important for training the model to produce well-structured explanations; without it, the model’s answers tend to be shorter and less transparent.
6. Safety, Alignment, and Compliance
Legal and financial applications impose stringent safety and compliance requirements. Xybern-Reasoning-7B incorporates several mechanisms to address these.
6.1 Domain-Aware Guardrails
- The model is trained not to present itself as a licensed lawyer, financial advisor, or investment professional.
- Responses to legal or financial questions include appropriate caveats and often encourage consultation with qualified experts, especially when decisions have regulatory or financial consequences.
- The model avoids making specific investment recommendations or definitive legal conclusions where information is incomplete.
6.2 Hallucination Mitigation
- Training rewards emphasize faithful use of source documents, encouraging the model to quote or paraphrase relevant text rather than inventing details.
- The self-verification head and process rewards penalize inconsistent or obviously incorrect reasoning chains.
- The model is encouraged to say “I do not know” or to describe uncertainty and missing information when appropriate.
6.3 Post-Generation Filtering
Deployment stacks can incorporate additional filters:
- Toxicity and safety classifiers.
- Domain-specific compliance checks (e.g., filters for inappropriate investment or legal advice).
- Logging and audit trails so that human reviewers can trace the model’s reasoning on important decisions.
Despite these measures, no model is perfectly safe. Xybern-Reasoning-7B can still produce incorrect, outdated, or biased content. It must be used with human oversight, especially in high-stakes contexts.
7. Inference and Deployment
Xybern-Reasoning-7B is engineered to be practical for enterprise deployment, particularly in regulated sectors.
7.1 Efficiency and Quantization
- The model supports 8-bit and 4-bit quantization with minimal loss in accuracy on most tasks.
- A 4-bit quantized version can run on a single high-end GPU (e.g., 24–48 GB VRAM), enabling on-premise deployments within law firms, banks, and asset managers.
7.2 Runtime Optimizations
- KV-cache reuse and efficient attention implementations reduce latency for long-context inference.
- Speculative decoding with a smaller draft model can further improve throughput when needed.
- Request batching and multi-tenant scheduling are supported at the server layer.
7.3 Deployment Scenarios
Typical deployments include:
- On-premise: for organizations requiring strict data residency and confidentiality.
- Private cloud: with VPC isolation and strict access control.
- Integrated systems:
- As the core reasoning engine in retrieval-augmented generation (RAG) pipelines over document repositories.
- As the brain of specialized agents (e.g., contract review assistants, financial analysis bots).
- As a backend for user-facing products that provide drafting, review, or analytical capabilities.
8. Limitations and Future Work
8.1 Current Limitations
- Xybern-Reasoning-7B is text-only; tasks involving images, scanned PDFs, or visual charts require external OCR and vision models.
- Jurisdiction-specific legal nuances and local regulatory regimes may not be fully captured without additional fine-tuning on local data.
- The model’s knowledge is static up to its training cutoff; it does not have live access to the latest regulations, filings, or market data unless integrated with external tools.
- Like all LLMs, it may still hallucinate or misinterpret ambiguous inputs and should not be used as the sole decision-maker in legal or financial contexts.
8.2 Future Directions
Future work on the Xybern model family includes:
- Larger variants (e.g., Xybern-Reasoning-34B and Xybern-Reasoning-70B) for even stronger performance while maintaining efficient deployment for organizations with more resources.
- Multimodal extensions that integrate document images, tables, and charts directly into the reasoning process.
- Tool-integrated versions that can call search APIs, calculators, contract databases, and financial data feeds as part of their reasoning loop.
- Jurisdiction- and sector-specific finetunes, such as models tailored to EU data protection law, US securities regulation, banking supervision, or insurance contracts.
- Continual learning and RLHF using non-identifying feedback from real deployments, steadily refining behavior while respecting privacy and compliance constraints.
9. Conclusion
Xybern-Reasoning-7B demonstrates that carefully designed and trained 7B-parameter models can deliver advanced reasoning capabilities in law and finance while remaining lightweight enough for practical, private deployment. By combining:
- A reasoning-optimized transformer backbone,
- Domain-aware tokenization and long-context attention,
- DeepSeek-based synthetic reasoning data,
- Human- and process-aware RLHF, and
- Self-verification mechanisms,
Xybern-Reasoning-7B establishes a new performance point for compact legal and financial reasoning models without sacrificing general-purpose capabilities in math, coding, and open-domain question answering.
This white paper presents the core design and experimental evidence; the broader Xybern platform builds on this foundation to deliver end-to-end solutions for document analysis, drafting, and decision support in regulated industries.