Therefore, username related conditions with "this", "a", "you", and so on are completely ignored, even if they are valid username components.
To prevent username related conditions from removing "stop words", this revision explicitly specifies "analyzer" for username related fields of SCHEMA and CHGSETS_SCHEMA.
Difference between EMAILADDRANALYZER and default analyzer of TEXT is whether "stop words" are preserved or not. Tokenization is still applied on usernames.
For future changing, this revision doesn't make EMAILADDRANALYZER share analyzer definition with PATHANALYZER, even though their definitions are identical with each other at this revision.
This revision requires full re-building index tables, because indexing schemas are changed.
Original patch has been modified by Mads Kiilerich - tests of 'owner' will be addressed separately.
Therefore, pathname related conditions with "this", "a", "you", and so on are completely ignored, even if they are valid pathname components.
To prevent pathname related conditions from removing "stop words", this revision explicitly specifies "analyzer" for pathname related fields of SCHEMA and CHGSETS_SCHEMA.
Difference between PATHANALYZER and default analyzer of TEXT is whether "stop words" are preserved or not. Tokenization is still applied on pathnames.
This revision requires full re-building index tables, because indexing schemas are changed.
search: make "repository:" condition work case-insensitively as expected
Before this revision, "repository:" condition at searching for "Commit messages" never shows revisions in a repository, of which name uses upper case letter.
Using ID for "repository" of CHGSETS_SCHEMA preserves case of repository name at indexing. On the other hand, search condition itself is forcibly lowered before parsing.
- files in repository "FOO" is indexed as "FOO" in "repository" field - "repository:FOO" condition is treated as "repository:foo:
Then, indexing search itself is executed case-sensitively. Therefore, "repository:FOO" condition never show revisions in repository "FOO".
But just making "repository" of CHGSETS_SCHEMA case-insensitive isn't reasonable enough, because it breaks assumptions below, if there is case-insensitive name collision between repositories, even though Kallithea itself can manage such repositories at same time.
- combination of "raw_id" (= revision hash ID) and "repository" is unique between all known revisions under Kallithea
CHGSETS_SCHEMA assumes this.
This unique-ness is required by Whoosh library to determine whether index table should be updated or not for that repository.
- searching in a repository shows only revisions in that repository
Before this revision, this filtering is achieve by "repository:" condition with case-preserved repository name from requested URL.
To make "repository:" search condition work case-insensitively as expected (without any violation of assumptions above), this revision does:
- make "repository" of CHGSETS_SCHEMA case-insensitive by "analyzer=ICASEIDANALYZER"
- introduce "repository_rawname" into SCHEMA and CHGSETS_SCHEMA, to ensure assumptions described above, by preserving case of repository name
"repository_rawname" of SCHEMA uses not ID but TEXT, because the former disable "positions" feature, which is required for highlight-ing file content (see previous revision for detail).
This revision requires full re-building index tables, because indexing schemas are changed.
search: make "repository:" condition work as expected
Before this revision, "repository:foo" condition at searching for "File contents" or "File names" shows files in repositories below.
- foo - foo/bar - foo-bar - and so on ...
Whoosh library, which is used to parse text for indexing and seaching, does:
- treat almost all non-alphanumeric characters as delimiter both at indexing search items and at parsing search condition - make each fields for a search item be indexed by multiple values
For example, files in "foo/bar" repository are indexed by "foo" and "bar" in "repository" field. This tokenization make "repository:foo" search condition match against files in "foo/bar" repository, too.
In addition to it, using plain TEXT also causes unintentional ignorance of "stop words" in search conditions. For example, "this", "a", "you", and so on are ignored at indexing and parsing, because these are too generic words (from point of view of generic "text search").
This issue can't be resolved by using ID instead of TEXT for "repository" of SCHEMA, like as previous revisions for JOURNAL_SCHEMA, because:
- highlight-ing file content requires SCHEMA to support "positions" feature, but using ID instead of TEXT disables it - using ID violates current case-insensitive search policy, because it preserves case of text
To make "repository:" condition work as expected, this revision explicitly specifies "analyzer", which does:
- avoid tokenization - match case-insensitively - avoid removing "stop words" from text
This revision requires full re-building index tables, because indexing schema is changed.
BTW, "repository:" condition at searching for "Commit messages" uses CHGSETS_SCHEMA instead of SCHEMA. The former uses ID for "repository", and it does:
- avoid issues by tokenization and removing "stop words"
- disable "positions" feature of CHGSETS_SCHEMA
But highlight-ing file content isn't needed at searching for "Commit messages". Therefore, this can be ignored.
- preserve case of text
This violates current case-insensitive search policy, This issue will be fixed by subsequent revision, because fixing it isn't so simple.
journal: make "repository:" filtering condition work as expected (Issue #261)
Before this revision, journal filtering conditions like as below never match against any entry, even if there are corresponded repositories.
- repository:foo/bar - repository:foo-bar
Whoosh library, which is used to parse filtering condition, does:
- treat almost all non-alphanumeric characters as delimiter at parsing condition - join each conditions at filtering by "AND", by default
For example, filtering condition "repository:foo/bar" is translated as "repository:foo AND repository:bar". This combined condition never matches against any entry, because it is impossible that "repository" field in DBMS table "user_logs" has both "foo" and "bar" values at same time.
Using TEXT for "repository" of JOURNAL_SCHEMA causes this issue, because TEXT assumes tokenization at parsing.
In addition to it, using TEXT also causes unintentional ignorance of "stop words" in filtering conditions. For example, "this", "a", "you", and so on are ignored at parsing, because these are too generic words (from point of view of generic "text search").
To make "repository:" filtering condition work as expected, this revision uses ID instead of TEST for "repository" of JOURNAL_COLUMN. ID avoids both tokenization and removing "stop words".
This replacement should be safe with already existing DBMS instance, because:
- JOURNAL_SCHEMA is used only to parse filtering condition - DBMS table "user_logs" itself is defined by UserLog class (in kallithea/model/db.py)
BTW, using ID also avoids normalization by lowercase-ing. But this doesn't violate current case-insensitive search policy, because LOWER-ing in actual SQL query is achieved by get_filterion() or so in kallithea/controllers/admin/admin.py.
tests: introduce more test coverage of whoosh filtering
This has been extracted from other changesets by Mads Kiilerich to establish a test baseline so we clearly can see what the following fixes are fixing.
Some of these tests will thus demonstrate bad behaviour - that will be fixed later.
* drop support for 'cache.' prefix - it is only 'beaker.config.' that ever has been documented or used in examples * make prefix stripping more safe (if the prefix should appear twice in the key) * trust exactly what has been configured - drop stripping of strings * drop int conversion of 'expire' - beaker documentation suggests strings are fine
The second argument to the guess_instance classmethod is in practice fixed for Gist, Permission, RepoGroup, Repository, User, UserGroup; so might as well move this logic into per-class specializations of guess_instance.
Since the test repositories are created in the database by scanning the file system, their IDs depends on the (non-deterministic) order in which files are listed by the file system.
The solution, of course, is to ask the database for the ID instead of assuming it will be ID 1.
This refactoring simply emphasizes that the origin reference used to generate the default title is not always the same as the one actually used to create the pull request; for PRs not from a branch head, the branch name is stored so it can be used for creating new iterations.
auth: simplify the double invoked auth classes used for permission checking
Also avoid storing request state in the Has Permission instances. The instances er temporary and only used once and there is thus not real problem, but it is simpler and cleaner this way.
controllers: avoid setting request state in controller instances - set it in the thread global request variable
In TurboGears, controllers are singletons and we should avoid using instance variables for any volatile data. Instead, use the "global thread local" request context.
With everything in request, some use of c is dropped.
Note: kallithea/controllers/api/__init__.py still use instance variables that will cause problems with TurboGears.
controllers: avoid setting constants as controller instance variables in __before__
Setting constants in __before__ is a weird pattern and we can do fine without doing it. That makes it more clear that there is no state in the controller instances.
Real constants can just be set at the module level.
Some values depend on configuration and can thus probably not be set as constants at module import time but could perhaps be set in __init__. But reading configuration directly when needed will probably be just as good.
This type of code examples are not present in other files. It is unclear whether this example is still accurate. And it is equally unclear whether they provide real value compared to 'real' code.
The markup in base.html is a bit misleading. The "quick login" drop down menu is not just the quick login box but also the "current user info" box.
When the user is logged in, the "real" quick login box isn't shown and the drop menu's aria-describedby="quick_login_h" thus becomes an invalid reference.
To fix the reference, also call the user info for quick_login_h - that is approximately correct.
hg: return 400 Bad Request for hg commands that not are commands
Avoid throwing bare Exceptions which requires framework specific testing. Instead, return a reasonable http error code and make the test more framework independent.
The "helpful" message will just be a description of the http exception and not sent to the client.
auth: add support for "Bearer" auth scheme (API key variant)
This allows the API key to be passed in a header instead of the query string, reducing the risk of accidental API key leaks:
Authorization: Bearer <api key>
The Bearer authorization scheme is standardized in RFC 6750, though used here outside the full OAuth 2.0 authorization framework. (Full OAuth can still be added later without breaking existing users.)
auth: track API key used for authentication in AuthUser
This allows us to define only once how an API key is passed to the app. We might e.g. allow API keys to be passed in an HTTP header; with this change, we only need to update the code in one place.
Also change the code to verify up front that the API key resolved to a valid and active user, so LoginRequired doesn't need to do that.
Also return plain 403 Forbidden for bad API keys instead of redirecting to the login form, which makes more sense for non-interactive clients (the typical users of API keys).
auth: add AuthUser.is_anonymous, along with some exposition
This reveals the name of the NotAnonymous decorator to be misleading, an unfortunate detail only documented here, but which must be properly resolved in a later changeset.
Note that NotAnonymous behaves as advertised as long as it is used together with LoginRequired, which is always the case in the current code, so there's no actual security issue here, the code is just weird, hard to read and fragile.
---
Some thoughts on cleaning this up in a future changeset: As it turns out, every controller (except the login page!) should be LoginRequired decorated (since it doesn't actually block anonymous users, as long as anonymous access is enabled in the Kallithea config). Thus the most obvious solution would be to move the LoginRequired functionality into BaseController (with an override for LoginController), and delete the decorator entirely. However, LoginRequired does one other thing: it carries information about whether API access is enabled for individual controller methods ("@LoginRequired(api_key=True)"), and also performs the check for this, something which is not easily moved into the base controller class, since the base controller doesn't know which method is about to be called. Possibly that can be determined by poking Pylons, but such code is likely to break with the upcoming TurboGears 2 move. Thus such cleanup is probably better revisited after the switch to TG2.
auth: perform basic HTTP security checks already in BaseController
There's no reason to postpone these to a LoginRequired decorated controller function. This way, they run unconditionally for all subclasses of BaseController (so everything except JSON-RPC and VCS controllers).